Proteomics

Why You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics (Shared Peptide Problem)

In cross-species LC-MS/MS proteomics (Porcine ECM vs Mouse Matrigel, human vs mouse cells, host-pathogen, etc.) merging species FASTA into one search database breaks protein quantification. This is the shared-peptide ambiguity problem. Here's why it happens, how to set up species-separated searches in MaxQuant and DIA-NN, and how to handle peptides that genuinely match both species.

·9 min read
#cross-species proteomics#MaxQuant#DIA-NN#FASTA database#shared peptide#razor peptide#Sus scrofa#Mus musculus#host pathogen proteomics#Matrigel#search engine

Species-separated FASTA search

The Single Most Common Cross-Species Proteomics Mistake

You have two sets of LC-MS/MS samples: Porcine esophageal ECM (n=5) and Mouse Matrigel (n=3). You want to compare protein composition. The instinct — open MaxQuant, load Sus scrofa + Mus musculus FASTA files into one search, hit run.

Don't. This breaks the quantification at the peptide-to-protein assignment step, and the failure mode is silent — your numbers come out, they just don't mean what you think.

This guide walks through why a merged-FASTA search fails in cross-species comparisons, the correct species-separated search workflow in MaxQuant and DIA-NN, and what to do with the small fraction of peptides that legitimately match both species. It's the principle that underpins the entire cross-species ECM proteomics workflow described in the Park et al. 2026 reproduction post, but worth its own focused write-up.

Why Merging Breaks Cross-Species Quantification

Tryptic peptides are often species-shared

Many tryptic peptides — particularly from highly conserved proteins like histones, ribosomal subunits, actin, GAPDH — are identical across mammalian species. Pig actin and mouse actin share most tryptic peptides at the sequence level.

In a single-species search, this is fine. The peptide maps unambiguously to "this species' actin."

In a merged Pig+Mouse FASTA search, the same MS/MS spectrum matches:

  • ACTB_PIG
  • ACTB_MOUSE

The search engine has to decide which protein "owns" the peptide. MaxQuant calls these razor peptides and assigns them by a tie-breaker (usually to the protein with the most unique peptides). DIA-NN handles them similarly. The assignment is arbitrary from a biological standpoint.

What goes wrong downstream

When you compute the protein-level quantity:

  • In a Porcine-only sample, all peptides match _PIG proteins → quantification correct
  • In a Mouse-only sample, all peptides match _MOUSE proteins → quantification correct
  • In the merged search, peptides matching both get assigned to one or the other based on tie-breakers → some peptides that should contribute to Porcine ACTB intensity get assigned to Mouse ACTB (or vice versa), depending on the run's overall ID counts

The result: your protein quantities are wrong by a variable, sample-dependent amount. The volcano plot looks fine. Nothing visibly breaks. But the fold changes you publish reflect peptide assignment artifacts as much as biology.

Why it's silent

This is the dangerous part. No software warns you. Both MaxQuant and DIA-NN will happily run a merged search and produce a proteinGroups.txt or report.pg_matrix.tsv. The numbers look reasonable. The downstream statistics run. You publish — and the result isn't reproducible.

The Correct Setup — Species-Separated Searches

General principle

Run one search per species. Each search uses only that species' FASTA. Then combine the per-species protein quantities for cross-species comparison at the ortholog level, not at the peptide level.

Porcine samples (5 files) → MaxQuant search w/ Sus scrofa FASTA → porcine pg_matrix
Mouse samples (3 files)   → MaxQuant search w/ Mus musculus FASTA → mouse pg_matrix
                                                ↓
                                   Join via BioMart 1:1 ortholog table
                                                ↓
                                   Cross-species comparison

MaxQuant — concrete setup

In MaxQuant GUI:

  1. Raw files tab: load all 8 raw files (5 porcine + 3 mouse)
  2. Set parameter group:
    • Parameter group 0 → assign all 5 porcine raw files
    • Parameter group 1 → assign all 3 mouse raw files
  3. Group-specific parameters → Group 0 (Porcine):
    • Standard variable/fixed mods (Oxidation M, Acetyl N-term; Carbamidomethyl C)
    • Trypsin/P, 2 missed cleavages
  4. Group-specific parameters → Group 1 (Mouse):
    • Same modifications, same enzyme
  5. Global parameters → Sequences:
    • Group 0 FASTA: sus_scrofa_UP000008227.fasta only
    • Group 1 FASTA: mus_musculus_UP000000589.fasta only
  6. Global parameters → Protein quantification:
    • iBAQ: ON
    • iBAQ log fit: ON
  7. Match between runs (MBR): enabled, but only within group (MaxQuant respects parameter groups for MBR by default — verify in your version)

Output: two sets of proteinGroups.txt, one per species, with clean species-specific quantification.

DIA-NN — concrete setup

DIA-NN doesn't have MaxQuant's "parameter groups" but achieves the same result via separate runs:

# Porcine run
diann --f porcine_1.d --f porcine_2.d ... \
      --fasta sus_scrofa_UP000008227.fasta \
      --lib porcine_lib.predicted \
      --out porcine_report.tsv \
      --threads 16

# Mouse run (separate invocation)
diann --f mouse_1.d --f mouse_2.d ... \
      --fasta mus_musculus_UP000000589.fasta \
      --lib mouse_lib.predicted \
      --out mouse_report.tsv \
      --threads 16

For library-free DIA-NN: also predict the spectral library separately per species.

Critically: do not enable cross-run MBR between species — the equivalent in DIA-NN is the global library/index. Keep them separate.

What About Peptides That Genuinely Match Both Species?

A small fraction (often 10-30% of the proteome) consists of peptides truly identical across species. Two valid approaches:

Option A — Drop shared peptides (most conservative)

After species-separated quantification, when joining via orthologs, only use species-unique peptides for protein quantification. Most modern tools support filtering by peptide uniqueness within their species reference.

  • Pro: cleanest biological interpretation
  • Con: lower protein coverage, especially for highly conserved proteins

Option B — Use shared peptides only for ID, not for quantification

Allow shared peptides to contribute to protein identification (boost confidence) but quantify proteins from species-unique peptides only. MaxQuant's Unique + Razor vs Unique columns in proteinGroups.txt controls this.

  • Pro: better ID coverage, clean quantification
  • Con: more bookkeeping

Option C — Accept shared-peptide quantification but report it

If you must include shared peptides (e.g., for histones where unique peptides are rare), report it explicitly in Methods and note it as a limitation in Discussion. Reviewers in the proteomics field will accept this if it's transparent.

For most cross-species ECM, immune, or signaling proteomics, Option A or B is preferred — most proteins of interest (collagens, laminins, cytokines) have plenty of species-unique peptides.

Common Variants of the Same Mistake

The shared-peptide problem appears in multiple cross-species contexts:

Host-pathogen proteomics

Infected human cells + bacterial pathogen. Many ribosomal and metabolic proteins share peptides. Same fix: search human-only and pathogen-only separately, then analyze the host and pathogen proteomes as two distinct outputs.

Patient-derived xenograft (PDX) tumors

Human tumor cells grown in immunocompromised mouse → biopsied tissue contains both human and mouse cells (stroma, vasculature). Same fix: separate human and mouse searches; the relative ID counts also report on tumor cellularity vs stromal infiltration.

Mixed-species cell culture (co-culture)

Same principle. Separate searches.

Spike-in standards

If you add a spike-in standard from a non-target species (e.g., yeast UPS standards in human samples), keep the spike-in FASTA in a separate parameter group too — or use it only via dedicated targeted analysis tools (Skyline).

Quick Sanity Checks for Existing Datasets

If you inherited a cross-species dataset and aren't sure how it was searched:

  1. Open proteinGroups.txt — look at the Protein IDs column. If you see entries like sp|P12345|ACTB_PIG;sp|P67890|ACTB_MOUSE, the search merged FASTAs. The quantification is suspect.
  2. Check the Fasta headers column — should be one species' UniProt entries per protein group in a properly separated search.
  3. Compare top-100 quantified proteins between samples — in a merged search you'll see suspiciously consistent "shared" proteins across species samples; in a clean separated search, conserved proteins still appear but their per-species intensities are independent.
  4. In DIA-NN reports — the Protein.Group column similarly reveals whether assignments crossed species boundaries.

If the dataset was merged-FASTA searched, the only fix is re-running the searches species-separated. There's no post-hoc statistical correction.

What This Means for Reproducibility

When you publish cross-species proteomics, your Methods should explicitly state:

Database searches were performed separately for each species, using Sus scrofa UniProt UP000008227 for porcine samples and Mus musculus UniProt UP000000589 for mouse samples (downloaded YYYY-MM-DD). Match-between-runs was restricted to within-species comparisons. Cross-species comparisons were performed at the ortholog level using Ensembl BioMart Release [N] 1:1 ortholog tables.

This sentence is small but it's how reviewers and future replicators know your quantification is valid. Its absence is the easiest red flag for a careful reviewer.

FAQ

Q: My samples are all the same species but include trace contamination from another species. Same rules? Mostly yes. If you suspect contamination (e.g., mouse cells in human-only experiment), add a contaminant database (mouse FASTA + cRAP) as a separate parameter group and inspect what gets identified there — useful for QC, but main quantification should still be your target species only.

Q: Can I post-process to fix a merged search? No statistically valid way. You can filter to species-unique peptides retrospectively, but the FDR was computed against the merged database, so significance estimates are off. Re-running the search separately is the only clean fix.

Q: What about closely related strains — e.g., two different mouse strains? For nearly-identical references, separate searches add complexity for minimal benefit. One unified FASTA per species is fine. The shared-peptide problem only matters when species (or significantly diverged reference proteomes) are mixed.

Q: Does this also apply to RNA-seq cross-species analysis? Conceptually yes — read mapping ambiguity is the analogue. But standard practice in cross-species RNA-seq already aligns to species-specific references separately and joins at the ortholog level. The MS proteomics community has been slower to standardize because tools allow merged searches without warning.

Q: How do I find Ensembl BioMart 1:1 orthologs to join species after separated searches? That's a separate workflow — see BioMart Pig ↔ Mouse 1:1 Ortholog Mapping for Cross-Species Proteomics (R + Python Tutorial).

Closing — The One Sentence

In cross-species LC-MS/MS proteomics: one search per species, join at the ortholog level afterward, never merge species FASTA into a single search. The shared-peptide problem makes merged-FASTA quantification silently wrong, and there's no post-hoc fix.


Related posts:

References:

  • Cox, J. & Mann, M. (2008). MaxQuant enables high peptide identification rates. Nature Biotechnology, 26, 1367-1372.
  • Demichev, V. et al. (2020). DIA-NN: neural networks and interference correction. Nature Methods, 17, 41-44.
  • Park, S. M. et al. (2026). Decellularized esophageal ECM. (Yonsei, Cho Lab)

관련 글