Reproducing Park et al. 2026: Three Iterations of a Cross-Species ECM Proteomics Pipeline

cross-species ECM proteomics

🇰🇷 한국어 버전

Quick Answer (TL;DR)

Three concrete iteration failures in reproducing a cross-species ECM proteomics pipeline, with their fixes:

Iteration 1 (circular logic): simulated data built from the target paper's results validated against itself — looked perfect, proved nothing. Fix: validate only against independent data (e.g., PRIDE PXD023694).
Iteration 2 (pseudocount artifact): adding +1e-6 to handle missing values caused log2 fold changes to explode (±10) for proteins detected in one group only. Fix: use valid-value filtering (≥k of n detected per group), no pseudocounts.
Iteration 3 (gene-symbol ortholog mapping): matching by uppercased gene symbol got ~24% coverage with silent paralog collapsing. Fix: Ensembl BioMart Release 111 one-to-one ortholog table for ~50-60% verified mapping.

The final pipeline validated against PRIDE PXD023694 reproduced the canonical Matrigel basement membrane signature (LAMB1, NID1, LAMC1, HSPG2) and Porcine ECM collagen signature (COL1A1, COL6A1, BGN). Total time: 2-3 days, half of which was debugging the above traps.

Definition

Cross-species ECM proteomics is LC-MS/MS-based comparison of extracellular matrix protein composition between species (typically porcine or human decellularized tissue vs mouse Matrigel) used to evaluate biological similarity for regenerative medicine scaffolds. It requires species-separated database searches via MaxQuant or DIA-NN followed by ortholog-level integration via Ensembl BioMart, because shared tryptic peptides between mammalian species cause silent quantification errors in merged-FASTA searches. Reference dataset for benchmarking: PRIDE PXD023694.

The Starting Point — A One-Line Request

"Can you reproduce the Park et al. 2026 methodology (esophagus EEM vs Matrigel) on our data?"

A collaborator's one-line message kicked this off. They were comparing extracellular matrix (ECM) protein composition between decellularized porcine esophagus and Matrigel, and a recent paper by Park et al. 2026 from the Cho lab at Yonsei University provided the methodological template. The request was simple. The work was not.

This post is a working record of that 2-3 day analysis — specifically, the three traps I fell into while building a reproducible cross-species multi-omics pipeline, and why none of them were visible in simulation alone. In a year flooded with AI-generated technical writing, I wanted to document what actually breaks when you run the code, not what an abstract says should work.

Background — Why Cross-Species ECM Proteomics

Decellularized ECM scaffolds from porcine esophagus, uterus, or stomach are increasingly used in regenerative medicine. The two most common scaffold options are:

Matrigel (Corning): derived from mouse Engelbreth-Holm-Swarm tumor; rich in basement membrane proteins (laminin, collagen IV)
Tissue-specific EEM (Extracted ECM): from animal organs after decellularization; retains the actual ECM composition of the source tissue

The question that matters before any clinical application: how different are these two compositions, quantitatively? Park et al. 2026 answered this for esophagus; Cha et al. 2024 for uterus. Both came from the same group with consistent methodology.

The standard workflow:

LC-MS/MS for protein quantification (MaxQuant + iBAQ)
iBAQ → riBAQ (relative iBAQ = iBAQ / Σ iBAQ) normalization
Cross-species mapping via Ensembl BioMart 1:1 orthologs (Sus scrofa ↔ Mus musculus)
Welch's t-test + BH-FDR for DEP (Differentially Expressed Proteins) classification
Visualization (Volcano, PCA, GO enrichment)

The principles are textbook. The traps were in the details.

Trap #1 — v1: Circular Logic in the Simulation Pipeline

The collaborator's raw data hadn't arrived yet, but they wanted to see a working demo of the pipeline. So I built v1 using simulated data.

The structure of v1, in retrospect, was self-defeating:

I read Park et al. 2026 Fig 3 and manually compiled an esophagus ECM marker catalog (COL1A1, FN1, COL3A1, etc.)
For the Mouse Matrigel side, I baseline-seeded known basement membrane proteins (LAMB1, NID1, LAMC1, HSPG2, ...)
Generated simulated riBAQ values
Ran the analysis pipeline → "Matches Park's paper exactly!" ✓

Initially I was satisfied. Then, while reviewing the figures, something felt off. The simulation results were too clean — even the top 5 GO categories matched the paper's order exactly.

It hit me when I traced the logic back. Of course they matched. The catalog had been populated from the paper's results. The analysis output was reproducing the paper's findings because I had encoded those findings as input. This wasn't validation — it was circular logic, the analytical equivalent of an echo chamber.

After this realization, I demoted v1 to "methodology demo + MaxQuant guide + FAQ" status. Real validation would have to come from genuinely independent data — public datasets in PRIDE and proper ortholog mapping via Ensembl BioMart (the path that would eventually become v2.2).

Lesson: When building a pipeline demo with simulated data, if your input distribution is informed by the target results, your validation collapses to tautology. Simulation either (1) samples from a genuinely unknown distribution or (2) uses ground-truth data that is independent of the validation criteria.

Trap #2 — v2.0: Pseudocount Artifacts in One-Sided Detection

After scrapping v1, I built v2.0 with more realistic simulation: random sampling, detection limit modeling, missing value patterns.

Plotting the results, the figures looked wrong again. The Volcano plot had proteins sitting in implausibly extreme positions — log2(FC) values of ±10 or more for many genes.

The root cause was a familiar pattern:

Group A (Porcine EEM, n=5): detected in 5/5 samples (mean riBAQ ~0.001)
Group B (Mouse MAT, n=3): detected in 0/3 samples → log transform → -∞

To handle the zeros, I had added a pseudocount (e.g., 1e-6) to every value. But when the pseudocount is much smaller than the actual detection limit, the fold change calculation has an artificially small denominator and the ratio explodes:

# Wrong code (v2.0)
log2_fc = np.log2((meanA + 1e-6) / (meanB + 1e-6))
# If meanB is truly 0 → log2(meanA / 1e-6) → very large value

Some of these "exploded" proteins were biologically meaningful (collagens that should only appear in EEM, for instance). But the analysis pipeline can't distinguish real one-sided signal from noise.

v2.1 fix — valid value filtering:

A protein enters quantitative comparison only if at least 3 out of 4 (or in general, ≥ n−1) samples in each group are above detection
Proteins detected in only one group are reported separately as "qualitative only"
No pseudocounts; the quantitative analysis operates on quantifiable proteins only

# Corrected (v2.1)
mask_a = (df.filter(like='EEM') > 0).sum(axis=1) >= 3   # 3 of 5
mask_b = (df.filter(like='MAT') > 0).sum(axis=1) >= 3   # 3 of 3
quantifiable = df[mask_a & mask_b]
qualitative_only = df[~(mask_a & mask_b)]

After this change the Volcano distribution looked sane — most log2(FC) values within ±5, and the remaining extreme values were genuine EEM- or MAT-specific proteins worth examining.

Lesson: Pseudocounts are convenient but dangerous in LC-MS proteomics. Detection limits vary per sample and per peptide, and the line between "truly zero" and "below detection" is fuzzy. Valid-value filtering (k-out-of-n detection) is closer to standard practice — it's the default in tools like Perseus and Proteus.

Trap #3 — Ortholog Mapping: Catalog-Based vs Authoritative BioMart

Through v2.1, ortholog mapping was limited to a curated catalog of 178 proteins, with naive gene-symbol uppercase matching (Porcine COL1A1 ↔ Mouse Col1a1).

This gave roughly a 24% mapping rate. I told the collaborator "real data will map at much higher rates," but I hadn't actually verified that against any real data.

v2.2 switched to the official Ensembl BioMart Release 111 Pig ↔ Mouse 1:1 ortholog table:

# BioMart 1:1 orthologs (Sus scrofa ↔ Mus musculus, Release 111)
ortholog_table = pd.read_csv('biomart_pig_mouse_one2one_r111.tsv', sep='\t')
# 13,476 1:1 pairs

Result: all the key ECM markers (COL1A1, COL6A1, LAMB1, LAMC1, NID1, FN1, BGN, HSPG2, ...) passed BioMart verification. Mapping rates rose to 50-60% range for general comparisons.

The important point: gene symbol uppercase matching works for the majority case, but cross-species ortholog relationships aren't always 1:1. GAPDH has many paralogs; some immune-related genes have ambiguous orthology across species. Using an authoritative ortholog database is the safer default for reproducible cross-species analyses.

Lesson: Symbol-uppercase matching works for ~80% of cases but generates false positives/negatives in the remaining 20%. For cross-species comparisons, Ensembl BioMart or InParanoid ortholog tables should be the default.

Independent Validation — Pilot on PRIDE PXD023694

Even after three simulation iterations, the question "does the same pattern appear in real data?" was still open. The collaborator's raw data hadn't arrived.

The solution: download PRIDE dataset PXD023694, published by the same Cho lab. It had the exact same cross-species design (Porcine ECM vs Mouse Matrigel), the same LC-MS/MS setup, and the MaxQuant output already shared — a directly comparable reference project.

I downloaded three files and fed them into my pipeline as-is:

proteinGroups_matrigel.txt — Mouse Matrigel (n=4, 0.7 MB, 680 proteins)
proteinGroups_intestine_unreviewd.txt — Porcine intestine ECM (n=4, 1.8 MB, 1,082 proteins)
proteinGroups_Stomach_unreviewd.txt — Porcine stomach ECM (n=4, 2.0 MB, 836 proteins)

⚠️ Caveat: the PRIDE data is intestine ECM, while the collaborator works on esophagus. Same GI tract, similar ECM composition, but not identical. Park 2026's actual esophagus data isn't deposited in PRIDE.

Pilot results

Comparing intestine ↔ matrigel:

998 (porcine intestine) + 672 (mouse matrigel) → 285 shared (by gene-symbol matching)
131 DEPs (81 IE-up + 50 MAT-up)
PC1 47.01%, PC2 14.02% (PCA)

The key finding: the top up-regulated proteins matched the simulation predictions remarkably well.

Direction	Real-data top markers	Simulation prediction	Overlap
Porcine ECM up	COL1A1, COL14A1, COL6A1, COL6A2, BGN, COL12A1	COL1A1, FN1, COL3A1, COL6A1, BGN	COL1A1, COL6A1, BGN ✓
Mouse MAT up	LAMB1, NID1, LAMC1, HSPG2	LAMB1, NID1, LAMC1, HSPG2	exact ✓

The basement membrane top-4 (LAMB1, NID1, LAMC1, HSPG2) appeared in identical order in both the simulation and the real data, matching Park et al. 2026 Fig 3's Top 10 Up in MAT. So while the simulation's circular logic problem was real, the underlying biology it encoded was correctly captured — confirmed against fully independent real data.

Noise that only real data revealed

What the simulation didn't have, and real data did:

Nuclear protein contaminants in Matrigel: EWSR1, RUVBL2, HCFC1. Matrigel is from EHS tumor, and residual nuclear material persists through purification. A known issue I hadn't modeled.
Smooth muscle remnants in porcine ECM: TPM4, MYL9 (and ACTA2/MYH11 in the stomach samples). Evidence that decellularization isn't 100% complete — which itself becomes a decellularization efficiency QC metric.
Matrigel batch-to-batch variation: in PCA, MAT_4 separated along PC2. Well-known industry issue.

Catching these in the pilot let me pre-warn the collaborator: "your real data will show these kinds of noise too" — instead of being surprised by them after the fact.

The Final Pipeline

The workflow that stabilized after three iterations:

1. MaxQuant search (species-specific UniProt FASTA, iBAQ enabled)
   ↓
2. Filter proteinGroups.txt (remove Reverse / Contaminant / Only-by-site)
   ↓
3. iBAQ → riBAQ normalization (per-sample mole fraction)
   ↓
4. Valid-value filter (proteins with ≥ 3/4 detection in each group)
   ↓
5. Ensembl BioMart 1:1 ortholog mapping (not gene-symbol matching)
   ↓
6. Welch's t-test (unequal variance) + BH-FDR
   ↓
7. DEP criteria: FC ≥ 2 (or ≤ 0.5) AND adj.p < 0.05
   ↓
8. Visualization: Volcano, PCA, Heatmap, GO enrichment (Fisher's exact)
   ↓
9. Cross-species transferability (cited against HPM Kim 2014 + HPA Uhlén 2015)

Total ~1,500 lines of Python across 7 modular scripts. Designed so the collaborator can run the entire pipeline 1:1 on raw data the moment it arrives.

MaxQuant Settings — The Most-Asked Questions

The two most consequential MaxQuant settings for cross-species comparison:

1. Separate searches per species (do not merge):

Porcine samples → Sus scrofa UniProt (UP000008227) FASTA only
Mouse samples → Mus musculus UniProt (UP000000589) FASTA only
Merging the two FASTAs into one search creates shared-peptide ambiguity and breaks species discrimination

2. Match-between-runs (MBR) scope restriction:

MBR is OK between samples of the same species
MBR across porcine ↔ mouse is a false-identification risk

Other standard settings:

Variable mod: Oxidation (M), Acetyl (Protein N-term)
Fixed mod: Carbamidomethyl (C)
Enzyme: Trypsin/P, max 2 missed cleavages
PSM FDR: 0.01, Protein FDR: 0.01
iBAQ: ✅ ON (Advanced menu)
iBAQ log fit: ✅ ON

A detailed step-by-step is in the project report.

General Lessons That Generalize Beyond This Project

Things from this 2-3 day project that apply broadly to cross-species multi-omics work:

Simulation validates algorithms, not biology. In novel-discovery contexts where ground truth is unknown, the "accuracy" of simulation results is unmeasurable. True validation requires (a) independent real data, (b) known positive/negative controls, or (c) orthogonal methods (Western blot, IHC).
Pseudocount traps are common in LC-MS proteomics, especially in sparse-data settings like cross-species comparisons or PTM analysis. Valid-value filtering is the standard.
Gene-symbol matching vs authoritative ortholog DBs is a trade-off scaling with dataset size. Small catalogs (<200) tolerate quick matching; large-scale comparisons need BioMart or InParanoid.
Public data (PRIDE, GEO) is a force multiplier. Without Park's lab depositing the related dataset in PRIDE, this validation would have been impossible. Worth contributing your own data back for the same reason.
The 2-3 day clock is misleading. Writing the analysis code was the easy part; more than half the time was spent debugging whether the results were real. Validation costs more than implementation.

Closing Checklist — Reproducible Cross-Species Multi-Omics

The checklist this project crystallized:

Is the simulated data's ground truth independent of the validation criteria? (Avoids circular logic)
Are 0/n detection proteins handled without pseudocounts? (Valid-value filter)
Is ortholog mapping done via an authoritative DB? (BioMart / InParanoid)
Are FASTA searches per-species, separate? (Avoids shared-peptide ambiguity)
Is MBR restricted to within-species? (Avoids false matches)
Has the pipeline been validated on independent real data (PRIDE etc.)? (Avoids synthetic-only conclusions)
Within-group Pearson r > 0.87? (Park et al. quality threshold)
DEP criterion is (FC ≥ 2) AND (adj.p < 0.05), not either alone?

Useful both for analysts running cross-species comparisons and for collaborators who commission the work.

Related posts:

References:

Park, S. M. et al. (2026). Decellularized esophageal extracellular matrix as a scaffold for esophageal tissue engineering. (Yonsei Univ., Cho Lab)
Cha, S. et al. (2024). Comparative proteomic analysis of uterine ECM vs Matrigel.
Kim, M. S. et al. (2014). A draft map of the human proteome. Nature, 509, 575-581. (HPM)
Uhlén, M. et al. (2015). Tissue-based map of the human proteome. Science, 347, 1260419. (HPA)
Ensembl BioMart: https://www.ensembl.org/biomart
PRIDE dataset PXD023694