Reanalyzing PRIDE PXD023694 — Matrigel Nuclear Contaminants (EWSR1, RUVBL2) You'll Find in Real Data
A practical reanalysis of PRIDE dataset PXD023694 (Cho lab cross-species ECM) — what proteins actually come out of mouse Matrigel vs porcine ECM, the basement membrane signature (LAMB1/NID1/LAMC1/HSPG2) that replicates Park et al. 2026, and the contamination patterns (Matrigel nuclear proteins EWSR1/RUVBL2, porcine smooth muscle remnants TPM4/MYL9) that simulations miss but real data shows.
Why I Reanalyzed PRIDE PXD023694
After running a cross-species ECM proteomics reproduction of Park et al. 2026 on simulated data three times in a row (the full story is here), I needed real data to actually validate the pipeline. The collaborator's raw files weren't ready yet, and Park 2026's specific esophagus data isn't in PRIDE — but the Cho lab deposited a similar cross-species ECM dataset at PXD023694 (porcine intestine and stomach ECM vs mouse Matrigel).
Same experimental design. Same LC-MS/MS setup. Same lab. Different tissues (intestine, stomach) instead of esophagus — close enough as a positive control for the pipeline.
This post documents:
- What PXD023694 contains and how to download just the parts you need
- Running the same DIA-NN/MaxQuant-style downstream analysis on real proteinGroups output
- What signatures replicate vs the simulation (basement membrane Top-4, ECM collagen)
- What contamination you only see in real data (Matrigel nuclear proteins, porcine smooth muscle remnants)
- The Methods sentence you can use if you're doing similar reanalysis
If you're working on ECM proteomics, this is the dataset to validate against. If you're benchmarking a cross-species pipeline, PXD023694 is essentially a free ground-truth comparator.
What PXD023694 Contains
Dataset: Park / Cho lab, deposited 2020 (referenced by Park 2026)
URL: https://www.ebi.ac.uk/pride/archive/projects/PXD023694
Title: Comparative proteomic analysis of porcine and mouse decellularized extracellular matrix
Files: raw .raw files (Thermo Q Exactive HF-X) + MaxQuant-processed proteinGroups.txt
Three sample sets:
| Sample set | n | Description |
|---|---|---|
| Mouse Matrigel | 4 | Standard Corning Matrigel batches |
| Porcine intestine ECM | 4 | Decellularized small intestine |
| Porcine stomach ECM | 4 | Decellularized stomach |
For reproducing the basic cross-species comparison, you don't need the raw files (each ~1 GB). The already-processed proteinGroups.txt files in the dataset are small (~4.5 MB total) and let you run the entire downstream pipeline directly.
Download strategy — just the processed files
Use the PRIDE FTP:
ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/
The relevant files (sizes approximate):
proteinGroups_matrigel.txt— 0.7 MB, 680 proteinsproteinGroups_intestine_unreviewd.txt— 1.8 MB, 1,082 proteinsproteinGroups_Stomach_unreviewd.txt— 2.0 MB, 836 proteins
Download via wget:
wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/proteinGroups_matrigel.txt
wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/proteinGroups_intestine_unreviewd.txt
wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/proteinGroups_Stomach_unreviewd.txt
(unreviewd — yes, with that spelling — refers to the protein database used; not a typo to fix.)
What's already processed
The proteinGroups.txt files came from MaxQuant 1.6.10.43 with:
- Species-separated FASTA searches (Sus scrofa for porcine, Mus musculus for mouse)
- iBAQ enabled
- Standard variable/fixed modifications (Oxidation M, Acetyl N-term; Carbamidomethyl C)
- Trypsin/P, 2 missed cleavages
So you can plug these directly into your downstream pipeline without re-running MaxQuant — which saves a day of compute on raw files you don't need.
Loading and Filtering
Standard MaxQuant cleanup applies. Reverse hits, contaminants (cRAP), and "Only identified by site" all need to go:
import pandas as pd
import numpy as np
def load_pg(path):
df = pd.read_csv(path, sep='\t', low_memory=False)
# Standard MaxQuant filter
df = df[df.get('Reverse', '') != '+']
df = df[df.get('Potential contaminant', '') != '+']
df = df[df.get('Only identified by site', '') != '+']
return df.reset_index(drop=True)
mat = load_pg('proteinGroups_matrigel.txt') # 680 → ~672 proteins
intest = load_pg('proteinGroups_intestine_unreviewd.txt') # 1,082 → ~998 proteins
stom = load_pg('proteinGroups_Stomach_unreviewd.txt') # 836 → ~ proteins
Extract gene symbols (UniProt headers carry them as GN=Gene1):
import re
def first_gene_symbol(s):
if not isinstance(s, str): return None
m = re.search(r'GN=(\S+)', s)
return m.group(1) if m else None
for df in [mat, intest, stom]:
df['Gene'] = df['Fasta headers'].apply(first_gene_symbol)
Identify the iBAQ columns
iBAQ columns are named like iBAQ Sample1, iBAQ Sample2, etc. Extract them:
mat_ibaq_cols = [c for c in mat.columns if c.startswith('iBAQ ') and c != 'iBAQ peptides']
intest_ibaq_cols = [c for c in intest.columns if c.startswith('iBAQ ') and c != 'iBAQ peptides']
For PXD023694 each sample set has 4 iBAQ columns.
Normalize to riBAQ
For cross-sample comparison, normalize iBAQ to relative iBAQ (riBAQ) = per-sample mole fraction:
def to_ribaq(df, ibaq_cols):
out = df[['Gene'] + ibaq_cols].copy()
for c in ibaq_cols:
total = out[c].sum()
out[c] = out[c] / total if total > 0 else 0
return out
mat_ri = to_ribaq(mat, mat_ibaq_cols)
intest_ri = to_ribaq(intest, intest_ibaq_cols)
Now each column sums to 1 and proteins are comparable across samples.
Joining Mouse and Porcine via Gene Symbol
For a quick exploration, gene-symbol matching (uppercase) is acceptable. For a publication, use BioMart 1:1 ortholog mapping per the BioMart Tutorial.
mat_ri['GENE_UPPER'] = mat_ri['Gene'].str.upper()
intest_ri['GENE_UPPER'] = intest_ri['Gene'].str.upper()
shared = mat_ri.merge(intest_ri, on='GENE_UPPER', suffixes=('_mat', '_intest'))
shared = shared[shared['GENE_UPPER'].notna()]
print(f"Shared by gene symbol: {len(shared)}")
# Real-data result: ~285 shared proteins
That's 285 shared proteins out of 998 porcine and 672 mouse — about 28% overlap, consistent with the expected 20-30% mapping rate for cross-species ECM proteomics.
Differential Abundance (Welch's t-test + BH-FDR)
from scipy import stats
from statsmodels.stats.multitest import multipletests
mat_cols = [c for c in shared.columns if c.startswith('iBAQ') and c.endswith('_mat')]
int_cols = [c for c in shared.columns if c.startswith('iBAQ') and c.endswith('_intest')]
rows = []
for _, r in shared.iterrows():
a = np.log2(r[int_cols].astype(float).replace(0, np.nan).dropna().values + 1e-12)
b = np.log2(r[mat_cols].astype(float).replace(0, np.nan).dropna().values + 1e-12)
if len(a) < 3 or len(b) < 3:
continue
t, p = stats.ttest_ind(a, b, equal_var=False)
rows.append({'Gene': r['GENE_UPPER'], 'log2FC': a.mean() - b.mean(), 'p': p})
dep = pd.DataFrame(rows)
dep['FDR'] = multipletests(dep['p'], method='fdr_bh')[1]
dep['DEP'] = (dep['FDR'] < 0.05) & (dep['log2FC'].abs() >= 1)
print(f"DEPs: {dep['DEP'].sum()}") # ~131
You'll get ~131 DEPs total — about 81 up in intestine ECM, 50 up in Matrigel.
What Replicates the Simulation (and Park 2026)
Matrigel basement membrane Top-4
Sort dep for Matrigel-up (negative log2FC in the above setup), check the top:
| Gene | Function |
|---|---|
| LAMB1 | Laminin β1 — basement membrane core |
| NID1 | Nidogen-1 — basement membrane bridge |
| LAMC1 | Laminin γ1 — basement membrane core |
| HSPG2 | Perlecan — basement membrane proteoglycan |
This is the canonical basement membrane signature. Matches Park 2026 Fig 3 Top-10 Up in MAT. Matches the simulation predictions. Matches biology — Matrigel is the EHS basement membrane extract.
Porcine ECM collagen signature
Sort for porcine-up:
| Gene | Function |
|---|---|
| COL1A1 | Type I collagen α1 — most abundant tissue ECM |
| COL6A1, COL6A2 | Type VI collagen — pericellular |
| COL14A1, COL12A1 | FACIT collagens |
| BGN | Biglycan — small leucine-rich proteoglycan |
| DCN | Decorin (frequently seen) |
Tissue-derived ECM scaffolds → strong fibrillar collagen signature. Expected and reproduced.
What Real Data Shows That Simulations Don't
Three categories of "noise" that only appear when you run the pipeline on actual deposited data:
1. Matrigel nuclear protein contaminants
Mouse Matrigel is harvested from EHS tumors. Tumor cells contain nuclei. Decellularization is incomplete:
| Gene | Localization | Why it's there |
|---|---|---|
| EWSR1 | Nucleus, RNA-binding | EHS tumor nuclear residue |
| RUVBL2 | Nucleus, ATPase | Tumor nuclear remnant |
| HCFC1 | Nucleus, transcription regulator | Tumor nuclear remnant |
| ERP29 | ER chaperone | Tumor secretory residue |
If you're not expecting them, EWSR1 jumping to the top of "Matrigel-specific" looks like a finding. It isn't — it's contamination from the Matrigel manufacturing process. Multiple papers have noted this; it's now a recognized characteristic.
Practical handling: in your Methods, note that nuclear/secretory residues are an expected Matrigel feature and either (a) filter them out of ECM-specific comparisons by gene ontology, or (b) report them transparently as scaffold characterization.
2. Porcine ECM smooth muscle remnants
Decellularization protocols target nuclear DNA and cytoplasmic protein but don't fully remove cytoskeletal residues:
| Gene | Cell type marker | Significance |
|---|---|---|
| TPM4 | Smooth muscle tropomyosin | Decellularization completeness indicator |
| MYL9 | Smooth muscle myosin light chain | Same |
| ACTA2 (more in stomach samples) | Alpha smooth muscle actin | Same |
| MYH11 (more in stomach samples) | Smooth muscle myosin heavy chain | Same |
Stomach ECM has more residual smooth muscle protein than intestine, which has more than esophagus (which is what you actually expect to see in the Park 2026 system). This is useful — it's an internal QC metric for decellularization efficiency.
3. Matrigel batch variation
PCA of the four Matrigel replicates shows one sample (often MAT_4) separating along PC2. This is the well-known Matrigel batch-to-batch variation. It's why downstream Matrigel users worry about lot-to-lot reproducibility.
For PXD023694 specifically: you can quantitatively see it. For your own Matrigel work: budget for it — buy from a single lot if your experiment is sensitive, or use a defined-composition alternative (GelTrex, Cultrex Reduced Growth Factor, recombinant laminin substrates) if you can.
Methods Statement You Can Use
For reanalysis posts/papers using PXD023694:
"PRIDE dataset PXD023694 was used as an independent test set. The pre-processed proteinGroups.txt files for Matrigel (n=4) and porcine intestine ECM (n=4) were downloaded from the PRIDE FTP (https://www.ebi.ac.uk/pride/archive/projects/PXD023694, accessed YYYY-MM-DD). Reverse hits, MaxQuant potential contaminants, and 'only identified by site' entries were removed. iBAQ values were normalized to per-sample mole fraction (riBAQ). Cross-species protein groups were matched at the gene symbol level [or by Ensembl BioMart Release N 1:1 orthologs]. Differential abundance was computed by Welch's t-test on log2-transformed riBAQ with Benjamini-Hochberg FDR correction (significant: FDR<0.05, |log2FC|≥1)."
FAQ
Q: Do I need to download the raw .raw files?
No, for reanalysis at the protein level the proteinGroups.txt files are sufficient. You only need raw files if you want to redo the database search with a different version/parameters.
Q: Why does Park 2026 not deposit the actual esophagus data? Various reasons — sometimes raw data is held back for ongoing follow-up studies, sometimes embargo periods. PXD023694 is the same lab's earlier, related deposit and serves as a close proxy.
Q: My matrigel results are different from yours — what's different? Likely differences: (a) MaxQuant version (search engine improvements), (b) FASTA database release (UniProt has updated since 2021), (c) different parameter group settings, (d) statistical filters. The qualitative signatures (LAMB1/NID1/LAMC1/HSPG2 in MAT; collagens in ECM) should reproduce regardless.
Q: Can I publish a paper using this dataset? Yes — PRIDE-deposited data is intended for reuse. Cite the original dataset DOI and the Park 2026 / Cho lab papers. Don't claim original data generation; do claim original analysis.
Q: Is there an esophagus equivalent on PRIDE? As of 2026, none specifically from the Cho lab. Other groups have deposited general esophagus tissue proteomics (search PRIDE for "esophagus") but not the decellularized ECM workflow. PXD023694 (intestine) is the closest proxy.
Q: How do I tell true Matrigel-up biology from EWSR1-type contamination? Rule of thumb: ECM-functional gene ontology (extracellular matrix, basement membrane, collagen catabolism) → real biology. Nuclear / RNA-binding / chromatin GO terms → contamination. When in doubt, check published Matrigel characterization papers (Hughes et al. 2010 Proteomics and follow-ups).
Q: Is "intestine vs esophagus" close enough as a positive control? Functionally yes for the ECM signature replication question (collagens vs basement membrane), but tissue-specific ECM differences exist (esophagus stratified squamous epithelium vs intestine columnar epithelium). For specific markers you'd want the actual esophagus data when it becomes available.
Closing — Why Reanalyzing PRIDE Datasets Matters
Most published proteomics datasets sit in PRIDE unused after the original paper. Reanalysis serves three purposes:
- Pipeline validation — confirms your method works on real data, not just simulations or your own (potentially biased) samples
- Independent replication — strengthens the field's confidence in core findings (Matrigel ≠ tissue ECM, in this case)
- New insights from old data — your pipeline, your statistical lens, sometimes finds patterns the original analysis missed
PXD023694 specifically is one of the most reusable cross-species ECM datasets available. The basement membrane signature replicates cleanly. The contamination patterns are educational. If you're starting cross-species ECM proteomics, run your pipeline through this dataset first — if you can't reproduce LAMB1/NID1/LAMC1/HSPG2 at the top of Matrigel-up, something is broken in your downstream code.
Related posts:
- Reproducing Park et al. 2026 — Cross-Species ECM Proteomics, Three Iterations
- Why You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics
- BioMart Pig ↔ Mouse 1:1 Ortholog Mapping for Cross-Species Proteomics
- From DIA-NN Output to Paper Draft: AI-Assisted Proteomics Workflow
References:
- Park, S. M. et al. (2026). Decellularized esophageal ECM. Yonsei, Cho Lab.
- PRIDE dataset: https://www.ebi.ac.uk/pride/archive/projects/PXD023694
- Hughes, C. S., Postovit, L. M., Lajoie, G. A. (2010). Matrigel: a complex protein mixture required for optimal growth of cell culture. Proteomics, 10, 1886-1890.
- Perez-Riverol, Y. et al. (2022). The PRIDE database resources in 2022. Nucleic Acids Research, 50, D543-D552.
관련 글
Reproducing Park et al. 2026: Three Iterations of a Cross-Species ECM Proteomics Pipeline
5월 19일 · 12 min read
Proteomics공동연구자 의뢰로 Park et al. 2026을 재현하다 — 종간 ECM 프로테오믹스 분석에서 3번 반복하며 잡은 것들
5월 19일 · 20 min read
ProteomicsWhy You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics (Shared Peptide Problem)
5월 23일 · 9 min read
ProteomicsBioMart Pig ↔ Mouse 1:1 Ortholog Mapping for Cross-Species Proteomics (R + Python Tutorial)
5월 23일 · 10 min read