Reanalyzing PRIDE PXD023694 — Matrigel Nuclear Contaminants (EWSR1, RUVBL2) You'll Find in Real Data

Q: Do I need to download the raw .raw files?

No, for reanalysis at the protein level the `proteinGroups.txt` files are sufficient. You only need raw files if you want to redo the database search with a different version/parameters.

Q: Why does Park 2026 not deposit the actual esophagus data?

Various reasons — sometimes raw data is held back for ongoing follow-up studies, sometimes embargo periods. PXD023694 is the same lab's earlier, related deposit and serves as a close proxy.

Q: My matrigel results are different from yours — what's different?

Likely differences: (a) MaxQuant version (search engine improvements), (b) FASTA database release (UniProt has updated since 2021), (c) different parameter group settings, (d) statistical filters. The qualitative signatures (LAMB1/NID1/LAMC1/HSPG2 in MAT; collagens in ECM) should reproduce regardless.

Q: Can I publish a paper using this dataset?

Yes — PRIDE-deposited data is intended for reuse. Cite the original dataset DOI and the Park 2026 / Cho lab papers. Don't claim original data generation; do claim original analysis.

Q: Is there an esophagus equivalent on PRIDE?

As of 2026, none specifically from the Cho lab. Other groups have deposited general esophagus tissue proteomics (search PRIDE for "esophagus") but not the decellularized ECM workflow. PXD023694 (intestine) is the closest proxy.

Q: How do I tell true Matrigel-up biology from EWSR1-type contamination?

Rule of thumb: ECM-functional gene ontology (extracellular matrix, basement membrane, collagen catabolism) → real biology. Nuclear / RNA-binding / chromatin GO terms → contamination. When in doubt, check published Matrigel characterization papers (Hughes et al. 2010 *Proteomics* and follow-ups).

Q: Is "intestine vs esophagus" close enough as a positive control?

Functionally yes for the ECM signature replication question (collagens vs basement membrane), but tissue-specific ECM differences exist (esophagus stratified squamous epithelium vs intestine columnar epithelium). For specific markers you'd want the actual esophagus data when it becomes available.

PRIDE PXD023694 reanalysis

Why I Reanalyzed PRIDE PXD023694

After running a cross-species ECM proteomics reproduction of Park et al. 2026 on simulated data three times in a row (the full story is here), I needed real data to actually validate the pipeline. The collaborator's raw files weren't ready yet, and Park 2026's specific esophagus data isn't in PRIDE — but the Cho lab deposited a similar cross-species ECM dataset at PXD023694 (porcine intestine and stomach ECM vs mouse Matrigel).

Same experimental design. Same LC-MS/MS setup. Same lab. Different tissues (intestine, stomach) instead of esophagus — close enough as a positive control for the pipeline.

This post documents:

What PXD023694 contains and how to download just the parts you need
Running the same DIA-NN/MaxQuant-style downstream analysis on real proteinGroups output
What signatures replicate vs the simulation (basement membrane Top-4, ECM collagen)
What contamination you only see in real data (Matrigel nuclear proteins, porcine smooth muscle remnants)
The Methods sentence you can use if you're doing similar reanalysis

If you're working on ECM proteomics, this is the dataset to validate against. If you're benchmarking a cross-species pipeline, PXD023694 is essentially a free ground-truth comparator.

What PXD023694 Contains

Dataset: Park / Cho lab, deposited 2020 (referenced by Park 2026) URL: https://www.ebi.ac.uk/pride/archive/projects/PXD023694 Title: Comparative proteomic analysis of porcine and mouse decellularized extracellular matrix Files: raw .raw files (Thermo Q Exactive HF-X) + MaxQuant-processed proteinGroups.txt

Three sample sets:

Sample set	n	Description
Mouse Matrigel	4	Standard Corning Matrigel batches
Porcine intestine ECM	4	Decellularized small intestine
Porcine stomach ECM	4	Decellularized stomach

For reproducing the basic cross-species comparison, you don't need the raw files (each ~1 GB). The already-processed proteinGroups.txt files in the dataset are small (~4.5 MB total) and let you run the entire downstream pipeline directly.

Download strategy — just the processed files

Use the PRIDE FTP:

ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/

The relevant files (sizes approximate):

proteinGroups_matrigel.txt — 0.7 MB, 680 proteins
proteinGroups_intestine_unreviewd.txt — 1.8 MB, 1,082 proteins
proteinGroups_Stomach_unreviewd.txt — 2.0 MB, 836 proteins

Download via wget:

wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/proteinGroups_matrigel.txt
wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/proteinGroups_intestine_unreviewd.txt
wget ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2021/01/PXD023694/proteinGroups_Stomach_unreviewd.txt

(unreviewd — yes, with that spelling — refers to the protein database used; not a typo to fix.)

What's already processed

The proteinGroups.txt files came from MaxQuant 1.6.10.43 with:

Species-separated FASTA searches (Sus scrofa for porcine, Mus musculus for mouse)
iBAQ enabled
Standard variable/fixed modifications (Oxidation M, Acetyl N-term; Carbamidomethyl C)
Trypsin/P, 2 missed cleavages

So you can plug these directly into your downstream pipeline without re-running MaxQuant — which saves a day of compute on raw files you don't need.

Loading and Filtering

Standard MaxQuant cleanup applies. Reverse hits, contaminants (cRAP), and "Only identified by site" all need to go:

import pandas as pd
import numpy as np

def load_pg(path):
    df = pd.read_csv(path, sep='\t', low_memory=False)
    # Standard MaxQuant filter
    df = df[df.get('Reverse', '') != '+']
    df = df[df.get('Potential contaminant', '') != '+']
    df = df[df.get('Only identified by site', '') != '+']
    return df.reset_index(drop=True)

mat = load_pg('proteinGroups_matrigel.txt')           # 680 → ~672 proteins
intest = load_pg('proteinGroups_intestine_unreviewd.txt')  # 1,082 → ~998 proteins
stom = load_pg('proteinGroups_Stomach_unreviewd.txt')   # 836 → ~ proteins

Extract gene symbols (UniProt headers carry them as GN=Gene1):

import re
def first_gene_symbol(s):
    if not isinstance(s, str): return None
    m = re.search(r'GN=(\S+)', s)
    return m.group(1) if m else None

for df in [mat, intest, stom]:
    df['Gene'] = df['Fasta headers'].apply(first_gene_symbol)

Identify the iBAQ columns

iBAQ columns are named like iBAQ Sample1, iBAQ Sample2, etc. Extract them:

mat_ibaq_cols = [c for c in mat.columns if c.startswith('iBAQ ') and c != 'iBAQ peptides']
intest_ibaq_cols = [c for c in intest.columns if c.startswith('iBAQ ') and c != 'iBAQ peptides']

For PXD023694 each sample set has 4 iBAQ columns.

Normalize to riBAQ

For cross-sample comparison, normalize iBAQ to relative iBAQ (riBAQ) = per-sample mole fraction:

def to_ribaq(df, ibaq_cols):
    out = df[['Gene'] + ibaq_cols].copy()
    for c in ibaq_cols:
        total = out[c].sum()
        out[c] = out[c] / total if total > 0 else 0
    return out

mat_ri = to_ribaq(mat, mat_ibaq_cols)
intest_ri = to_ribaq(intest, intest_ibaq_cols)

Now each column sums to 1 and proteins are comparable across samples.

Joining Mouse and Porcine via Gene Symbol

For a quick exploration, gene-symbol matching (uppercase) is acceptable. For a publication, use BioMart 1:1 ortholog mapping per the BioMart Tutorial.

mat_ri['GENE_UPPER'] = mat_ri['Gene'].str.upper()
intest_ri['GENE_UPPER'] = intest_ri['Gene'].str.upper()

shared = mat_ri.merge(intest_ri, on='GENE_UPPER', suffixes=('_mat', '_intest'))
shared = shared[shared['GENE_UPPER'].notna()]
print(f"Shared by gene symbol: {len(shared)}")
# Real-data result: ~285 shared proteins

That's 285 shared proteins out of 998 porcine and 672 mouse — about 28% overlap, consistent with the expected 20-30% mapping rate for cross-species ECM proteomics.

Differential Abundance (Welch's t-test + BH-FDR)

from scipy import stats
from statsmodels.stats.multitest import multipletests

mat_cols = [c for c in shared.columns if c.startswith('iBAQ') and c.endswith('_mat')]
int_cols = [c for c in shared.columns if c.startswith('iBAQ') and c.endswith('_intest')]

rows = []
for _, r in shared.iterrows():
    a = np.log2(r[int_cols].astype(float).replace(0, np.nan).dropna().values + 1e-12)
    b = np.log2(r[mat_cols].astype(float).replace(0, np.nan).dropna().values + 1e-12)
    if len(a) < 3 or len(b) < 3:
        continue
    t, p = stats.ttest_ind(a, b, equal_var=False)
    rows.append({'Gene': r['GENE_UPPER'], 'log2FC': a.mean() - b.mean(), 'p': p})

dep = pd.DataFrame(rows)
dep['FDR'] = multipletests(dep['p'], method='fdr_bh')[1]
dep['DEP'] = (dep['FDR'] < 0.05) & (dep['log2FC'].abs() >= 1)
print(f"DEPs: {dep['DEP'].sum()}")  # ~131

You'll get ~131 DEPs total — about 81 up in intestine ECM, 50 up in Matrigel.

What Replicates the Simulation (and Park 2026)

Matrigel basement membrane Top-4

Sort dep for Matrigel-up (negative log2FC in the above setup), check the top:

Gene	Function
LAMB1	Laminin β1 — basement membrane core
NID1	Nidogen-1 — basement membrane bridge
LAMC1	Laminin γ1 — basement membrane core
HSPG2	Perlecan — basement membrane proteoglycan

This is the canonical basement membrane signature. Matches Park 2026 Fig 3 Top-10 Up in MAT. Matches the simulation predictions. Matches biology — Matrigel is the EHS basement membrane extract.

Porcine ECM collagen signature

Sort for porcine-up:

Gene	Function
COL1A1	Type I collagen α1 — most abundant tissue ECM
COL6A1, COL6A2	Type VI collagen — pericellular
COL14A1, COL12A1	FACIT collagens
BGN	Biglycan — small leucine-rich proteoglycan
DCN	Decorin (frequently seen)

Tissue-derived ECM scaffolds → strong fibrillar collagen signature. Expected and reproduced.

What Real Data Shows That Simulations Don't

Three categories of "noise" that only appear when you run the pipeline on actual deposited data:

1. Matrigel nuclear protein contaminants

Mouse Matrigel is harvested from EHS tumors. Tumor cells contain nuclei. Decellularization is incomplete:

Gene	Localization	Why it's there
EWSR1	Nucleus, RNA-binding	EHS tumor nuclear residue
RUVBL2	Nucleus, ATPase	Tumor nuclear remnant
HCFC1	Nucleus, transcription regulator	Tumor nuclear remnant
ERP29	ER chaperone	Tumor secretory residue

If you're not expecting them, EWSR1 jumping to the top of "Matrigel-specific" looks like a finding. It isn't — it's contamination from the Matrigel manufacturing process. Multiple papers have noted this; it's now a recognized characteristic.

Practical handling: in your Methods, note that nuclear/secretory residues are an expected Matrigel feature and either (a) filter them out of ECM-specific comparisons by gene ontology, or (b) report them transparently as scaffold characterization.

2. Porcine ECM smooth muscle remnants

Decellularization protocols target nuclear DNA and cytoplasmic protein but don't fully remove cytoskeletal residues:

Gene	Cell type marker	Significance
TPM4	Smooth muscle tropomyosin	Decellularization completeness indicator
MYL9	Smooth muscle myosin light chain	Same
ACTA2 (more in stomach samples)	Alpha smooth muscle actin	Same
MYH11 (more in stomach samples)	Smooth muscle myosin heavy chain	Same

Stomach ECM has more residual smooth muscle protein than intestine, which has more than esophagus (which is what you actually expect to see in the Park 2026 system). This is useful — it's an internal QC metric for decellularization efficiency.

3. Matrigel batch variation

PCA of the four Matrigel replicates shows one sample (often MAT_4) separating along PC2. This is the well-known Matrigel batch-to-batch variation. It's why downstream Matrigel users worry about lot-to-lot reproducibility.

For PXD023694 specifically: you can quantitatively see it. For your own Matrigel work: budget for it — buy from a single lot if your experiment is sensitive, or use a defined-composition alternative (GelTrex, Cultrex Reduced Growth Factor, recombinant laminin substrates) if you can.

Methods Statement You Can Use

For reanalysis posts/papers using PXD023694:

"PRIDE dataset PXD023694 was used as an independent test set. The pre-processed proteinGroups.txt files for Matrigel (n=4) and porcine intestine ECM (n=4) were downloaded from the PRIDE FTP (https://www.ebi.ac.uk/pride/archive/projects/PXD023694, accessed YYYY-MM-DD). Reverse hits, MaxQuant potential contaminants, and 'only identified by site' entries were removed. iBAQ values were normalized to per-sample mole fraction (riBAQ). Cross-species protein groups were matched at the gene symbol level [or by Ensembl BioMart Release N 1:1 orthologs]. Differential abundance was computed by Welch's t-test on log2-transformed riBAQ with Benjamini-Hochberg FDR correction (significant: FDR<0.05, |log2FC|≥1)."

FAQ

Q: Do I need to download the raw .raw files? No, for reanalysis at the protein level the proteinGroups.txt files are sufficient. You only need raw files if you want to redo the database search with a different version/parameters.

Q: Why does Park 2026 not deposit the actual esophagus data? Various reasons — sometimes raw data is held back for ongoing follow-up studies, sometimes embargo periods. PXD023694 is the same lab's earlier, related deposit and serves as a close proxy.

Q: My matrigel results are different from yours — what's different? Likely differences: (a) MaxQuant version (search engine improvements), (b) FASTA database release (UniProt has updated since 2021), (c) different parameter group settings, (d) statistical filters. The qualitative signatures (LAMB1/NID1/LAMC1/HSPG2 in MAT; collagens in ECM) should reproduce regardless.

Q: Can I publish a paper using this dataset? Yes — PRIDE-deposited data is intended for reuse. Cite the original dataset DOI and the Park 2026 / Cho lab papers. Don't claim original data generation; do claim original analysis.

Q: Is there an esophagus equivalent on PRIDE? As of 2026, none specifically from the Cho lab. Other groups have deposited general esophagus tissue proteomics (search PRIDE for "esophagus") but not the decellularized ECM workflow. PXD023694 (intestine) is the closest proxy.

Q: How do I tell true Matrigel-up biology from EWSR1-type contamination? Rule of thumb: ECM-functional gene ontology (extracellular matrix, basement membrane, collagen catabolism) → real biology. Nuclear / RNA-binding / chromatin GO terms → contamination. When in doubt, check published Matrigel characterization papers (Hughes et al. 2010 Proteomics and follow-ups).

Q: Is "intestine vs esophagus" close enough as a positive control? Functionally yes for the ECM signature replication question (collagens vs basement membrane), but tissue-specific ECM differences exist (esophagus stratified squamous epithelium vs intestine columnar epithelium). For specific markers you'd want the actual esophagus data when it becomes available.

Closing — Why Reanalyzing PRIDE Datasets Matters

Most published proteomics datasets sit in PRIDE unused after the original paper. Reanalysis serves three purposes:

Pipeline validation — confirms your method works on real data, not just simulations or your own (potentially biased) samples
Independent replication — strengthens the field's confidence in core findings (Matrigel ≠ tissue ECM, in this case)
New insights from old data — your pipeline, your statistical lens, sometimes finds patterns the original analysis missed

PXD023694 specifically is one of the most reusable cross-species ECM datasets available. The basement membrane signature replicates cleanly. The contamination patterns are educational. If you're starting cross-species ECM proteomics, run your pipeline through this dataset first — if you can't reproduce LAMB1/NID1/LAMC1/HSPG2 at the top of Matrigel-up, something is broken in your downstream code.

Related posts:

References:

Park, S. M. et al. (2026). Decellularized esophageal ECM. Yonsei, Cho Lab.
PRIDE dataset: https://www.ebi.ac.uk/pride/archive/projects/PXD023694
Hughes, C. S., Postovit, L. M., Lajoie, G. A. (2010). Matrigel: a complex protein mixture required for optimal growth of cell culture. Proteomics, 10, 1886-1890.
Perez-Riverol, Y. et al. (2022). The PRIDE database resources in 2022. Nucleic Acids Research, 50, D543-D552.