BioMart Pig ↔ Mouse 1:1 Ortholog Mapping for Cross-Species Proteomics (R + Python Tutorial)
Step-by-step tutorial: download the Ensembl BioMart Release 111 Sus scrofa ↔ Mus musculus 1:1 ortholog table and join it to your per-species protein quantification in R (biomaRt) and Python (pandas). Why gene-symbol matching is fragile, how to handle one-to-many and many-to-many orthologs, and how to validate your mapping against UniProt.
Why Gene-Symbol Matching Isn't Good Enough
If you ran cross-species LC-MS/MS proteomics — Porcine ECM vs Mouse Matrigel, human-vs-mouse cells, host-pathogen — you eventually need to join two species' protein tables to compare. The fast-and-loose way is gene symbol matching with uppercase normalization:
df['GENE_UPPER'] = df['Genes'].str.upper()
merged = porcine_df.merge(mouse_df, on='GENE_UPPER')
This works for ~80% of well-annotated proteins. It fails silently for:
- Paralogs: GAPDH has multiple human/mouse paralogs (GAPDHS, GAPDH-like). Naive upper-case match collapses them
- Symbol drift: gene symbols change between annotations; mouse
Hba1vs humanHBA1look identical but aren't always 1:1 orthologs - Many-to-many: some gene families (immunoglobulins, MHC, olfactory receptors) have lineage-specific expansions
- Case where one species has a gene and the other doesn't
The robust alternative is to use the Ensembl BioMart 1:1 ortholog table — Ensembl Compara's curated set of orthologs flagged as "one-to-one" between species pairs. This guide shows the exact workflow for Sus scrofa ↔ Mus musculus at Release 111 (2026), with R and Python implementations.
Related: Why You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics covers the upstream principle — why you have two separate species tables that need joining in the first place.
Step 1 — What Is "1:1 Ortholog" in BioMart
Ensembl Compara classifies cross-species gene relationships:
| Type | Meaning |
|---|---|
ortholog_one2one | One gene in each species, single common ancestor |
ortholog_one2many | One gene → multiple copies in the other species (lineage expansion) |
ortholog_many2many | Multiple in both (gene family expansion in both lineages) |
paralog | Within-species duplication |
apparent_ortholog_one2one | Special case, less stringent |
For cross-species quantitative proteomics comparison, only ortholog_one2one is safe. The others require case-by-case handling.
Step 2 — Download via BioMart Web UI
The fastest path:
- Go to https://www.ensembl.org/biomart/martview
- CHOOSE DATABASE: Ensembl Genes 111 (or current release)
- CHOOSE DATASET: Pig genes (Sus scrofa)
- Filters: none
- Attributes:
- Features → GENE →
Gene stable ID,Gene name,UniProtKB/Swiss-Prot ID - HOMOLOGUES → Orthologues → expand "Mouse Orthologues":
Mouse gene stable IDMouse gene nameMouse homology type(this is where you'll filter toortholog_one2one)% identity(optional, useful for QC)
- Features → GENE →
- Results → Compressed: No → Format: TSV → Go
Save as biomart_pig_mouse_r111.tsv. Expect ~25,000-30,000 rows (most pig genes have mouse counterparts, including some non-1:1).
Filter to 1:1 in your downstream code:
import pandas as pd
bm = pd.read_csv('biomart_pig_mouse_r111.tsv', sep='\t')
bm_one2one = bm[bm['Mouse homology type'] == 'ortholog_one2one'].copy()
print(f"1:1 orthologs: {len(bm_one2one):,}") # typically ~13,000-14,000
For the Pig (Sus scrofa) ↔ Mouse (Mus musculus) pair at Release 111, you'll get roughly 13,476 1:1 pairs. Numbers shift slightly with each release.
Step 3 — Programmatic Download (R, biomaRt)
If you want the download reproducible in a script:
library(biomaRt)
# Pig (Sus scrofa) as the source
pig <- useEnsembl(biomart = "genes", dataset = "sscrofa_gene_ensembl", version = 111)
# Pull orthologs to mouse
orthos <- getBM(
attributes = c(
"ensembl_gene_id",
"external_gene_name",
"uniprotswissprot",
"mmusculus_homolog_ensembl_gene",
"mmusculus_homolog_associated_gene_name",
"mmusculus_homolog_orthology_type",
"mmusculus_homolog_perc_id"
),
mart = pig
)
# Filter to 1:1
one2one <- subset(orthos, mmusculus_homolog_orthology_type == "ortholog_one2one")
write.table(one2one, "biomart_pig_mouse_one2one_r111.tsv",
sep = "\t", row.names = FALSE, quote = FALSE)
Caveats:
- Ensembl mirror availability fluctuates — sometimes
useEnsembl()times out. Usehost = "https://www.ensembl.org"or specify a mirror (mirror = "asia"). - For reproducibility, pin the
version =argument. Don't rely on "current." - Record the release number and download date in your Methods.
Step 4 — Programmatic Download (Python, pybiomart)
from pybiomart import Server
server = Server(host='https://www.ensembl.org')
mart = server['ENSEMBL_MART_ENSEMBL']
dataset = mart['sscrofa_gene_ensembl']
orthos = dataset.query(attributes=[
'ensembl_gene_id',
'external_gene_name',
'uniprotswissprot',
'mmusculus_homolog_ensembl_gene',
'mmusculus_homolog_associated_gene_name',
'mmusculus_homolog_orthology_type',
'mmusculus_homolog_perc_id',
])
one2one = orthos[orthos['Mouse homology type'] == 'ortholog_one2one']
one2one.to_csv('biomart_pig_mouse_one2one_r111.tsv', sep='\t', index=False)
print(f"Wrote {len(one2one):,} 1:1 orthologs")
pybiomart works but is unmaintained as of 2026. Alternatives:
gget(https://github.com/pachterlab/gget): modern, fast, hasgget.search()andgget.info()for orthologs- Direct HTTP query to
https://rest.ensembl.org/homology/...(full control, slower)
Step 5 — Join to Your Proteomics Quantification Tables
After species-separated MaxQuant/DIA-NN searches, you have:
porcine_pg_matrix.tsv(Porcine proteins, columns: Genes, sample_1, sample_2, ...)mouse_pg_matrix.tsv(Mouse proteins, columns: Genes, sample_1, sample_2, ...)
Join via the BioMart table:
import pandas as pd
porcine = pd.read_csv('porcine_pg_matrix.tsv', sep='\t')
mouse = pd.read_csv('mouse_pg_matrix.tsv', sep='\t')
bm = pd.read_csv('biomart_pig_mouse_one2one_r111.tsv', sep='\t')
# Keep only essential ortholog mapping columns
bm_map = bm[['external_gene_name', 'mmusculus_homolog_associated_gene_name']].rename(
columns={
'external_gene_name': 'pig_gene',
'mmusculus_homolog_associated_gene_name': 'mouse_gene',
}
)
bm_map = bm_map.dropna()
# Add an ortholog ID
bm_map['ortho_id'] = range(len(bm_map))
# Map porcine proteins to ortholog IDs
porcine_mapped = porcine.merge(
bm_map[['pig_gene', 'ortho_id']],
left_on='Genes', right_on='pig_gene', how='inner'
)
# Map mouse proteins to ortholog IDs
mouse_mapped = mouse.merge(
bm_map[['mouse_gene', 'ortho_id']],
left_on='Genes', right_on='mouse_gene', how='inner'
)
# Inner-join on ortho_id — now porcine and mouse rows are aligned by orthology
combined = porcine_mapped.merge(
mouse_mapped, on='ortho_id', suffixes=('_pig', '_mouse')
)
print(f"Porcine proteins quantified: {len(porcine):,}")
print(f"Mouse proteins quantified: {len(mouse):,}")
print(f"After 1:1 ortholog join: {len(combined):,} ortholog pairs")
What to expect (realistic numbers from PRIDE PXD023694-style data)
- Porcine intestine ECM after QC: ~1,000 proteins
- Mouse Matrigel after QC: ~700 proteins
- 1:1 ortholog pairs after join: ~280-300
Why so few? Two compounding effects:
- Not every protein is detected in both species (biological + technical reasons)
- Some proteins lack 1:1 orthologs (paralog expansions, lineage-specific genes)
This is normal. A 20-30% mapping rate is realistic for ECM proteomics with current Ensembl annotation.
Step 6 — Validate Your Mapping
After joining, sanity-check before running statistics:
Check 1 — known orthologs are present
known_orthos = [
('COL1A1', 'Col1a1'), # collagen I alpha 1
('LAMB1', 'Lamb1'), # laminin beta 1
('GAPDH', 'Gapdh'), # housekeeping
('ACTB', 'Actb'), # actin
('FN1', 'Fn1'), # fibronectin
]
for pig, mouse in known_orthos:
found = ((bm_map['pig_gene'] == pig) & (bm_map['mouse_gene'] == mouse)).any()
print(f" {pig} ↔ {mouse}: {'✓' if found else '✗ MISSING'}")
If any of these are missing, recheck your BioMart download (wrong filter? wrong release? incomplete columns?).
Check 2 — duplicate handling
A "1:1" ortholog table should have at most one row per pig gene and per mouse gene. Verify:
print("Pig genes appearing >1x:", (bm_map['pig_gene'].value_counts() > 1).sum())
print("Mouse genes appearing >1x:", (bm_map['mouse_gene'].value_counts() > 1).sum())
# Both should be 0 for true 1:1
If non-zero, your filter let some non-1:1 entries through.
Check 3 — protein groups with multiple gene names
MaxQuant/DIA-NN sometimes outputs Genes like COL1A1;COL1A2 (protein group has multiple inferred genes). Decide upfront:
- Split and map each → conservative
- Use only the first → simpler, slightly lossy
- Drop multi-gene protein groups → cleanest but loses data
For ECM proteomics with collagens and laminins, multi-gene groups are common. Often the simplest defensible choice is to drop them in the cross-species comparison and report it as a limitation.
Common Pitfalls
"I'm using the wrong species pair"
BioMart asks for "the dataset" (e.g., Sus scrofa) and the homologue species attributes (Mouse Orthologues). The attribute prefix mmusculus_homolog_* is what specifies mouse. Make sure you're not pulling hsapiens_homolog_* (human) by accident — easy mistake when copy-pasting query code.
"Old Ensembl release"
useEnsembl(version = 90) from a tutorial pulls 2017 annotations. Always check the release number is current (Ensembl Release 111 as of 2026). Use the current release unless you have a specific reproducibility reason to pin an older one — and if you pin, document it.
"Symbol case mismatch in your data"
Mouse genes are conventionally lowercase except first letter (Col1a1). Human and pig conventionally uppercase (COL1A1). BioMart returns them in their native conventions. When joining to your proteomics data, check which case convention MaxQuant used and either match exactly or apply .str.upper() on both sides consistently.
"Some highly conserved proteins missing"
Truly identical sequences across species sometimes lack a "1:1 ortholog" assignment due to annotation gaps. Workaround: cross-reference with OrthoDB or InParanoid if a critical protein is missing from BioMart.
"ortholog_one2one is too strict"
For some studies you may want to include apparent_ortholog_one2one or even ortholog_one2many (collapsing the many-side via summation). Decide based on biological question; default to strict 1:1 unless you have a reason.
Alternatives to BioMart
For most cross-species proteomics, BioMart is the right default. Alternatives worth knowing:
- InParanoid (https://inparanoid.sbc.su.se/): all-vs-all approach, good for less-studied species
- OrthoDB (https://www.orthodb.org/): hierarchical orthology, broader species coverage
- OMA (https://omabrowser.org/): high precision, longer compute
- eggNOG (http://eggnog5.embl.de/): function-aware orthology
Use these when BioMart is missing your species pair or when you need more conservative orthology.
FAQ
Q: Can I use gene-symbol matching for a quick exploration? Yes, for exploration. But once you commit to publishing, switch to BioMart 1:1 — reviewers in cross-species fields increasingly expect it explicitly mentioned in Methods.
Q: What if my data uses Ensembl gene IDs, not gene names?
Even better — Ensembl IDs are stable and unambiguous. Join via ensembl_gene_id ↔ mmusculus_homolog_ensembl_gene and skip the gene name layer entirely.
Q: Does this work for protein IDs (UniProt) instead of gene IDs?
Indirectly. BioMart can provide UniProt IDs as an attribute (uniprotswissprot), and you can join via those. Caveat: some proteins have multiple UniProt accessions per gene (isoforms), so an intermediate gene-level mapping is usually cleaner.
Q: How often should I re-download the ortholog table? Ensembl releases ~4× per year. For active long-term projects, re-pull each release and re-validate key orthologs. For a single paper, pin the release at start of analysis and stick with it through publication.
Q: My biomaRt R call is failing with TLS errors.
Common issue with corporate proxies and Ensembl mirror outages. Try httr::set_config(httr::config(ssl_verifypeer = 0L)) for testing, or use a different mirror via mirror = "asia". Production: use the web UI download for one-off, scripted only for reproducibility.
Q: Does this matter for closely related strains within a species? Generally no. Two C57BL/6 mouse substrains use the same annotation — no ortholog mapping needed. Cross-strain or cross-population studies use SNP-based approaches, not orthology.
Closing — The Workflow in One Block
1. Run species-separated proteomics searches (MaxQuant/DIA-NN)
2. Download Ensembl BioMart 1:1 ortholog table (pin release)
3. Filter to ortholog_one2one
4. Sanity-check known orthologs are present
5. Inner-join per-species protein tables via ortholog ID
6. Run cross-species statistics on the joined table
7. Report: BioMart release number + download date + 1:1 filter in Methods
Used this way, cross-species ortholog mapping becomes a transparent, reproducible step instead of a hidden source of errors. The gene-symbol shortcut is fine for exploration; BioMart 1:1 is what survives peer review.
Related posts:
- Reproducing Park et al. 2026 — Cross-Species ECM Proteomics, Three Iterations
- Why You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics
- From DIA-NN Output to Paper Draft: AI-Assisted Proteomics Workflow
- LC-MS/MS Proteomics: Complete Workflow Guide 2026
References:
- Smedley, D. et al. (2015). The BioMart community portal. Nucleic Acids Research, 43, W589-W598.
- Cunningham, F. et al. (2024). Ensembl 2025. Nucleic Acids Research (current annual release).
- Sonnhammer, E. L. et al. (2015). InParanoid 8. Nucleic Acids Research, 43, D234-D239.
- Kuzniar, A. et al. (2008). The quest for orthologs. Trends in Genetics, 24, 539-551.
관련 글
Why You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics (Shared Peptide Problem)
5월 23일 · 9 min read
ProteomicsReproducing Park et al. 2026: Three Iterations of a Cross-Species ECM Proteomics Pipeline
5월 19일 · 12 min read
ProteomicsReanalyzing PRIDE PXD023694 — Matrigel Nuclear Contaminants (EWSR1, RUVBL2) You'll Find in Real Data
5월 23일 · 11 min read
ProteomicsFrom DIA-NN Output to Paper Draft: A Complete AI-Assisted Proteomics Workflow (2026)
5월 22일 · 13 min read