Proteomics

BioMart Pig ↔ Mouse 1:1 Ortholog Mapping for Cross-Species Proteomics (R + Python Tutorial)

Step-by-step tutorial: download the Ensembl BioMart Release 111 Sus scrofa ↔ Mus musculus 1:1 ortholog table and join it to your per-species protein quantification in R (biomaRt) and Python (pandas). Why gene-symbol matching is fragile, how to handle one-to-many and many-to-many orthologs, and how to validate your mapping against UniProt.

·10 min read
#Ensembl BioMart#ortholog#Sus scrofa#Mus musculus#cross-species proteomics#biomaRt#pandas#BioMart Release 111#1:1 ortholog#gene symbol matching#InParanoid

BioMart ortholog mapping

Why Gene-Symbol Matching Isn't Good Enough

If you ran cross-species LC-MS/MS proteomics — Porcine ECM vs Mouse Matrigel, human-vs-mouse cells, host-pathogen — you eventually need to join two species' protein tables to compare. The fast-and-loose way is gene symbol matching with uppercase normalization:

df['GENE_UPPER'] = df['Genes'].str.upper()
merged = porcine_df.merge(mouse_df, on='GENE_UPPER')

This works for ~80% of well-annotated proteins. It fails silently for:

  • Paralogs: GAPDH has multiple human/mouse paralogs (GAPDHS, GAPDH-like). Naive upper-case match collapses them
  • Symbol drift: gene symbols change between annotations; mouse Hba1 vs human HBA1 look identical but aren't always 1:1 orthologs
  • Many-to-many: some gene families (immunoglobulins, MHC, olfactory receptors) have lineage-specific expansions
  • Case where one species has a gene and the other doesn't

The robust alternative is to use the Ensembl BioMart 1:1 ortholog table — Ensembl Compara's curated set of orthologs flagged as "one-to-one" between species pairs. This guide shows the exact workflow for Sus scrofa ↔ Mus musculus at Release 111 (2026), with R and Python implementations.

Related: Why You Must NOT Merge Species FASTA Databases in Cross-Species Proteomics covers the upstream principle — why you have two separate species tables that need joining in the first place.

Step 1 — What Is "1:1 Ortholog" in BioMart

Ensembl Compara classifies cross-species gene relationships:

TypeMeaning
ortholog_one2oneOne gene in each species, single common ancestor
ortholog_one2manyOne gene → multiple copies in the other species (lineage expansion)
ortholog_many2manyMultiple in both (gene family expansion in both lineages)
paralogWithin-species duplication
apparent_ortholog_one2oneSpecial case, less stringent

For cross-species quantitative proteomics comparison, only ortholog_one2one is safe. The others require case-by-case handling.

Step 2 — Download via BioMart Web UI

The fastest path:

  1. Go to https://www.ensembl.org/biomart/martview
  2. CHOOSE DATABASE: Ensembl Genes 111 (or current release)
  3. CHOOSE DATASET: Pig genes (Sus scrofa)
  4. Filters: none
  5. Attributes:
    • Features → GENE → Gene stable ID, Gene name, UniProtKB/Swiss-Prot ID
    • HOMOLOGUES → Orthologues → expand "Mouse Orthologues":
      • Mouse gene stable ID
      • Mouse gene name
      • Mouse homology type (this is where you'll filter to ortholog_one2one)
      • % identity (optional, useful for QC)
  6. Results → Compressed: No → Format: TSV → Go

Save as biomart_pig_mouse_r111.tsv. Expect ~25,000-30,000 rows (most pig genes have mouse counterparts, including some non-1:1).

Filter to 1:1 in your downstream code:

import pandas as pd

bm = pd.read_csv('biomart_pig_mouse_r111.tsv', sep='\t')
bm_one2one = bm[bm['Mouse homology type'] == 'ortholog_one2one'].copy()
print(f"1:1 orthologs: {len(bm_one2one):,}")  # typically ~13,000-14,000

For the Pig (Sus scrofa) ↔ Mouse (Mus musculus) pair at Release 111, you'll get roughly 13,476 1:1 pairs. Numbers shift slightly with each release.

Step 3 — Programmatic Download (R, biomaRt)

If you want the download reproducible in a script:

library(biomaRt)

# Pig (Sus scrofa) as the source
pig <- useEnsembl(biomart = "genes", dataset = "sscrofa_gene_ensembl", version = 111)

# Pull orthologs to mouse
orthos <- getBM(
  attributes = c(
    "ensembl_gene_id",
    "external_gene_name",
    "uniprotswissprot",
    "mmusculus_homolog_ensembl_gene",
    "mmusculus_homolog_associated_gene_name",
    "mmusculus_homolog_orthology_type",
    "mmusculus_homolog_perc_id"
  ),
  mart = pig
)

# Filter to 1:1
one2one <- subset(orthos, mmusculus_homolog_orthology_type == "ortholog_one2one")
write.table(one2one, "biomart_pig_mouse_one2one_r111.tsv",
            sep = "\t", row.names = FALSE, quote = FALSE)

Caveats:

  • Ensembl mirror availability fluctuates — sometimes useEnsembl() times out. Use host = "https://www.ensembl.org" or specify a mirror (mirror = "asia").
  • For reproducibility, pin the version = argument. Don't rely on "current."
  • Record the release number and download date in your Methods.

Step 4 — Programmatic Download (Python, pybiomart)

from pybiomart import Server

server = Server(host='https://www.ensembl.org')
mart = server['ENSEMBL_MART_ENSEMBL']
dataset = mart['sscrofa_gene_ensembl']

orthos = dataset.query(attributes=[
    'ensembl_gene_id',
    'external_gene_name',
    'uniprotswissprot',
    'mmusculus_homolog_ensembl_gene',
    'mmusculus_homolog_associated_gene_name',
    'mmusculus_homolog_orthology_type',
    'mmusculus_homolog_perc_id',
])

one2one = orthos[orthos['Mouse homology type'] == 'ortholog_one2one']
one2one.to_csv('biomart_pig_mouse_one2one_r111.tsv', sep='\t', index=False)
print(f"Wrote {len(one2one):,} 1:1 orthologs")

pybiomart works but is unmaintained as of 2026. Alternatives:

  • gget (https://github.com/pachterlab/gget): modern, fast, has gget.search() and gget.info() for orthologs
  • Direct HTTP query to https://rest.ensembl.org/homology/... (full control, slower)

Step 5 — Join to Your Proteomics Quantification Tables

After species-separated MaxQuant/DIA-NN searches, you have:

  • porcine_pg_matrix.tsv (Porcine proteins, columns: Genes, sample_1, sample_2, ...)
  • mouse_pg_matrix.tsv (Mouse proteins, columns: Genes, sample_1, sample_2, ...)

Join via the BioMart table:

import pandas as pd

porcine = pd.read_csv('porcine_pg_matrix.tsv', sep='\t')
mouse = pd.read_csv('mouse_pg_matrix.tsv', sep='\t')
bm = pd.read_csv('biomart_pig_mouse_one2one_r111.tsv', sep='\t')

# Keep only essential ortholog mapping columns
bm_map = bm[['external_gene_name', 'mmusculus_homolog_associated_gene_name']].rename(
    columns={
        'external_gene_name': 'pig_gene',
        'mmusculus_homolog_associated_gene_name': 'mouse_gene',
    }
)
bm_map = bm_map.dropna()

# Add an ortholog ID
bm_map['ortho_id'] = range(len(bm_map))

# Map porcine proteins to ortholog IDs
porcine_mapped = porcine.merge(
    bm_map[['pig_gene', 'ortho_id']],
    left_on='Genes', right_on='pig_gene', how='inner'
)

# Map mouse proteins to ortholog IDs
mouse_mapped = mouse.merge(
    bm_map[['mouse_gene', 'ortho_id']],
    left_on='Genes', right_on='mouse_gene', how='inner'
)

# Inner-join on ortho_id — now porcine and mouse rows are aligned by orthology
combined = porcine_mapped.merge(
    mouse_mapped, on='ortho_id', suffixes=('_pig', '_mouse')
)

print(f"Porcine proteins quantified: {len(porcine):,}")
print(f"Mouse proteins quantified: {len(mouse):,}")
print(f"After 1:1 ortholog join: {len(combined):,} ortholog pairs")

What to expect (realistic numbers from PRIDE PXD023694-style data)

  • Porcine intestine ECM after QC: ~1,000 proteins
  • Mouse Matrigel after QC: ~700 proteins
  • 1:1 ortholog pairs after join: ~280-300

Why so few? Two compounding effects:

  1. Not every protein is detected in both species (biological + technical reasons)
  2. Some proteins lack 1:1 orthologs (paralog expansions, lineage-specific genes)

This is normal. A 20-30% mapping rate is realistic for ECM proteomics with current Ensembl annotation.

Step 6 — Validate Your Mapping

After joining, sanity-check before running statistics:

Check 1 — known orthologs are present

known_orthos = [
    ('COL1A1', 'Col1a1'),  # collagen I alpha 1
    ('LAMB1',  'Lamb1'),   # laminin beta 1
    ('GAPDH',  'Gapdh'),   # housekeeping
    ('ACTB',   'Actb'),    # actin
    ('FN1',    'Fn1'),     # fibronectin
]
for pig, mouse in known_orthos:
    found = ((bm_map['pig_gene'] == pig) & (bm_map['mouse_gene'] == mouse)).any()
    print(f"  {pig}{mouse}: {'✓' if found else '✗ MISSING'}")

If any of these are missing, recheck your BioMart download (wrong filter? wrong release? incomplete columns?).

Check 2 — duplicate handling

A "1:1" ortholog table should have at most one row per pig gene and per mouse gene. Verify:

print("Pig genes appearing >1x:", (bm_map['pig_gene'].value_counts() > 1).sum())
print("Mouse genes appearing >1x:", (bm_map['mouse_gene'].value_counts() > 1).sum())
# Both should be 0 for true 1:1

If non-zero, your filter let some non-1:1 entries through.

Check 3 — protein groups with multiple gene names

MaxQuant/DIA-NN sometimes outputs Genes like COL1A1;COL1A2 (protein group has multiple inferred genes). Decide upfront:

  • Split and map each → conservative
  • Use only the first → simpler, slightly lossy
  • Drop multi-gene protein groups → cleanest but loses data

For ECM proteomics with collagens and laminins, multi-gene groups are common. Often the simplest defensible choice is to drop them in the cross-species comparison and report it as a limitation.

Common Pitfalls

"I'm using the wrong species pair"

BioMart asks for "the dataset" (e.g., Sus scrofa) and the homologue species attributes (Mouse Orthologues). The attribute prefix mmusculus_homolog_* is what specifies mouse. Make sure you're not pulling hsapiens_homolog_* (human) by accident — easy mistake when copy-pasting query code.

"Old Ensembl release"

useEnsembl(version = 90) from a tutorial pulls 2017 annotations. Always check the release number is current (Ensembl Release 111 as of 2026). Use the current release unless you have a specific reproducibility reason to pin an older one — and if you pin, document it.

"Symbol case mismatch in your data"

Mouse genes are conventionally lowercase except first letter (Col1a1). Human and pig conventionally uppercase (COL1A1). BioMart returns them in their native conventions. When joining to your proteomics data, check which case convention MaxQuant used and either match exactly or apply .str.upper() on both sides consistently.

"Some highly conserved proteins missing"

Truly identical sequences across species sometimes lack a "1:1 ortholog" assignment due to annotation gaps. Workaround: cross-reference with OrthoDB or InParanoid if a critical protein is missing from BioMart.

"ortholog_one2one is too strict"

For some studies you may want to include apparent_ortholog_one2one or even ortholog_one2many (collapsing the many-side via summation). Decide based on biological question; default to strict 1:1 unless you have a reason.

Alternatives to BioMart

For most cross-species proteomics, BioMart is the right default. Alternatives worth knowing:

Use these when BioMart is missing your species pair or when you need more conservative orthology.

FAQ

Q: Can I use gene-symbol matching for a quick exploration? Yes, for exploration. But once you commit to publishing, switch to BioMart 1:1 — reviewers in cross-species fields increasingly expect it explicitly mentioned in Methods.

Q: What if my data uses Ensembl gene IDs, not gene names? Even better — Ensembl IDs are stable and unambiguous. Join via ensembl_gene_idmmusculus_homolog_ensembl_gene and skip the gene name layer entirely.

Q: Does this work for protein IDs (UniProt) instead of gene IDs? Indirectly. BioMart can provide UniProt IDs as an attribute (uniprotswissprot), and you can join via those. Caveat: some proteins have multiple UniProt accessions per gene (isoforms), so an intermediate gene-level mapping is usually cleaner.

Q: How often should I re-download the ortholog table? Ensembl releases ~4× per year. For active long-term projects, re-pull each release and re-validate key orthologs. For a single paper, pin the release at start of analysis and stick with it through publication.

Q: My biomaRt R call is failing with TLS errors. Common issue with corporate proxies and Ensembl mirror outages. Try httr::set_config(httr::config(ssl_verifypeer = 0L)) for testing, or use a different mirror via mirror = "asia". Production: use the web UI download for one-off, scripted only for reproducibility.

Q: Does this matter for closely related strains within a species? Generally no. Two C57BL/6 mouse substrains use the same annotation — no ortholog mapping needed. Cross-strain or cross-population studies use SNP-based approaches, not orthology.

Closing — The Workflow in One Block

1. Run species-separated proteomics searches (MaxQuant/DIA-NN)
2. Download Ensembl BioMart 1:1 ortholog table (pin release)
3. Filter to ortholog_one2one
4. Sanity-check known orthologs are present
5. Inner-join per-species protein tables via ortholog ID
6. Run cross-species statistics on the joined table
7. Report: BioMart release number + download date + 1:1 filter in Methods

Used this way, cross-species ortholog mapping becomes a transparent, reproducible step instead of a hidden source of errors. The gene-symbol shortcut is fine for exploration; BioMart 1:1 is what survives peer review.


Related posts:

References:

  • Smedley, D. et al. (2015). The BioMart community portal. Nucleic Acids Research, 43, W589-W598.
  • Cunningham, F. et al. (2024). Ensembl 2025. Nucleic Acids Research (current annual release).
  • Sonnhammer, E. L. et al. (2015). InParanoid 8. Nucleic Acids Research, 43, D234-D239.
  • Kuzniar, A. et al. (2008). The quest for orthologs. Trends in Genetics, 24, 539-551.

관련 글