GO and Pathway Enrichment Analysis: A Complete Practical Guide (clusterProfiler, fgsea, MSigDB)
Comprehensive guide to Gene Ontology and Pathway enrichment analysis. Compare clusterProfiler, topGO, g:Profiler, fgsea, and learn to avoid the background gene set trap that invalidates most analyses. Includes ORA vs GSEA decision framework.
📚 Series Navigation · This is Part 2 of the post-DEG functional analysis series. Part 1 — PPI + Hub Analysis · Part 2 — GO + Pathway Enrichment (current) · Part 3 — TF Activity + Biomarker Integration (coming)
After identifying Hub proteins from a PPI network, the next obvious question is: "What are these genes actually doing biologically?" GO enrichment and Pathway enrichment are the two standard tools that answer this. Yet despite decades of use, most published enrichment analyses contain at least one critical methodological error — usually invisible until a reviewer asks pointed questions. This guide focuses on the practical decisions and statistical pitfalls that separate publication-quality enrichment analyses from misleading ones.
TL;DR
- The background gene set (universe) is the single most consequential decision — get it wrong and 90% of your "significant" results are spurious.
- ORA (Over-Representation Analysis) is fast and intuitive when you have a defined gene list with a clear cutoff; GSEA is more powerful when you have a continuous ranking and want to detect subtle, coordinated changes.
- GO redundancy is real and unavoidable — use
clusterProfiler::simplify()or topGO's elim algorithm to collapse semantically similar terms. - Always report database versions, multiple testing correction method, background definition, and cutoffs. Reviewers will ask.
Why GO and Pathway Enrichment Matter
A list of differentially expressed genes is just a list. It doesn't tell you which biological functions are perturbed, which signaling pathways are activated, or which cellular compartments are affected. Functional enrichment analysis transforms a gene list into biological narrative.
The two main analytical frameworks address slightly different questions:
- GO Enrichment: Which biological processes, molecular functions, or cellular components are over-represented in my gene list compared to the genome background?
- Pathway Enrichment: Which curated signaling cascades or metabolic pathways are over-represented?
Both rest on the same statistical foundation (testing whether observed overlap with a predefined gene set exceeds what's expected by chance), but they use different reference annotations and yield different types of biological insight.
Gene Ontology: Structure and Statistics
The DAG Structure
Gene Ontology is not a tree. It's a Directed Acyclic Graph (DAG) — meaning a single GO term can have multiple parent terms. This has practical consequences:
GO:0000278 (mitotic cell cycle)
is_a → GO:0007049 (cell cycle)
is_a → GO:0009987 (cellular process)
is_a → GO:0008150 (biological_process)
When a gene is annotated to a specific term, it is implicitly annotated to all ancestor terms via "true path rule." This means you may see overlap between a parent term ("cell cycle") and its child ("mitotic cell cycle") in your results. Both can be statistically significant from the same gene set.
The Three GO Namespaces
GO terms are partitioned into three orthogonal hierarchies:
- BP (Biological Process) — Multi-step processes: mitotic cell cycle, apoptosis, immune response, blood vessel formation
- MF (Molecular Function) — Molecular activities of single gene products: DNA binding, kinase activity, transporter activity
- CC (Cellular Component) — Subcellular locations: nucleus, mitochondrion, ribosome, plasma membrane
For most functional analyses, BP is the most informative, followed by MF. CC is useful for confirming subcellular localization patterns or identifying compartment-specific responses.
Evidence Codes: Not All Annotations Are Equal
Each gene-to-GO-term annotation comes with an evidence code indicating how it was assigned. Major categories:
| Code | Meaning | Reliability |
|---|---|---|
| EXP, IDA, IMP, IGI, IPI, IEP | Experimental evidence | ⭐⭐⭐ Highest |
| TAS, IC | Curator-assigned, traceable | ⭐⭐ High |
| ISS, ISO, ISA, ISM | Sequence similarity to annotated proteins | ⭐⭐ Medium |
| IEA | Automated annotation (algorithmic) | ⭐ Caution — most annotations fall here |
Practical filtering: For conservative analyses (especially in regulated/clinical contexts), exclude IEA. For maximum coverage in exploratory analyses, include it. Document your choice.
When using org.Hs.eg.db, querying GOALL automatically includes ancestor terms — convenient but watch for double-counting in downstream analyses.
GO Enrichment Tools Compared
| Tool | Interface | Key Strengths | When to Use |
|---|---|---|---|
| clusterProfiler | R | De facto standard, ORA + GSEA, rich visualization | Default choice for most analyses |
| topGO | R | DAG-aware elim/weight algorithms, automatic redundancy reduction | When term redundancy is a major concern |
| g:Profiler / gprofiler2 | R / Web / Python | Most current databases, easy custom backgrounds, multi-query | Recent gene sets, comparative analyses |
| DAVID | Web | Classic, easy to use, broad tool integration | Quick preliminary scans (caveat: update lag) |
| Enrichr | Web / R | 200+ gene set libraries beyond GO | Exploratory analyses, transcription factor enrichment |
| GOseq | R | RNA-seq length bias correction | Specifically for RNA-seq gene lists |
Complete clusterProfiler ORA Workflow
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
# Convert your DEG symbols to Entrez IDs
deg_entrez <- bitr(
deg_symbols,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db
)$ENTREZID
# Universe = genes detected in your experiment (CRITICAL)
universe_entrez <- bitr(
detected_genes, # NOT all human genes!
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db
)$ENTREZID
# Run GO BP enrichment
ego <- enrichGO(
gene = deg_entrez,
universe = universe_entrez,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.10,
readable = TRUE
)
# Reduce redundancy (semantic similarity-based clustering)
ego_simp <- simplify(
ego,
cutoff = 0.7,
by = "p.adjust",
select_fun = min
)
# Visualization options
dotplot(ego_simp, showCategory = 20)
barplot(ego_simp, showCategory = 20)
emapplot(pairwise_termsim(ego_simp), showCategory = 30)
cnetplot(
ego_simp,
categorySize = "pvalue",
foldChange = log2fc_named_vector
)
heatplot(ego_simp, foldChange = log2fc_named_vector)
treeplot(pairwise_termsim(ego_simp))
# Save results
write.csv(as.data.frame(ego_simp), "go_bp_results.csv", row.names = FALSE)
The Background Gene Set Trap
This deserves its own section because it's the single most common methodological error in published enrichment analyses.
Why Background Matters
Enrichment is fundamentally a question of proportion: "Is the proportion of my gene list belonging to category X higher than expected?" The expectation is computed from your background (universe). If your background is wrong, your null hypothesis is wrong.
The Wrong Way (Don't Do This)
# WRONG: Using the entire human genome as universe
ego <- enrichGO(
gene = deg_entrez,
universe = NULL, # Defaults to all annotated genes — bad!
OrgDb = org.Hs.eg.db,
...
)
The problem: If your RNA-seq experiment only detected 12,000 of 20,000 human genes, including the undetected 8,000 in the background creates a biased reference. The undetected genes typically include rarely expressed genes (olfactory receptors, testis-specific genes, etc.), which inflates the apparent enrichment of commonly expressed pathways.
Real-world consequence: Analyses with this error consistently show "immune response" or "metabolic process" enrichment regardless of actual experimental context, because those categories happen to overlap with commonly-expressed genes.
The Right Way
# CORRECT: Use only detected genes as universe
detected_genes <- rownames(rnaseq_counts)[
rowSums(rnaseq_counts >= 10) >= 3
]
universe_entrez <- bitr(
detected_genes,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db
)$ENTREZID
ego <- enrichGO(
gene = deg_entrez,
universe = universe_entrez, # Properly scoped
OrgDb = org.Hs.eg.db,
ont = "BP",
...
)
For proteomics, the universe is proteins detected in your MS experiment. For ChIP-seq peak-to-gene analyses, it's genes within reachable distance of your peak set. The principle is the same: the background should match what was actually testable in your experiment.
Pathway Database Comparison
| Database | Coverage | License | Strengths |
|---|---|---|---|
| KEGG | ~550 pathways | Free for academic, paid for commercial | Classic, well-known, includes metabolism |
| Reactome | ~2,500 pathways | Fully open | Hierarchical, reaction-level detail, modern |
| WikiPathways | ~1,300 pathways | Open, community-contributed | Disease-specific pathways, regularly updated |
| MSigDB | ~35,000 gene sets | Free for academic | Hallmark, C2 (curated), C5 (GO), C7 (immune), C8 (cell signature) |
| BioCarta, PID | Subsumed in MSigDB C2 | - | Legacy reference |
For most analyses, Reactome + MSigDB Hallmark provides excellent coverage and interpretability. Add KEGG if metabolic pathway focus is needed.
ORA vs GSEA: When to Use Which
This is the second-most-common confusion in enrichment analysis. Both methods test enrichment, but they answer different questions.
ORA (Over-Representation Analysis)
Statistical test: Fisher's exact test or hypergeometric distribution
Input: A defined gene list (e.g., DEGs at FDR < 0.05) + background
Question: "Are members of pathway X over-represented in my gene list compared to expectation?"
Strengths:
- Conceptually simple
- Computationally fast
- Easy to communicate
Weaknesses:
- Sensitive to your DEG cutoff choice (one gene above/below threshold can change results)
- Loses information about the magnitude and direction of expression change
- Cannot detect coordinated subtle changes
GSEA (Gene Set Enrichment Analysis)
Statistical test: Kolmogorov-Smirnov-like enrichment score, with permutation-based significance
Input: All genes ranked by some statistic (e.g., signed -log10(p) × log2FC)
Question: "Are pathway X members enriched at the top or bottom of my ranked gene list?"
Strengths:
- No arbitrary cutoff
- Captures coordinated subtle changes
- Provides directional information (up vs down regulation)
- Identifies "leading edge" — the actual genes driving the enrichment signal
Weaknesses:
- More parameters to choose (weighted vs classic, permutation type)
- Slower, especially with many gene sets
- Harder to communicate to non-specialists
Practical Decision Framework
Use ORA when:
- You have a small, well-defined gene list (< 500 genes)
- Your analysis includes a clear cutoff (e.g., FDR < 0.05)
- You need quick exploratory results
Use GSEA when:
- You have a continuous ranking (DESeq2 stat, limma t-stat)
- Effects are subtle but coordinated (e.g., aging, mild perturbations)
- You want direction-aware results (up- vs down-regulation)
- Reviewers explicitly ask for it (increasingly common)
Best practice: Run both. ORA for the headline result, GSEA for the deeper analysis with leading edge identification.
Complete fgsea Workflow
library(fgsea)
library(msigdbr)
library(dplyr)
library(ggplot2)
# Get MSigDB Hallmark gene sets for human
m_df <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_pathways <- split(m_df$gene_symbol, m_df$gs_name)
# Construct ranked gene list (signed -log10p × log2FC)
deg_results$rank <- sign(deg_results$log2FoldChange) *
-log10(deg_results$pvalue + 1e-300)
ranks <- setNames(deg_results$rank, deg_results$gene_symbol)
ranks <- sort(ranks, decreasing = TRUE)
# Run fgsea (multi-threaded, fast)
fgsea_res <- fgsea(
pathways = hallmark_pathways,
stats = ranks,
minSize = 15,
maxSize = 500,
eps = 0
)
# Examine top pathways
fgsea_res %>%
arrange(padj) %>%
head(20) %>%
select(pathway, NES, padj, leadingEdge)
# Visualize a single pathway's running enrichment score
plotEnrichment(
hallmark_pathways[["HALLMARK_E2F_TARGETS"]],
ranks
) + ggtitle("E2F Targets — Cell Cycle Regulation")
# Multi-pathway summary plot
top_up <- fgsea_res[ES > 0][order(padj)][1:10, pathway]
top_down <- fgsea_res[ES < 0][order(padj)][1:10, pathway]
top_all <- c(top_up, top_down)
plotGseaTable(
hallmark_pathways[top_all],
ranks,
fgsea_res,
gseaParam = 0.5
)
Pitfalls in Pathway Result Interpretation
Beyond background selection, several other issues commonly trip up enrichment analyses.
Pathway Size Bias
Larger pathways (e.g., "metabolism," "signal transduction") are easier to enrich simply because they contain more genes. When reading results, always look at NES alongside gene set size — a pathway with NES = 1.8 and 30 genes is more impressive than NES = 2.0 and 500 genes.
Redundancy Between Parent and Child Pathways
In hierarchical databases like Reactome, both "Cell Cycle" and "Mitotic Cell Cycle" frequently appear in top results. They're not independent findings. Use:
clusterProfiler::simplify()for GO terms- Manual review of leading edge gene overlap for pathway results
- The
enrichplot::emapplotto visualize redundancy clusters
Direction Ambiguity in ORA
ORA doesn't distinguish between up- and down-regulated genes — it lumps them together. Run separate ORAs for up and down DEGs when direction matters (almost always). GSEA naturally provides directional NES values.
Gene-Level Redundancy
A single gene like TP53 belongs to dozens of pathways. When the same pathway always shows up, check whether it's actually the same set of leading-edge genes driving multiple results. If yes, treat them as one finding.
Tissue Context Mismatches
If your analysis returns "neurogenesis" enrichment from a blood sample, something is wrong. Common causes:
- Wrong background (universe contains tissue-irrelevant genes)
- Misclassified gene set
- Genuine tissue-specific function of a generally-expressed gene
Always sanity-check enrichment hits against your sample biology.
Topology-Aware Pathway Analysis
Standard ORA/GSEA treats pathways as flat gene lists, ignoring the actual biological topology (gene regulatory relationships, signaling direction). For more sophisticated analyses:
- SPIA (Signaling Pathway Impact Analysis): Uses KEGG pathway topology, computes signed impact factors
- PathwayPCA, CePa: Account for gene-gene dependencies within pathways
- PROGENy: Infers signaling pathway activity from downstream target gene expression patterns (14 major pathways)
- DoRothEA + viper: Transcription factor activity inference (covered in Part 3 of this series)
PROGENy is particularly valuable as a quick second-opinion: it produces a single activity score per pathway per sample, which can be visualized as a heatmap across conditions and immediately reveals which signaling pathways differ between groups.
Recommended Analysis Workflow
For a standard RNA-seq differential expression analysis:
- First pass — clusterProfiler ORA: KEGG + Reactome on DEG list with proper background → quick "big picture"
- Second pass — fgsea GSEA: MSigDB Hallmark + C2 (curated) on ranked list → subtle signal detection with directional NES
- Third pass — PROGENy: Sample-level signaling activity for heatmap visualization
- Fourth pass — cross-reference with PPI Hubs (from Part 1): Identify Hub proteins that appear in leading edges → high-priority candidates
- Optional — SPIA: When pathway topology matters (e.g., signaling cascade activation/inhibition direction)
Reporting Standards for Publication
A reviewer-proof enrichment analysis report should include:
- Database versions: e.g., "MSigDB v2023.2.Hs", "GO release 2024-01-17", "KEGG v110"
- Tool versions: clusterProfiler 4.10, fgsea 1.28
- Background definition: How you defined the universe (with citation/justification)
- Multiple testing correction: BH, Bonferroni, or FDR — and why
- Cutoffs: Both p-value/FDR and effect size (NES) thresholds
- Leading edge tables: Supplementary table with the actual genes driving each enrichment
- Session info: R sessionInfo() output for full reproducibility
Following these standards reduces reviewer pushback significantly and makes your analysis easier to reproduce or extend later.
What's Next
You now have a list of Hub proteins (Part 1) and a ranked list of biological processes / pathways (Part 2). The third pillar of post-DEG functional analysis asks a deeper question: "Which transcription factors are driving these expression changes?"
➡️ Coming next: TF Activity Inference and Biomarker Integration — how DoRothEA, viper, SCENIC, and ChEA3 reveal the regulatory programs behind your DEGs, and how to integrate Hub + Pathway + TF results into validated biomarker candidates.
Further Reading
- Subramanian, A. et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43).
- Khatri, P., Sirota, M. & Butte, A. J. (2012). Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Computational Biology, 8(2).
- Wu, T. et al. (2021). clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation, 2(3).
- Korotkevich, G. et al. (2021). Fast gene set enrichment analysis. bioRxiv. (fgsea methods paper)
- Liberzon, A. et al. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Systems, 1(6).