GO and Pathway Enrichment Analysis: A Complete Practical Guide (clusterProfiler, fgsea, MSigDB)

📚 Series Navigation · This is Part 2 of the post-DEG functional analysis series. Part 1 — PPI + Hub Analysis · Part 2 — GO + Pathway Enrichment (current) · Part 3 — TF Activity + Biomarker Integration (coming)

After identifying Hub proteins from a PPI network, the next obvious question is: "What are these genes actually doing biologically?" GO enrichment and Pathway enrichment are the two standard tools that answer this. Yet despite decades of use, most published enrichment analyses contain at least one critical methodological error — usually invisible until a reviewer asks pointed questions. This guide focuses on the practical decisions and statistical pitfalls that separate publication-quality enrichment analyses from misleading ones.

TL;DR

The background gene set (universe) is the single most consequential decision — get it wrong and 90% of your "significant" results are spurious.
ORA (Over-Representation Analysis) is fast and intuitive when you have a defined gene list with a clear cutoff; GSEA is more powerful when you have a continuous ranking and want to detect subtle, coordinated changes.
GO redundancy is real and unavoidable — use clusterProfiler::simplify() or topGO's elim algorithm to collapse semantically similar terms.
Always report database versions, multiple testing correction method, background definition, and cutoffs. Reviewers will ask.

Why GO and Pathway Enrichment Matter

A list of differentially expressed genes is just a list. It doesn't tell you which biological functions are perturbed, which signaling pathways are activated, or which cellular compartments are affected. Functional enrichment analysis transforms a gene list into biological narrative.

The two main analytical frameworks address slightly different questions:

GO Enrichment: Which biological processes, molecular functions, or cellular components are over-represented in my gene list compared to the genome background?
Pathway Enrichment: Which curated signaling cascades or metabolic pathways are over-represented?

Both rest on the same statistical foundation (testing whether observed overlap with a predefined gene set exceeds what's expected by chance), but they use different reference annotations and yield different types of biological insight.

Gene Ontology: Structure and Statistics

The DAG Structure

Gene Ontology is not a tree. It's a Directed Acyclic Graph (DAG) — meaning a single GO term can have multiple parent terms. This has practical consequences:

GO:0000278 (mitotic cell cycle)
  is_a → GO:0007049 (cell cycle)
    is_a → GO:0009987 (cellular process)
      is_a → GO:0008150 (biological_process)

When a gene is annotated to a specific term, it is implicitly annotated to all ancestor terms via "true path rule." This means you may see overlap between a parent term ("cell cycle") and its child ("mitotic cell cycle") in your results. Both can be statistically significant from the same gene set.

The Three GO Namespaces

GO terms are partitioned into three orthogonal hierarchies:

BP (Biological Process) — Multi-step processes: mitotic cell cycle, apoptosis, immune response, blood vessel formation
MF (Molecular Function) — Molecular activities of single gene products: DNA binding, kinase activity, transporter activity
CC (Cellular Component) — Subcellular locations: nucleus, mitochondrion, ribosome, plasma membrane

For most functional analyses, BP is the most informative, followed by MF. CC is useful for confirming subcellular localization patterns or identifying compartment-specific responses.

Evidence Codes: Not All Annotations Are Equal

Each gene-to-GO-term annotation comes with an evidence code indicating how it was assigned. Major categories:

Code	Meaning	Reliability
EXP, IDA, IMP, IGI, IPI, IEP	Experimental evidence	⭐⭐⭐ Highest
TAS, IC	Curator-assigned, traceable	⭐⭐ High
ISS, ISO, ISA, ISM	Sequence similarity to annotated proteins	⭐⭐ Medium
IEA	Automated annotation (algorithmic)	⭐ Caution — most annotations fall here

Practical filtering: For conservative analyses (especially in regulated/clinical contexts), exclude IEA. For maximum coverage in exploratory analyses, include it. Document your choice.

When using org.Hs.eg.db, querying GOALL automatically includes ancestor terms — convenient but watch for double-counting in downstream analyses.

GO Enrichment Tools Compared

Tool	Interface	Key Strengths	When to Use
clusterProfiler	R	De facto standard, ORA + GSEA, rich visualization	Default choice for most analyses
topGO	R	DAG-aware elim/weight algorithms, automatic redundancy reduction	When term redundancy is a major concern
g:Profiler / gprofiler2	R / Web / Python	Most current databases, easy custom backgrounds, multi-query	Recent gene sets, comparative analyses
DAVID	Web	Classic, easy to use, broad tool integration	Quick preliminary scans (caveat: update lag)
Enrichr	Web / R	200+ gene set libraries beyond GO	Exploratory analyses, transcription factor enrichment
GOseq	R	RNA-seq length bias correction	Specifically for RNA-seq gene lists

Complete clusterProfiler ORA Workflow

library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

# Convert your DEG symbols to Entrez IDs
deg_entrez <- bitr(
  deg_symbols,
  fromType = "SYMBOL",
  toType   = "ENTREZID",
  OrgDb    = org.Hs.eg.db
)$ENTREZID

# Universe = genes detected in your experiment (CRITICAL)
universe_entrez <- bitr(
  detected_genes,  # NOT all human genes!
  fromType = "SYMBOL",
  toType   = "ENTREZID",
  OrgDb    = org.Hs.eg.db
)$ENTREZID

# Run GO BP enrichment
ego <- enrichGO(
  gene          = deg_entrez,
  universe      = universe_entrez,
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff  = 0.05,
  qvalueCutoff  = 0.10,
  readable      = TRUE
)

# Reduce redundancy (semantic similarity-based clustering)
ego_simp <- simplify(
  ego,
  cutoff      = 0.7,
  by          = "p.adjust",
  select_fun  = min
)

# Visualization options
dotplot(ego_simp, showCategory = 20)
barplot(ego_simp, showCategory = 20)
emapplot(pairwise_termsim(ego_simp), showCategory = 30)
cnetplot(
  ego_simp,
  categorySize = "pvalue",
  foldChange   = log2fc_named_vector
)
heatplot(ego_simp, foldChange = log2fc_named_vector)
treeplot(pairwise_termsim(ego_simp))

# Save results
write.csv(as.data.frame(ego_simp), "go_bp_results.csv", row.names = FALSE)

The Background Gene Set Trap

This deserves its own section because it's the single most common methodological error in published enrichment analyses.

Why Background Matters

Enrichment is fundamentally a question of proportion: "Is the proportion of my gene list belonging to category X higher than expected?" The expectation is computed from your background (universe). If your background is wrong, your null hypothesis is wrong.

The Wrong Way (Don't Do This)

# WRONG: Using the entire human genome as universe
ego <- enrichGO(
  gene          = deg_entrez,
  universe      = NULL,  # Defaults to all annotated genes — bad!
  OrgDb         = org.Hs.eg.db,
  ...
)

The problem: If your RNA-seq experiment only detected 12,000 of 20,000 human genes, including the undetected 8,000 in the background creates a biased reference. The undetected genes typically include rarely expressed genes (olfactory receptors, testis-specific genes, etc.), which inflates the apparent enrichment of commonly expressed pathways.

Real-world consequence: Analyses with this error consistently show "immune response" or "metabolic process" enrichment regardless of actual experimental context, because those categories happen to overlap with commonly-expressed genes.

The Right Way

# CORRECT: Use only detected genes as universe
detected_genes <- rownames(rnaseq_counts)[
  rowSums(rnaseq_counts >= 10) >= 3
]

universe_entrez <- bitr(
  detected_genes,
  fromType = "SYMBOL",
  toType   = "ENTREZID",
  OrgDb    = org.Hs.eg.db
)$ENTREZID

ego <- enrichGO(
  gene          = deg_entrez,
  universe      = universe_entrez,  # Properly scoped
  OrgDb         = org.Hs.eg.db,
  ont           = "BP",
  ...
)

For proteomics, the universe is proteins detected in your MS experiment. For ChIP-seq peak-to-gene analyses, it's genes within reachable distance of your peak set. The principle is the same: the background should match what was actually testable in your experiment.

Pathway Database Comparison

Database	Coverage	License	Strengths
KEGG	~550 pathways	Free for academic, paid for commercial	Classic, well-known, includes metabolism
Reactome	~2,500 pathways	Fully open	Hierarchical, reaction-level detail, modern
WikiPathways	~1,300 pathways	Open, community-contributed	Disease-specific pathways, regularly updated
MSigDB	~35,000 gene sets	Free for academic	Hallmark, C2 (curated), C5 (GO), C7 (immune), C8 (cell signature)
BioCarta, PID	Subsumed in MSigDB C2	-	Legacy reference

For most analyses, Reactome + MSigDB Hallmark provides excellent coverage and interpretability. Add KEGG if metabolic pathway focus is needed.

ORA vs GSEA: When to Use Which

This is the second-most-common confusion in enrichment analysis. Both methods test enrichment, but they answer different questions.

ORA (Over-Representation Analysis)

Statistical test: Fisher's exact test or hypergeometric distribution

Input: A defined gene list (e.g., DEGs at FDR < 0.05) + background

Question: "Are members of pathway X over-represented in my gene list compared to expectation?"

Strengths:

Conceptually simple
Computationally fast
Easy to communicate

Weaknesses:

Sensitive to your DEG cutoff choice (one gene above/below threshold can change results)
Loses information about the magnitude and direction of expression change
Cannot detect coordinated subtle changes

GSEA (Gene Set Enrichment Analysis)

Statistical test: Kolmogorov-Smirnov-like enrichment score, with permutation-based significance

Input: All genes ranked by some statistic (e.g., signed -log10(p) × log2FC)

Question: "Are pathway X members enriched at the top or bottom of my ranked gene list?"

Strengths:

No arbitrary cutoff
Captures coordinated subtle changes
Provides directional information (up vs down regulation)
Identifies "leading edge" — the actual genes driving the enrichment signal

Weaknesses:

More parameters to choose (weighted vs classic, permutation type)
Slower, especially with many gene sets
Harder to communicate to non-specialists

Practical Decision Framework

Use ORA when:

You have a small, well-defined gene list (< 500 genes)
Your analysis includes a clear cutoff (e.g., FDR < 0.05)
You need quick exploratory results

Use GSEA when:

You have a continuous ranking (DESeq2 stat, limma t-stat)
Effects are subtle but coordinated (e.g., aging, mild perturbations)
You want direction-aware results (up- vs down-regulation)
Reviewers explicitly ask for it (increasingly common)

Best practice: Run both. ORA for the headline result, GSEA for the deeper analysis with leading edge identification.

Complete fgsea Workflow

library(fgsea)
library(msigdbr)
library(dplyr)
library(ggplot2)

# Get MSigDB Hallmark gene sets for human
m_df <- msigdbr(species = "Homo sapiens", category = "H")
hallmark_pathways <- split(m_df$gene_symbol, m_df$gs_name)

# Construct ranked gene list (signed -log10p × log2FC)
deg_results$rank <- sign(deg_results$log2FoldChange) *
                    -log10(deg_results$pvalue + 1e-300)
ranks <- setNames(deg_results$rank, deg_results$gene_symbol)
ranks <- sort(ranks, decreasing = TRUE)

# Run fgsea (multi-threaded, fast)
fgsea_res <- fgsea(
  pathways = hallmark_pathways,
  stats    = ranks,
  minSize  = 15,
  maxSize  = 500,
  eps      = 0
)

# Examine top pathways
fgsea_res %>%
  arrange(padj) %>%
  head(20) %>%
  select(pathway, NES, padj, leadingEdge)

# Visualize a single pathway's running enrichment score
plotEnrichment(
  hallmark_pathways[["HALLMARK_E2F_TARGETS"]],
  ranks
) + ggtitle("E2F Targets — Cell Cycle Regulation")

# Multi-pathway summary plot
top_up   <- fgsea_res[ES > 0][order(padj)][1:10, pathway]
top_down <- fgsea_res[ES < 0][order(padj)][1:10, pathway]
top_all  <- c(top_up, top_down)

plotGseaTable(
  hallmark_pathways[top_all],
  ranks,
  fgsea_res,
  gseaParam = 0.5
)

Pitfalls in Pathway Result Interpretation

Beyond background selection, several other issues commonly trip up enrichment analyses.

Pathway Size Bias

Larger pathways (e.g., "metabolism," "signal transduction") are easier to enrich simply because they contain more genes. When reading results, always look at NES alongside gene set size — a pathway with NES = 1.8 and 30 genes is more impressive than NES = 2.0 and 500 genes.

Redundancy Between Parent and Child Pathways

In hierarchical databases like Reactome, both "Cell Cycle" and "Mitotic Cell Cycle" frequently appear in top results. They're not independent findings. Use:

clusterProfiler::simplify() for GO terms
Manual review of leading edge gene overlap for pathway results
The enrichplot::emapplot to visualize redundancy clusters

Direction Ambiguity in ORA

ORA doesn't distinguish between up- and down-regulated genes — it lumps them together. Run separate ORAs for up and down DEGs when direction matters (almost always). GSEA naturally provides directional NES values.

Gene-Level Redundancy

A single gene like TP53 belongs to dozens of pathways. When the same pathway always shows up, check whether it's actually the same set of leading-edge genes driving multiple results. If yes, treat them as one finding.

Tissue Context Mismatches

If your analysis returns "neurogenesis" enrichment from a blood sample, something is wrong. Common causes:

Wrong background (universe contains tissue-irrelevant genes)
Misclassified gene set
Genuine tissue-specific function of a generally-expressed gene

Always sanity-check enrichment hits against your sample biology.

Topology-Aware Pathway Analysis

Standard ORA/GSEA treats pathways as flat gene lists, ignoring the actual biological topology (gene regulatory relationships, signaling direction). For more sophisticated analyses:

SPIA (Signaling Pathway Impact Analysis): Uses KEGG pathway topology, computes signed impact factors
PathwayPCA, CePa: Account for gene-gene dependencies within pathways
PROGENy: Infers signaling pathway activity from downstream target gene expression patterns (14 major pathways)
DoRothEA + viper: Transcription factor activity inference (covered in Part 3 of this series)

PROGENy is particularly valuable as a quick second-opinion: it produces a single activity score per pathway per sample, which can be visualized as a heatmap across conditions and immediately reveals which signaling pathways differ between groups.

Recommended Analysis Workflow

For a standard RNA-seq differential expression analysis:

First pass — clusterProfiler ORA: KEGG + Reactome on DEG list with proper background → quick "big picture"
Second pass — fgsea GSEA: MSigDB Hallmark + C2 (curated) on ranked list → subtle signal detection with directional NES
Third pass — PROGENy: Sample-level signaling activity for heatmap visualization
Fourth pass — cross-reference with PPI Hubs (from Part 1): Identify Hub proteins that appear in leading edges → high-priority candidates
Optional — SPIA: When pathway topology matters (e.g., signaling cascade activation/inhibition direction)

Reporting Standards for Publication

A reviewer-proof enrichment analysis report should include:

Database versions: e.g., "MSigDB v2023.2.Hs", "GO release 2024-01-17", "KEGG v110"
Tool versions: clusterProfiler 4.10, fgsea 1.28
Background definition: How you defined the universe (with citation/justification)
Multiple testing correction: BH, Bonferroni, or FDR — and why
Cutoffs: Both p-value/FDR and effect size (NES) thresholds
Leading edge tables: Supplementary table with the actual genes driving each enrichment
Session info: R sessionInfo() output for full reproducibility

Following these standards reduces reviewer pushback significantly and makes your analysis easier to reproduce or extend later.

What's Next

You now have a list of Hub proteins (Part 1) and a ranked list of biological processes / pathways (Part 2). The third pillar of post-DEG functional analysis asks a deeper question: "Which transcription factors are driving these expression changes?"

➡️ Coming next: TF Activity Inference and Biomarker Integration — how DoRothEA, viper, SCENIC, and ChEA3 reveal the regulatory programs behind your DEGs, and how to integrate Hub + Pathway + TF results into validated biomarker candidates.