DESeq2 vs edgeR vs limma-voom: Complete Comparison for RNA-seq Differential Expression

If you've started learning RNA-seq analysis, you've probably hit the same wall every researcher does: there are three "standard" tools for differential expression — DESeq2, edgeR, and limma-voom — and the documentation for each insists it's the best. Which should you actually use? This guide cuts through the noise with a practical, statistics-grounded comparison: how each tool models your data, when one outperforms the others, and a clear decision framework for choosing the right tool for your specific experiment.

TL;DR

All three tools work well for standard two-group comparisons with 6-20 samples per group — differences are minor, not worth obsessing over.
DESeq2: Best general-purpose default; conservative, well-documented, easiest for beginners.
edgeR: Slightly more sensitive for small sample sizes (3-5/group); excellent for complex experimental designs.
limma-voom: Fastest, most flexible model formulas, best for large studies (>50 samples) and continuous covariates.
Never use raw FPKM/TPM as input — all three require raw integer counts.
Always run RUVSeq or SVA for hidden batch correction if you suspect technical confounds.

Why Differential Expression Analysis Is Statistically Hard

RNA-seq differential expression seems simple — compare gene expression between two groups, run a statistical test, pick the significant genes. But three properties of count data make it surprisingly difficult:

Counts are non-negative integers, not normally distributed: Standard t-tests assume normality, which fails badly for low-count genes.
Mean-variance dependency: Genes with higher mean expression have higher variance — and the relationship is non-linear. Standard parametric tests assume constant variance.
Small sample sizes: Most RNA-seq experiments have 3-10 replicates per condition, far too few to estimate gene-specific variance accurately.

DESeq2, edgeR, and limma-voom are all responses to these challenges. Each takes a slightly different statistical approach, leading to slightly different results — but the fundamental problem they solve is the same: how to detect real expression changes when individual genes have unreliable variance estimates from few replicates.

Statistical Foundations

DESeq2: Negative Binomial GLM with Wald or LRT Tests

DESeq2 (Love, Huber & Anders, 2014) models read counts using a Negative Binomial distribution:

counts ~ NegBin(μ_ij, α_i)

Where μ_ij is the expected count for gene i in sample j (related to library size and the gene's true expression), and α_i is the gene-specific dispersion parameter capturing biological variability.

The key insight: DESeq2 shares dispersion information across genes. Instead of estimating each gene's variance independently from few replicates, it pools information across genes to estimate a "trend" (dispersion as a function of mean expression), then shrinks individual gene dispersions toward this trend. This dramatically improves variance estimation accuracy.

For testing, DESeq2 fits a Generalized Linear Model and uses either:

Wald test (default): Tests a single coefficient (e.g., treatment vs control)
Likelihood Ratio Test (LRT): Compares two model fits, useful for testing multiple coefficients simultaneously

edgeR: Negative Binomial with Empirical Bayes Dispersion

edgeR (Robinson, McCarthy & Smyth, 2010) also uses a Negative Binomial model but with a different dispersion estimation strategy. Three variants:

Common dispersion: Single dispersion shared across all genes (rarely used, too simple)
Trended dispersion: Like DESeq2's mean-dispersion trend
Tagwise dispersion: Gene-specific, with empirical Bayes shrinkage toward the trend (recommended default)

edgeR offers two main testing approaches:

Exact test (exactTest): For simple two-group comparisons, computationally fast
Quasi-likelihood F test (glmQLFTest): For complex designs, uses quasi-likelihood to better account for dispersion uncertainty (recommended for most analyses)

limma-voom: Linear Modeling with Mean-Variance Trend

limma-voom (Law, Chen, Shi & Smyth, 2014) takes a fundamentally different approach. Instead of modeling counts directly, it:

Transforms counts to log2 counts-per-million (CPM)
Computes the mean-variance trend across all genes
Estimates gene/sample-specific weights from this trend
Fits linear models with weights (like classical limma for microarrays)
Uses empirical Bayes moderation of standard errors

The advantage: linear models are extremely flexible — limma can handle continuous covariates, complex interactions, and time-course designs more naturally than negative binomial GLMs. The disadvantage: the count-to-CPM transformation discards some information, especially for low-count genes.

When Each Tool Has an Advantage

Small Sample Sizes (3 per group)

For very small experiments (3 vs 3), benchmarks show:

edgeR with QL test: Often the most sensitive while controlling false positives
DESeq2: Most conservative, may miss some real DEGs
limma-voom: Generally underperforms here because few samples make the mean-variance trend less reliable

Recommendation: For 3v3 experiments, edgeR (glmQLFit + glmQLFTest) tends to give the most balanced results. Always validate top hits via qPCR.

Standard Sample Sizes (6-20 per group)

For typical experiments, all three perform similarly. Differences in DEG lists are usually 5-15% — most "core" DEGs are detected by all three. The choice often comes down to:

Familiarity / lab tradition
Specific design needs (interaction effects, continuous covariates)
Output format preferences

Recommendation: DESeq2 is a solid default. The shrunken log2FoldChange values it provides via lfcShrink() are particularly useful for visualization and downstream ranking.

Large Studies (>50 samples)

For larger cohorts, especially clinical studies with continuous covariates (age, BMI, tumor stage), limma-voom has clear advantages:

Much faster computation
More flexible model formulas
Better handling of mixed continuous/categorical covariates
Can correct for known confounders elegantly via design matrix

Recommendation: Use limma-voom. For traditional case/control with categorical groups, DESeq2 or edgeR still work but limma is cleaner.

Single-Cell RNA-seq (Pseudobulk Analysis)

For pseudobulk-aggregated single-cell data (recommended approach for cell-type-specific DE in scRNA-seq):

DESeq2 or edgeR: Standard choices, performance comparable
limma-voom: Works but the variance-weight assumption may break with high zero-inflation
Specialized tools (Glimma, muscat): Built specifically for this use case

Practical Code Comparison

Same input data, three tools — see how the workflows compare.

Setup (Common to All)

library(SummarizedExperiment)

# Your count matrix: genes (rows) × samples (columns), integer counts
counts <- as.matrix(read.csv("counts.csv", row.names = 1))

# Sample metadata
coldata <- read.csv("samples.csv", row.names = 1)
coldata$condition <- factor(coldata$condition, levels = c("control", "treated"))

DESeq2 Workflow

library(DESeq2)

# Construct DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = ~ condition
)

# Pre-filter very low-count genes (optional but recommended)
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]

# Run differential expression analysis
dds <- DESeq(dds)

# Extract results
res <- results(dds, contrast = c("condition", "treated", "control"))

# Shrink log2 fold changes for better visualization/ranking
res_shrunk <- lfcShrink(
  dds,
  coef = "condition_treated_vs_control",
  type = "apeglm"
)

# Order by adjusted p-value
res_ordered <- res_shrunk[order(res_shrunk$padj), ]
summary(res_ordered)

# Visualization
plotMA(res_shrunk, ylim = c(-5, 5))

edgeR Workflow

library(edgeR)

# Construct DGEList
dge <- DGEList(counts = counts, group = coldata$condition)

# Filter low-count genes (edgeR's recommended approach)
keep <- filterByExpr(dge, group = coldata$condition)
dge <- dge[keep, , keep.lib.sizes = FALSE]

# TMM normalization (edgeR standard)
dge <- calcNormFactors(dge, method = "TMM")

# Design matrix
design <- model.matrix(~ condition, data = coldata)

# Estimate dispersions (tagwise with trend)
dge <- estimateDisp(dge, design)

# Fit GLM with quasi-likelihood
fit <- glmQLFit(dge, design)

# Test for differential expression
qlf <- glmQLFTest(fit, coef = 2)  # coef 2 = condition effect

# Top DEGs
top_degs <- topTags(qlf, n = Inf, sort.by = "PValue")
summary(decideTests(qlf))

# Visualization
plotMD(qlf)

limma-voom Workflow

library(limma)
library(edgeR)  # for DGEList and filterByExpr

# Same filtering approach as edgeR
dge <- DGEList(counts = counts, group = coldata$condition)
keep <- filterByExpr(dge, group = coldata$condition)
dge <- dge[keep, , keep.lib.sizes = FALSE]
dge <- calcNormFactors(dge, method = "TMM")

# Design matrix
design <- model.matrix(~ condition, data = coldata)

# voom transformation: counts → log2(CPM) with weights
v <- voom(dge, design, plot = TRUE)

# Linear model fitting
fit <- lmFit(v, design)

# Empirical Bayes moderation
fit <- eBayes(fit)

# Top DEGs
top_degs <- topTable(fit, coef = 2, number = Inf, sort.by = "P")
summary(decideTests(fit, p.value = 0.05))

# Visualization
plotSA(fit, main = "Mean-variance trend")

Three Outputs Compared

For a typical 8 vs 8 comparison, expect:

All three to identify ~80-85% of the same top DEGs (FDR < 0.05, |log2FC| > 1)
DESeq2 typically reports slightly fewer DEGs than edgeR
limma-voom occasionally finds genes the others miss (and vice versa) due to different statistical model
Effect size estimates (log2FC) are similar after DESeq2's lfcShrink() is applied

Critical Pre-Analysis Steps (Apply to All Three)

The biggest factor in DE analysis quality isn't tool choice — it's whether you've done these prerequisites correctly.

1. Use Raw Counts, Never FPKM/TPM

All three tools require integer count data as input. Common mistake: feeding FPKM, TPM, or already-normalized values produces nonsense results.

If you only have FPKM/TPM, you should re-quantify from raw FASTQ files using salmon, kallisto, or STAR-htseq-count. There is no statistically sound way to "convert" FPKM to counts.

2. Quality Control Plots Before DE

Always run before testing:

PCA plot on vst() (DESeq2) or cpm(log = TRUE) (limma) to detect outliers and batch effects
Sample distance heatmap to identify mislabeled samples
Library size visualization to spot failed sequencing runs
Count distribution plots (boxplots, density curves) per sample

If PCA shows obvious batch effects or outliers, address them before running DE — otherwise your DEGs may be technical artifacts.

3. Batch Effect Correction

If samples were processed in batches (different days, kits, sequencing runs), include batch as a covariate:

# DESeq2
design = ~ batch + condition  # condition tested while controlling for batch

# edgeR / limma
design <- model.matrix(~ batch + condition, data = coldata)

For unknown technical confounders, use SVA (svaseq) or RUVSeq to estimate hidden surrogate variables, then include them in the design.

4. Sensible Filtering

Pre-filter genes with very low counts before DE:

Light filtering (DESeq2 default): Genes with ≥10 counts in ≥3 samples
Stricter filtering (edgeR filterByExpr): Considers minimum group size and library composition
Don't over-filter: Removing too many genes inflates the multiple testing penalty for the rest

5. Check Multiple Testing Correction

All three tools use Benjamini-Hochberg (BH) FDR by default. For most analyses this is appropriate. For very strict applications (clinical biomarkers), consider Bonferroni (more stringent) or qvalue package (more powerful when many tests are truly null).

Common Pitfalls Across All Three Tools

Treating log2FoldChange Magnitudes as Reliable for Low-Expression Genes

A gene with 1 count in control vs 10 in treated has a calculated log2FC = 3.3, but this is meaningless statistical noise. DESeq2's lfcShrink() partially solves this by shrinking small effects toward zero. Always use shrunken estimates for ranking and visualization.

Confusing Statistical Significance with Biological Importance

A gene with FDR = 1e-50 but log2FC = 0.05 is statistically significant but probably biologically irrelevant. Always require both an FDR threshold AND a fold-change threshold (e.g., |log2FC| > 1, FDR < 0.05).

Multiple Testing Across Multiple Comparisons

If you run 5 separate condition comparisons, the FDR within each is controlled — but the family-wise error rate across all 5 is not. For studies with many comparisons, consider hierarchical FDR control or a study-wise correction.

Forgetting to Set Reference Level

The reference level of your factor determines which group is the "control" in fold change calculations:

# Without setting levels: alphabetical order — "control" first by chance, but...
coldata$condition <- factor(coldata$condition)  # alphabetical: A, B, C, ...

# Always set explicitly to avoid surprises:
coldata$condition <- factor(coldata$condition, levels = c("control", "treated"))

If your fold changes are in the wrong direction, this is usually why.

Decision Framework Summary

Here's a simple decision tree for choosing your DE tool:

Start with: How many samples per group?

3 per group: edgeR (glmQLFTest) or DESeq2 — both work, edgeR may be slightly more sensitive
5-20 per group: DESeq2 (sensible default) or edgeR
20+ per group: limma-voom (faster, more flexible)
50+ per group with continuous covariates (age, BMI, etc.): limma-voom (clearly best)

Then check: Is your design simple or complex?

Two-group comparison: Any of the three works
Multi-factor (interaction terms, batch + condition + sex): All three handle this; limma's syntax is cleanest
Time course or continuous predictor: limma-voom strongly preferred
Pseudobulk single-cell: DESeq2 or edgeR

Finally: Are you new to RNA-seq?

Yes: Use DESeq2 — best documentation, largest user community, most tutorials, friendly error messages

Beyond Differential Expression

DEG analysis is the start, not the end. Once you have your DEG list:

Quality check: Do top DEGs make biological sense? Use volcano plot, MA plot, heatmap of top 50.
Functional analysis: GO + Pathway enrichment to understand biological themes (covered in Part 2 of our functional analysis series)
Network analysis: PPI Hub identification to find regulatory drivers (Part 1 of our series)
TF activity: DoRothEA + viper to identify upstream regulators (Part 3)
Validation: qPCR, Western blot, IHC for top candidates

Conclusion

Choosing between DESeq2, edgeR, and limma-voom is less consequential than people think. All three are mature, well-validated tools that produce comparable results for typical experiments. What matters far more:

Correct experimental design and sample size
Proper raw count input (not FPKM/TPM)
Thoughtful batch effect handling
Meaningful filtering thresholds
Appropriate fold-change cutoffs alongside FDR

Pick a tool, learn its idioms well, and focus your energy on the biology and the experimental design. Then move to the downstream functional analysis that turns a DEG list into biological insight.