Differential Expression Analysis in Proteomics: A Complete R Pipeline (limma, t-test, ANOVA)
Step-by-step guide to differential expression analysis in proteomics using R. Covers limma, Welch's t-test, Mann-Whitney, ANOVA with Tukey HSD, BH correction, and volcano plots. With real code and statistical diagnostics.
Differential expression (DE) analysis is the core step in any quantitative proteomics experiment: which proteins are significantly different between conditions? Having built a complete DE pipeline from scratch (for BioAI Market), here's everything I learned about doing it properly.
Overview: Which Statistical Test to Use?
The right test depends on your data:
| Scenario | Recommended Method | Why |
|---|---|---|
| 2 groups, ≥3 replicates | limma (R) | Gold standard — empirical Bayes moderated t-test |
| 2 groups, ≥3 replicates, no R | Welch's t-test | Robust to unequal variance |
| 2 groups, small n or non-normal | Mann-Whitney U | Non-parametric, no normality assumption |
| 3+ groups | ANOVA + Tukey HSD | Tests all pairwise comparisons |
| 2 groups, equal variance confirmed | Student's t-test | Only if Levene's test confirms homoscedasticity |
Rule of thumb: If you have R available and ≥3 replicates, always use limma. It borrows information across proteins to stabilize variance estimates, which is especially important in proteomics where replicates are often limited.
Step 1: Data Preprocessing
Before any statistical test, preprocess your expression matrix:
Filtering
# Remove proteins with too many missing values
# Keep proteins detected in at least 70% of samples
min_valid <- ncol(data) * 0.7
data_filtered <- data[rowSums(!is.na(data)) >= min_valid, ]
cat(sprintf("Kept %d / %d proteins\n", nrow(data_filtered), nrow(data)))
Imputation
# For proteomics, common approaches:
# 1. MinProb — replace NA with low-abundance values
library(imputeLCMD)
data_imputed <- impute.MinProb(data_filtered, q = 0.01)
# 2. KNN imputation
library(impute)
data_imputed <- impute.knn(as.matrix(data_filtered), k = 5)$data
# 3. Simple minimum/2 (quick and dirty)
for (i in 1:nrow(data_filtered)) {
row_min <- min(data_filtered[i, ], na.rm = TRUE)
data_filtered[i, is.na(data_filtered[i, ])] <- row_min / 2
}
Log2 Transformation
# If data is not already log-transformed
data_log <- log2(data_imputed)
Normalization
# Median normalization (proteomics gold standard)
col_medians <- apply(data_log, 2, median, na.rm = TRUE)
global_median <- median(col_medians)
data_norm <- sweep(data_log, 2, col_medians - global_median)
Step 2: Statistical Diagnostics
Before choosing a test, diagnose your data:
Normality Test (Shapiro-Wilk)
# Test normality for each protein
shapiro_results <- apply(data_norm, 1, function(row) {
tryCatch(shapiro.test(row)$p.value, error = function(e) NA)
})
# What fraction of proteins pass normality?
normal_fraction <- mean(shapiro_results > 0.05, na.rm = TRUE)
cat(sprintf("%.1f%% of proteins pass normality test (p > 0.05)\n",
normal_fraction * 100))
# Typically 60-80% in proteomics data
Variance Homogeneity (Levene's Test)
library(car)
levene_results <- apply(data_norm, 1, function(row) {
df <- data.frame(value = row, group = group_labels)
tryCatch(leveneTest(value ~ group, data = df)$`Pr(>F)`[1],
error = function(e) NA)
})
equal_var_fraction <- mean(levene_results > 0.05, na.rm = TRUE)
cat(sprintf("%.1f%% of proteins have equal variance\n",
equal_var_fraction * 100))
Decision Logic
if (n_groups > 2) {
method <- "ANOVA + Tukey HSD"
} else if (r_available && n_replicates >= 3) {
method <- "limma" # Always prefer limma when available
} else if (normal_fraction < 0.5) {
method <- "Mann-Whitney U"
} else if (equal_var_fraction < 0.5) {
method <- "Welch's t-test"
} else {
method <- "Welch's t-test" # Default safe choice
}
Step 3A: limma (Recommended for 2-Group Comparison)
library(limma)
# Create design matrix
group <- factor(c(rep("Control", 3), rep("Treatment", 3)))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
# Create contrast
contrast <- makeContrasts(Treatment - Control, levels = design)
# Fit model
fit <- lmFit(data_norm, design)
fit2 <- contrasts.fit(fit, contrast)
fit3 <- eBayes(fit2)
# Extract results
results <- topTable(fit3, number = Inf, sort.by = "none")
# Add significance annotation
results$significant <- abs(results$logFC) > 1 & results$adj.P.Val < 0.05
results$regulation <- ifelse(!results$significant, "NS",
ifelse(results$logFC > 0, "Up", "Down"))
cat(sprintf("Significant: %d (%d up, %d down)\n",
sum(results$significant),
sum(results$regulation == "Up"),
sum(results$regulation == "Down")))
Why limma is Gold Standard for Proteomics
- Empirical Bayes moderation — Shrinks protein-level variance toward a common prior, solving the "small n" problem
- Handles missing data patterns — Robust to the missingness common in proteomics
- Multiple testing built in —
adj.P.Valuses BH correction by default - Battle-tested — Thousands of publications since 2004
Step 3B: Welch's t-test (Python Alternative)
If R is not available:
from scipy import stats
from statsmodels.stats.multitest import multipletests
import numpy as np
def welch_ttest_de(data, control_idx, treatment_idx, fc_threshold=1.0, p_threshold=0.05):
results = []
p_values = []
for i in range(data.shape[0]):
control = data[i, control_idx]
treatment = data[i, treatment_idx]
# Remove NaN
control = control[~np.isnan(control)]
treatment = treatment[~np.isnan(treatment)]
if len(control) < 2 or len(treatment) < 2:
continue
log2fc = np.mean(treatment) - np.mean(control) # Already log2
_, pval = stats.ttest_ind(control, treatment, equal_var=False)
results.append({
'protein': protein_ids[i],
'log2FC': log2fc,
'pvalue': pval,
})
p_values.append(pval)
# BH correction (CRITICAL — never skip this!)
_, adj_pvals, _, _ = multipletests(p_values, method='fdr_bh')
for i, r in enumerate(results):
r['adj_pvalue'] = adj_pvals[i]
r['significant'] = abs(r['log2FC']) > np.log2(fc_threshold) and r['adj_pvalue'] < p_threshold
return results
⚠️ CRITICAL: Always apply multiple testing correction (BH/FDR). With 5000+ proteins tested, you'll get hundreds of false positives at p < 0.05 without it.
Step 3C: ANOVA + Tukey HSD (3+ Groups)
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
from scipy.stats import tukey_hsd
def anova_de(data, groups, p_threshold=0.05):
"""
Vectorized ANOVA for proteomics — much faster than per-protein loops.
Args:
data: numpy array (proteins × samples), log2 transformed
groups: list of group labels per sample
"""
unique_groups = sorted(set(groups))
n_groups = len(unique_groups)
# Group indices
group_idx = {g: [i for i, x in enumerate(groups) if x == g]
for g in unique_groups}
# Vectorized F-statistic computation
grand_mean = np.nanmean(data, axis=1, keepdims=True)
n_total = data.shape[1]
# SSB (between-group sum of squares)
ssb = np.zeros(data.shape[0])
for g, idx in group_idx.items():
group_mean = np.nanmean(data[:, idx], axis=1)
ssb += len(idx) * (group_mean - grand_mean.flatten()) ** 2
# SSW (within-group sum of squares)
ssw = np.zeros(data.shape[0])
for g, idx in group_idx.items():
group_mean = np.nanmean(data[:, idx], axis=1, keepdims=True)
ssw += np.nansum((data[:, idx] - group_mean) ** 2, axis=1)
# F-statistic
df_between = n_groups - 1
df_within = n_total - n_groups
f_stat = (ssb / df_between) / (ssw / df_within + 1e-10)
p_values = stats.f.sf(f_stat, df_between, df_within)
# BH correction FIRST, then Tukey only for significant
_, adj_pvals, _, _ = multipletests(p_values, method='fdr_bh')
significant_mask = adj_pvals < p_threshold
print(f"ANOVA significant: {significant_mask.sum()} / {len(adj_pvals)}")
# Tukey HSD only for significant proteins (saves massive compute time)
# Limit to top 100 by F-statistic for speed
sig_indices = np.where(significant_mask)[0]
top_sig = sig_indices[np.argsort(f_stat[sig_indices])[::-1][:100]]
for idx in top_sig:
group_data = [data[idx, group_idx[g]] for g in unique_groups]
tukey = tukey_hsd(*group_data)
# ... store pairwise results
return results
Key optimization: BH correction first, then Tukey only for significant proteins. On a 9-group, 5000-protein dataset, this reduced runtime from 215 seconds to 8 seconds.
Step 4: Volcano Plot
library(ggplot2)
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = regulation)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_manual(values = c("Up" = "#e74c3c", "Down" = "#3498db", "NS" = "#bdc3c7")) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40") +
labs(
title = "Volcano Plot — Differential Expression",
x = "Log2 Fold Change",
y = "-Log10 Adjusted P-value"
) +
theme_minimal() +
annotate("text", x = max(results$logFC) * 0.8, y = max(-log10(results$adj.P.Val)) * 0.9,
label = sprintf("%d Up", sum(results$regulation == "Up")), color = "#e74c3c") +
annotate("text", x = min(results$logFC) * 0.8, y = max(-log10(results$adj.P.Val)) * 0.9,
label = sprintf("%d Down", sum(results$regulation == "Down")), color = "#3498db")
Step 5: Pathway Enrichment
After DE, feed significant proteins into pathway analysis:
library(clusterProfiler)
library(org.Hs.eg.db)
# Convert gene symbols to ENTREZ IDs
sig_genes <- results$gene_symbol[results$significant]
ego <- enrichGO(
gene = sig_genes,
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.05
)
# KEGG pathways
gene_entrez <- bitr(sig_genes, fromType = "SYMBOL",
toType = "ENTREZID", OrgDb = org.Hs.eg.db)
ekegg <- enrichKEGG(
gene = gene_entrez$ENTREZID,
organism = "hsa",
pAdjustMethod = "BH"
)
If Your IDs are UniProt Accessions
# Auto-detect and convert
if (grepl("^[A-Z][0-9A-Z]{5}", sig_genes[1])) {
# UniProt → Symbol conversion
mapped <- bitr(sig_genes, fromType = "UNIPROT",
toType = "SYMBOL", OrgDb = org.Hs.eg.db)
sig_genes <- mapped$SYMBOL
# For enrichGO, can use UNIPROT directly
ego <- enrichGO(gene = sig_genes_uniprot,
OrgDb = org.Hs.eg.db,
keyType = "UNIPROT", ont = "BP")
}
Common Mistakes to Avoid
1. No Multiple Testing Correction
# WRONG — thousands of false positives
significant <- results$P.Value < 0.05
# RIGHT — BH-corrected
significant <- results$adj.P.Val < 0.05
2. Using Log2FC Threshold Without P-value
# WRONG — large fold changes can be noise
significant <- abs(results$logFC) > 1
# RIGHT — both thresholds
significant <- abs(results$logFC) > 1 & results$adj.P.Val < 0.05
3. Student's t-test Instead of Welch's
Proteomics data almost always has unequal variance between groups. Always use equal_var=False (Welch's) unless Levene's test explicitly confirms equal variance.
4. Not Log-Transforming
Raw LFQ intensities span orders of magnitude. Statistical tests assume roughly normal distributions. Log2 transformation is essential.
5. Ignoring Missing Data Patterns
Proteins missing in one condition but present in another may be the most biologically interesting! Consider "missing not at random" (MNAR) imputation methods.
Complete Minimal Pipeline
# 1. Load & preprocess
data <- read.csv("protein_matrix.csv", row.names = 1)
data_log <- log2(data + 1)
col_med <- apply(data_log, 2, median)
data_norm <- sweep(data_log, 2, col_med - median(col_med))
# 2. limma DE
library(limma)
group <- factor(c(rep("Control", 3), rep("Treatment", 3)))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
fit <- eBayes(contrasts.fit(lmFit(data_norm, design),
makeContrasts(Treatment - Control, levels = design)))
res <- topTable(fit, number = Inf, sort.by = "none")
res$sig <- abs(res$logFC) > 1 & res$adj.P.Val < 0.05
# 3. Results
cat(sprintf("%d significant (%d up, %d down) out of %d tested\n",
sum(res$sig),
sum(res$sig & res$logFC > 0),
sum(res$sig & res$logFC < 0),
nrow(res)))
Tools That Do This For You
If you want a GUI instead of code:
- BioAI Market — Our platform (full pipeline with visualizations)
- Perseus — Max Planck's desktop tool
- Proteome Discoverer — Thermo Fisher's commercial solution
- MSstats — R/Bioconductor package specifically for MS proteomics
This guide is based on building the analysis pipeline for BioAI Market. The platform implements all these methods with automatic statistical diagnostics and method recommendation. Check it out if you want a no-code solution.