Proteomics

Differential Expression Analysis in Proteomics: A Complete R Pipeline (limma, t-test, ANOVA)

Step-by-step guide to differential expression analysis in proteomics using R. Covers limma, Welch's t-test, Mann-Whitney, ANOVA with Tukey HSD, BH correction, and volcano plots. With real code and statistical diagnostics.

·9 min read
#proteomics#differential expression#limma#R#bioinformatics#volcano plot#ANOVA#statistics#BH correction#bioconductor

Differential expression (DE) analysis is the core step in any quantitative proteomics experiment: which proteins are significantly different between conditions? Having built a complete DE pipeline from scratch (for BioAI Market), here's everything I learned about doing it properly.

Proteomics data analysis workflow

Overview: Which Statistical Test to Use?

The right test depends on your data:

ScenarioRecommended MethodWhy
2 groups, ≥3 replicateslimma (R)Gold standard — empirical Bayes moderated t-test
2 groups, ≥3 replicates, no RWelch's t-testRobust to unequal variance
2 groups, small n or non-normalMann-Whitney UNon-parametric, no normality assumption
3+ groupsANOVA + Tukey HSDTests all pairwise comparisons
2 groups, equal variance confirmedStudent's t-testOnly if Levene's test confirms homoscedasticity

Rule of thumb: If you have R available and ≥3 replicates, always use limma. It borrows information across proteins to stabilize variance estimates, which is especially important in proteomics where replicates are often limited.

Step 1: Data Preprocessing

Before any statistical test, preprocess your expression matrix:

Filtering

# Remove proteins with too many missing values
# Keep proteins detected in at least 70% of samples
min_valid <- ncol(data) * 0.7
data_filtered <- data[rowSums(!is.na(data)) >= min_valid, ]
cat(sprintf("Kept %d / %d proteins\n", nrow(data_filtered), nrow(data)))

Imputation

# For proteomics, common approaches:
# 1. MinProb — replace NA with low-abundance values
library(imputeLCMD)
data_imputed <- impute.MinProb(data_filtered, q = 0.01)

# 2. KNN imputation
library(impute)
data_imputed <- impute.knn(as.matrix(data_filtered), k = 5)$data

# 3. Simple minimum/2 (quick and dirty)
for (i in 1:nrow(data_filtered)) {
  row_min <- min(data_filtered[i, ], na.rm = TRUE)
  data_filtered[i, is.na(data_filtered[i, ])] <- row_min / 2
}

Log2 Transformation

# If data is not already log-transformed
data_log <- log2(data_imputed)

Normalization

# Median normalization (proteomics gold standard)
col_medians <- apply(data_log, 2, median, na.rm = TRUE)
global_median <- median(col_medians)
data_norm <- sweep(data_log, 2, col_medians - global_median)

Step 2: Statistical Diagnostics

Before choosing a test, diagnose your data:

Normality Test (Shapiro-Wilk)

# Test normality for each protein
shapiro_results <- apply(data_norm, 1, function(row) {
  tryCatch(shapiro.test(row)$p.value, error = function(e) NA)
})

# What fraction of proteins pass normality?
normal_fraction <- mean(shapiro_results > 0.05, na.rm = TRUE)
cat(sprintf("%.1f%% of proteins pass normality test (p > 0.05)\n",
            normal_fraction * 100))
# Typically 60-80% in proteomics data

Variance Homogeneity (Levene's Test)

library(car)

levene_results <- apply(data_norm, 1, function(row) {
  df <- data.frame(value = row, group = group_labels)
  tryCatch(leveneTest(value ~ group, data = df)$`Pr(>F)`[1],
           error = function(e) NA)
})

equal_var_fraction <- mean(levene_results > 0.05, na.rm = TRUE)
cat(sprintf("%.1f%% of proteins have equal variance\n",
            equal_var_fraction * 100))

Decision Logic

if (n_groups > 2) {
  method <- "ANOVA + Tukey HSD"
} else if (r_available && n_replicates >= 3) {
  method <- "limma"  # Always prefer limma when available
} else if (normal_fraction < 0.5) {
  method <- "Mann-Whitney U"
} else if (equal_var_fraction < 0.5) {
  method <- "Welch's t-test"
} else {
  method <- "Welch's t-test"  # Default safe choice
}
library(limma)

# Create design matrix
group <- factor(c(rep("Control", 3), rep("Treatment", 3)))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)

# Create contrast
contrast <- makeContrasts(Treatment - Control, levels = design)

# Fit model
fit <- lmFit(data_norm, design)
fit2 <- contrasts.fit(fit, contrast)
fit3 <- eBayes(fit2)

# Extract results
results <- topTable(fit3, number = Inf, sort.by = "none")

# Add significance annotation
results$significant <- abs(results$logFC) > 1 & results$adj.P.Val < 0.05
results$regulation <- ifelse(!results$significant, "NS",
                      ifelse(results$logFC > 0, "Up", "Down"))

cat(sprintf("Significant: %d (%d up, %d down)\n",
    sum(results$significant),
    sum(results$regulation == "Up"),
    sum(results$regulation == "Down")))

Why limma is Gold Standard for Proteomics

  1. Empirical Bayes moderation — Shrinks protein-level variance toward a common prior, solving the "small n" problem
  2. Handles missing data patterns — Robust to the missingness common in proteomics
  3. Multiple testing built inadj.P.Val uses BH correction by default
  4. Battle-tested — Thousands of publications since 2004

Step 3B: Welch's t-test (Python Alternative)

If R is not available:

from scipy import stats
from statsmodels.stats.multitest import multipletests
import numpy as np

def welch_ttest_de(data, control_idx, treatment_idx, fc_threshold=1.0, p_threshold=0.05):
    results = []
    p_values = []

    for i in range(data.shape[0]):
        control = data[i, control_idx]
        treatment = data[i, treatment_idx]

        # Remove NaN
        control = control[~np.isnan(control)]
        treatment = treatment[~np.isnan(treatment)]

        if len(control) < 2 or len(treatment) < 2:
            continue

        log2fc = np.mean(treatment) - np.mean(control)  # Already log2
        _, pval = stats.ttest_ind(control, treatment, equal_var=False)

        results.append({
            'protein': protein_ids[i],
            'log2FC': log2fc,
            'pvalue': pval,
        })
        p_values.append(pval)

    # BH correction (CRITICAL — never skip this!)
    _, adj_pvals, _, _ = multipletests(p_values, method='fdr_bh')
    for i, r in enumerate(results):
        r['adj_pvalue'] = adj_pvals[i]
        r['significant'] = abs(r['log2FC']) > np.log2(fc_threshold) and r['adj_pvalue'] < p_threshold

    return results

⚠️ CRITICAL: Always apply multiple testing correction (BH/FDR). With 5000+ proteins tested, you'll get hundreds of false positives at p < 0.05 without it.

Step 3C: ANOVA + Tukey HSD (3+ Groups)

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
from scipy.stats import tukey_hsd

def anova_de(data, groups, p_threshold=0.05):
    """
    Vectorized ANOVA for proteomics — much faster than per-protein loops.

    Args:
        data: numpy array (proteins × samples), log2 transformed
        groups: list of group labels per sample
    """
    unique_groups = sorted(set(groups))
    n_groups = len(unique_groups)

    # Group indices
    group_idx = {g: [i for i, x in enumerate(groups) if x == g]
                 for g in unique_groups}

    # Vectorized F-statistic computation
    grand_mean = np.nanmean(data, axis=1, keepdims=True)
    n_total = data.shape[1]

    # SSB (between-group sum of squares)
    ssb = np.zeros(data.shape[0])
    for g, idx in group_idx.items():
        group_mean = np.nanmean(data[:, idx], axis=1)
        ssb += len(idx) * (group_mean - grand_mean.flatten()) ** 2

    # SSW (within-group sum of squares)
    ssw = np.zeros(data.shape[0])
    for g, idx in group_idx.items():
        group_mean = np.nanmean(data[:, idx], axis=1, keepdims=True)
        ssw += np.nansum((data[:, idx] - group_mean) ** 2, axis=1)

    # F-statistic
    df_between = n_groups - 1
    df_within = n_total - n_groups
    f_stat = (ssb / df_between) / (ssw / df_within + 1e-10)
    p_values = stats.f.sf(f_stat, df_between, df_within)

    # BH correction FIRST, then Tukey only for significant
    _, adj_pvals, _, _ = multipletests(p_values, method='fdr_bh')

    significant_mask = adj_pvals < p_threshold
    print(f"ANOVA significant: {significant_mask.sum()} / {len(adj_pvals)}")

    # Tukey HSD only for significant proteins (saves massive compute time)
    # Limit to top 100 by F-statistic for speed
    sig_indices = np.where(significant_mask)[0]
    top_sig = sig_indices[np.argsort(f_stat[sig_indices])[::-1][:100]]

    for idx in top_sig:
        group_data = [data[idx, group_idx[g]] for g in unique_groups]
        tukey = tukey_hsd(*group_data)
        # ... store pairwise results

    return results

Key optimization: BH correction first, then Tukey only for significant proteins. On a 9-group, 5000-protein dataset, this reduced runtime from 215 seconds to 8 seconds.

Step 4: Volcano Plot

library(ggplot2)

ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = regulation)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_manual(values = c("Up" = "#e74c3c", "Down" = "#3498db", "NS" = "#bdc3c7")) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray40") +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "gray40") +
  labs(
    title = "Volcano Plot — Differential Expression",
    x = "Log2 Fold Change",
    y = "-Log10 Adjusted P-value"
  ) +
  theme_minimal() +
  annotate("text", x = max(results$logFC) * 0.8, y = max(-log10(results$adj.P.Val)) * 0.9,
           label = sprintf("%d Up", sum(results$regulation == "Up")), color = "#e74c3c") +
  annotate("text", x = min(results$logFC) * 0.8, y = max(-log10(results$adj.P.Val)) * 0.9,
           label = sprintf("%d Down", sum(results$regulation == "Down")), color = "#3498db")

Step 5: Pathway Enrichment

After DE, feed significant proteins into pathway analysis:

library(clusterProfiler)
library(org.Hs.eg.db)

# Convert gene symbols to ENTREZ IDs
sig_genes <- results$gene_symbol[results$significant]

ego <- enrichGO(
  gene = sig_genes,
  OrgDb = org.Hs.eg.db,
  keyType = "SYMBOL",
  ont = "BP",
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05
)

# KEGG pathways
gene_entrez <- bitr(sig_genes, fromType = "SYMBOL",
                     toType = "ENTREZID", OrgDb = org.Hs.eg.db)

ekegg <- enrichKEGG(
  gene = gene_entrez$ENTREZID,
  organism = "hsa",
  pAdjustMethod = "BH"
)

If Your IDs are UniProt Accessions

# Auto-detect and convert
if (grepl("^[A-Z][0-9A-Z]{5}", sig_genes[1])) {
  # UniProt → Symbol conversion
  mapped <- bitr(sig_genes, fromType = "UNIPROT",
                  toType = "SYMBOL", OrgDb = org.Hs.eg.db)
  sig_genes <- mapped$SYMBOL

  # For enrichGO, can use UNIPROT directly
  ego <- enrichGO(gene = sig_genes_uniprot,
                   OrgDb = org.Hs.eg.db,
                   keyType = "UNIPROT", ont = "BP")
}

Common Mistakes to Avoid

1. No Multiple Testing Correction

# WRONG — thousands of false positives
significant <- results$P.Value < 0.05

# RIGHT — BH-corrected
significant <- results$adj.P.Val < 0.05

2. Using Log2FC Threshold Without P-value

# WRONG — large fold changes can be noise
significant <- abs(results$logFC) > 1

# RIGHT — both thresholds
significant <- abs(results$logFC) > 1 & results$adj.P.Val < 0.05

3. Student's t-test Instead of Welch's

Proteomics data almost always has unequal variance between groups. Always use equal_var=False (Welch's) unless Levene's test explicitly confirms equal variance.

4. Not Log-Transforming

Raw LFQ intensities span orders of magnitude. Statistical tests assume roughly normal distributions. Log2 transformation is essential.

5. Ignoring Missing Data Patterns

Proteins missing in one condition but present in another may be the most biologically interesting! Consider "missing not at random" (MNAR) imputation methods.

Complete Minimal Pipeline

# 1. Load & preprocess
data <- read.csv("protein_matrix.csv", row.names = 1)
data_log <- log2(data + 1)
col_med <- apply(data_log, 2, median)
data_norm <- sweep(data_log, 2, col_med - median(col_med))

# 2. limma DE
library(limma)
group <- factor(c(rep("Control", 3), rep("Treatment", 3)))
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
fit <- eBayes(contrasts.fit(lmFit(data_norm, design),
              makeContrasts(Treatment - Control, levels = design)))
res <- topTable(fit, number = Inf, sort.by = "none")
res$sig <- abs(res$logFC) > 1 & res$adj.P.Val < 0.05

# 3. Results
cat(sprintf("%d significant (%d up, %d down) out of %d tested\n",
    sum(res$sig),
    sum(res$sig & res$logFC > 0),
    sum(res$sig & res$logFC < 0),
    nrow(res)))

Tools That Do This For You

If you want a GUI instead of code:

  • BioAI Market — Our platform (full pipeline with visualizations)
  • Perseus — Max Planck's desktop tool
  • Proteome Discoverer — Thermo Fisher's commercial solution
  • MSstats — R/Bioconductor package specifically for MS proteomics

This guide is based on building the analysis pipeline for BioAI Market. The platform implements all these methods with automatic statistical diagnostics and method recommendation. Check it out if you want a no-code solution.

관련 글