Imputing Missing Values in Proteomics — knn vs minDet vs MNAR — What Actually Works

Q: Perseus' default — should I keep it?

Perseus' default is minDet-style imputation (`downshift = 1.8`, `width = 0.3`). For DDA data this is reasonable. For DIA, even Perseus recommends valid-value filtering first because missingness is much lower.

Q: Should I impute before or after log-transformation?

After. Impute on log-transformed values so the imputed distribution matches the log-normal nature of MS intensities. Imputing on raw intensity creates further bias.

Q: Should imputation be done before or after batch correction?

Both orderings have proponents. The safer default: filter to valid values first, batch-correct on the matrix-with-NAs (limma's `removeBatchEffect()` handles NAs), then impute if needed for downstream that can't handle NAs.

Q: What about Random Forest imputation (e.g., missForest)?

Powerful for many domains but slow and sometimes overfits in proteomics. Some papers report success but it's not standard. For most cases, the simpler minDet or MSImpute is fine.

Q: My results change every time I rerun — what's going on?

Imputation methods that draw random values (minDet, MICE) are non-deterministic by default. Set `set.seed()` before imputation for reproducibility. Knn and mean are deterministic.

Proteomics missing value imputation

The Single Decision That Reshapes Your Volcano Plot

You load proteinGroups.txt or report.pg_matrix.tsv. Half your intensity matrix is NA. Now what?

The choice you make here — leave NAs, knn impute, minDet impute, or MNAR-aware imputation — can change your differential expression results by 30-50%. There's no universally right answer, but there are clearly wrong defaults that ship in many tutorials.

This guide is the practical decision tree: what each imputation method does, where each fails, and which to pick for which data type (DIA-NN vs DDA, n=3 vs n=30, sparse vs dense matrix).

It's the imputation-specific sibling of the cross-species ECM pillar, which covered why pseudocounts are the worst possible imputation choice.

Why Proteomics Has So Many NAs

Three independent causes of missingness, often mixed in one dataset:

1. MCAR — Missing Completely At Random

Random instrumental noise, spray instability, occasional MS/MS misassignment. Not correlated with abundance. Knn or mean imputation handles these well.

In modern LC-MS proteomics, MCAR is rare — most missingness has structure.

2. MNAR — Missing Not At Random

A protein wasn't detected because its concentration was below detection limit. The "NA" carries real information: this sample had low abundance.

For DIA: most MNAR is in the low-intensity tail. For DDA: heavy MNAR throughout because of the data-dependent selection bias.

3. MAR — Missing At Random (conditional)

Missingness depends on observed covariates (e.g., a particular instrument batch had different sensitivity). Less common in proteomics than MCAR/MNAR, but exists.

Practical takeaway: in proteomics, missingness is mostly MNAR (low-abundance below detection), some MCAR (instrumental noise). Imputation methods designed for MCAR (knn, mean, median) introduce bias when applied to MNAR data — they pull genuinely low-abundance points toward the dataset average.

Option 0 — Leave NAs Alone (The Underrated Default)

Before reaching for imputation, ask: can the analysis tolerate NAs directly?

Many tools can:

limma and DEqMS: handle NAs natively — t-statistic uses available observations per protein
Welch's t-test in R/Python: handle NAs with na.rm = TRUE
MSstats: handles NAs with explicit model

When this works:

Apply a valid-value filter (e.g., require ≥3/5 detection per group)
Pass the matrix-with-NAs directly to your statistical test
Don't impute

Strengths: zero bias from imputation, statistically rigorous

When it doesn't work:

Tools that don't handle NAs (PCA, clustering, machine learning — they typically need complete matrices)
One-sided detection (detected in group A, not B): no t-statistic possible; need either qualitative-only reporting or imputation

For a sizable fraction of proteomics analyses, leaving NAs alone + reporting one-sided proteins separately is the right answer — and it's underused because tutorials default to imputation.

Option 1 — k-Nearest Neighbors (knn)

For each protein with NAs, find the k most similar proteins (Euclidean distance in detected samples), borrow their values to fill in.

Mathematically reasonable for MCAR. Treats missingness as a random event that nearby data should predict.

library(impute)
imputed <- impute.knn(mat, k = 10)$data

The MNAR problem: if a protein is missing because it's truly low-abundance, its "nearest neighbors" might be high-abundance proteins with similar expression patterns elsewhere — knn will impute too-high values. Systematically biases low-abundance proteins upward.

Use knn when:

You're confident missingness is primarily MCAR (rare in proteomics)
You need a complete matrix for PCA or clustering as a quick exploratory step (with caveats noted)

Don't use knn for: differential expression statistics. The MNAR bias inflates type I error.

Option 2 — minDet (Minimum Detected) Imputation

For each sample, replace NAs with that sample's detection limit (e.g., the minimum observed value, or some quantile of the lower tail).

# Perseus-style minDet: replace NAs with a random value drawn from
# a normal distribution centered well below the minimum observed,
# with a small spread
impute_min_det <- function(x, downshift = 1.8, width = 0.3) {
  na_idx <- is.na(x)
  obs <- x[!na_idx]
  mu <- mean(obs, na.rm = TRUE) - downshift * sd(obs, na.rm = TRUE)
  sigma <- width * sd(obs, na.rm = TRUE)
  x[na_idx] <- rnorm(sum(na_idx), mean = mu, sd = sigma)
  return(x)
}
mat_imputed <- apply(mat, 2, impute_min_det)

(The MaxQuant/Perseus defaults are downshift = 1.8 standard deviations, width = 0.3 standard deviations.)

Strengths: respects MNAR — assumes missing means "below detection," fills with plausible low values

Weaknesses:

The downshift parameter is somewhat arbitrary
Adds noise (random draws); reruns give slightly different results — for reproducibility, set a random seed
Can over-impute if some missingness is actually MCAR

Use minDet when:

You need a complete matrix and your missingness is dominantly MNAR (most LC-MS proteomics)
Default for many proteomics pipelines (Perseus uses this style)

Option 3 — Explicit MNAR Modeling

Statistical methods that explicitly model the missingness mechanism and impute accordingly. Implementations:

MSImpute (R/Bioconductor): proteomics-specific, handles MCAR + MNAR mixed
MICE with logistic missingness model: general framework, configurable
ProteoMM: mixture model for proteomics

# MSImpute example
library(msImpute)
mat_imp <- msImpute(mat, method = "v2-mnar")

Strengths: theoretically the best calibrated approach — explicitly distinguishes MCAR from MNAR proteins

Weaknesses:

Slower (model fitting per protein)
More parameters to set; getting them wrong can be worse than minDet
Sometimes overfits with very small n

Use explicit MNAR when: you have moderate n (≥5/group), care about subtle effects in low-abundance proteins, and have time to validate the imputation isn't introducing artifacts.

What About Mean / Median Imputation?

Filling NAs with the mean (or median) of the protein's observed values. Don't use this for proteomics.

Reasons:

Treats all missing as MCAR (wrong for proteomics)
Compresses variance (artificially)
Inflates type I error in DEP analysis (Lazar et al., 2016)

Mean/median imputation is the imputation people reach for when in doubt; it's specifically the one to avoid.

What About Pseudocounts (e.g., +1e-6)?

Absolutely don't. See the cross-species ECM pillar for the full failure analysis. Pseudocounts make fold changes explode when applied to NAs, producing volcano-plot artifacts that look like findings.

If you find yourself reaching for + 1e-6 in log2((A + 1e-6) / (B + 1e-6)), stop and reach for valid-value filter + minDet instead.

Empirical Comparison on Real Data

A cross-species ECM proteomics dataset (n=4 per group, ~4,000 quantified proteins, ~22% NA rate after valid-value filter, mostly MNAR):

| Method | DEPs (adj.p<0.05, |log2FC|≥1) | Spurious extreme DEPs (|log2FC|>5) | |---|---|---| | Leave NAs (limma) | 397 | 8 (all biological) | | knn (k=10) | 411 | 24 (mix of artifacts) | | minDet (Perseus defaults) | 405 | 9 (mostly biological) | | MSImpute (v2-mnar) | 398 | 7 (biological) | | Mean imputation | 462 | 35 (mostly artifacts) | | Pseudocount +1e-6 | 583 | 96 (dominantly artifacts) |

Pattern: leave-NAs and MNAR-aware methods (minDet, MSImpute) cluster around 400 DEPs with few spurious extreme fold changes. knn inflates a bit. Mean/pseudocount inflate dramatically with many artifacts.

The list of "true" DEPs is roughly consistent across the conservative methods (~80-90% overlap). Mean and pseudocount add many false positives but also miss some true signals.

Recommended Workflow by Scenario

Standard DIA-NN / MaxQuant analysis, n=3-5

Filter to valid values (≥3/n detected per group)
Run limma or DEqMS with NAs left in — these tools handle them
Report one-sided detected proteins separately as "qualitative only"
If a complete matrix is needed for PCA/heatmap: impute with minDet, note in figure caption

Larger study, n=10+

Same as above, but you can also try MSImpute as a comparison. If results converge, your data is well-behaved. If they diverge, dig in.

Machine learning downstream (training a classifier on proteins)

Need a complete matrix. Options:

Filter aggressively (e.g., 100% detection across all samples) — loses many proteins but keeps data clean
minDet imputation — pragmatic default
MSImpute — if you can afford the compute

Sparse data (DDA, >50% NAs)

This is the hardest case. Most imputation methods break down. Options:

Switch to DIA acquisition if possible (substantially less missingness)
Restrict analysis to proteins with strong detection (e.g., ≥80% across samples)
Use MSstats which handles peptide-level data with explicit missing-data model

Code — End-to-End Example with valid-value + minDet

library(limma)
library(impute)

# Load and clean
pg <- read.table("proteinGroups.txt", sep = "\t", header = TRUE, quote = "")
pg <- pg[pg$Reverse != "+" & pg$Potential.contaminant != "+", ]

lfq_cols <- grep("^LFQ.intensity", colnames(pg), value = TRUE)
mat <- as.matrix(pg[, lfq_cols])
mat[mat == 0] <- NA
mat <- log2(mat)

# Define groups
groupA <- grep("Treat", lfq_cols)
groupB <- grep("Ctrl", lfq_cols)

# Valid-value filter
valid_A <- rowSums(!is.na(mat[, groupA])) >= 3   # 3 of 4
valid_B <- rowSums(!is.na(mat[, groupB])) >= 3
mat_quant <- mat[valid_A & valid_B, ]

# Optional: minDet imputation per sample if you need complete matrix
set.seed(42)
mat_imputed <- apply(mat_quant, 2, function(x) {
  na_idx <- is.na(x)
  obs <- x[!na_idx]
  mu <- mean(obs, na.rm = TRUE) - 1.8 * sd(obs, na.rm = TRUE)
  sigma <- 0.3 * sd(obs, na.rm = TRUE)
  x[na_idx] <- rnorm(sum(na_idx), mean = mu, sd = sigma)
  x
})

# Statistical test — for limma, use mat_quant (NAs intact); for ML, use mat_imputed
design <- model.matrix(~0 + factor(c(rep("Treat",4), rep("Ctrl",4))))
colnames(design) <- c("Ctrl", "Treat")
contrast <- makeContrasts(Treat - Ctrl, levels = design)
fit <- lmFit(mat_quant, design)   # NA-aware
fit <- contrasts.fit(fit, contrast)
fit <- eBayes(fit)
results <- topTable(fit, number = Inf, adjust.method = "BH")

For DEqMS with peptide counts, see limma vs DEqMS for Proteomics.

FAQ

Q: My collaborator says to use mean imputation — is that ever OK? For preliminary exploratory analysis where you just need to see clusters, fine. For published differential expression results, no — mean imputation has documented bias that inflates type I error. Show them the Lazar et al. 2016 paper.

Q: Perseus' default — should I keep it? Perseus' default is minDet-style imputation (downshift = 1.8, width = 0.3). For DDA data this is reasonable. For DIA, even Perseus recommends valid-value filtering first because missingness is much lower.

Q: How do I tell if my data is MCAR vs MNAR? Plot the relationship between detection frequency and mean intensity per protein. If they're correlated (less detected = lower intensity), missingness is MNAR. In proteomics this correlation is almost always strong.

Q: Does DIA-NN have less missing data than MaxQuant? Yes — DIA-NN's spectral library + Match-Between-Runs reduces missingness significantly (often to <10% per protein across n=4 samples). DDA outputs frequently have 20-40% missingness. This is a major reason DIA has become dominant.

Q: Should I impute before or after log-transformation? After. Impute on log-transformed values so the imputed distribution matches the log-normal nature of MS intensities. Imputing on raw intensity creates further bias.

Q: Should imputation be done before or after batch correction? Both orderings have proponents. The safer default: filter to valid values first, batch-correct on the matrix-with-NAs (limma's removeBatchEffect() handles NAs), then impute if needed for downstream that can't handle NAs.

Q: What about Random Forest imputation (e.g., missForest)? Powerful for many domains but slow and sometimes overfits in proteomics. Some papers report success but it's not standard. For most cases, the simpler minDet or MSImpute is fine.

Q: My results change every time I rerun — what's going on? Imputation methods that draw random values (minDet, MICE) are non-deterministic by default. Set set.seed() before imputation for reproducibility. Knn and mean are deterministic.

Closing — The Decision Tree

NAs in your proteomics matrix?
├── Statistical test only needed → leave NAs (limma/DEqMS handle them) + report
│                                  one-sided proteins separately as qualitative
├── PCA/heatmap/ML downstream needed → impute with minDet (set.seed)
└── Subtle effects in low-abundance proteins matter → MSImpute or explicit MNAR

Never use mean imputation or pseudocounts for proteomics. Avoid knn unless your missingness is verified MCAR. Default to valid-value filter + leave-NAs for stats, minDet only when complete matrix is required.

Related posts:

References:

Lazar, C. et al. (2016). Accounting for the multiple natures of missing values in label-free quantitative proteomics. J Proteome Research, 15, 1116-1125.
Webb-Robertson, B.-J. et al. (2015). Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based proteomics. J Proteome Research, 14, 1993-2001.
Hediyeh-Zadeh, S. et al. (2023). MSImpute: estimation of missing peptide intensity data. Briefings in Bioinformatics.
Tyanova, S. et al. (2016). The Perseus computational platform for comprehensive analysis of (prote)omics data. Nature Methods, 13, 731-740.