Imputing Missing Values in Proteomics — knn vs minDet vs MNAR — What Actually Works
Proteomics datasets are full of NAs, and how you handle them can flip your DEP list. This guide compares the four real-world options — leave NAs alone, k-Nearest Neighbors (knn), minimum detected (minDet), and explicit MNAR (Missing Not At Random) imputation — with practical recommendations for DIA-NN and MaxQuant outputs.
The Single Decision That Reshapes Your Volcano Plot
You load proteinGroups.txt or report.pg_matrix.tsv. Half your intensity matrix is NA. Now what?
The choice you make here — leave NAs, knn impute, minDet impute, or MNAR-aware imputation — can change your differential expression results by 30-50%. There's no universally right answer, but there are clearly wrong defaults that ship in many tutorials.
This guide is the practical decision tree: what each imputation method does, where each fails, and which to pick for which data type (DIA-NN vs DDA, n=3 vs n=30, sparse vs dense matrix).
It's the imputation-specific sibling of the cross-species ECM pillar, which covered why pseudocounts are the worst possible imputation choice.
Why Proteomics Has So Many NAs
Three independent causes of missingness, often mixed in one dataset:
1. MCAR — Missing Completely At Random
Random instrumental noise, spray instability, occasional MS/MS misassignment. Not correlated with abundance. Knn or mean imputation handles these well.
In modern LC-MS proteomics, MCAR is rare — most missingness has structure.
2. MNAR — Missing Not At Random
A protein wasn't detected because its concentration was below detection limit. The "NA" carries real information: this sample had low abundance.
For DIA: most MNAR is in the low-intensity tail. For DDA: heavy MNAR throughout because of the data-dependent selection bias.
3. MAR — Missing At Random (conditional)
Missingness depends on observed covariates (e.g., a particular instrument batch had different sensitivity). Less common in proteomics than MCAR/MNAR, but exists.
Practical takeaway: in proteomics, missingness is mostly MNAR (low-abundance below detection), some MCAR (instrumental noise). Imputation methods designed for MCAR (knn, mean, median) introduce bias when applied to MNAR data — they pull genuinely low-abundance points toward the dataset average.
Option 0 — Leave NAs Alone (The Underrated Default)
Before reaching for imputation, ask: can the analysis tolerate NAs directly?
Many tools can:
- limma and DEqMS: handle NAs natively — t-statistic uses available observations per protein
- Welch's t-test in R/Python: handle NAs with
na.rm = TRUE - MSstats: handles NAs with explicit model
When this works:
- Apply a valid-value filter (e.g., require ≥3/5 detection per group)
- Pass the matrix-with-NAs directly to your statistical test
- Don't impute
Strengths: zero bias from imputation, statistically rigorous
When it doesn't work:
- Tools that don't handle NAs (PCA, clustering, machine learning — they typically need complete matrices)
- One-sided detection (detected in group A, not B): no t-statistic possible; need either qualitative-only reporting or imputation
For a sizable fraction of proteomics analyses, leaving NAs alone + reporting one-sided proteins separately is the right answer — and it's underused because tutorials default to imputation.
Option 1 — k-Nearest Neighbors (knn)
For each protein with NAs, find the k most similar proteins (Euclidean distance in detected samples), borrow their values to fill in.
Mathematically reasonable for MCAR. Treats missingness as a random event that nearby data should predict.
library(impute)
imputed <- impute.knn(mat, k = 10)$data
The MNAR problem: if a protein is missing because it's truly low-abundance, its "nearest neighbors" might be high-abundance proteins with similar expression patterns elsewhere — knn will impute too-high values. Systematically biases low-abundance proteins upward.
Use knn when:
- You're confident missingness is primarily MCAR (rare in proteomics)
- You need a complete matrix for PCA or clustering as a quick exploratory step (with caveats noted)
Don't use knn for: differential expression statistics. The MNAR bias inflates type I error.
Option 2 — minDet (Minimum Detected) Imputation
For each sample, replace NAs with that sample's detection limit (e.g., the minimum observed value, or some quantile of the lower tail).
# Perseus-style minDet: replace NAs with a random value drawn from
# a normal distribution centered well below the minimum observed,
# with a small spread
impute_min_det <- function(x, downshift = 1.8, width = 0.3) {
na_idx <- is.na(x)
obs <- x[!na_idx]
mu <- mean(obs, na.rm = TRUE) - downshift * sd(obs, na.rm = TRUE)
sigma <- width * sd(obs, na.rm = TRUE)
x[na_idx] <- rnorm(sum(na_idx), mean = mu, sd = sigma)
return(x)
}
mat_imputed <- apply(mat, 2, impute_min_det)
(The MaxQuant/Perseus defaults are downshift = 1.8 standard deviations, width = 0.3 standard deviations.)
Strengths: respects MNAR — assumes missing means "below detection," fills with plausible low values
Weaknesses:
- The downshift parameter is somewhat arbitrary
- Adds noise (random draws); reruns give slightly different results — for reproducibility, set a random seed
- Can over-impute if some missingness is actually MCAR
Use minDet when:
- You need a complete matrix and your missingness is dominantly MNAR (most LC-MS proteomics)
- Default for many proteomics pipelines (Perseus uses this style)
Option 3 — Explicit MNAR Modeling
Statistical methods that explicitly model the missingness mechanism and impute accordingly. Implementations:
- MSImpute (R/Bioconductor): proteomics-specific, handles MCAR + MNAR mixed
- MICE with logistic missingness model: general framework, configurable
- ProteoMM: mixture model for proteomics
# MSImpute example
library(msImpute)
mat_imp <- msImpute(mat, method = "v2-mnar")
Strengths: theoretically the best calibrated approach — explicitly distinguishes MCAR from MNAR proteins
Weaknesses:
- Slower (model fitting per protein)
- More parameters to set; getting them wrong can be worse than minDet
- Sometimes overfits with very small n
Use explicit MNAR when: you have moderate n (≥5/group), care about subtle effects in low-abundance proteins, and have time to validate the imputation isn't introducing artifacts.
What About Mean / Median Imputation?
Filling NAs with the mean (or median) of the protein's observed values. Don't use this for proteomics.
Reasons:
- Treats all missing as MCAR (wrong for proteomics)
- Compresses variance (artificially)
- Inflates type I error in DEP analysis (Lazar et al., 2016)
Mean/median imputation is the imputation people reach for when in doubt; it's specifically the one to avoid.
What About Pseudocounts (e.g., +1e-6)?
Absolutely don't. See the cross-species ECM pillar for the full failure analysis. Pseudocounts make fold changes explode when applied to NAs, producing volcano-plot artifacts that look like findings.
If you find yourself reaching for + 1e-6 in log2((A + 1e-6) / (B + 1e-6)), stop and reach for valid-value filter + minDet instead.
Empirical Comparison on Real Data
A cross-species ECM proteomics dataset (n=4 per group, ~4,000 quantified proteins, ~22% NA rate after valid-value filter, mostly MNAR):
| Method | DEPs (adj.p<0.05, |log2FC|≥1) | Spurious extreme DEPs (|log2FC|>5) | |---|---|---| | Leave NAs (limma) | 397 | 8 (all biological) | | knn (k=10) | 411 | 24 (mix of artifacts) | | minDet (Perseus defaults) | 405 | 9 (mostly biological) | | MSImpute (v2-mnar) | 398 | 7 (biological) | | Mean imputation | 462 | 35 (mostly artifacts) | | Pseudocount +1e-6 | 583 | 96 (dominantly artifacts) |
Pattern: leave-NAs and MNAR-aware methods (minDet, MSImpute) cluster around 400 DEPs with few spurious extreme fold changes. knn inflates a bit. Mean/pseudocount inflate dramatically with many artifacts.
The list of "true" DEPs is roughly consistent across the conservative methods (~80-90% overlap). Mean and pseudocount add many false positives but also miss some true signals.
Recommended Workflow by Scenario
Standard DIA-NN / MaxQuant analysis, n=3-5
- Filter to valid values (≥3/n detected per group)
- Run limma or DEqMS with NAs left in — these tools handle them
- Report one-sided detected proteins separately as "qualitative only"
- If a complete matrix is needed for PCA/heatmap: impute with minDet, note in figure caption
Larger study, n=10+
Same as above, but you can also try MSImpute as a comparison. If results converge, your data is well-behaved. If they diverge, dig in.
Machine learning downstream (training a classifier on proteins)
Need a complete matrix. Options:
- Filter aggressively (e.g., 100% detection across all samples) — loses many proteins but keeps data clean
- minDet imputation — pragmatic default
- MSImpute — if you can afford the compute
Sparse data (DDA, >50% NAs)
This is the hardest case. Most imputation methods break down. Options:
- Switch to DIA acquisition if possible (substantially less missingness)
- Restrict analysis to proteins with strong detection (e.g., ≥80% across samples)
- Use MSstats which handles peptide-level data with explicit missing-data model
Code — End-to-End Example with valid-value + minDet
library(limma)
library(impute)
# Load and clean
pg <- read.table("proteinGroups.txt", sep = "\t", header = TRUE, quote = "")
pg <- pg[pg$Reverse != "+" & pg$Potential.contaminant != "+", ]
lfq_cols <- grep("^LFQ.intensity", colnames(pg), value = TRUE)
mat <- as.matrix(pg[, lfq_cols])
mat[mat == 0] <- NA
mat <- log2(mat)
# Define groups
groupA <- grep("Treat", lfq_cols)
groupB <- grep("Ctrl", lfq_cols)
# Valid-value filter
valid_A <- rowSums(!is.na(mat[, groupA])) >= 3 # 3 of 4
valid_B <- rowSums(!is.na(mat[, groupB])) >= 3
mat_quant <- mat[valid_A & valid_B, ]
# Optional: minDet imputation per sample if you need complete matrix
set.seed(42)
mat_imputed <- apply(mat_quant, 2, function(x) {
na_idx <- is.na(x)
obs <- x[!na_idx]
mu <- mean(obs, na.rm = TRUE) - 1.8 * sd(obs, na.rm = TRUE)
sigma <- 0.3 * sd(obs, na.rm = TRUE)
x[na_idx] <- rnorm(sum(na_idx), mean = mu, sd = sigma)
x
})
# Statistical test — for limma, use mat_quant (NAs intact); for ML, use mat_imputed
design <- model.matrix(~0 + factor(c(rep("Treat",4), rep("Ctrl",4))))
colnames(design) <- c("Ctrl", "Treat")
contrast <- makeContrasts(Treat - Ctrl, levels = design)
fit <- lmFit(mat_quant, design) # NA-aware
fit <- contrasts.fit(fit, contrast)
fit <- eBayes(fit)
results <- topTable(fit, number = Inf, adjust.method = "BH")
For DEqMS with peptide counts, see limma vs DEqMS for Proteomics.
FAQ
Q: My collaborator says to use mean imputation — is that ever OK? For preliminary exploratory analysis where you just need to see clusters, fine. For published differential expression results, no — mean imputation has documented bias that inflates type I error. Show them the Lazar et al. 2016 paper.
Q: Perseus' default — should I keep it?
Perseus' default is minDet-style imputation (downshift = 1.8, width = 0.3). For DDA data this is reasonable. For DIA, even Perseus recommends valid-value filtering first because missingness is much lower.
Q: How do I tell if my data is MCAR vs MNAR? Plot the relationship between detection frequency and mean intensity per protein. If they're correlated (less detected = lower intensity), missingness is MNAR. In proteomics this correlation is almost always strong.
Q: Does DIA-NN have less missing data than MaxQuant? Yes — DIA-NN's spectral library + Match-Between-Runs reduces missingness significantly (often to <10% per protein across n=4 samples). DDA outputs frequently have 20-40% missingness. This is a major reason DIA has become dominant.
Q: Should I impute before or after log-transformation? After. Impute on log-transformed values so the imputed distribution matches the log-normal nature of MS intensities. Imputing on raw intensity creates further bias.
Q: Should imputation be done before or after batch correction?
Both orderings have proponents. The safer default: filter to valid values first, batch-correct on the matrix-with-NAs (limma's removeBatchEffect() handles NAs), then impute if needed for downstream that can't handle NAs.
Q: What about Random Forest imputation (e.g., missForest)? Powerful for many domains but slow and sometimes overfits in proteomics. Some papers report success but it's not standard. For most cases, the simpler minDet or MSImpute is fine.
Q: My results change every time I rerun — what's going on?
Imputation methods that draw random values (minDet, MICE) are non-deterministic by default. Set set.seed() before imputation for reproducibility. Knn and mean are deterministic.
Closing — The Decision Tree
NAs in your proteomics matrix?
├── Statistical test only needed → leave NAs (limma/DEqMS handle them) + report
│ one-sided proteins separately as qualitative
├── PCA/heatmap/ML downstream needed → impute with minDet (set.seed)
└── Subtle effects in low-abundance proteins matter → MSImpute or explicit MNAR
Never use mean imputation or pseudocounts for proteomics. Avoid knn unless your missingness is verified MCAR. Default to valid-value filter + leave-NAs for stats, minDet only when complete matrix is required.
Related posts:
- limma vs DEqMS for Proteomics — When to Use Which
- Reproducing Park et al. 2026 — Cross-Species ECM Proteomics, Three Iterations
- From DIA-NN Output to Paper Draft: AI-Assisted Proteomics Workflow
- Statistical Test Selection Guide — t-test, limma, ANOVA
- FragPipe vs MaxQuant — 2026 Real Speed Benchmarks
References:
- Lazar, C. et al. (2016). Accounting for the multiple natures of missing values in label-free quantitative proteomics. J Proteome Research, 15, 1116-1125.
- Webb-Robertson, B.-J. et al. (2015). Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based proteomics. J Proteome Research, 14, 1993-2001.
- Hediyeh-Zadeh, S. et al. (2023). MSImpute: estimation of missing peptide intensity data. Briefings in Bioinformatics.
- Tyanova, S. et al. (2016). The Perseus computational platform for comprehensive analysis of (prote)omics data. Nature Methods, 13, 731-740.
관련 글
limma vs DEqMS for Proteomics — When to Use Which (n=3 to n=20+ Comparison)
5월 27일 · 10 min read
ProteomicsFrom DIA-NN Output to Paper Draft: A Complete AI-Assisted Proteomics Workflow (2026)
5월 22일 · 13 min read
ProteomicsHandling Missing Values in Proteomics: Imputation Methods Compared
2월 25일 · 6 min read
도구/소프트웨어2026년 바이오인포매틱스 소프트웨어 가성비 순위: 어떤 걸 선택해야 할까?
3월 26일 · 28 min read