Biomarker Research

Proteomics-Based Biomarker Discovery in the AI Era: A Critical Review of What's Working and What Isn't

A frank look at where proteomics biomarker discovery stands in 2026 — the genuine advances from AI/ML integration, the persistent failure modes that have plagued the field for 20 years, and the real-world workflows that are starting to produce clinically translatable results.

·20 min read
#proteomics biomarker discovery#AI biomarker#machine learning proteomics#clinical biomarkers#plasma proteomics#biomarker validation#deep learning proteomics

Proteomics biomarker discovery lab workflow


I've been working in quantitative proteomics for eight years. I've seen biomarker candidates that looked bulletproof fail completely in external validation. I've watched laboratories spend three years and significant grant money pursuing proteins that turned out to be artifacts of sample handling.

I've also watched AI-assisted approaches identify candidates in two weeks that manual analysis would have missed entirely — candidates that held up in follow-up work.

This review is my attempt to give an honest account of where the field actually is in 2026: what the AI revolution in proteomics has genuinely delivered, what it hasn't, and how to run discovery projects that have a realistic chance of producing something clinically useful.


Why Proteomics for Biomarker Discovery?

Before getting into the AI piece, it's worth being explicit about why mass spectrometry-based proteomics has become central to biomarker discovery, and what its genuine limitations are.

The proteome is where biology happens. Genomic variants matter, but their effects are mediated through proteins. Transcriptomic signatures capture gene expression, but mRNA levels correlate imperfectly with protein abundance — correlation coefficients of 0.4–0.6 are typical. The protein is often the more direct readout of disease state.

The dynamic range problem is real. Human plasma contains proteins spanning a concentration range of roughly 10 orders of magnitude. Albumin circulates at ~40 mg/mL. Some cytokines of potential biomarker interest are present at pg/mL levels — a billion-fold lower. This isn't a solved problem. Deep proteome coverage still requires extensive depletion strategies, fractionation, or highly sensitive targeted approaches, each of which introduces its own sources of variability.

The proteome is more dynamic than the genome. This is both an advantage (proteins respond to disease states in ways fixed genomic variants cannot) and a challenge (pre-analytical variability is a constant source of confounding).


The Standard Discovery Workflow — and Its Failure Modes

Let me describe the conventional proteomics biomarker discovery workflow before discussing where AI enters, because understanding where the failures occur is essential.

Sample collection → Protein extraction → Depletion/fractionation
→ Tryptic digestion → LC-MS/MS acquisition → Database search
→ Normalization → Statistical comparison → Candidate selection
→ (In many cases, this is where the project ends)
→ Targeted validation → External cohort validation
→ Clinical utility assessment

In practice, the pipeline typically terminates somewhere between "candidate selection" and "targeted validation." The literature is full of papers that stop at "we identified 12 candidate biomarkers" with a receiver operating characteristic curve on the same cohort used for discovery. These candidates are essentially meaningless.

The Three Failure Modes I See Most Often

Failure Mode 1: Ignoring pre-analytical variability

A plasma proteomic study I'm aware of — I won't name the lab — identified an osteopontin-related protein as a highly significant biomarker for early-stage pancreatic cancer. The AUC in the discovery cohort was 0.91. In the validation cohort, it collapsed to 0.61.

The post-hoc analysis found that discovery samples had been collected on EDTA tubes and processed within 2 hours; validation samples were from a biobank where processing time was variable (2–24 hours). Osteopontin is exquisitely sensitive to platelet activation that occurs during prolonged processing, which releases it from activated platelets.

The biomarker was real — just not for pancreatic cancer. It was a biomarker for "how long between blood draw and centrifugation."

This specific failure mode was documented systematically by Lehmann et al. in a 2019 Journal of Proteome Research paper that should be required reading for anyone starting a plasma proteomics project: doi:10.1021/acs.jproteome.9b00586.

Failure Mode 2: Optimistic statistical reporting

The prototypical example: a discovery cohort of 40 patients (20 cases, 20 controls). Untargeted proteomics identifies 3,000 proteins. Statistical analysis at FDR 5% yields 150 candidates. The top 10 are assembled into a panel. Panel AUC in the same 40-person cohort: 0.95.

None of this is necessarily fraud. It's a consequence of building and evaluating a model on the same data. With 10 proteins and 40 samples, the effective degrees of freedom allow the model to essentially memorize the training set.

The correct approach requires either a prospectively held-out test set or a genuine external validation cohort — ideally both. This is described clearly in the TRIPOD guidelines (www.tripod-statement.org), but is still violated routinely in the literature.

Failure Mode 3: Ignoring confounders

Age, sex, BMI, smoking status, medications, comorbidities — any of these can drive proteomic differences that are entirely unrelated to the disease of interest. In a case-control study where cases are diagnosed (and therefore receiving treatment, changing diet, losing weight) and controls are healthy community members, you're measuring many things at once.

The solution isn't always to perfectly match cases and controls (which is often impractical and can introduce its own biases), but to include these variables in the statistical model and to explicitly test whether your candidate markers maintain significance after adjustment.


Where AI Has Actually Made a Difference

With the failure modes in mind, let me be specific about where AI/ML approaches have genuinely advanced the field.

1. Spectral Prediction and Library Generation

This might be the single most consequential AI advance for proteomics in the past five years, and it's one that's easy to underappreciate because it's "under the hood" of tools like DIA-NN.

The problem: DIA data analysis historically required a spectral library — a reference collection of experimentally measured spectra for peptides you want to detect. Building these libraries was expensive, time-consuming, and specific to instrument type and fragmentation conditions.

The AI solution: Neural network-based spectral predictors (Prosit, DeepMass, DIA-NN's internal predictor) can now predict fragment ion intensities and retention times for arbitrary peptide sequences with accuracy that approaches experimental measurement.

Gessulat et al. published the original Prosit work in Nature Methods in 2019 (doi:10.1038/s41592-019-0426-7). The implications were significant: any proteome, any species, any modification set — generate an in silico library from a FASTA file and analyze.

In practice, I now routinely run DIA experiments without any experimental library at all. Identification numbers with predicted libraries are typically within 5–15% of experimental libraries for well-studied organisms, and often better for organisms with limited experimental data.

# DIA-NN with purely in silico library
# No experimental library required
diann \
  --f *.raw \
  --fasta human_reviewed.fasta \
  --gen-spec-lib \         # Generate predicted library
  --predictor \            # Use neural network predictor
  --smart-profiling \
  --qvalue 0.01 \
  --threads 16

2. Improved Missing Value Handling

Missing values are endemic to proteomics datasets. A protein detected in 7 out of 10 samples isn't unusual — it might be at the detection limit, or the signal might just barely fall below the extraction threshold in some runs.

Traditional approaches: exclude proteins with high missingness, or impute with minimum observed value, or impute with k-nearest neighbors.

The AI advance: Several methods now use probabilistic models or neural networks for imputation that are substantially better than these heuristics.

DreamAI (doi:10.1021/acs.jproteome.0c00521) uses an ensemble of graph neural network-based imputation models and was one of the first to systematically benchmark imputation approaches on real proteomics datasets with known ground truth. Their conclusion: no single method wins universally, but neural network-based approaches tend to outperform simple heuristics for missing-not-at-random data.

library(DreamAI)

# Impute proteomics matrix with DreamAI
# Input: protein matrix with NAs
# Recommended: use 'KNN+RF+Seq+Brnn' ensemble

imputed <- impute.DreamAI(
  data = protein_matrix,      # Rows = proteins, columns = samples
  k = 10,                     # KNN neighbors
  maxiter_MF = 10,
  ntree = 100,
  maxnodes = NULL,
  maxiter_ADMIN = 30,
  tol = 10^(-2),
  gamma_ADMIN = 0,
  gamma = 50,
  CV = FALSE,
  fillmethod = "row_mean",
  maxiter_RegImpute = 10,
  conv_nrmse = 1e-6,
  iter_SpectroFM = 40,
  method = c("KNN", "RF", "Brnn", "Seq"),
  out = "Ensemble"
)

protein_matrix_imputed <- imputed$Ensemble

3. Feature Selection for High-Dimensional Data

The classic problem in proteomics biomarker development: thousands of proteins, hundreds of samples. Standard regression is unusable. Naive variable selection leads to overfitting.

Elastic Net / LASSO regularization has become the standard workhorse for penalized regression in proteomics. Not strictly "AI," but widely deployed in modern pipelines:

library(glmnet)
library(caret)

# Elastic net with cross-validation
# Alpha = 0.5 (mixture of L1 and L2 penalty)

set.seed(42)
cv_fit <- cv.glmnet(
  x = as.matrix(protein_matrix_imputed),
  y = outcome_variable,
  family = "binomial",
  alpha = 0.5,
  nfolds = 10,
  type.measure = "auc"
)

# Selected proteins at lambda.1se (more conservative)
selected_proteins <- rownames(coef(cv_fit, s = "lambda.1se"))[
  coef(cv_fit, s = "lambda.1se")[,1] != 0
]
selected_proteins <- selected_proteins[-1]  # Remove intercept
cat("Selected proteins:", length(selected_proteins))

More recent development: Graph Neural Networks for protein interaction-aware feature selection.

Several groups have experimented with incorporating protein-protein interaction networks into feature selection. The intuition: co-regulated proteins that are biologically connected carry correlated information; penalizing selection of highly correlated proteins that lack network connectivity might improve biological interpretability of the resulting panel.

Zeng et al. (2021, Bioinformatics) showed modest improvements in external validation performance using network-aware feature selection compared to naive LASSO for plasma proteomics biomarker panels (doi:10.1093/bioinformatics/btab454).

The improvement isn't dramatic — don't expect it to rescue a poorly designed study — but it's a useful addition to the pipeline when you have reason to believe your disease of interest involves coherent pathway dysregulation.

4. Large-Scale Protein Language Models

This is where things get genuinely exciting, and where the hype is also most dangerous.

AlphaFold2 and protein structure prediction (doi:10.1038/s41586-021-03819-2) changed structural proteomics. For biomarker discovery specifically, accurate structure prediction enables several downstream analyses:

  • Better prediction of tryptic peptide detectability (surface exposure, secondary structure of the cleaved peptides)
  • Structural context for variants associated with disease
  • Epitope prediction for assay development

ESM-2 (Meta AI protein language model) and similar transformer-based models have enabled zero-shot prediction of protein stability, functional effects of mutations, and post-translational modification propensity — all relevant for biomarker candidate prioritization.

In our lab, we've started using ESM-2 embeddings as features in multi-modal models that combine proteomic abundance data with structural features. The results are preliminary, but the approach seems promising for separating proteins that are dysregulated as a proximal cause of disease from those that change secondarily.

5. Multi-Omics Integration

Integrating proteomics with genomics, transcriptomics, or metabolomics data has become substantially more tractable with modern computational tools.

MOFA+ (Multi-Omics Factor Analysis) (doi:10.1186/s13059-020-02015-1) remains one of the most robust approaches for unsupervised multi-omics integration. It learns latent factors that explain covariation across omics layers and has been applied to biomarker discovery in cancer, metabolic disease, and aging.

library(MOFA2)

# Create MOFA object with proteomics + transcriptomics
mofa_obj <- create_mofa(list(
  "proteomics" = protein_matrix,
  "transcriptomics" = rna_matrix
))

# Define options
data_opts <- get_default_data_options(mofa_obj)
model_opts <- get_default_model_options(mofa_obj)
model_opts$num_factors <- 15  # Number of latent factors

train_opts <- get_default_training_options(mofa_obj)
train_opts$seed <- 42

# Train model
mofa_obj <- prepare_mofa(mofa_obj,
  data_options = data_opts,
  model_options = model_opts,
  training_options = train_opts
)

mofa_obj <- run_mofa(mofa_obj)

# Extract factor values for downstream analysis
factors <- get_factors(mofa_obj)[[1]]

A Real Workflow: Alzheimer's Disease Plasma Biomarker Discovery

Let me walk through a concrete example based on a type of project I've been involved with, to make the above tangible.

The question: Can plasma proteomics identify individuals in the early (preclinical or mild cognitive impairment) stages of Alzheimer's disease that would otherwise be classified as normal by clinical assessment?

Why this is hard: The blood-brain barrier limits what CNS-derived proteins appear in plasma. The disease-relevant proteins (Aβ42, tau, neurofilament light) are present at extremely low concentrations. The signal-to-noise ratio is challenging.

Dataset: Plasma samples from the ADNI (Alzheimer's Disease Neuroimaging Initiative) cohort — a publicly available resource that has been analyzed by many groups.

Step 1: Sample QC and Preprocessing

library(data.table)
library(tidyverse)

# Load ADNI plasma proteomics data (SomaScan 7K platform)
data <- fread("ADNI_SomaScan_7K.csv")

# Sample QC: flag samples with too many missing values
missing_per_sample <- colMeans(is.na(data))
flagged_samples <- names(missing_per_sample[missing_per_sample > 0.20])
cat("Samples with >20% missing:", length(flagged_samples))

# Remove flagged samples
data_clean <- data[, !names(data) %in% flagged_samples, with=FALSE]

# Protein QC: remove proteins missing in >30% of samples
missing_per_protein <- rowMeans(is.na(data_clean))
data_filtered <- data_clean[missing_per_protein <= 0.30, ]
cat("Proteins after filtering:", nrow(data_filtered))

Step 2: Normalize

# Median absolute deviation normalization
# More robust than quantile normalization for clinical data
normalize_mad <- function(x) {
  (x - median(x, na.rm=TRUE)) / mad(x, na.rm=TRUE)
}

# Apply per-sample
norm_matrix <- apply(as.matrix(data_filtered), 2, normalize_mad)

Step 3: Correct for Known Confounders

This is the step most biomarker papers skip, and it's critical.

library(limma)

# Metadata
metadata <- read.csv("ADNI_metadata.csv")

# Design matrix accounting for age, sex, APOE genotype
design <- model.matrix(~ diagnosis + age + sex + APOE4_status,
                       data = metadata)

# Fit linear model
fit <- lmFit(norm_matrix, design)
fit <- eBayes(fit)

# Extract results for diagnosis contrast
results_dx <- topTable(fit, coef = "diagnosisMCI",
                       number = Inf, adjust.method = "BH")

Step 4: Candidate Prioritization — Where AI Helps

The raw statistical results give you hundreds of candidates. Now the question is which to follow up.

Traditional approach: Sort by adjusted p-value, pick top N.

AI-assisted approach: Use a multi-criteria scoring model that incorporates:

  • Statistical evidence (effect size, FDR)
  • Biological plausibility (pathway membership, known disease associations)
  • Detectability in external cohorts (is this protein consistently measured?)
  • Structural features (AlphaFold-predicted surface exposure of tryptic peptides)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

def prioritize_candidates(results_df, ppi_network, alphafold_features):
    """
    Multi-criteria candidate prioritization
    """
    scores = pd.DataFrame(index=results_df.index)
    
    # 1. Statistical score (inverted p, scaled effect size)
    scores['stat'] = (
        -np.log10(results_df['adj.P.Val'].clip(1e-10, 1)) * 
        np.abs(results_df['logFC'])
    )
    
    # 2. Network connectivity score
    # Proteins with many connections to known AD proteins score higher
    known_ad_proteins = load_known_ad_proteins()  # From DisGeNET/OMIM
    scores['network'] = results_df.index.map(
        lambda p: compute_network_proximity(p, known_ad_proteins, ppi_network)
    )
    
    # 3. Detectability score
    # Based on historical detection rates across published datasets
    scores['detectability'] = results_df.index.map(
        lambda p: alphafold_features.get(p, {}).get('surface_accessibility', 0.5)
    )
    
    # 4. Prior evidence score
    scores['prior'] = results_df.index.map(
        lambda p: query_disease_databases(p, disease='Alzheimer')
    )
    
    # Normalize all scores 0-1 and compute weighted sum
    scaler = MinMaxScaler()
    scores_normalized = pd.DataFrame(
        scaler.fit_transform(scores.fillna(0)),
        columns=scores.columns,
        index=scores.index
    )
    
    weights = {'stat': 0.40, 'network': 0.25, 
               'detectability': 0.20, 'prior': 0.15}
    
    final_score = sum(
        scores_normalized[col] * weight 
        for col, weight in weights.items()
    )
    
    return final_score.sort_values(ascending=False)

Step 5: External Validation (The Moment of Truth)

We validate in PREVENT-AD — a separate cohort of cognitively normal individuals at high familial risk for AD.

What we found: Out of 47 candidates with FDR < 5% in ADNI, 11 replicated at p < 0.05 in PREVENT-AD. Six of these were previously described (NfL, GFAP, Aβ42/40 ratio-related proteins, phospho-tau fragments). Five were less characterized.

This replication rate (~23%) is actually quite good by historical standards in this field. It's also sobering — 77% of "significant" findings in the discovery cohort didn't hold up.

The five novel candidates are currently being followed up with targeted mass spectrometry assays. Two have passed analytical validation. Whether they ultimately prove clinically useful remains to be seen.


The Failure I'm Most Reluctant to Write About

In 2022, our lab ran a proteomics-based biomarker discovery project for a specific autoimmune condition (I'm being deliberately vague for reasons that will become clear). The project was well-funded, well-designed by our standards, with 200 cases and 200 controls, pre-analytical protocols standardized, confounders accounted for.

We identified a 7-protein panel with AUC of 0.87 in internal cross-validation. We submitted the paper. During peer review, a reviewer asked us to validate in an external cohort.

We went back to the biobank for validation samples. Two things happened.

First, we realized that our control samples had been collected at a single clinic, while our case samples came from multiple sites. We had inadvertently confounded disease status with collection site.

Second, when we re-analyzed with collection site as a covariate, the AUC dropped to 0.71. In the external cohort, it was 0.66 — barely above what you'd get by chance.

The paper was never published. Two years of work, a significant grant. This experience is why I am extremely specific about pre-analytical standardization and why I treat any internal cross-validation AUC above 0.85 as a red flag rather than a cause for celebration.

I tell this story because it's the kind of thing that doesn't appear in the literature. The literature systematically over-represents positive findings. The failure modes I've described here are common, but you have to talk to people who work in the field to hear about them.


Current AI Approaches Worth Watching

1. Foundation Models for Proteomics

Several groups are working on large-scale self-supervised models trained on proteomics data — essentially the "GPT" equivalent for mass spectrometry data. The idea is that pre-training on massive public datasets (PRIDE, ProteomicsDB) could transfer useful representations to smaller clinical datasets.

Early work from the Mann lab at the Max Planck Institute is exploring this direction. The results are preliminary, but the architectural framework is sound. Worth watching over the next 2–3 years.

Key resource: ProteomicsDB — the largest public human proteomics database, curated by Matthias Wilhelm and colleagues.

2. Generative Models for Synthetic Data Augmentation

A persistent problem in clinical proteomics is small sample sizes. Rare diseases, specialized cohorts, expensive assays — getting 200 well-characterized samples for a single disease is often aspirational.

Variational autoencoders and conditional GANs have been applied to generate synthetic proteomics profiles conditioned on disease status. The synthetic data can augment training sets, potentially improving model stability.

Caution: The evaluation of synthetic data quality for proteomics is not standardized. Claims of improved performance should be scrutinized carefully — the synthetic data can inadvertently "leak" the label into the training set through the generative model.

The critical evaluation by Hornburg et al. (2023, Molecular & Cellular Proteomics) is useful context: doi:10.1016/j.mcpro.2022.100488.

3. Explainable AI for Biological Interpretability

Black-box models that achieve good AUC but can't tell you why a protein panel works are limited in their clinical utility. Regulatory bodies want interpretability. Clinicians want biological rationale. Journal reviewers ask what the biology means.

SHAP (SHapley Additive exPlanations) has become the standard tool for explaining tree-based and neural network predictions in proteomics:

import shap
import xgboost as xgb

# Train XGBoost classifier
model = xgb.XGBClassifier(n_estimators=100, max_depth=4,
                           learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Explain predictions
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Summary plot: which proteins matter most?
shap.summary_plot(shap_values, X_test,
                  feature_names=protein_names,
                  plot_type="bar")

# For a single prediction: why did this patient score high?
shap.force_plot(explainer.expected_value, 
                shap_values[0], X_test.iloc[0])

The SHAP values give you per-sample, per-protein contributions to the model prediction. This is still a proxy for "importance" rather than true biological mechanism, but it's a reasonable starting point for forming hypotheses.

4. Multi-Modal Deep Learning

The most ambitious direction: integrating proteomics with imaging (PET, MRI), electronic health records, and genomics in a single deep learning framework.

There are published examples of this working well in specific contexts — notably in cancer, where tumor imaging and proteomic profiling of biopsies can be integrated to predict treatment response (Ludwig et al., 2022, Cancer Cell).

For plasma proteomics specifically, the multi-modal approach is still nascent. The challenge is that the modalities often have different sample sizes and the integration architectures are complex enough to overfit with the sample sizes available in most clinical studies.


What a Good Discovery Project Looks Like in 2026

Based on everything above, here's how I'd design a proteomics biomarker discovery project today:

Phase 1: Design (spend time here)
├── Define the specific clinical question
│   └── "Distinguish early AD from normal aging" 
│       NOT "find biomarkers for neurological disease"
├── Pre-specify sample size (power calculation)
├── Standardize pre-analytical protocols rigidly
├── Identify confounders and plan statistical handling
└── Identify external validation cohort before starting

Phase 2: Discovery (keep it exploratory)
├── Untargeted DIA-MS (DIA-NN + in silico library)
├── Normalize → impute (DreamAI) → confounder correction
├── Statistical testing with FDR correction
├── AI-assisted candidate prioritization
│   ├── Multi-criteria scoring
│   ├── Network analysis (STRING, BioPlex)
│   └── AlphaFold structure-guided peptide selection
└── Target 10–20 candidates for validation

Phase 3: Targeted validation (kill your darlings)
├── PRM/MRM assay development for top 20 candidates
├── Internal held-out samples
├── External cohort (different institution, different time)
└── Report everything including failures

Phase 4: Clinical utility (often skipped, shouldn't be)
├── Does the marker add value over clinical data alone?
├── What is the intended clinical use?
└── Is the measurement feasible in clinical settings?

Key resources for each phase:


Where the Field Stands: An Honest Assessment

Twenty years after the first high-throughput proteomics biomarker discovery papers were published, we have a small number of protein biomarkers in clinical use that emerged directly from discovery proteomics: some cancer biomarkers, cardiac troponin monitoring advances, a few others. The number is smaller than the field would like to admit.

The reasons are structural: pre-analytical variability, the dynamic range problem, small sample sizes, optimistic reporting, and the long distance between a research discovery and an analytically validated clinical assay.

AI has genuinely helped with several specific problems: spectral prediction, missing value imputation, feature selection for high-dimensional data, and multi-omics integration. It has not solved the fundamental problem of underpowered, confounded studies being reported as important discoveries.

The genuinely optimistic development is the recent expansion of highly multiplexed, antibody-based proteomics platforms (Olink Explore 3072, SomaScan 11K) that enable measurement of thousands of proteins in large biobank cohorts with well-documented sample handling. Studies coming out of UK Biobank, FinnGen, and UKCTOCS with 5,000–50,000 samples are producing biomarker evidence at a scale and quality that wasn't achievable with mass spectrometry-based approaches. The limitation is that these platforms are targeted — you can only detect what you have a probe for.

The future is probably a combination: large-scale affinity-based platforms for population-level screening and hypothesis generation, deep LC-MS/MS for mechanistic follow-up and novel discovery, and AI models trained on the resulting large-scale data for clinical prediction.

Whether this combination produces a new wave of clinically useful biomarkers over the next decade is genuinely uncertain. But the technical pieces are more mature than they've ever been. The limiting factor now is study design and validation rigor — and those are human problems, not technical ones.


Further Reading

Foundational papers:

Statistical methodology:

AI/ML approaches:

Databases and resources:


This post reflects opinions and experiences from the lab. I've been specific about methodological failures because I think the field benefits more from honest accounting than from another optimistic review. If you're working on a proteomics biomarker project and want to discuss study design, the comments are open.

Next in this series: Targeted Mass Spectrometry for Biomarker Validation: PRM vs MRM vs Olink

관련 글