Proteomics

DIA-NN Complete Tutorial 2026: From Raw Files to Quantified Proteome

Step-by-step DIA-NN tutorial covering installation, spectral library generation, parameter optimization, and downstream analysis. Real benchmarks, common errors, and practical tips from 8 months of daily use.

·10 min read
#DIA-NN tutorial#DIA proteomics#spectral library#proteomics workflow#DIA-NN parameters#mass spectrometry#quantitative proteomics

DIA-NN Proteomics Workflow

Why DIA-NN Has Become the Standard for DIA Proteomics

When DIA-NN was first released, many labs were skeptical. MaxQuant was the established gold standard. Why switch?

Eight months and hundreds of samples later, here's my honest assessment: DIA-NN is faster, equally accurate, and free. For most DIA proteomics workflows, it has become my default tool.

This tutorial covers everything you need to go from raw .raw files to a quantified protein matrix — including the parameters that matter, the ones that don't, and the mistakes I made so you don't have to.

What is DIA-NN?

DIA-NN (Data-Independent Acquisition by Neural Networks) is an open-source software for analyzing DIA mass spectrometry data. It was developed by Vadim Demichev and colleagues.

Key features:

  • Neural network-based peptide scoring
  • In silico spectral library generation (no experimental library required)
  • Single-pass analysis (no iterative processing)
  • GPU acceleration support
  • Free for academic and commercial use
Current version: 1.8.1 (as of March 2026)
Download: https://github.com/vdemichev/DiaNN
License: Apache 2.0

System Requirements and Installation

Minimum Requirements

OS: Windows 10/11 (64-bit) or Linux (Ubuntu 18.04+)
CPU: 4 cores minimum, 8+ recommended
RAM: 16 GB minimum, 32 GB recommended
Storage: 100 GB+ for raw files and temporary files
GPU: Optional but significantly speeds up library generation

Installation (Windows)

# Download installer from GitHub releases
# Run DIA-NN-1.8.1.exe
# Default installation path: C:\DIA-NN\

# Verify installation
"C:\DIA-NN\DIA-NN.exe" --version

Installation (Linux)

# Download binary
wget https://github.com/vdemichev/DiaNN/releases/download/1.8.1/diann-1.8.1.tar.gz
tar -xzf diann-1.8.1.tar.gz
cd diann-1.8.1

# Make executable
chmod +x diann

# Add to PATH
echo 'export PATH="/path/to/diann-1.8.1:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Test
diann --version

Understanding DIA-NN Analysis Modes

DIA-NN offers three main analysis approaches. Choosing the right one is the first decision you need to make.

diann \
  --f /data/raw/*.raw \
  --fasta /databases/human_reviewed.fasta \
  --out /results/report.tsv \
  --gen-spec-lib \
  --predictor \
  --threads 16

When to use:

  • No existing spectral library
  • First time analyzing a sample type
  • Want the most up-to-date peptide predictions

Pros: No library preparation needed, always uses latest neural network models Cons: Slightly longer first run

Mode 2: Existing Spectral Library

diann \
  --f /data/raw/*.raw \
  --lib /libraries/human_library.tsv \
  --out /results/report.tsv \
  --threads 16

When to use:

  • You have a validated library from previous experiments
  • Consistency with previous datasets is required
diann \
  --f /data/raw/*.raw \
  --fasta /databases/human_reviewed.fasta \
  --out /results/report.tsv \
  --fasta-search \
  --threads 16

When to use:

  • No library, want to skip library generation
  • Quick initial assessment
  • Novel proteome with no published library

Step-by-Step: Full DIA-NN Workflow

Step 1: Prepare Your Input Files

Raw files:

  • Supported formats: .raw (Thermo), .d (Bruker), .wiff (Sciex), .mzML
  • Place all files from the same experiment in one folder
  • Consistent naming convention helps downstream analysis

FASTA database:

# Download UniProt human reviewed proteins
wget "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=(reviewed:true)(organism_id:9606)" \
  -O human_reviewed_2026.fasta

# Check entry count
grep -c "^>" human_reviewed_2026.fasta
# Should be ~20,000 entries

Step 2: Configure Your Analysis

For a standard human plasma proteomics run:

diann \
  --f /data/raw/sample1.raw \
  --f /data/raw/sample2.raw \
  --f /data/raw/sample3.raw \
  --fasta /databases/human_reviewed_2026.fasta \
  --out /results/experiment1/report.tsv \
  --out-lib /results/experiment1/library.tsv \
  --gen-spec-lib \
  --predictor \
  --smart-profiling \
  --peak-center \
  --no-ifs-removal \
  --qvalue 0.01 \
  --min-fr-mz 200 \
  --max-fr-mz 1800 \
  --min-pr-mz 300 \
  --max-pr-mz 1800 \
  --min-pr-charge 2 \
  --max-pr-charge 4 \
  --missed-cleavages 1 \
  --var-mods 2 \
  --var-mod UniMod:21,0.9840,M \
  --var-mod UniMod:1,1.0,nQ \
  --var-mod UniMod:1,1.0,nC \
  --reanalyse \
  --threads 16 \
  --verbose 1

Step 3: Understanding the Key Parameters

Parameters that matter most:

--qvalue — FDR threshold at precursor level

Default: 0.01 (1% FDR)
More stringent: 0.005
More permissive: 0.05 (for method development, not publication)
Recommendation: Always use 0.01 for publication-quality data

--threads — CPU threads to use

Set to number of physical cores (not hyperthreads)
For 8-core CPU: --threads 8
Diminishing returns above 16

--missed-cleavages — Tryptic missed cleavages allowed

0: No missed cleavages (fastest, least IDs)
1: Standard (good balance, recommended)
2: More IDs but more FP risk and longer runtime

--gen-spec-lib + --predictor — Generate in silico library

Always use together
--gen-spec-lib: generates the library
--predictor: uses neural network for RT and ion intensity prediction

--smart-profiling — Adaptive retention time profiling

Almost always beneficial — leave it on
Helps when retention times vary across samples

--reanalyse — Second-pass analysis

Improves sensitivity ~10-15%
Adds ~50% to runtime
Worth it for final analysis, can skip for quick QC

Step 4: Running the Analysis

# Create output directory
mkdir -p /results/experiment1

# Run DIA-NN (full command)
diann \
  --f $(ls /data/raw/*.raw | tr '\n' ' ') \
  --fasta /databases/human_reviewed_2026.fasta \
  --out /results/experiment1/report.tsv \
  --out-lib /results/experiment1/library.tsv \
  --gen-spec-lib \
  --predictor \
  --smart-profiling \
  --peak-center \
  --no-ifs-removal \
  --qvalue 0.01 \
  --reanalyse \
  --threads 16 \
  --verbose 1 2>&1 | tee /results/experiment1/diann.log

Expected runtime (rough estimates):

10 samples, 2h gradient: ~15-20 minutes
50 samples, 2h gradient: ~45-60 minutes
100 samples, 2h gradient: ~1.5-2 hours
(with --reanalyse enabled, add ~50%)

Step 5: Interpreting the Output

Key output files:

report.tsv         — Main results (precursor-level)
report.pr_matrix.tsv — Protein quantification matrix
report.pg_matrix.tsv — Protein group matrix  
report.stats.tsv   — Per-run statistics
library.tsv        — Generated spectral library

Quick QC check:

library(data.table)
library(ggplot2)

# Load results
report <- fread("/results/experiment1/report.tsv")

# Check identification numbers per sample
ids_per_run <- report[Q.Value < 0.01, 
                       .(n_proteins = uniqueN(Protein.Group),
                         n_precursors = .N), 
                       by = Run]
print(ids_per_run)

# Expected for good human plasma data:
# n_proteins: 500-800 (depleted) or 300-500 (neat)
# n_precursors: 5,000-20,000

# Visualize
ggplot(ids_per_run, aes(x = Run, y = n_proteins)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Proteins per Run", y = "Protein Groups")

Check quantification quality:

# Load protein matrix
prot_matrix <- fread("/results/experiment1/report.pg_matrix.tsv")

# Missing value percentage per sample
sample_cols <- grep("^/", names(prot_matrix), value = TRUE)
mv_pct <- sapply(sample_cols, function(col) {
  sum(is.na(prot_matrix[[col]])) / nrow(prot_matrix) * 100
})

cat("Missing values per sample:\n")
print(round(mv_pct, 1))
# Good data: <30% missing values
# Problematic: >50% missing values

Common Errors and Solutions

Error: "No MS2 spectra found"

Cause: Wrong file format or corrupted raw file

# Check if raw file is readable
diann --f problem_file.raw --test-raw

# Convert to mzML first if needed
msconvert problem_file.raw --mzML -o /tmp/test/
diann --f /tmp/test/problem_file.mzML ...

Error: "Library is empty"

Cause: FASTA parsing issue or overly restrictive parameters

# Verify FASTA format
head -4 your_database.fasta
# Should look like:
# >sp|P12345|PROT_HUMAN Protein name OS=Homo sapiens
# MKTLLLTLVVVTIVCLDLGYTPQVKQLNHYQELLKSELDPYIYKIVPNLVSGFTVQLKK
# >sp|P67890|ANOT_HUMAN Another protein...

# Try relaxed parameters first
diann ... --min-peptide-length 6 --max-peptide-length 30 --missed-cleavages 2

Low Identification Rate (<500 proteins from human cell line)

Systematic troubleshooting:

# Step 1: Check mass accuracy
# Add --individual-mass-acc to let DIA-NN calibrate per run
diann ... --individual-mass-acc

# Step 2: Broaden m/z range
diann ... --min-pr-mz 200 --max-pr-mz 2000

# Step 3: Increase FDR threshold for testing
diann ... --qvalue 0.05  # Never use for publication

# Step 4: Check if DDA library helps
# Run a DDA search first, generate library, then use for DIA

Memory Errors

# Limit RAM usage
diann ... --max-ram 32  # Set to available RAM in GB

# Process in smaller batches
# Split samples into groups of 20-30
for i in {1..5}; do
  start=$(( (i-1)*20 + 1 ))
  end=$(( i*20 ))
  files=$(ls /data/raw/*.raw | sed -n "${start},${end}p" | tr '\n' ' ')
  diann --f $files ... --out /results/batch_${i}/report.tsv
done

Downstream Analysis in R

Load and Prepare Data

library(data.table)
library(tidyverse)

# Load protein matrix
prot <- fread("report.pg_matrix.tsv")

# Extract sample columns and protein IDs
sample_cols <- grep("\\.raw$|\\.mzML$", names(prot), value = TRUE)
proteins <- prot$Protein.Group

# Create clean matrix
mat <- as.matrix(prot[, ..sample_cols])
rownames(mat) <- proteins

# Log2 transform
mat_log <- log2(mat + 1)

# Sample metadata
metadata <- data.frame(
  sample = sample_cols,
  condition = c(rep("control", 5), rep("treatment", 5)),
  batch = c(1,1,1,2,2,1,1,2,2,2)
)

Normalization

library(limma)

# Quantile normalization
mat_norm <- normalizeBetweenArrays(mat_log, method = "quantile")

# Or median normalization (less aggressive)
mat_med <- sweep(mat_log, 2, apply(mat_log, 2, median, na.rm=TRUE))
mat_med <- mat_med + median(mat_log, na.rm=TRUE)

# Check normalization
boxplot(mat_norm, las=2, main="After Normalization")

Differential Expression

# Using limma for differential analysis
design <- model.matrix(~ condition + batch, data = metadata)

# Handle missing values
mat_imputed <- mat_norm
mat_imputed[is.na(mat_imputed)] <- min(mat_norm, na.rm=TRUE) - 1

fit <- lmFit(mat_imputed, design)
fit <- eBayes(fit)

results <- topTable(fit, coef = "conditiontreatment", 
                   number = Inf, adjust.method = "BH")

# Significant proteins
sig <- results[results$adj.P.Val < 0.05 & abs(results$logFC) > 1, ]
cat("Significant proteins:", nrow(sig), "\n")

Volcano Plot

library(ggplot2)
library(ggrepel)

# Prepare data
volcano_data <- results %>%
  mutate(
    significance = case_when(
      adj.P.Val < 0.05 & logFC > 1  ~ "Up",
      adj.P.Val < 0.05 & logFC < -1 ~ "Down",
      TRUE ~ "NS"
    ),
    neg_log10_p = -log10(adj.P.Val),
    label = ifelse(significance != "NS" & neg_log10_p > 5, 
                   rownames(.), "")
  )

ggplot(volcano_data, aes(x = logFC, y = neg_log10_p, 
                          color = significance, label = label)) +
  geom_point(alpha = 0.6, size = 1.5) +
  geom_text_repel(size = 3, max.overlaps = 20) +
  scale_color_manual(values = c("Up" = "#E74C3C", 
                                 "Down" = "#3498DB", 
                                 "NS" = "#95A5A6")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed", alpha = 0.5) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", alpha = 0.5) +
  labs(title = "Differential Protein Expression",
       x = "log2 Fold Change",
       y = "-log10 Adjusted P-value") +
  theme_classic()

DIA-NN vs MaxQuant: When to Use Which

After 8 months of using both:

CriterionDIA-NNMaxQuant
Speed⭐⭐⭐⭐⭐ (10x faster)⭐⭐
DIA data⭐⭐⭐⭐⭐⭐⭐⭐
DDA data⭐⭐⭐⭐⭐⭐⭐
SILAC⭐⭐⭐⭐⭐⭐⭐⭐
Memory usage⭐⭐⭐⭐⭐⭐
CostFreeFree (academic)
Documentation⭐⭐⭐⭐⭐⭐⭐
CommunityGrowingLarge, mature

Use DIA-NN for: DIA experiments (obvious), large sample sets, when speed matters Use MaxQuant for: DDA experiments, SILAC quantification, when you need the mature ecosystem

Tips from 8 Months of Daily Use

1. Always use --reanalyse for final analysis The ~15% sensitivity improvement is worth the extra runtime.

2. --smart-profiling + --peak-center is your default combo Almost never reason to turn these off.

3. Generate the library once, reuse it If you're running multiple batches of the same sample type, generate the library once and reuse with --lib.

4. Check report.stats.tsv first Before diving into results, always check run-level statistics for outliers.

5. The --no-ifs-removal flag For most human samples, removing this filter helps identification. Test both ways.

6. FASTA database matters Always use reviewed (Swiss-Prot) entries only. Adding TrEMBL massively inflates search space without proportional ID gains for well-studied organisms.


Questions about your specific workflow? Leave a comment below — I respond to all of them.

For downstream analysis: see our R tutorial for proteomics data visualization.

관련 글