DIA-NN Complete Tutorial 2026: From Raw Files to Quantified Proteome
Step-by-step DIA-NN tutorial covering installation, spectral library generation, parameter optimization, and downstream analysis. Real benchmarks, common errors, and practical tips from 8 months of daily use.
Why DIA-NN Has Become the Standard for DIA Proteomics
When DIA-NN was first released, many labs were skeptical. MaxQuant was the established gold standard. Why switch?
Eight months and hundreds of samples later, here's my honest assessment: DIA-NN is faster, equally accurate, and free. For most DIA proteomics workflows, it has become my default tool.
This tutorial covers everything you need to go from raw .raw files to a quantified protein matrix — including the parameters that matter, the ones that don't, and the mistakes I made so you don't have to.
What is DIA-NN?
DIA-NN (Data-Independent Acquisition by Neural Networks) is an open-source software for analyzing DIA mass spectrometry data. It was developed by Vadim Demichev and colleagues.
Key features:
- Neural network-based peptide scoring
- In silico spectral library generation (no experimental library required)
- Single-pass analysis (no iterative processing)
- GPU acceleration support
- Free for academic and commercial use
Current version: 1.8.1 (as of March 2026)
Download: https://github.com/vdemichev/DiaNN
License: Apache 2.0
System Requirements and Installation
Minimum Requirements
OS: Windows 10/11 (64-bit) or Linux (Ubuntu 18.04+)
CPU: 4 cores minimum, 8+ recommended
RAM: 16 GB minimum, 32 GB recommended
Storage: 100 GB+ for raw files and temporary files
GPU: Optional but significantly speeds up library generation
Installation (Windows)
# Download installer from GitHub releases
# Run DIA-NN-1.8.1.exe
# Default installation path: C:\DIA-NN\
# Verify installation
"C:\DIA-NN\DIA-NN.exe" --version
Installation (Linux)
# Download binary
wget https://github.com/vdemichev/DiaNN/releases/download/1.8.1/diann-1.8.1.tar.gz
tar -xzf diann-1.8.1.tar.gz
cd diann-1.8.1
# Make executable
chmod +x diann
# Add to PATH
echo 'export PATH="/path/to/diann-1.8.1:$PATH"' >> ~/.bashrc
source ~/.bashrc
# Test
diann --version
Understanding DIA-NN Analysis Modes
DIA-NN offers three main analysis approaches. Choosing the right one is the first decision you need to make.
Mode 1: In Silico Spectral Library (Recommended for Most Labs)
diann \
--f /data/raw/*.raw \
--fasta /databases/human_reviewed.fasta \
--out /results/report.tsv \
--gen-spec-lib \
--predictor \
--threads 16
When to use:
- No existing spectral library
- First time analyzing a sample type
- Want the most up-to-date peptide predictions
Pros: No library preparation needed, always uses latest neural network models Cons: Slightly longer first run
Mode 2: Existing Spectral Library
diann \
--f /data/raw/*.raw \
--lib /libraries/human_library.tsv \
--out /results/report.tsv \
--threads 16
When to use:
- You have a validated library from previous experiments
- Consistency with previous datasets is required
Mode 3: Library-Free Search
diann \
--f /data/raw/*.raw \
--fasta /databases/human_reviewed.fasta \
--out /results/report.tsv \
--fasta-search \
--threads 16
When to use:
- No library, want to skip library generation
- Quick initial assessment
- Novel proteome with no published library
Step-by-Step: Full DIA-NN Workflow
Step 1: Prepare Your Input Files
Raw files:
- Supported formats: .raw (Thermo), .d (Bruker), .wiff (Sciex), .mzML
- Place all files from the same experiment in one folder
- Consistent naming convention helps downstream analysis
FASTA database:
# Download UniProt human reviewed proteins
wget "https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=(reviewed:true)(organism_id:9606)" \
-O human_reviewed_2026.fasta
# Check entry count
grep -c "^>" human_reviewed_2026.fasta
# Should be ~20,000 entries
Step 2: Configure Your Analysis
For a standard human plasma proteomics run:
diann \
--f /data/raw/sample1.raw \
--f /data/raw/sample2.raw \
--f /data/raw/sample3.raw \
--fasta /databases/human_reviewed_2026.fasta \
--out /results/experiment1/report.tsv \
--out-lib /results/experiment1/library.tsv \
--gen-spec-lib \
--predictor \
--smart-profiling \
--peak-center \
--no-ifs-removal \
--qvalue 0.01 \
--min-fr-mz 200 \
--max-fr-mz 1800 \
--min-pr-mz 300 \
--max-pr-mz 1800 \
--min-pr-charge 2 \
--max-pr-charge 4 \
--missed-cleavages 1 \
--var-mods 2 \
--var-mod UniMod:21,0.9840,M \
--var-mod UniMod:1,1.0,nQ \
--var-mod UniMod:1,1.0,nC \
--reanalyse \
--threads 16 \
--verbose 1
Step 3: Understanding the Key Parameters
Parameters that matter most:
--qvalue — FDR threshold at precursor level
Default: 0.01 (1% FDR)
More stringent: 0.005
More permissive: 0.05 (for method development, not publication)
Recommendation: Always use 0.01 for publication-quality data
--threads — CPU threads to use
Set to number of physical cores (not hyperthreads)
For 8-core CPU: --threads 8
Diminishing returns above 16
--missed-cleavages — Tryptic missed cleavages allowed
0: No missed cleavages (fastest, least IDs)
1: Standard (good balance, recommended)
2: More IDs but more FP risk and longer runtime
--gen-spec-lib + --predictor — Generate in silico library
Always use together
--gen-spec-lib: generates the library
--predictor: uses neural network for RT and ion intensity prediction
--smart-profiling — Adaptive retention time profiling
Almost always beneficial — leave it on
Helps when retention times vary across samples
--reanalyse — Second-pass analysis
Improves sensitivity ~10-15%
Adds ~50% to runtime
Worth it for final analysis, can skip for quick QC
Step 4: Running the Analysis
# Create output directory
mkdir -p /results/experiment1
# Run DIA-NN (full command)
diann \
--f $(ls /data/raw/*.raw | tr '\n' ' ') \
--fasta /databases/human_reviewed_2026.fasta \
--out /results/experiment1/report.tsv \
--out-lib /results/experiment1/library.tsv \
--gen-spec-lib \
--predictor \
--smart-profiling \
--peak-center \
--no-ifs-removal \
--qvalue 0.01 \
--reanalyse \
--threads 16 \
--verbose 1 2>&1 | tee /results/experiment1/diann.log
Expected runtime (rough estimates):
10 samples, 2h gradient: ~15-20 minutes
50 samples, 2h gradient: ~45-60 minutes
100 samples, 2h gradient: ~1.5-2 hours
(with --reanalyse enabled, add ~50%)
Step 5: Interpreting the Output
Key output files:
report.tsv — Main results (precursor-level)
report.pr_matrix.tsv — Protein quantification matrix
report.pg_matrix.tsv — Protein group matrix
report.stats.tsv — Per-run statistics
library.tsv — Generated spectral library
Quick QC check:
library(data.table)
library(ggplot2)
# Load results
report <- fread("/results/experiment1/report.tsv")
# Check identification numbers per sample
ids_per_run <- report[Q.Value < 0.01,
.(n_proteins = uniqueN(Protein.Group),
n_precursors = .N),
by = Run]
print(ids_per_run)
# Expected for good human plasma data:
# n_proteins: 500-800 (depleted) or 300-500 (neat)
# n_precursors: 5,000-20,000
# Visualize
ggplot(ids_per_run, aes(x = Run, y = n_proteins)) +
geom_bar(stat = "identity", fill = "steelblue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Proteins per Run", y = "Protein Groups")
Check quantification quality:
# Load protein matrix
prot_matrix <- fread("/results/experiment1/report.pg_matrix.tsv")
# Missing value percentage per sample
sample_cols <- grep("^/", names(prot_matrix), value = TRUE)
mv_pct <- sapply(sample_cols, function(col) {
sum(is.na(prot_matrix[[col]])) / nrow(prot_matrix) * 100
})
cat("Missing values per sample:\n")
print(round(mv_pct, 1))
# Good data: <30% missing values
# Problematic: >50% missing values
Common Errors and Solutions
Error: "No MS2 spectra found"
Cause: Wrong file format or corrupted raw file
# Check if raw file is readable
diann --f problem_file.raw --test-raw
# Convert to mzML first if needed
msconvert problem_file.raw --mzML -o /tmp/test/
diann --f /tmp/test/problem_file.mzML ...
Error: "Library is empty"
Cause: FASTA parsing issue or overly restrictive parameters
# Verify FASTA format
head -4 your_database.fasta
# Should look like:
# >sp|P12345|PROT_HUMAN Protein name OS=Homo sapiens
# MKTLLLTLVVVTIVCLDLGYTPQVKQLNHYQELLKSELDPYIYKIVPNLVSGFTVQLKK
# >sp|P67890|ANOT_HUMAN Another protein...
# Try relaxed parameters first
diann ... --min-peptide-length 6 --max-peptide-length 30 --missed-cleavages 2
Low Identification Rate (<500 proteins from human cell line)
Systematic troubleshooting:
# Step 1: Check mass accuracy
# Add --individual-mass-acc to let DIA-NN calibrate per run
diann ... --individual-mass-acc
# Step 2: Broaden m/z range
diann ... --min-pr-mz 200 --max-pr-mz 2000
# Step 3: Increase FDR threshold for testing
diann ... --qvalue 0.05 # Never use for publication
# Step 4: Check if DDA library helps
# Run a DDA search first, generate library, then use for DIA
Memory Errors
# Limit RAM usage
diann ... --max-ram 32 # Set to available RAM in GB
# Process in smaller batches
# Split samples into groups of 20-30
for i in {1..5}; do
start=$(( (i-1)*20 + 1 ))
end=$(( i*20 ))
files=$(ls /data/raw/*.raw | sed -n "${start},${end}p" | tr '\n' ' ')
diann --f $files ... --out /results/batch_${i}/report.tsv
done
Downstream Analysis in R
Load and Prepare Data
library(data.table)
library(tidyverse)
# Load protein matrix
prot <- fread("report.pg_matrix.tsv")
# Extract sample columns and protein IDs
sample_cols <- grep("\\.raw$|\\.mzML$", names(prot), value = TRUE)
proteins <- prot$Protein.Group
# Create clean matrix
mat <- as.matrix(prot[, ..sample_cols])
rownames(mat) <- proteins
# Log2 transform
mat_log <- log2(mat + 1)
# Sample metadata
metadata <- data.frame(
sample = sample_cols,
condition = c(rep("control", 5), rep("treatment", 5)),
batch = c(1,1,1,2,2,1,1,2,2,2)
)
Normalization
library(limma)
# Quantile normalization
mat_norm <- normalizeBetweenArrays(mat_log, method = "quantile")
# Or median normalization (less aggressive)
mat_med <- sweep(mat_log, 2, apply(mat_log, 2, median, na.rm=TRUE))
mat_med <- mat_med + median(mat_log, na.rm=TRUE)
# Check normalization
boxplot(mat_norm, las=2, main="After Normalization")
Differential Expression
# Using limma for differential analysis
design <- model.matrix(~ condition + batch, data = metadata)
# Handle missing values
mat_imputed <- mat_norm
mat_imputed[is.na(mat_imputed)] <- min(mat_norm, na.rm=TRUE) - 1
fit <- lmFit(mat_imputed, design)
fit <- eBayes(fit)
results <- topTable(fit, coef = "conditiontreatment",
number = Inf, adjust.method = "BH")
# Significant proteins
sig <- results[results$adj.P.Val < 0.05 & abs(results$logFC) > 1, ]
cat("Significant proteins:", nrow(sig), "\n")
Volcano Plot
library(ggplot2)
library(ggrepel)
# Prepare data
volcano_data <- results %>%
mutate(
significance = case_when(
adj.P.Val < 0.05 & logFC > 1 ~ "Up",
adj.P.Val < 0.05 & logFC < -1 ~ "Down",
TRUE ~ "NS"
),
neg_log10_p = -log10(adj.P.Val),
label = ifelse(significance != "NS" & neg_log10_p > 5,
rownames(.), "")
)
ggplot(volcano_data, aes(x = logFC, y = neg_log10_p,
color = significance, label = label)) +
geom_point(alpha = 0.6, size = 1.5) +
geom_text_repel(size = 3, max.overlaps = 20) +
scale_color_manual(values = c("Up" = "#E74C3C",
"Down" = "#3498DB",
"NS" = "#95A5A6")) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed", alpha = 0.5) +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", alpha = 0.5) +
labs(title = "Differential Protein Expression",
x = "log2 Fold Change",
y = "-log10 Adjusted P-value") +
theme_classic()
DIA-NN vs MaxQuant: When to Use Which
After 8 months of using both:
| Criterion | DIA-NN | MaxQuant |
|---|---|---|
| Speed | ⭐⭐⭐⭐⭐ (10x faster) | ⭐⭐ |
| DIA data | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| DDA data | ⭐⭐ | ⭐⭐⭐⭐⭐ |
| SILAC | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Memory usage | ⭐⭐⭐⭐ | ⭐⭐ |
| Cost | Free | Free (academic) |
| Documentation | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Community | Growing | Large, mature |
Use DIA-NN for: DIA experiments (obvious), large sample sets, when speed matters Use MaxQuant for: DDA experiments, SILAC quantification, when you need the mature ecosystem
Tips from 8 Months of Daily Use
1. Always use --reanalyse for final analysis
The ~15% sensitivity improvement is worth the extra runtime.
2. --smart-profiling + --peak-center is your default combo
Almost never reason to turn these off.
3. Generate the library once, reuse it
If you're running multiple batches of the same sample type, generate the library once and reuse with --lib.
4. Check report.stats.tsv first
Before diving into results, always check run-level statistics for outliers.
5. The --no-ifs-removal flag
For most human samples, removing this filter helps identification. Test both ways.
6. FASTA database matters Always use reviewed (Swiss-Prot) entries only. Adding TrEMBL massively inflates search space without proportional ID gains for well-studied organisms.
Questions about your specific workflow? Leave a comment below — I respond to all of them.
For downstream analysis: see our R tutorial for proteomics data visualization.