How to Analyze Mass Spectrometry Data — A Complete Beginner's Guide
Step-by-step guide to analyzing mass spectrometry data for proteomics. Learn about raw data processing, database searching, quantification, and statistical analysis.
Introduction
You've just finished running your proteomics experiment. The mass spectrometer has generated gigabytes of raw data files. Now what? Analyzing mass spectrometry (MS) data can feel overwhelming for beginners, but it follows a logical workflow that, once understood, becomes second nature.
This guide walks you through the complete MS data analysis pipeline — from raw files to biological insights — with practical advice on software, parameters, and common pitfalls.
Understanding Your Raw Data
What's in a Raw File?
Mass spectrometry raw files contain thousands to hundreds of thousands of mass spectra. Each spectrum is essentially a plot of signal intensity versus mass-to-charge ratio (m/z). There are two main types:
- MS1 (Survey Scans): Show all peptide ions detected at a given time point. Think of this as an aerial photo of a crowded city — you can see everything but can't identify individuals.
- MS2 (Fragmentation Scans): Show the fragment ions of a selected peptide. This is like zooming in on one person and checking their ID.
Raw File Formats
Different instrument vendors use different file formats:
- Thermo: .raw
- Bruker: .d (folder)
- Sciex: .wiff
- Waters: .raw (folder)
- Agilent: .d
Most analysis software can read these directly, or you can convert them to open formats like mzML using tools like msconvert from ProteoWizard.
Step 1: Quality Control
Before diving into analysis, always check your data quality:
Run QC Metrics
- Total ion chromatogram (TIC): Should show a smooth, consistent profile across your LC gradient
- Number of MS2 scans: Low numbers may indicate instrument issues
- Peptide identification rate: Typically 20-50% of MS2 scans should yield identifications
- Mass accuracy: Should be within the expected range for your instrument (typically < 5 ppm for Orbitrap)
Tools for QC
- RawMeat (Thermo): Quick visualization of raw files
- QCloud: Automated quality control for proteomics core facilities
- PTXQC: R package that generates QC reports from MaxQuant output
Step 2: Database Searching
The core of MS data analysis is database searching — matching your experimental spectra against theoretical spectra predicted from a protein sequence database.
How Database Searching Works
- You provide a protein database (usually from UniProt)
- The software performs an in silico digestion of all proteins using the same enzyme you used experimentally (typically trypsin)
- For each experimental MS2 spectrum, the software finds the best-matching theoretical spectrum
- A score reflects how well the experimental and theoretical spectra match
- False Discovery Rate (FDR) is calculated using a decoy database (reversed or shuffled sequences)
Choosing Your Search Engine
| Software | Speed | Best For | Cost |
|---|---|---|---|
| MaxQuant/Andromeda | Moderate | DDA, label-free, SILAC | Free |
| MSFragger | Very fast | Large datasets, open searches | Free |
| Comet | Fast | General-purpose DDA | Free |
| SEQUEST | Moderate | Thermo instruments | Commercial |
| Mascot | Moderate | Established labs | Commercial |
Critical Search Parameters
Setting the right parameters is crucial:
- Enzyme: Trypsin/P (most common), with up to 2 missed cleavages
- Fixed modifications: Carbamidomethylation of cysteine (if you used iodoacetamide)
- Variable modifications: Oxidation of methionine, N-terminal acetylation
- Mass tolerance: 10-20 ppm for MS1, 0.02-0.05 Da for high-resolution MS2
- FDR threshold: 1% at both peptide and protein levels (standard in the field)
The FASTA Database
Your search database should include:
- The proteome of your organism (e.g., human: ~20,400 reviewed entries from UniProt)
- Common contaminants (keratins, trypsin, BSA) — MaxQuant includes a contaminant database automatically
- Decoy sequences for FDR calculation (usually generated automatically)
Pro tip: Using too large a database reduces sensitivity. Stick to the relevant organism's reviewed (Swiss-Prot) proteome unless you have a specific reason to include TrEMBL entries.
Step 3: Quantification
After identifying proteins, you need to quantify them. The method depends on your experimental design:
Label-Free Quantification (LFQ)
The simplest approach — no special reagents needed:
- Intensity-based: Uses the sum of peptide intensities
- Spectral counting: Counts the number of MS2 spectra per protein
- MaxLFQ: MaxQuant's algorithm that provides accurate, normalized protein intensities
Labeling-Based Quantification
- TMT/iTRAQ: Chemical labels allow multiplexing 6-18 samples in one run
- SILAC: Metabolic labeling with heavy amino acids (best for cell culture)
- dimethyl labeling: Cost-effective chemical labeling (2-3 plex)
DIA Quantification
For DIA data, tools like DIA-NN or Spectronaut extract ion chromatograms from the comprehensive MS2 data:
- Typically provides more complete quantification with fewer missing values
- Requires either a spectral library or library-free prediction approach
- Generally offers better reproducibility than DDA
Step 4: Statistical Analysis
Raw protein intensities need statistical processing before you can draw biological conclusions.
Normalization
- Median normalization: Adjusts for systematic differences in total protein loading
- Quantile normalization: Makes intensity distributions identical across samples
- Variance Stabilization Normalization (VSN): Particularly good for mass spectrometry data
Missing Value Handling
Missing values are a major challenge in proteomics. Common approaches:
- Filtering: Remove proteins with too many missing values (e.g., require values in at least 70% of samples per group)
- Imputation: Replace missing values using methods like:
- MinProb (draws from low-intensity distribution)
- KNN (k-nearest neighbors)
- QRILC (quantile regression)
Differential Expression Analysis
To find proteins that change between conditions:
- t-test (two groups) or ANOVA (multiple groups) with multiple testing correction (Benjamini-Hochberg)
- limma (R package): Empirical Bayes moderated t-test, excellent for small sample sizes
- Perseus: User-friendly platform for statistical analysis of proteomics data
Visualization
Key plots for proteomics data:
- Volcano plot: Shows fold change vs. statistical significance
- Heatmap: Displays protein expression patterns across samples
- PCA plot: Reveals overall sample clustering and outliers
- Correlation matrix: Assesses reproducibility between replicates
Step 5: Biological Interpretation
Pathway and Gene Ontology Analysis
- Gene Ontology (GO) enrichment: Identifies overrepresented biological processes, molecular functions, or cellular components
- KEGG pathway analysis: Maps proteins to metabolic and signaling pathways
- Reactome: Curated pathway database with excellent visualization
Network Analysis
- STRING: Predicts protein-protein interactions
- Cytoscape: Visualizes interaction networks
- ClueGO: Cytoscape plugin for GO/pathway visualization
Integration with Other Omics
For the fullest picture, integrate proteomics with:
- Transcriptomics: Compare mRNA and protein levels
- Metabolomics: Link protein changes to metabolic consequences
- Phosphoproteomics: Understand signaling cascades
A Practical Workflow Example
Here's a concrete workflow for a typical DDA label-free experiment:
1. Convert raw files → mzML (if needed)
2. MaxQuant search
- Human UniProt FASTA + contaminants
- Trypsin, 2 missed cleavages
- Carbamidomethyl (C) fixed
- Oxidation (M), Acetyl (N-term) variable
- LFQ enabled, match between runs ON
3. Perseus analysis
- Filter: reverse, contaminants, only identified by site
- Log2 transform LFQ intensities
- Filter: valid values in ≥70% of at least one group
- Impute missing values (MinProb)
- t-test with Benjamini-Hochberg correction
- Volcano plot (S0=0.1, FDR=0.05)
4. Enrichment analysis
- GO/KEGG enrichment of significant proteins
- Fisher exact test with FDR correction
Common Mistakes to Avoid
- Not using contaminant databases: Keratins from skin contamination can dominate your results
- Ignoring missing values: Simply removing all proteins with any missing values is too aggressive
- No multiple testing correction: Without it, you'll have many false positives
- Using the wrong database: Make sure your FASTA matches your organism and is up to date
- Skipping QC: A failed LC run can ruin your entire dataset
- Over-interpreting single experiments: Always validate key findings with biological replicates (minimum n=3)
Recommended Learning Resources
- MaxQuant Summer School: Free annual training course (maxquant.org)
- Computational Proteomics course on Coursera
- Bioconductor proteomics workflows: Excellent R-based tutorials
- Nature Protocols: Published step-by-step proteomics analysis guides
Conclusion
Analyzing mass spectrometry data follows a clear pipeline: quality control → database searching → quantification → statistical analysis → biological interpretation. While the details can be complex, the fundamental logic is straightforward.
Start with well-established tools like MaxQuant and Perseus for DDA data, or DIA-NN for DIA data. Focus on understanding the principles behind each step rather than just clicking buttons. And always, always include proper quality control and statistical rigor.
The proteomics community is welcoming and collaborative. Don't hesitate to reach out on forums like the MaxQuant Google Group or the Proteomics subreddit when you get stuck.