Single-Cell RNA-seq Analysis: Complete Guide from Raw Data to Cell Type Annotation (2026)
End-to-end practical guide to scRNA-seq analysis using Seurat and Scanpy. Covers QC, normalization, batch correction, clustering, marker gene identification, cell type annotation, and trajectory analysis with code examples.
Single-cell RNA-seq has transformed transcriptomics from "average expression of millions of cells" to "individual cell-by-cell measurement." But the analytical pipeline is complex: a typical experiment generates 10,000+ cells × 20,000+ genes of sparse, noisy count data that requires multiple sequential analytical decisions. This guide walks through every step from raw Cell Ranger output to annotated cell type clusters — with the practical defaults that work, the parameter choices that matter, and the pitfalls that derail many first scRNA-seq projects.
TL;DR
- Use Seurat (R) if you're R-native; Scanpy (Python) if you prefer Python or work with massive datasets (>500k cells).
- Aggressive QC filtering is essential: mito% < 20, gene count 200-7500, doublet detection
- Normalization matters less than people think — SCTransform (Seurat) or
sc.pp.normalize_total+log1p(Scanpy) both work fine for most data. - Batch correction is non-optional for multi-sample studies — use Harmony or scVI.
- Cell type annotation requires both automated tools (SingleR, ScType) and manual marker review — automated alone is insufficient.
Why Single-Cell RNA-seq Is Different from Bulk
Bulk RNA-seq measures average expression across millions of cells in a tissue sample. Single-cell RNA-seq (scRNA-seq) measures individual cells. The differences in data and analysis are profound:
| Property | Bulk RNA-seq | scRNA-seq |
|---|---|---|
| Samples | 10s to 100s | 1,000 to 1,000,000+ cells |
| Read depth per sample/cell | 20-50M reads | 1k-200k reads |
| Genes detected | 15,000-20,000 | 1,000-5,000 per cell |
| Drop-out (zeros) | Minimal | 70-95% of values are zero |
| Heterogeneity | Averaged out | Preserved — the whole point |
| Statistical methods | Negative binomial, sufficient | Specialized (zero-inflated, sparse) |
| Computational cost | Modest | Significant (RAM, runtime) |
The high drop-out (genes that should be expressed registering as zero) is the central analytical challenge. Most scRNA-seq methods are designed around managing sparse data robustly.
Choosing Your Tool: Seurat vs Scanpy
Both are mature, well-supported, and produce comparable results. The choice usually comes down to language preference and dataset size.
Seurat (R)
- Strengths: Comprehensive, excellent for visualization, large active community in biology, good integration with ggplot2
- Weaknesses: Memory-intensive for large datasets, slower than Scanpy on huge data
- Best for: Standard scRNA-seq analyses up to ~200k cells, R-native workflows
Scanpy (Python)
- Strengths: Memory-efficient, fast, scales well to >1M cells, integrates with broader Python ML ecosystem (PyTorch, scikit-learn)
- Weaknesses: Less mature visualization, Python-only
- Best for: Large datasets, integration with deep learning models, computational pipelines
For most labs, Seurat is the practical default unless you're working with 500k+ cells or need to integrate with Python-based tools.
End-to-End Analysis Pipeline
A standard scRNA-seq analysis follows roughly these stages:
Raw FASTQ files
↓
Cell Ranger (10x Genomics) or alternative (Salmon Alevin)
↓
Count matrix (cells × genes)
↓
Quality control (filter low-quality cells, doublets, mitochondrial%)
↓
Normalization (SCTransform or log-normalize)
↓
Feature selection (highly variable genes)
↓
Dimensionality reduction (PCA → UMAP/tSNE)
↓
Clustering (Leiden or Louvain)
↓
Batch correction (if multi-sample) — Harmony, scVI
↓
Marker gene identification (FindAllMarkers / rank_genes_groups)
↓
Cell type annotation (SingleR, ScType, manual)
↓
Optional: Trajectory analysis, RNA velocity, regulon analysis
Step 1: From FASTQ to Count Matrix
For 10x Genomics data, Cell Ranger is the standard. It handles:
- Barcode demultiplexing (cell identification)
- Read alignment to reference genome (STAR-based)
- UMI counting per gene per cell
- Outputs filtered count matrix
cellranger count \
--id=sample_01 \
--transcriptome=/path/to/refdata-gex-GRCh38-2024-A \
--fastqs=/path/to/fastqs \
--sample=Sample01 \
--localcores=16 \
--localmem=64
Alternative: Salmon Alevin + tximeta for faster, lighter-weight quantification. Particularly useful for SMART-seq2 or non-10x platforms.
The output you'll work with is typically the filtered_feature_bc_matrix (10x format) — a sparse matrix of cells × genes with raw UMI counts.
Step 2: Loading Data and Initial Object Creation
Seurat
library(Seurat)
library(dplyr)
# Load 10x output
counts <- Read10X(data.dir = "filtered_feature_bc_matrix/")
# Create Seurat object with basic filtering
seurat_obj <- CreateSeuratObject(
counts = counts,
project = "MyExperiment",
min.cells = 3, # Gene detected in at least 3 cells
min.features = 200 # Cell expresses at least 200 genes
)
# Basic stats
seurat_obj
# An object of class Seurat
# 18847 features across 8523 samples within 1 assay
Scanpy
import scanpy as sc
import pandas as pd
import numpy as np
# Load 10x output
adata = sc.read_10x_mtx(
"filtered_feature_bc_matrix/",
var_names="gene_symbols",
cache=True
)
# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
print(adata)
# AnnData object with n_obs × n_vars = 8523 × 18847
Step 3: Quality Control — The Most Important Step
Bad QC = bad analysis. Three metrics matter most:
Mitochondrial Percentage
High mitochondrial gene content indicates dying or stressed cells (membrane integrity is lost, cytoplasmic mRNAs leak out, mitochondrial RNAs become disproportionately abundant).
Total Gene Count Per Cell
- Too low (e.g., <200): Empty droplets or dying cells
- Too high (e.g., >7500): Likely doublets — two cells captured in one droplet
UMI Count Per Cell
- Too low: Low-quality capture
- Too high: Doublets
Seurat QC
# Calculate mitochondrial percentage
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(
seurat_obj,
pattern = "^MT-" # Human; use "^mt-" for mouse
)
# Visualize
VlnPlot(
seurat_obj,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3
)
FeatureScatter(seurat_obj, "nCount_RNA", "percent.mt")
FeatureScatter(seurat_obj, "nCount_RNA", "nFeature_RNA")
# Filter
seurat_obj <- subset(
seurat_obj,
subset = nFeature_RNA > 200 &
nFeature_RNA < 7500 &
percent.mt < 20
)
Scanpy QC
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
adata,
qc_vars=["mt"],
percent_top=None,
log1p=False,
inplace=True
)
# Visualize
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4, multi_panel=True
)
sc.pl.scatter(adata, "total_counts", "pct_counts_mt")
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts")
# Filter
adata = adata[adata.obs.n_genes_by_counts > 200, :]
adata = adata[adata.obs.n_genes_by_counts < 7500, :]
adata = adata[adata.obs.pct_counts_mt < 20, :]
Doublet Detection
Beyond simple thresholds, dedicated doublet detection tools find heterotypic doublets that look "normal" by basic QC. Recommended:
- DoubletFinder (Seurat) — Simulates artificial doublets and identifies real ones by similarity
- Scrublet (Python, integrates with Scanpy)
- scDblFinder (R) — Alternative to DoubletFinder
Always run doublet detection before downstream analysis. Typical doublet rates: 1% per 1000 cells (so 10k cells = ~10% doublets to remove).
Step 4: Normalization
Two main approaches; differences are typically small.
LogNormalize (Standard)
Divide counts by cell total counts, multiply by 10,000, log-transform:
# Seurat
seurat_obj <- NormalizeData(
seurat_obj,
normalization.method = "LogNormalize",
scale.factor = 10000
)
# Scanpy
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
SCTransform (Seurat) / scTransform-equivalent
Uses regularized negative binomial regression to remove technical variability, often improves downstream analysis:
seurat_obj <- SCTransform(
seurat_obj,
vars.to.regress = "percent.mt",
verbose = FALSE
)
For most analyses, LogNormalize is fine. SCTransform may give slightly cleaner results for complex datasets.
Step 5: Feature Selection — Highly Variable Genes
Most genes show little cell-to-cell variation. Focusing on the most variable genes (typically 2000-3000) speeds computation and reduces noise.
# Seurat
seurat_obj <- FindVariableFeatures(
seurat_obj,
selection.method = "vst",
nfeatures = 2000
)
top10 <- head(VariableFeatures(seurat_obj), 10)
VariableFeaturePlot(seurat_obj)
# Scanpy
sc.pp.highly_variable_genes(
adata,
min_mean=0.0125,
max_mean=3,
min_disp=0.5
)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]
Step 6: Dimensionality Reduction
scRNA-seq operates in 2000+ dimensional space, which is impossible to interpret. Reduction to 2D/3D for visualization happens in two steps:
- PCA: Reduce to 30-50 principal components — captures most biological variance
- UMAP / tSNE: Reduce PCA to 2D for visualization
# Seurat
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(seurat_obj))
# Choose number of PCs (look for "elbow")
ElbowPlot(seurat_obj, ndims = 50)
# UMAP
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "umap")
# Scanpy
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata)
sc.pl.umap(adata)
UMAP is now strongly preferred over tSNE — it preserves global structure better and is faster.
Step 7: Clustering
Cells in the UMAP space are clustered into discrete groups, which (after annotation) become cell types.
Leiden (Recommended) and Louvain Clustering
# Seurat (Louvain by default; for Leiden, set algorithm = 4)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5, algorithm = 4)
DimPlot(seurat_obj, reduction = "umap", label = TRUE)
# Scanpy
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color="leiden")
The resolution parameter controls cluster granularity:
- Low (0.1-0.3): Few large clusters (broad cell types)
- Medium (0.4-0.8): Standard, ~10-20 clusters
- High (1.0-2.0): Many fine clusters (cell states, subpopulations)
Practical advice: Try multiple resolutions and pick the one where cluster boundaries match obvious biological transitions.
Step 8: Batch Correction (Multi-Sample Studies)
When you have samples from different patients, days, or batches, technical variation can dominate biological signal. You will see this: PCA/UMAP separates by batch instead of cell type.
Harmony (Fast, Effective)
library(harmony)
seurat_obj <- RunHarmony(
seurat_obj,
group.by.vars = "batch", # Your batch variable
reduction = "pca",
dims.use = 1:30
)
# Use Harmony embedding for downstream analysis
seurat_obj <- RunUMAP(seurat_obj, reduction = "harmony", dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction = "harmony", dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)
scVI (Deep Learning, Best for Complex Datasets)
scVI uses a variational autoencoder for batch correction and is particularly effective for very large or complex multi-batch datasets:
import scvi
scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
model.train()
adata.obsm["X_scVI"] = model.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
Step 9: Cluster Marker Identification
Find genes that distinguish each cluster — these become the basis for cell type annotation.
# Seurat
all_markers <- FindAllMarkers(
seurat_obj,
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25
)
top_markers <- all_markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC)
# Visualize top markers
DoHeatmap(seurat_obj, features = top_markers$gene)
# Scanpy
sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
Step 10: Cell Type Annotation
This is where biology meets computation. Three complementary strategies:
Automated Annotation Tools
- SingleR: Reference-based annotation using bulk RNA-seq references
- ScType: Marker-database-based, supports many tissues
- CellTypist: Hierarchical classification with confidence scores
- Azimuth (Seurat): Reference-mapped annotation for several common tissues
library(SingleR)
library(celldex)
# Load reference (Human Primary Cell Atlas)
ref <- celldex::HumanPrimaryCellAtlasData()
# Run SingleR
predictions <- SingleR(
test = GetAssayData(seurat_obj, slot = "data"),
ref = ref,
labels = ref$label.main
)
seurat_obj$SingleR.labels <- predictions$labels
DimPlot(seurat_obj, group.by = "SingleR.labels", label = TRUE)
Manual Marker-Based Annotation
For each cluster, examine the top marker genes and look up known cell type markers:
# Visualize known markers for various cell types
FeaturePlot(seurat_obj,
features = c("CD3E", # T cells
"CD19", # B cells
"CD14", # Monocytes
"FCGR3A", # NK cells
"PPBP")) # Platelets
Common immune cell markers (PBMC reference):
- T cells: CD3E, CD3D, CD3G, IL7R
- CD8 T cells: CD8A, CD8B
- B cells: CD19, CD79A, MS4A1
- NK cells: KLRD1, NKG7, GNLY
- Monocytes: CD14, FCGR3A, CD68
- Dendritic cells: HLA-DRA, LYZ
- Platelets: PPBP, PF4
Best Practice: Hybrid Approach
- Run automated annotation (SingleR or Azimuth) for initial labels
- Examine top markers per cluster manually
- Cross-reference with literature-known markers for your tissue
- Consider sub-clustering ambiguous regions (e.g., resolution ↑ for T cells to distinguish CD4/CD8/Treg)
- Validate by gene set scoring with tissue-specific signatures
Optional Advanced Analyses
Trajectory Analysis
For developmental or differentiation studies:
- Monocle3 — Pseudotime trajectory inference
- Slingshot — Lineage inference
- PAGA + scVelo — Combined topology + RNA velocity
RNA Velocity
Predict future cell states from spliced/unspliced mRNA ratios:
- scVelo (Python) — Standard tool
- velocyto — Original implementation
Requires reprocessing FASTQ files with velocyto run to get spliced/unspliced counts.
Regulon Analysis (SCENIC)
Identify TF-driven gene programs (covered in detail in Part 3 of our functional analysis series):
- pySCENIC — Python implementation, scales well
- SCENIC (R) — Original implementation
Cell-Cell Communication
Predict signaling between cell types:
- CellChat (R) — Most popular
- CellPhoneDB (Python) — Database-backed
Computational Considerations
scRNA-seq analyses are computationally heavy. Rough requirements:
| Dataset Size | RAM | Recommended Setup |
|---|---|---|
| <10k cells | 16 GB | Laptop sufficient |
| 10-100k cells | 32-64 GB | Workstation |
| 100k-500k cells | 128 GB | Server, consider scVI |
| >500k cells | 256+ GB | HPC cluster, mandatory scVI / batch processing |
For very large datasets, consider:
- Scanpy over Seurat (better memory efficiency)
- scVI for batch correction (scales linearly)
- Subset to highly variable genes early
- Use sparse matrix representations throughout
Common Pitfalls
Pitfall 1: Inadequate QC Filtering
Failing to filter dying cells (high mito%) or doublets contaminates clusters with low-quality "ghost" cells that don't correspond to real biology.
Pitfall 2: Ignoring Batch Effects
Running multi-sample analysis without batch correction → clusters reflect batch differences, not cell types. Always check UMAP colored by batch before clustering.
Pitfall 3: Over-Trusting Automated Annotations
SingleR can be confidently wrong — always verify with manual marker review.
Pitfall 4: Wrong Resolution Parameter
Picking resolution = 0.5 by default without exploration may miss biological variation (under-clustering) or create artificial subdivisions (over-clustering).
Pitfall 5: Treating UMAP as a Map
UMAP distances are not biologically meaningful in any quantitative sense. Two distant clusters in UMAP may be biologically more similar than two adjacent ones. UMAP is a visualization aid, not a metric space.
Summary: A Realistic Workflow Time
For a typical 30k cell, 4-sample scRNA-seq dataset on a 64GB workstation:
- Cell Ranger (per sample, parallel): 4-6 hours
- Loading + QC + filtering: 30 min
- Normalization + HVG selection: 20 min
- PCA + UMAP + clustering: 30 min
- Batch correction (Harmony): 20 min
- Marker gene + annotation: 1-2 hours
- Manual review and refinement: 4-8 hours (the big one!)
Plan accordingly. The "wow, 30k cells in one UMAP" moment is fast; the "now what does each cluster actually mean biologically" is the real time investment.
Conclusion
Single-cell RNA-seq analysis isn't a single technique — it's a sequential pipeline of decisions, each affecting the final result. Master the standard workflow first (10x → Seurat/Scanpy → cluster → annotate), then add advanced analyses (trajectory, velocity, communication) as your scientific questions demand.
Most importantly: biology should drive analysis choices, not the other way around. A perfect statistical pipeline that produces clusters without biological meaning is a failed analysis. Always cross-reference your findings with known biology and validate critical claims experimentally.
Further Reading
- Hao, Y. et al. (2024). Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nature Biotechnology. (Seurat v5)
- Wolf, F. A. et al. (2018). SCANPY: Large-scale single-cell gene expression data analysis. Genome Biology, 19(1).
- Korsunsky, I. et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16(12).
- Lopez, R. et al. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12). (scVI)
- Heumos, L. et al. (2023). Best practices for single-cell analysis across modalities. Nature Reviews Genetics, 24(8).