Single-Cell RNA-seq Analysis: Complete Guide from Raw Data to Cell Type Annotation (2026)

Single-cell RNA-seq has transformed transcriptomics from "average expression of millions of cells" to "individual cell-by-cell measurement." But the analytical pipeline is complex: a typical experiment generates 10,000+ cells × 20,000+ genes of sparse, noisy count data that requires multiple sequential analytical decisions. This guide walks through every step from raw Cell Ranger output to annotated cell type clusters — with the practical defaults that work, the parameter choices that matter, and the pitfalls that derail many first scRNA-seq projects.

TL;DR

Use Seurat (R) if you're R-native; Scanpy (Python) if you prefer Python or work with massive datasets (>500k cells).
Aggressive QC filtering is essential: mito% < 20, gene count 200-7500, doublet detection
Normalization matters less than people think — SCTransform (Seurat) or sc.pp.normalize_total + log1p (Scanpy) both work fine for most data.
Batch correction is non-optional for multi-sample studies — use Harmony or scVI.
Cell type annotation requires both automated tools (SingleR, ScType) and manual marker review — automated alone is insufficient.

Why Single-Cell RNA-seq Is Different from Bulk

Bulk RNA-seq measures average expression across millions of cells in a tissue sample. Single-cell RNA-seq (scRNA-seq) measures individual cells. The differences in data and analysis are profound:

Property	Bulk RNA-seq	scRNA-seq
Samples	10s to 100s	1,000 to 1,000,000+ cells
Read depth per sample/cell	20-50M reads	1k-200k reads
Genes detected	15,000-20,000	1,000-5,000 per cell
Drop-out (zeros)	Minimal	70-95% of values are zero
Heterogeneity	Averaged out	Preserved — the whole point
Statistical methods	Negative binomial, sufficient	Specialized (zero-inflated, sparse)
Computational cost	Modest	Significant (RAM, runtime)

The high drop-out (genes that should be expressed registering as zero) is the central analytical challenge. Most scRNA-seq methods are designed around managing sparse data robustly.

Choosing Your Tool: Seurat vs Scanpy

Both are mature, well-supported, and produce comparable results. The choice usually comes down to language preference and dataset size.

Seurat (R)

Strengths: Comprehensive, excellent for visualization, large active community in biology, good integration with ggplot2
Weaknesses: Memory-intensive for large datasets, slower than Scanpy on huge data
Best for: Standard scRNA-seq analyses up to ~200k cells, R-native workflows

Scanpy (Python)

Strengths: Memory-efficient, fast, scales well to >1M cells, integrates with broader Python ML ecosystem (PyTorch, scikit-learn)
Weaknesses: Less mature visualization, Python-only
Best for: Large datasets, integration with deep learning models, computational pipelines

For most labs, Seurat is the practical default unless you're working with 500k+ cells or need to integrate with Python-based tools.

End-to-End Analysis Pipeline

A standard scRNA-seq analysis follows roughly these stages:

Raw FASTQ files
    ↓
Cell Ranger (10x Genomics) or alternative (Salmon Alevin)
    ↓
Count matrix (cells × genes)
    ↓
Quality control (filter low-quality cells, doublets, mitochondrial%)
    ↓
Normalization (SCTransform or log-normalize)
    ↓
Feature selection (highly variable genes)
    ↓
Dimensionality reduction (PCA → UMAP/tSNE)
    ↓
Clustering (Leiden or Louvain)
    ↓
Batch correction (if multi-sample) — Harmony, scVI
    ↓
Marker gene identification (FindAllMarkers / rank_genes_groups)
    ↓
Cell type annotation (SingleR, ScType, manual)
    ↓
Optional: Trajectory analysis, RNA velocity, regulon analysis

Step 1: From FASTQ to Count Matrix

For 10x Genomics data, Cell Ranger is the standard. It handles:

Barcode demultiplexing (cell identification)
Read alignment to reference genome (STAR-based)
UMI counting per gene per cell
Outputs filtered count matrix

cellranger count \
  --id=sample_01 \
  --transcriptome=/path/to/refdata-gex-GRCh38-2024-A \
  --fastqs=/path/to/fastqs \
  --sample=Sample01 \
  --localcores=16 \
  --localmem=64

Alternative: Salmon Alevin + tximeta for faster, lighter-weight quantification. Particularly useful for SMART-seq2 or non-10x platforms.

The output you'll work with is typically the filtered_feature_bc_matrix (10x format) — a sparse matrix of cells × genes with raw UMI counts.

Step 2: Loading Data and Initial Object Creation

Seurat

library(Seurat)
library(dplyr)

# Load 10x output
counts <- Read10X(data.dir = "filtered_feature_bc_matrix/")

# Create Seurat object with basic filtering
seurat_obj <- CreateSeuratObject(
  counts        = counts,
  project       = "MyExperiment",
  min.cells     = 3,      # Gene detected in at least 3 cells
  min.features  = 200     # Cell expresses at least 200 genes
)

# Basic stats
seurat_obj
# An object of class Seurat
# 18847 features across 8523 samples within 1 assay

Scanpy

import scanpy as sc
import pandas as pd
import numpy as np

# Load 10x output
adata = sc.read_10x_mtx(
    "filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True
)

# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

print(adata)
# AnnData object with n_obs × n_vars = 8523 × 18847

Step 3: Quality Control — The Most Important Step

Bad QC = bad analysis. Three metrics matter most:

Mitochondrial Percentage

High mitochondrial gene content indicates dying or stressed cells (membrane integrity is lost, cytoplasmic mRNAs leak out, mitochondrial RNAs become disproportionately abundant).

Total Gene Count Per Cell

Too low (e.g., <200): Empty droplets or dying cells
Too high (e.g., >7500): Likely doublets — two cells captured in one droplet

UMI Count Per Cell

Too low: Low-quality capture
Too high: Doublets

Seurat QC

# Calculate mitochondrial percentage
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(
  seurat_obj,
  pattern = "^MT-"   # Human; use "^mt-" for mouse
)

# Visualize
VlnPlot(
  seurat_obj,
  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
  ncol = 3
)

FeatureScatter(seurat_obj, "nCount_RNA", "percent.mt")
FeatureScatter(seurat_obj, "nCount_RNA", "nFeature_RNA")

# Filter
seurat_obj <- subset(
  seurat_obj,
  subset = nFeature_RNA > 200 &
           nFeature_RNA < 7500 &
           percent.mt < 20
)

Scanpy QC

adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt"],
    percent_top=None,
    log1p=False,
    inplace=True
)

# Visualize
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4, multi_panel=True
)
sc.pl.scatter(adata, "total_counts", "pct_counts_mt")
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts")

# Filter
adata = adata[adata.obs.n_genes_by_counts > 200, :]
adata = adata[adata.obs.n_genes_by_counts < 7500, :]
adata = adata[adata.obs.pct_counts_mt < 20, :]

Doublet Detection

Beyond simple thresholds, dedicated doublet detection tools find heterotypic doublets that look "normal" by basic QC. Recommended:

DoubletFinder (Seurat) — Simulates artificial doublets and identifies real ones by similarity
Scrublet (Python, integrates with Scanpy)
scDblFinder (R) — Alternative to DoubletFinder

Always run doublet detection before downstream analysis. Typical doublet rates: 1% per 1000 cells (so 10k cells = ~10% doublets to remove).

Step 4: Normalization

Two main approaches; differences are typically small.

LogNormalize (Standard)

Divide counts by cell total counts, multiply by 10,000, log-transform:

# Seurat
seurat_obj <- NormalizeData(
  seurat_obj,
  normalization.method = "LogNormalize",
  scale.factor = 10000
)

# Scanpy
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

SCTransform (Seurat) / scTransform-equivalent

Uses regularized negative binomial regression to remove technical variability, often improves downstream analysis:

seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = "percent.mt",
  verbose = FALSE
)

For most analyses, LogNormalize is fine. SCTransform may give slightly cleaner results for complex datasets.

Step 5: Feature Selection — Highly Variable Genes

Most genes show little cell-to-cell variation. Focusing on the most variable genes (typically 2000-3000) speeds computation and reduces noise.

# Seurat
seurat_obj <- FindVariableFeatures(
  seurat_obj,
  selection.method = "vst",
  nfeatures = 2000
)
top10 <- head(VariableFeatures(seurat_obj), 10)
VariableFeaturePlot(seurat_obj)

# Scanpy
sc.pp.highly_variable_genes(
    adata,
    min_mean=0.0125,
    max_mean=3,
    min_disp=0.5
)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]

Step 6: Dimensionality Reduction

scRNA-seq operates in 2000+ dimensional space, which is impossible to interpret. Reduction to 2D/3D for visualization happens in two steps:

PCA: Reduce to 30-50 principal components — captures most biological variance
UMAP / tSNE: Reduce PCA to 2D for visualization

# Seurat
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(seurat_obj))

# Choose number of PCs (look for "elbow")
ElbowPlot(seurat_obj, ndims = 50)

# UMAP
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "umap")

# Scanpy
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True)

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata)
sc.pl.umap(adata)

UMAP is now strongly preferred over tSNE — it preserves global structure better and is faster.

Step 7: Clustering

Cells in the UMAP space are clustered into discrete groups, which (after annotation) become cell types.

Leiden (Recommended) and Louvain Clustering

# Seurat (Louvain by default; for Leiden, set algorithm = 4)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5, algorithm = 4)
DimPlot(seurat_obj, reduction = "umap", label = TRUE)

# Scanpy
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color="leiden")

The resolution parameter controls cluster granularity:

Low (0.1-0.3): Few large clusters (broad cell types)
Medium (0.4-0.8): Standard, ~10-20 clusters
High (1.0-2.0): Many fine clusters (cell states, subpopulations)

Practical advice: Try multiple resolutions and pick the one where cluster boundaries match obvious biological transitions.

Step 8: Batch Correction (Multi-Sample Studies)

When you have samples from different patients, days, or batches, technical variation can dominate biological signal. You will see this: PCA/UMAP separates by batch instead of cell type.

Harmony (Fast, Effective)

library(harmony)

seurat_obj <- RunHarmony(
  seurat_obj,
  group.by.vars = "batch",  # Your batch variable
  reduction = "pca",
  dims.use = 1:30
)
# Use Harmony embedding for downstream analysis
seurat_obj <- RunUMAP(seurat_obj, reduction = "harmony", dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction = "harmony", dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)

scVI (Deep Learning, Best for Complex Datasets)

scVI uses a variational autoencoder for batch correction and is particularly effective for very large or complex multi-batch datasets:

import scvi

scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
model.train()
adata.obsm["X_scVI"] = model.get_latent_representation()

sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

Step 9: Cluster Marker Identification

Find genes that distinguish each cluster — these become the basis for cell type annotation.

# Seurat
all_markers <- FindAllMarkers(
  seurat_obj,
  only.pos = TRUE,
  min.pct = 0.25,
  logfc.threshold = 0.25
)

top_markers <- all_markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

# Visualize top markers
DoHeatmap(seurat_obj, features = top_markers$gene)

# Scanpy
sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

Step 10: Cell Type Annotation

This is where biology meets computation. Three complementary strategies:

Automated Annotation Tools

SingleR: Reference-based annotation using bulk RNA-seq references
ScType: Marker-database-based, supports many tissues
CellTypist: Hierarchical classification with confidence scores
Azimuth (Seurat): Reference-mapped annotation for several common tissues

library(SingleR)
library(celldex)

# Load reference (Human Primary Cell Atlas)
ref <- celldex::HumanPrimaryCellAtlasData()

# Run SingleR
predictions <- SingleR(
  test = GetAssayData(seurat_obj, slot = "data"),
  ref = ref,
  labels = ref$label.main
)

seurat_obj$SingleR.labels <- predictions$labels
DimPlot(seurat_obj, group.by = "SingleR.labels", label = TRUE)

Manual Marker-Based Annotation

For each cluster, examine the top marker genes and look up known cell type markers:

# Visualize known markers for various cell types
FeaturePlot(seurat_obj,
            features = c("CD3E",     # T cells
                         "CD19",     # B cells
                         "CD14",     # Monocytes
                         "FCGR3A",   # NK cells
                         "PPBP"))    # Platelets

Common immune cell markers (PBMC reference):

T cells: CD3E, CD3D, CD3G, IL7R
CD8 T cells: CD8A, CD8B
B cells: CD19, CD79A, MS4A1
NK cells: KLRD1, NKG7, GNLY
Monocytes: CD14, FCGR3A, CD68
Dendritic cells: HLA-DRA, LYZ
Platelets: PPBP, PF4

Best Practice: Hybrid Approach

Run automated annotation (SingleR or Azimuth) for initial labels
Examine top markers per cluster manually
Cross-reference with literature-known markers for your tissue
Consider sub-clustering ambiguous regions (e.g., resolution ↑ for T cells to distinguish CD4/CD8/Treg)
Validate by gene set scoring with tissue-specific signatures

Optional Advanced Analyses

Trajectory Analysis

For developmental or differentiation studies:

Monocle3 — Pseudotime trajectory inference
Slingshot — Lineage inference
PAGA + scVelo — Combined topology + RNA velocity

RNA Velocity

Predict future cell states from spliced/unspliced mRNA ratios:

scVelo (Python) — Standard tool
velocyto — Original implementation

Requires reprocessing FASTQ files with velocyto run to get spliced/unspliced counts.

Regulon Analysis (SCENIC)

Identify TF-driven gene programs (covered in detail in Part 3 of our functional analysis series):

pySCENIC — Python implementation, scales well
SCENIC (R) — Original implementation

Cell-Cell Communication

Predict signaling between cell types:

CellChat (R) — Most popular
CellPhoneDB (Python) — Database-backed

Computational Considerations

scRNA-seq analyses are computationally heavy. Rough requirements:

Dataset Size	RAM	Recommended Setup
<10k cells	16 GB	Laptop sufficient
10-100k cells	32-64 GB	Workstation
100k-500k cells	128 GB	Server, consider scVI
>500k cells	256+ GB	HPC cluster, mandatory scVI / batch processing

For very large datasets, consider:

Scanpy over Seurat (better memory efficiency)
scVI for batch correction (scales linearly)
Subset to highly variable genes early
Use sparse matrix representations throughout

Common Pitfalls

Pitfall 1: Inadequate QC Filtering

Failing to filter dying cells (high mito%) or doublets contaminates clusters with low-quality "ghost" cells that don't correspond to real biology.

Pitfall 2: Ignoring Batch Effects

Running multi-sample analysis without batch correction → clusters reflect batch differences, not cell types. Always check UMAP colored by batch before clustering.

Pitfall 3: Over-Trusting Automated Annotations

SingleR can be confidently wrong — always verify with manual marker review.

Pitfall 4: Wrong Resolution Parameter

Picking resolution = 0.5 by default without exploration may miss biological variation (under-clustering) or create artificial subdivisions (over-clustering).

Pitfall 5: Treating UMAP as a Map

UMAP distances are not biologically meaningful in any quantitative sense. Two distant clusters in UMAP may be biologically more similar than two adjacent ones. UMAP is a visualization aid, not a metric space.

Summary: A Realistic Workflow Time

For a typical 30k cell, 4-sample scRNA-seq dataset on a 64GB workstation:

Cell Ranger (per sample, parallel): 4-6 hours
Loading + QC + filtering: 30 min
Normalization + HVG selection: 20 min
PCA + UMAP + clustering: 30 min
Batch correction (Harmony): 20 min
Marker gene + annotation: 1-2 hours
Manual review and refinement: 4-8 hours (the big one!)

Plan accordingly. The "wow, 30k cells in one UMAP" moment is fast; the "now what does each cluster actually mean biologically" is the real time investment.

Conclusion

Single-cell RNA-seq analysis isn't a single technique — it's a sequential pipeline of decisions, each affecting the final result. Master the standard workflow first (10x → Seurat/Scanpy → cluster → annotate), then add advanced analyses (trajectory, velocity, communication) as your scientific questions demand.

Most importantly: biology should drive analysis choices, not the other way around. A perfect statistical pipeline that produces clusters without biological meaning is a failed analysis. Always cross-reference your findings with known biology and validate critical claims experimentally.