Bioinformatics

Single-Cell RNA-seq Analysis: Complete Guide from Raw Data to Cell Type Annotation (2026)

End-to-end practical guide to scRNA-seq analysis using Seurat and Scanpy. Covers QC, normalization, batch correction, clustering, marker gene identification, cell type annotation, and trajectory analysis with code examples.

·13 min read
#scRNA-seq#single-cell#Seurat#Scanpy#cell type annotation#UMAP#clustering#Cell Ranger#transcriptomics

Single-cell RNA-seq has transformed transcriptomics from "average expression of millions of cells" to "individual cell-by-cell measurement." But the analytical pipeline is complex: a typical experiment generates 10,000+ cells × 20,000+ genes of sparse, noisy count data that requires multiple sequential analytical decisions. This guide walks through every step from raw Cell Ranger output to annotated cell type clusters — with the practical defaults that work, the parameter choices that matter, and the pitfalls that derail many first scRNA-seq projects.

TL;DR

  • Use Seurat (R) if you're R-native; Scanpy (Python) if you prefer Python or work with massive datasets (>500k cells).
  • Aggressive QC filtering is essential: mito% < 20, gene count 200-7500, doublet detection
  • Normalization matters less than people think — SCTransform (Seurat) or sc.pp.normalize_total + log1p (Scanpy) both work fine for most data.
  • Batch correction is non-optional for multi-sample studies — use Harmony or scVI.
  • Cell type annotation requires both automated tools (SingleR, ScType) and manual marker review — automated alone is insufficient.

Why Single-Cell RNA-seq Is Different from Bulk

Bulk RNA-seq measures average expression across millions of cells in a tissue sample. Single-cell RNA-seq (scRNA-seq) measures individual cells. The differences in data and analysis are profound:

PropertyBulk RNA-seqscRNA-seq
Samples10s to 100s1,000 to 1,000,000+ cells
Read depth per sample/cell20-50M reads1k-200k reads
Genes detected15,000-20,0001,000-5,000 per cell
Drop-out (zeros)Minimal70-95% of values are zero
HeterogeneityAveraged outPreserved — the whole point
Statistical methodsNegative binomial, sufficientSpecialized (zero-inflated, sparse)
Computational costModestSignificant (RAM, runtime)

The high drop-out (genes that should be expressed registering as zero) is the central analytical challenge. Most scRNA-seq methods are designed around managing sparse data robustly.

Choosing Your Tool: Seurat vs Scanpy

Both are mature, well-supported, and produce comparable results. The choice usually comes down to language preference and dataset size.

Seurat (R)

  • Strengths: Comprehensive, excellent for visualization, large active community in biology, good integration with ggplot2
  • Weaknesses: Memory-intensive for large datasets, slower than Scanpy on huge data
  • Best for: Standard scRNA-seq analyses up to ~200k cells, R-native workflows

Scanpy (Python)

  • Strengths: Memory-efficient, fast, scales well to >1M cells, integrates with broader Python ML ecosystem (PyTorch, scikit-learn)
  • Weaknesses: Less mature visualization, Python-only
  • Best for: Large datasets, integration with deep learning models, computational pipelines

For most labs, Seurat is the practical default unless you're working with 500k+ cells or need to integrate with Python-based tools.

End-to-End Analysis Pipeline

A standard scRNA-seq analysis follows roughly these stages:

Raw FASTQ files
    ↓
Cell Ranger (10x Genomics) or alternative (Salmon Alevin)
    ↓
Count matrix (cells × genes)
    ↓
Quality control (filter low-quality cells, doublets, mitochondrial%)
    ↓
Normalization (SCTransform or log-normalize)
    ↓
Feature selection (highly variable genes)
    ↓
Dimensionality reduction (PCA → UMAP/tSNE)
    ↓
Clustering (Leiden or Louvain)
    ↓
Batch correction (if multi-sample) — Harmony, scVI
    ↓
Marker gene identification (FindAllMarkers / rank_genes_groups)
    ↓
Cell type annotation (SingleR, ScType, manual)
    ↓
Optional: Trajectory analysis, RNA velocity, regulon analysis

Step 1: From FASTQ to Count Matrix

For 10x Genomics data, Cell Ranger is the standard. It handles:

  • Barcode demultiplexing (cell identification)
  • Read alignment to reference genome (STAR-based)
  • UMI counting per gene per cell
  • Outputs filtered count matrix
cellranger count \
  --id=sample_01 \
  --transcriptome=/path/to/refdata-gex-GRCh38-2024-A \
  --fastqs=/path/to/fastqs \
  --sample=Sample01 \
  --localcores=16 \
  --localmem=64

Alternative: Salmon Alevin + tximeta for faster, lighter-weight quantification. Particularly useful for SMART-seq2 or non-10x platforms.

The output you'll work with is typically the filtered_feature_bc_matrix (10x format) — a sparse matrix of cells × genes with raw UMI counts.

Step 2: Loading Data and Initial Object Creation

Seurat

library(Seurat)
library(dplyr)

# Load 10x output
counts <- Read10X(data.dir = "filtered_feature_bc_matrix/")

# Create Seurat object with basic filtering
seurat_obj <- CreateSeuratObject(
  counts        = counts,
  project       = "MyExperiment",
  min.cells     = 3,      # Gene detected in at least 3 cells
  min.features  = 200     # Cell expresses at least 200 genes
)

# Basic stats
seurat_obj
# An object of class Seurat
# 18847 features across 8523 samples within 1 assay

Scanpy

import scanpy as sc
import pandas as pd
import numpy as np

# Load 10x output
adata = sc.read_10x_mtx(
    "filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True
)

# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

print(adata)
# AnnData object with n_obs × n_vars = 8523 × 18847

Step 3: Quality Control — The Most Important Step

Bad QC = bad analysis. Three metrics matter most:

Mitochondrial Percentage

High mitochondrial gene content indicates dying or stressed cells (membrane integrity is lost, cytoplasmic mRNAs leak out, mitochondrial RNAs become disproportionately abundant).

Total Gene Count Per Cell

  • Too low (e.g., <200): Empty droplets or dying cells
  • Too high (e.g., >7500): Likely doublets — two cells captured in one droplet

UMI Count Per Cell

  • Too low: Low-quality capture
  • Too high: Doublets

Seurat QC

# Calculate mitochondrial percentage
seurat_obj[["percent.mt"]] <- PercentageFeatureSet(
  seurat_obj,
  pattern = "^MT-"   # Human; use "^mt-" for mouse
)

# Visualize
VlnPlot(
  seurat_obj,
  features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
  ncol = 3
)

FeatureScatter(seurat_obj, "nCount_RNA", "percent.mt")
FeatureScatter(seurat_obj, "nCount_RNA", "nFeature_RNA")

# Filter
seurat_obj <- subset(
  seurat_obj,
  subset = nFeature_RNA > 200 &
           nFeature_RNA < 7500 &
           percent.mt < 20
)

Scanpy QC

adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt"],
    percent_top=None,
    log1p=False,
    inplace=True
)

# Visualize
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4, multi_panel=True
)
sc.pl.scatter(adata, "total_counts", "pct_counts_mt")
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts")

# Filter
adata = adata[adata.obs.n_genes_by_counts > 200, :]
adata = adata[adata.obs.n_genes_by_counts < 7500, :]
adata = adata[adata.obs.pct_counts_mt < 20, :]

Doublet Detection

Beyond simple thresholds, dedicated doublet detection tools find heterotypic doublets that look "normal" by basic QC. Recommended:

  • DoubletFinder (Seurat) — Simulates artificial doublets and identifies real ones by similarity
  • Scrublet (Python, integrates with Scanpy)
  • scDblFinder (R) — Alternative to DoubletFinder

Always run doublet detection before downstream analysis. Typical doublet rates: 1% per 1000 cells (so 10k cells = ~10% doublets to remove).

Step 4: Normalization

Two main approaches; differences are typically small.

LogNormalize (Standard)

Divide counts by cell total counts, multiply by 10,000, log-transform:

# Seurat
seurat_obj <- NormalizeData(
  seurat_obj,
  normalization.method = "LogNormalize",
  scale.factor = 10000
)
# Scanpy
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

SCTransform (Seurat) / scTransform-equivalent

Uses regularized negative binomial regression to remove technical variability, often improves downstream analysis:

seurat_obj <- SCTransform(
  seurat_obj,
  vars.to.regress = "percent.mt",
  verbose = FALSE
)

For most analyses, LogNormalize is fine. SCTransform may give slightly cleaner results for complex datasets.

Step 5: Feature Selection — Highly Variable Genes

Most genes show little cell-to-cell variation. Focusing on the most variable genes (typically 2000-3000) speeds computation and reduces noise.

# Seurat
seurat_obj <- FindVariableFeatures(
  seurat_obj,
  selection.method = "vst",
  nfeatures = 2000
)
top10 <- head(VariableFeatures(seurat_obj), 10)
VariableFeaturePlot(seurat_obj)
# Scanpy
sc.pp.highly_variable_genes(
    adata,
    min_mean=0.0125,
    max_mean=3,
    min_disp=0.5
)
sc.pl.highly_variable_genes(adata)
adata = adata[:, adata.var.highly_variable]

Step 6: Dimensionality Reduction

scRNA-seq operates in 2000+ dimensional space, which is impossible to interpret. Reduction to 2D/3D for visualization happens in two steps:

  1. PCA: Reduce to 30-50 principal components — captures most biological variance
  2. UMAP / tSNE: Reduce PCA to 2D for visualization
# Seurat
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(seurat_obj))

# Choose number of PCs (look for "elbow")
ElbowPlot(seurat_obj, ndims = 50)

# UMAP
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)
DimPlot(seurat_obj, reduction = "umap")
# Scanpy
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True)

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata)
sc.pl.umap(adata)

UMAP is now strongly preferred over tSNE — it preserves global structure better and is faster.

Step 7: Clustering

Cells in the UMAP space are clustered into discrete groups, which (after annotation) become cell types.

# Seurat (Louvain by default; for Leiden, set algorithm = 4)
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5, algorithm = 4)
DimPlot(seurat_obj, reduction = "umap", label = TRUE)
# Scanpy
sc.tl.leiden(adata, resolution=0.5)
sc.pl.umap(adata, color="leiden")

The resolution parameter controls cluster granularity:

  • Low (0.1-0.3): Few large clusters (broad cell types)
  • Medium (0.4-0.8): Standard, ~10-20 clusters
  • High (1.0-2.0): Many fine clusters (cell states, subpopulations)

Practical advice: Try multiple resolutions and pick the one where cluster boundaries match obvious biological transitions.

Step 8: Batch Correction (Multi-Sample Studies)

When you have samples from different patients, days, or batches, technical variation can dominate biological signal. You will see this: PCA/UMAP separates by batch instead of cell type.

Harmony (Fast, Effective)

library(harmony)

seurat_obj <- RunHarmony(
  seurat_obj,
  group.by.vars = "batch",  # Your batch variable
  reduction = "pca",
  dims.use = 1:30
)
# Use Harmony embedding for downstream analysis
seurat_obj <- RunUMAP(seurat_obj, reduction = "harmony", dims = 1:30)
seurat_obj <- FindNeighbors(seurat_obj, reduction = "harmony", dims = 1:30)
seurat_obj <- FindClusters(seurat_obj, resolution = 0.5)

scVI (Deep Learning, Best for Complex Datasets)

scVI uses a variational autoencoder for batch correction and is particularly effective for very large or complex multi-batch datasets:

import scvi

scvi.model.SCVI.setup_anndata(adata, batch_key="batch")
model = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
model.train()
adata.obsm["X_scVI"] = model.get_latent_representation()

sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

Step 9: Cluster Marker Identification

Find genes that distinguish each cluster — these become the basis for cell type annotation.

# Seurat
all_markers <- FindAllMarkers(
  seurat_obj,
  only.pos = TRUE,
  min.pct = 0.25,
  logfc.threshold = 0.25
)

top_markers <- all_markers %>%
  group_by(cluster) %>%
  top_n(n = 10, wt = avg_log2FC)

# Visualize top markers
DoHeatmap(seurat_obj, features = top_markers$gene)
# Scanpy
sc.tl.rank_genes_groups(adata, "leiden", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)

Step 10: Cell Type Annotation

This is where biology meets computation. Three complementary strategies:

Automated Annotation Tools

  • SingleR: Reference-based annotation using bulk RNA-seq references
  • ScType: Marker-database-based, supports many tissues
  • CellTypist: Hierarchical classification with confidence scores
  • Azimuth (Seurat): Reference-mapped annotation for several common tissues
library(SingleR)
library(celldex)

# Load reference (Human Primary Cell Atlas)
ref <- celldex::HumanPrimaryCellAtlasData()

# Run SingleR
predictions <- SingleR(
  test = GetAssayData(seurat_obj, slot = "data"),
  ref = ref,
  labels = ref$label.main
)

seurat_obj$SingleR.labels <- predictions$labels
DimPlot(seurat_obj, group.by = "SingleR.labels", label = TRUE)

Manual Marker-Based Annotation

For each cluster, examine the top marker genes and look up known cell type markers:

# Visualize known markers for various cell types
FeaturePlot(seurat_obj,
            features = c("CD3E",     # T cells
                         "CD19",     # B cells
                         "CD14",     # Monocytes
                         "FCGR3A",   # NK cells
                         "PPBP"))    # Platelets

Common immune cell markers (PBMC reference):

  • T cells: CD3E, CD3D, CD3G, IL7R
  • CD8 T cells: CD8A, CD8B
  • B cells: CD19, CD79A, MS4A1
  • NK cells: KLRD1, NKG7, GNLY
  • Monocytes: CD14, FCGR3A, CD68
  • Dendritic cells: HLA-DRA, LYZ
  • Platelets: PPBP, PF4

Best Practice: Hybrid Approach

  1. Run automated annotation (SingleR or Azimuth) for initial labels
  2. Examine top markers per cluster manually
  3. Cross-reference with literature-known markers for your tissue
  4. Consider sub-clustering ambiguous regions (e.g., resolution ↑ for T cells to distinguish CD4/CD8/Treg)
  5. Validate by gene set scoring with tissue-specific signatures

Optional Advanced Analyses

Trajectory Analysis

For developmental or differentiation studies:

  • Monocle3 — Pseudotime trajectory inference
  • Slingshot — Lineage inference
  • PAGA + scVelo — Combined topology + RNA velocity

RNA Velocity

Predict future cell states from spliced/unspliced mRNA ratios:

  • scVelo (Python) — Standard tool
  • velocyto — Original implementation

Requires reprocessing FASTQ files with velocyto run to get spliced/unspliced counts.

Regulon Analysis (SCENIC)

Identify TF-driven gene programs (covered in detail in Part 3 of our functional analysis series):

  • pySCENIC — Python implementation, scales well
  • SCENIC (R) — Original implementation

Cell-Cell Communication

Predict signaling between cell types:

  • CellChat (R) — Most popular
  • CellPhoneDB (Python) — Database-backed

Computational Considerations

scRNA-seq analyses are computationally heavy. Rough requirements:

Dataset SizeRAMRecommended Setup
<10k cells16 GBLaptop sufficient
10-100k cells32-64 GBWorkstation
100k-500k cells128 GBServer, consider scVI
>500k cells256+ GBHPC cluster, mandatory scVI / batch processing

For very large datasets, consider:

  • Scanpy over Seurat (better memory efficiency)
  • scVI for batch correction (scales linearly)
  • Subset to highly variable genes early
  • Use sparse matrix representations throughout

Common Pitfalls

Pitfall 1: Inadequate QC Filtering

Failing to filter dying cells (high mito%) or doublets contaminates clusters with low-quality "ghost" cells that don't correspond to real biology.

Pitfall 2: Ignoring Batch Effects

Running multi-sample analysis without batch correction → clusters reflect batch differences, not cell types. Always check UMAP colored by batch before clustering.

Pitfall 3: Over-Trusting Automated Annotations

SingleR can be confidently wrong — always verify with manual marker review.

Pitfall 4: Wrong Resolution Parameter

Picking resolution = 0.5 by default without exploration may miss biological variation (under-clustering) or create artificial subdivisions (over-clustering).

Pitfall 5: Treating UMAP as a Map

UMAP distances are not biologically meaningful in any quantitative sense. Two distant clusters in UMAP may be biologically more similar than two adjacent ones. UMAP is a visualization aid, not a metric space.

Summary: A Realistic Workflow Time

For a typical 30k cell, 4-sample scRNA-seq dataset on a 64GB workstation:

  • Cell Ranger (per sample, parallel): 4-6 hours
  • Loading + QC + filtering: 30 min
  • Normalization + HVG selection: 20 min
  • PCA + UMAP + clustering: 30 min
  • Batch correction (Harmony): 20 min
  • Marker gene + annotation: 1-2 hours
  • Manual review and refinement: 4-8 hours (the big one!)

Plan accordingly. The "wow, 30k cells in one UMAP" moment is fast; the "now what does each cluster actually mean biologically" is the real time investment.

Conclusion

Single-cell RNA-seq analysis isn't a single technique — it's a sequential pipeline of decisions, each affecting the final result. Master the standard workflow first (10x → Seurat/Scanpy → cluster → annotate), then add advanced analyses (trajectory, velocity, communication) as your scientific questions demand.

Most importantly: biology should drive analysis choices, not the other way around. A perfect statistical pipeline that produces clusters without biological meaning is a failed analysis. Always cross-reference your findings with known biology and validate critical claims experimentally.

Further Reading

  • Hao, Y. et al. (2024). Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nature Biotechnology. (Seurat v5)
  • Wolf, F. A. et al. (2018). SCANPY: Large-scale single-cell gene expression data analysis. Genome Biology, 19(1).
  • Korsunsky, I. et al. (2019). Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 16(12).
  • Lopez, R. et al. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15(12). (scVI)
  • Heumos, L. et al. (2023). Best practices for single-cell analysis across modalities. Nature Reviews Genetics, 24(8).

관련 글