RNA-seq Analysis Pipeline 2026: Complete Guide from Raw Reads to Biological Insight
End-to-end RNA-seq analysis tutorial covering quality control, alignment, quantification, differential expression with DESeq2, pathway analysis, and visualization. All code included, all common errors addressed.
What This Tutorial Covers
RNA-seq has become the workhorse of functional genomics. Despite decades of use, newcomers still struggle with the same issues: quality control decisions, normalization choices, statistical model selection, and interpreting pathway results.
This guide covers the complete pipeline:
- Raw data QC and preprocessing
- Alignment and quantification
- Differential expression analysis (DESeq2)
- Visualization
- Pathway and gene set enrichment analysis
- Common errors and how to avoid them
All code is tested and reproducible. We use a real public dataset throughout.
Dataset and Tools
Example Dataset
We'll use a publicly available dataset: GSE183947 (breast cancer cell lines, treatment vs control, 3 replicates each).
# Download from SRA
prefetch SRR15660539 SRR15660540 SRR15660541 \
SRR15660542 SRR15660543 SRR15660544
# Convert to FASTQ
fastq-dump --split-files --gzip SRR15660539
Required Tools
# Quality control
fastqc --version # >= 0.11.9
multiqc --version # >= 1.14
# Alignment
STAR --version # >= 2.7.10
# Quantification
featureCounts -v # Subread >= 2.0.3
# R packages
# Install via Bioconductor
BiocManager::install(c("DESeq2", "ggplot2", "ComplexHeatmap",
"clusterProfiler", "EnhancedVolcano",
"pheatmap", "org.Hs.eg.db"))
Step 1: Quality Control
Never skip QC. It saves enormous time later.
# Run FastQC on all samples
mkdir -p qc/fastqc
fastqc raw_data/*.fastq.gz \
--outdir qc/fastqc \
--threads 8
# Aggregate with MultiQC
multiqc qc/fastqc/ --outdir qc/multiqc
What to Look For in FastQC
Per base sequence quality:
Good: Phred scores > 28 across entire read length
Warning: Scores drop at 3' end (normal for Illumina)
Action needed: Scores < 20 in first 10-15 bases (primer contamination?)
Sequence duplication levels:
Acceptable: < 50% for most RNA-seq
High (>80%): May indicate PCR overamplification
Note: Highly expressed genes will naturally show high duplication
Overrepresented sequences:
Adapter contamination: Trim with Trimmomatic or fastp
rRNA contamination: Consider rRNA depletion in next experiment
Trimming (When Necessary)
# fastp is faster than Trimmomatic and auto-detects adapters
fastp \
--in1 sample1_R1.fastq.gz \
--in2 sample1_R2.fastq.gz \
--out1 trimmed/sample1_R1.fastq.gz \
--out2 trimmed/sample1_R2.fastq.gz \
--json trimmed/sample1_fastp.json \
--html trimmed/sample1_fastp.html \
--thread 8 \
--qualified_quality_phred 20 \
--length_required 50
# For all samples in loop
for sample in sample1 sample2 sample3 sample4 sample5 sample6; do
fastp \
--in1 raw_data/${sample}_R1.fastq.gz \
--in2 raw_data/${sample}_R2.fastq.gz \
--out1 trimmed/${sample}_R1.fastq.gz \
--out2 trimmed/${sample}_R2.fastq.gz \
--json trimmed/${sample}_fastp.json \
--thread 8 2>&1 | tee logs/${sample}_fastp.log
done
Step 2: Alignment with STAR
Build Genome Index (Once)
# Download reference genome and annotation
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz
# Build STAR index (requires ~30GB RAM, ~1 hour)
mkdir -p genome/star_index_GRCh38
STAR \
--runMode genomeGenerate \
--genomeDir genome/star_index_GRCh38 \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile Homo_sapiens.GRCh38.110.gtf \
--runThreadN 16 \
--genomeSAindexNbases 14
Align Samples
# Alignment function
align_sample() {
SAMPLE=$1
STAR \
--runMode alignReads \
--genomeDir genome/star_index_GRCh38 \
--readFilesIn trimmed/${SAMPLE}_R1.fastq.gz trimmed/${SAMPLE}_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI NM MD \
--outFileNamePrefix aligned/${SAMPLE}_ \
--runThreadN 8 \
--outFilterMultimapNmax 1 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--quantMode GeneCounts \
2>&1 | tee logs/${SAMPLE}_STAR.log
}
# Run all samples
for sample in sample1 sample2 sample3 sample4 sample5 sample6; do
align_sample $sample
done
Check Alignment Rates
# Extract alignment rates from STAR logs
for f in logs/*_STAR.log; do
echo -n "$f: "
grep "Uniquely mapped reads %" $f
done
# Good alignment rates:
# >75% uniquely mapped for human RNA-seq
# 60-75%: acceptable but investigate
# <60%: something is wrong (wrong genome? rRNA contamination?)
Step 3: Quantification with featureCounts
# Count reads per gene for all samples at once
featureCounts \
-T 8 \
-s 2 \
-p \
-a Homo_sapiens.GRCh38.110.gtf \
-o counts/all_samples_counts.txt \
aligned/sample1_Aligned.sortedByCoord.out.bam \
aligned/sample2_Aligned.sortedByCoord.out.bam \
aligned/sample3_Aligned.sortedByCoord.out.bam \
aligned/sample4_Aligned.sortedByCoord.out.bam \
aligned/sample5_Aligned.sortedByCoord.out.bam \
aligned/sample6_Aligned.sortedByCoord.out.bam
Important flag: -s 2
This specifies strandedness. Wrong strandedness is a common mistake.
-s 0: Unstranded-s 1: Stranded (forward)-s 2: Stranded (reverse) — most common for TruSeq libraries
# Check strandedness with RSeQC
infer_experiment.py \
-r genome/hg38.bed \
-i aligned/sample1_Aligned.sortedByCoord.out.bam
# Output interpretation:
# "1++,1--,2+-,2-+" 50%/50% → Unstranded (-s 0)
# "1++,1--,2+-,2-+" 90%/10% → Stranded forward (-s 1)
# "1++,1--,2+-,2-+" 10%/90% → Stranded reverse (-s 2)
Step 4: Differential Expression with DESeq2
This is the core statistical analysis step.
Load and Prepare Data
library(DESeq2)
library(tidyverse)
# Load count matrix
raw_counts <- read.table("counts/all_samples_counts.txt",
header = TRUE, skip = 1, row.names = 1)
# featureCounts output has metadata columns — remove them
count_matrix <- raw_counts[, 6:ncol(raw_counts)]
# Clean up column names (remove path and suffix)
colnames(count_matrix) <- gsub("aligned/|_Aligned.*", "",
colnames(count_matrix))
# Create metadata
metadata <- data.frame(
sample = c("sample1", "sample2", "sample3",
"sample4", "sample5", "sample6"),
condition = c("control", "control", "control",
"treatment", "treatment", "treatment"),
row.names = c("sample1", "sample2", "sample3",
"sample4", "sample5", "sample6")
)
# Verify order matches
all(colnames(count_matrix) == rownames(metadata))
# Must be TRUE before proceeding
Create DESeq2 Object
# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
countData = count_matrix,
colData = metadata,
design = ~ condition
)
# Set reference level (control should be reference)
dds$condition <- relevel(dds$condition, ref = "control")
# Pre-filter: remove genes with very low counts
# Keeps genes with at least 10 reads in at least 3 samples
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
cat("Genes after filtering:", nrow(dds), "\n")
# Typical: 15,000-25,000 genes retained
Run DESeq2
# Run the full DESeq2 pipeline
# This does: estimation of size factors → dispersion → negative binomial GLM
dds <- DESeq(dds)
# Extract results
results <- results(dds,
contrast = c("condition", "treatment", "control"),
alpha = 0.05, # FDR threshold for summary
lfcThreshold = 0) # No LFC filter at this stage
# Summary
summary(results)
# Output shows:
# - Number of up/downregulated genes at FDR 5%
# - Number of outliers
# Add gene symbols if needed
library(org.Hs.eg.db)
results$symbol <- mapIds(org.Hs.eg.db,
keys = rownames(results),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
LFC Shrinkage (Important for Downstream Analysis)
# Raw LFC estimates are noisy for low-count genes
# Shrinkage improves rankings and downstream analyses
library(apeglm)
# apeglm shrinkage — current best practice
results_shrunk <- lfcShrink(dds,
coef = "condition_treatment_vs_control",
type = "apeglm")
# Use shrunk results for:
# - Ranking genes by effect size
# - Heatmaps
# - Machine learning inputs
# Use unshrunken results for:
# - Volcano plots (personal preference)
# - Exact fold change reporting
Step 5: Visualization
PCA Plot (Quality Check)
library(ggplot2)
# Variance-stabilizing transformation for visualization
vsd <- vst(dds, blind = FALSE)
# PCA
pca_data <- plotPCA(vsd, intgroup = "condition", returnData = TRUE)
pct_var <- round(100 * attr(pca_data, "percentVar"))
ggplot(pca_data, aes(PC1, PC2, color = condition, label = name)) +
geom_point(size = 4) +
ggrepel::geom_text_repel(size = 3) +
xlab(paste0("PC1: ", pct_var[1], "% variance")) +
ylab(paste0("PC2: ", pct_var[2], "% variance")) +
theme_classic() +
ggtitle("PCA of RNA-seq samples")
What to look for:
- Samples should cluster by condition, not by batch or other factors
- Outlier samples will appear far from their group
- If batch clusters more than condition → apply batch correction
Volcano Plot
library(EnhancedVolcano)
EnhancedVolcano(results,
lab = results$symbol,
x = 'log2FoldChange',
y = 'padj',
title = 'Treatment vs Control',
subtitle = 'DESeq2 differential expression',
pCutoff = 0.05,
FCcutoff = 1,
pointSize = 2,
labSize = 3,
col = c('grey', 'grey', 'grey', '#E74C3C'),
legendPosition = 'right')
Heatmap of Top Differentially Expressed Genes
library(ComplexHeatmap)
library(circlize)
# Get top 50 significant genes
sig_genes <- results %>%
as.data.frame() %>%
filter(!is.na(padj), padj < 0.05, abs(log2FoldChange) > 1) %>%
arrange(padj) %>%
head(50) %>%
rownames()
# Extract VST values for these genes
heatmap_mat <- assay(vsd)[sig_genes, ]
# Scale by row (z-score)
heatmap_scaled <- t(scale(t(heatmap_mat)))
# Color annotation
col_ha <- HeatmapAnnotation(
Condition = metadata$condition,
col = list(Condition = c("control" = "#3498DB",
"treatment" = "#E74C3C"))
)
# Color scale
col_fun <- colorRamp2(c(-2, 0, 2), c("#3498DB", "white", "#E74C3C"))
# Draw heatmap
Heatmap(heatmap_scaled,
name = "z-score",
top_annotation = col_ha,
col = col_fun,
show_row_names = FALSE,
show_column_names = TRUE,
clustering_method_rows = "ward.D2",
clustering_method_columns = "ward.D2",
column_title = "Top 50 DE Genes")
Step 6: Pathway Analysis
Gene Set Enrichment Analysis (GSEA)
library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)
# Prepare ranked gene list (by -log10(p) × sign(LFC))
gene_list <- -log10(results$pvalue) * sign(results$log2FoldChange)
names(gene_list) <- results$symbol
gene_list <- gene_list[!is.na(gene_list) & !is.na(names(gene_list))]
gene_list <- sort(gene_list, decreasing = TRUE)
# Run GSEA with KEGG pathways
gsea_kegg <- gseKEGG(
geneList = gene_list,
organism = "hsa",
keyType = "kegg",
minGSSize = 10,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE
)
# Dot plot of top pathways
dotplot(gsea_kegg, showCategory = 15,
title = "GSEA - KEGG Pathways")
Over-Representation Analysis (ORA)
# Get significant upregulated genes
up_genes <- results %>%
as.data.frame() %>%
filter(!is.na(padj), padj < 0.05, log2FoldChange > 1) %>%
rownames()
# Convert ENSEMBL to ENTREZ
up_entrez <- bitr(up_genes,
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db)
# GO enrichment
go_up <- enrichGO(
gene = up_entrez$ENTREZID,
OrgDb = org.Hs.eg.db,
ont = "BP", # Biological Process
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
readable = TRUE
)
# Visualize
barplot(go_up, showCategory = 20) +
ggtitle("GO:BP Enrichment - Upregulated Genes")
Common Errors and Solutions
Error: "size factors > 10x different between samples"
# Check size factors
sizeFactors(dds)
# If any are > 5 or < 0.2, investigate
# Possible causes:
# 1. One sample has very different library size → check total counts
colSums(counts(dds))
# 2. Global expression shift (real biology or normalization failure)
# 3. Wrong condition assignment (sample swap!)
# Verify with PCA first — outlier will be obvious
plotPCA(vsd, intgroup = "condition")
Error: DESeq2 converges slowly or fails
# Common for count matrices with many zeros
# Solution 1: More aggressive pre-filtering
keep <- rowSums(counts(dds) >= 20) >= (ncol(dds)/2)
dds <- dds[keep, ]
# Solution 2: Use betaPrior = FALSE (default in recent versions)
dds <- DESeq(dds, betaPrior = FALSE)
# Solution 3: For very small N (2 per group), use LRT instead
dds <- DESeq(dds, test = "LRT", reduced = ~ 1)
Warning: "Rows with fewer than 2 non-zero samples"
# This is just a warning, not an error
# Those genes will have NA in results — expected
# Your pre-filtering should minimize these
# Check how many NAs
sum(is.na(results$padj))
# Should be small fraction of total genes
Complete Pipeline Summary
# Shell pipeline script
#!/bin/bash
set -e
# 1. QC
fastqc raw_data/*.fastq.gz --outdir qc/fastqc --threads 8
multiqc qc/fastqc -o qc/multiqc
# 2. Trim
for sample in $(cat sample_list.txt); do
fastp -i raw_data/${sample}_R1.fastq.gz \
-I raw_data/${sample}_R2.fastq.gz \
-o trimmed/${sample}_R1.fastq.gz \
-O trimmed/${sample}_R2.fastq.gz \
--thread 8
done
# 3. Align
for sample in $(cat sample_list.txt); do
STAR --runMode alignReads \
--genomeDir genome/star_index \
--readFilesIn trimmed/${sample}_R1.fastq.gz trimmed/${sample}_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix aligned/${sample}_ \
--runThreadN 8 \
--quantMode GeneCounts
done
# 4. Count
featureCounts -T 8 -s 2 -p \
-a annotation/genes.gtf \
-o counts/counts.txt \
aligned/*_Aligned.sortedByCoord.out.bam
echo "Pipeline complete. Proceed to R for DESeq2 analysis."
Continue to differential expression analysis in R: the DESeq2 code above covers the complete workflow.
For downstream multi-omics integration: Proteomics + Transcriptomics Integration Guide