RNA-seq Analysis Pipeline 2026: Complete Guide from Raw Reads to Biological Insight

RNA-seq Analysis Pipeline

What This Tutorial Covers

RNA-seq has become the workhorse of functional genomics. Despite decades of use, newcomers still struggle with the same issues: quality control decisions, normalization choices, statistical model selection, and interpreting pathway results.

This guide covers the complete pipeline:

Raw data QC and preprocessing
Alignment and quantification
Differential expression analysis (DESeq2)
Visualization
Pathway and gene set enrichment analysis
Common errors and how to avoid them

All code is tested and reproducible. We use a real public dataset throughout.

Dataset and Tools

Example Dataset

We'll use a publicly available dataset: GSE183947 (breast cancer cell lines, treatment vs control, 3 replicates each).

# Download from SRA
prefetch SRR15660539 SRR15660540 SRR15660541 \
         SRR15660542 SRR15660543 SRR15660544

# Convert to FASTQ
fastq-dump --split-files --gzip SRR15660539

Required Tools

# Quality control
fastqc --version  # >= 0.11.9
multiqc --version  # >= 1.14

# Alignment
STAR --version  # >= 2.7.10

# Quantification
featureCounts -v  # Subread >= 2.0.3

# R packages
# Install via Bioconductor
BiocManager::install(c("DESeq2", "ggplot2", "ComplexHeatmap",
                       "clusterProfiler", "EnhancedVolcano",
                       "pheatmap", "org.Hs.eg.db"))

Step 1: Quality Control

Never skip QC. It saves enormous time later.

# Run FastQC on all samples
mkdir -p qc/fastqc
fastqc raw_data/*.fastq.gz \
       --outdir qc/fastqc \
       --threads 8

# Aggregate with MultiQC
multiqc qc/fastqc/ --outdir qc/multiqc

What to Look For in FastQC

Per base sequence quality:

Good: Phred scores > 28 across entire read length
Warning: Scores drop at 3' end (normal for Illumina)
Action needed: Scores < 20 in first 10-15 bases (primer contamination?)

Sequence duplication levels:

Acceptable: < 50% for most RNA-seq
High (>80%): May indicate PCR overamplification
Note: Highly expressed genes will naturally show high duplication

Overrepresented sequences:

Adapter contamination: Trim with Trimmomatic or fastp
rRNA contamination: Consider rRNA depletion in next experiment

Trimming (When Necessary)

# fastp is faster than Trimmomatic and auto-detects adapters
fastp \
  --in1 sample1_R1.fastq.gz \
  --in2 sample1_R2.fastq.gz \
  --out1 trimmed/sample1_R1.fastq.gz \
  --out2 trimmed/sample1_R2.fastq.gz \
  --json trimmed/sample1_fastp.json \
  --html trimmed/sample1_fastp.html \
  --thread 8 \
  --qualified_quality_phred 20 \
  --length_required 50

# For all samples in loop
for sample in sample1 sample2 sample3 sample4 sample5 sample6; do
    fastp \
      --in1 raw_data/${sample}_R1.fastq.gz \
      --in2 raw_data/${sample}_R2.fastq.gz \
      --out1 trimmed/${sample}_R1.fastq.gz \
      --out2 trimmed/${sample}_R2.fastq.gz \
      --json trimmed/${sample}_fastp.json \
      --thread 8 2>&1 | tee logs/${sample}_fastp.log
done

Step 2: Alignment with STAR

Build Genome Index (Once)

# Download reference genome and annotation
wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz

gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.110.gtf.gz

# Build STAR index (requires ~30GB RAM, ~1 hour)
mkdir -p genome/star_index_GRCh38
STAR \
  --runMode genomeGenerate \
  --genomeDir genome/star_index_GRCh38 \
  --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
  --sjdbGTFfile Homo_sapiens.GRCh38.110.gtf \
  --runThreadN 16 \
  --genomeSAindexNbases 14

Align Samples

# Alignment function
align_sample() {
    SAMPLE=$1
    STAR \
      --runMode alignReads \
      --genomeDir genome/star_index_GRCh38 \
      --readFilesIn trimmed/${SAMPLE}_R1.fastq.gz trimmed/${SAMPLE}_R2.fastq.gz \
      --readFilesCommand zcat \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes NH HI NM MD \
      --outFileNamePrefix aligned/${SAMPLE}_ \
      --runThreadN 8 \
      --outFilterMultimapNmax 1 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --quantMode GeneCounts \
      2>&1 | tee logs/${SAMPLE}_STAR.log
}

# Run all samples
for sample in sample1 sample2 sample3 sample4 sample5 sample6; do
    align_sample $sample
done

Check Alignment Rates

# Extract alignment rates from STAR logs
for f in logs/*_STAR.log; do
    echo -n "$f: "
    grep "Uniquely mapped reads %" $f
done

# Good alignment rates:
# >75% uniquely mapped for human RNA-seq
# 60-75%: acceptable but investigate
# <60%: something is wrong (wrong genome? rRNA contamination?)

Step 3: Quantification with featureCounts

# Count reads per gene for all samples at once
featureCounts \
  -T 8 \
  -s 2 \
  -p \
  -a Homo_sapiens.GRCh38.110.gtf \
  -o counts/all_samples_counts.txt \
  aligned/sample1_Aligned.sortedByCoord.out.bam \
  aligned/sample2_Aligned.sortedByCoord.out.bam \
  aligned/sample3_Aligned.sortedByCoord.out.bam \
  aligned/sample4_Aligned.sortedByCoord.out.bam \
  aligned/sample5_Aligned.sortedByCoord.out.bam \
  aligned/sample6_Aligned.sortedByCoord.out.bam

Important flag: -s 2 This specifies strandedness. Wrong strandedness is a common mistake.

-s 0: Unstranded
-s 1: Stranded (forward)
-s 2: Stranded (reverse) — most common for TruSeq libraries

# Check strandedness with RSeQC
infer_experiment.py \
  -r genome/hg38.bed \
  -i aligned/sample1_Aligned.sortedByCoord.out.bam

# Output interpretation:
# "1++,1--,2+-,2-+" 50%/50% → Unstranded (-s 0)
# "1++,1--,2+-,2-+" 90%/10% → Stranded forward (-s 1)  
# "1++,1--,2+-,2-+" 10%/90% → Stranded reverse (-s 2)

Step 4: Differential Expression with DESeq2

This is the core statistical analysis step.

Load and Prepare Data

library(DESeq2)
library(tidyverse)

# Load count matrix
raw_counts <- read.table("counts/all_samples_counts.txt",
                         header = TRUE, skip = 1, row.names = 1)

# featureCounts output has metadata columns — remove them
count_matrix <- raw_counts[, 6:ncol(raw_counts)]

# Clean up column names (remove path and suffix)
colnames(count_matrix) <- gsub("aligned/|_Aligned.*", "",
                                colnames(count_matrix))

# Create metadata
metadata <- data.frame(
  sample = c("sample1", "sample2", "sample3", 
             "sample4", "sample5", "sample6"),
  condition = c("control", "control", "control",
                "treatment", "treatment", "treatment"),
  row.names = c("sample1", "sample2", "sample3",
                "sample4", "sample5", "sample6")
)

# Verify order matches
all(colnames(count_matrix) == rownames(metadata))
# Must be TRUE before proceeding

Create DESeq2 Object

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(
  countData = count_matrix,
  colData = metadata,
  design = ~ condition
)

# Set reference level (control should be reference)
dds$condition <- relevel(dds$condition, ref = "control")

# Pre-filter: remove genes with very low counts
# Keeps genes with at least 10 reads in at least 3 samples
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]
cat("Genes after filtering:", nrow(dds), "\n")
# Typical: 15,000-25,000 genes retained

Run DESeq2

# Run the full DESeq2 pipeline
# This does: estimation of size factors → dispersion → negative binomial GLM
dds <- DESeq(dds)

# Extract results
results <- results(dds,
                   contrast = c("condition", "treatment", "control"),
                   alpha = 0.05,         # FDR threshold for summary
                   lfcThreshold = 0)     # No LFC filter at this stage

# Summary
summary(results)
# Output shows:
# - Number of up/downregulated genes at FDR 5%
# - Number of outliers

# Add gene symbols if needed
library(org.Hs.eg.db)
results$symbol <- mapIds(org.Hs.eg.db,
                         keys = rownames(results),
                         column = "SYMBOL",
                         keytype = "ENSEMBL",
                         multiVals = "first")

LFC Shrinkage (Important for Downstream Analysis)

# Raw LFC estimates are noisy for low-count genes
# Shrinkage improves rankings and downstream analyses

library(apeglm)

# apeglm shrinkage — current best practice
results_shrunk <- lfcShrink(dds,
                             coef = "condition_treatment_vs_control",
                             type = "apeglm")

# Use shrunk results for:
# - Ranking genes by effect size
# - Heatmaps
# - Machine learning inputs
# Use unshrunken results for:
# - Volcano plots (personal preference)
# - Exact fold change reporting

Step 5: Visualization

PCA Plot (Quality Check)

library(ggplot2)

# Variance-stabilizing transformation for visualization
vsd <- vst(dds, blind = FALSE)

# PCA
pca_data <- plotPCA(vsd, intgroup = "condition", returnData = TRUE)
pct_var <- round(100 * attr(pca_data, "percentVar"))

ggplot(pca_data, aes(PC1, PC2, color = condition, label = name)) +
  geom_point(size = 4) +
  ggrepel::geom_text_repel(size = 3) +
  xlab(paste0("PC1: ", pct_var[1], "% variance")) +
  ylab(paste0("PC2: ", pct_var[2], "% variance")) +
  theme_classic() +
  ggtitle("PCA of RNA-seq samples")

What to look for:

Samples should cluster by condition, not by batch or other factors
Outlier samples will appear far from their group
If batch clusters more than condition → apply batch correction

Volcano Plot

library(EnhancedVolcano)

EnhancedVolcano(results,
  lab = results$symbol,
  x = 'log2FoldChange',
  y = 'padj',
  title = 'Treatment vs Control',
  subtitle = 'DESeq2 differential expression',
  pCutoff = 0.05,
  FCcutoff = 1,
  pointSize = 2,
  labSize = 3,
  col = c('grey', 'grey', 'grey', '#E74C3C'),
  legendPosition = 'right')

Heatmap of Top Differentially Expressed Genes

library(ComplexHeatmap)
library(circlize)

# Get top 50 significant genes
sig_genes <- results %>%
  as.data.frame() %>%
  filter(!is.na(padj), padj < 0.05, abs(log2FoldChange) > 1) %>%
  arrange(padj) %>%
  head(50) %>%
  rownames()

# Extract VST values for these genes
heatmap_mat <- assay(vsd)[sig_genes, ]

# Scale by row (z-score)
heatmap_scaled <- t(scale(t(heatmap_mat)))

# Color annotation
col_ha <- HeatmapAnnotation(
  Condition = metadata$condition,
  col = list(Condition = c("control" = "#3498DB",
                           "treatment" = "#E74C3C"))
)

# Color scale
col_fun <- colorRamp2(c(-2, 0, 2), c("#3498DB", "white", "#E74C3C"))

# Draw heatmap
Heatmap(heatmap_scaled,
  name = "z-score",
  top_annotation = col_ha,
  col = col_fun,
  show_row_names = FALSE,
  show_column_names = TRUE,
  clustering_method_rows = "ward.D2",
  clustering_method_columns = "ward.D2",
  column_title = "Top 50 DE Genes")

Step 6: Pathway Analysis

Gene Set Enrichment Analysis (GSEA)

library(clusterProfiler)
library(org.Hs.eg.db)
library(enrichplot)

# Prepare ranked gene list (by -log10(p) × sign(LFC))
gene_list <- -log10(results$pvalue) * sign(results$log2FoldChange)
names(gene_list) <- results$symbol
gene_list <- gene_list[!is.na(gene_list) & !is.na(names(gene_list))]
gene_list <- sort(gene_list, decreasing = TRUE)

# Run GSEA with KEGG pathways
gsea_kegg <- gseKEGG(
  geneList = gene_list,
  organism = "hsa",
  keyType = "kegg",
  minGSSize = 10,
  maxGSSize = 500,
  pvalueCutoff = 0.05,
  verbose = FALSE
)

# Dot plot of top pathways
dotplot(gsea_kegg, showCategory = 15,
        title = "GSEA - KEGG Pathways")

Over-Representation Analysis (ORA)

# Get significant upregulated genes
up_genes <- results %>%
  as.data.frame() %>%
  filter(!is.na(padj), padj < 0.05, log2FoldChange > 1) %>%
  rownames()

# Convert ENSEMBL to ENTREZ
up_entrez <- bitr(up_genes,
                  fromType = "ENSEMBL",
                  toType = "ENTREZID",
                  OrgDb = org.Hs.eg.db)

# GO enrichment
go_up <- enrichGO(
  gene = up_entrez$ENTREZID,
  OrgDb = org.Hs.eg.db,
  ont = "BP",  # Biological Process
  pAdjustMethod = "BH",
  pvalueCutoff = 0.05,
  readable = TRUE
)

# Visualize
barplot(go_up, showCategory = 20) +
  ggtitle("GO:BP Enrichment - Upregulated Genes")

Common Errors and Solutions

Error: "size factors > 10x different between samples"

# Check size factors
sizeFactors(dds)
# If any are > 5 or < 0.2, investigate

# Possible causes:
# 1. One sample has very different library size → check total counts
colSums(counts(dds))

# 2. Global expression shift (real biology or normalization failure)
# 3. Wrong condition assignment (sample swap!)

# Verify with PCA first — outlier will be obvious
plotPCA(vsd, intgroup = "condition")

Error: DESeq2 converges slowly or fails

# Common for count matrices with many zeros
# Solution 1: More aggressive pre-filtering
keep <- rowSums(counts(dds) >= 20) >= (ncol(dds)/2)
dds <- dds[keep, ]

# Solution 2: Use betaPrior = FALSE (default in recent versions)
dds <- DESeq(dds, betaPrior = FALSE)

# Solution 3: For very small N (2 per group), use LRT instead
dds <- DESeq(dds, test = "LRT", reduced = ~ 1)

Warning: "Rows with fewer than 2 non-zero samples"

# This is just a warning, not an error
# Those genes will have NA in results — expected
# Your pre-filtering should minimize these

# Check how many NAs
sum(is.na(results$padj))
# Should be small fraction of total genes

Complete Pipeline Summary

# Shell pipeline script
#!/bin/bash
set -e

# 1. QC
fastqc raw_data/*.fastq.gz --outdir qc/fastqc --threads 8
multiqc qc/fastqc -o qc/multiqc

# 2. Trim
for sample in $(cat sample_list.txt); do
    fastp -i raw_data/${sample}_R1.fastq.gz \
          -I raw_data/${sample}_R2.fastq.gz \
          -o trimmed/${sample}_R1.fastq.gz \
          -O trimmed/${sample}_R2.fastq.gz \
          --thread 8
done

# 3. Align
for sample in $(cat sample_list.txt); do
    STAR --runMode alignReads \
         --genomeDir genome/star_index \
         --readFilesIn trimmed/${sample}_R1.fastq.gz trimmed/${sample}_R2.fastq.gz \
         --readFilesCommand zcat \
         --outSAMtype BAM SortedByCoordinate \
         --outFileNamePrefix aligned/${sample}_ \
         --runThreadN 8 \
         --quantMode GeneCounts
done

# 4. Count
featureCounts -T 8 -s 2 -p \
  -a annotation/genes.gtf \
  -o counts/counts.txt \
  aligned/*_Aligned.sortedByCoord.out.bam

echo "Pipeline complete. Proceed to R for DESeq2 analysis."

Continue to differential expression analysis in R: the DESeq2 code above covers the complete workflow.

For downstream multi-omics integration: Proteomics + Transcriptomics Integration Guide