Genomic Data Analysis in R: A Beginner’s Guide
Why R for Genomic Data Analysis? R has established itself as the dominant programming language for genomic data analysis. The Bioconductor project more
Why R for Genomic Data Analysis?
R has established itself as the dominant programming language for genomic data analysis. The Bioconductor project — a repository of over 2,200 specialized packages for biological data analysis — provides an unmatched ecosystem for processing, analyzing, and visualizing omics data. From RNA-seq differential expression to single-cell analysis, from ChIP-seq peak calling to variant annotation, R/Bioconductor covers virtually every genomic analysis task.
This guide introduces the essential R tools and workflows for genomic data analysis, targeting biologists and bioinformaticians who are beginning their journey with computational genomics.
Setting Up Your R Environment
Installing R and RStudio
Start by installing R from CRAN (https://cran.r-project.org/) and RStudio Desktop (https://posit.co/downloads/). RStudio provides an integrated development environment with code editing, visualization, and project management features that make R programming much more productive.
Installing Bioconductor
Bioconductor packages are installed using the BiocManager package:
`# Install BiocManager install.packages("BiocManager")
Install core Bioconductor packages
BiocManager::install(c( "DESeq2", # Differential expression "edgeR", # Differential expression "limma", # Linear models for microarray/RNA-seq "GenomicRanges", # Genomic interval operations "clusterProfiler", # Pathway enrichment "org.Hs.eg.db", # Human gene annotations "EnhancedVolcano", # Volcano plots "ComplexHeatmap" # Advanced heatmaps )) `
Essential Tidyverse Packages
The tidyverse ecosystem complements Bioconductor for data manipulation and visualization:
install.packages(c("tidyverse", "ggplot2", "dplyr", "readr", "patchwork"))
RNA-seq Analysis with DESeq2
Understanding the RNA-seq Workflow
A typical RNA-seq analysis workflow consists of:
-
Quality control: FastQC and MultiQC assess read quality
-
Alignment: STAR or HISAT2 maps reads to the reference genome
-
Quantification: featureCounts or Salmon counts reads per gene
-
Differential expression: DESeq2 or edgeR identifies differentially expressed genes
-
Functional analysis: Pathway enrichment reveals biological meaning
Loading Count Data
DESeq2 takes a count matrix (genes × samples) and a sample information table as input:
`library(DESeq2)
Read count matrix
counts
Running Differential Expression Analysis
`# Filter low-count genes keep = 10) >= 3 dds 1) sig_genes
Visualization
`# MA Plot plotMA(res, ylim = c(-5, 5))
Volcano Plot
library(EnhancedVolcano) EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pvalue', title = 'Treated vs Control', pCutoff = 0.05, FCcutoff = 1 )
PCA Plot
vsd
Gene Set Enrichment Analysis
Over-Representation Analysis (ORA)
`library(clusterProfiler) library(org.Hs.eg.db)
Convert gene symbols to Entrez IDs
sig_symbols
Gene Set Enrichment Analysis (GSEA)
`# Rank all genes by log2FoldChange gene_list
Working with Genomic Ranges
GenomicRanges Package
GenomicRanges is the foundational Bioconductor package for representing and manipulating genomic intervals:
`library(GenomicRanges)
Create genomic ranges
gr
Reading Genomic Data Files
`# Read BED files library(rtracklayer) peaks
Single-Cell RNA-seq Analysis with Seurat
Basic Seurat Workflow
`library(Seurat)
Load 10X Genomics data
data 200 & nFeature_RNA % FindVariableFeatures(nfeatures = 2000) %>% ScaleData() %>% RunPCA(npcs = 30) %>% FindNeighbors(dims = 1:20) %>% FindClusters(resolution = 0.5) %>% RunUMAP(dims = 1:20)
Visualization
DimPlot(seurat_obj, reduction = "umap", label = TRUE) FeaturePlot(seurat_obj, features = c("CD3D", "CD14", "MS4A1", "NKG7"))
Find cluster markers
markers % group_by(cluster) %>% top_n(5, avg_log2FC) DoHeatmap(seurat_obj, features = top_markers$gene) `
Best Practices for Reproducible Genomic Analysis
Project Organization
-
Use RStudio Projects to organize analyses
-
Maintain separate directories for raw data, processed data, scripts, and results
-
Never modify raw data files
Version Control and Reproducibility
-
Use Git for version control of analysis scripts
-
Record R and package versions with
sessionInfo()orrenv -
Use R Markdown or Quarto for literate programming — combining code, results, and narrative
-
Consider using Snakemake or Nextflow for pipeline management
Common Pitfalls to Avoid
-
Not filtering low-count genes: Genes with very low counts add noise without information
-
Ignoring batch effects: Use ComBat, limma::removeBatchEffect, or Harmony for batch correction
-
Multiple testing: Always use adjusted p-values (FDR/BH correction), never raw p-values
-
Circular analysis: Don't use the same data for feature selection and validation
Conclusion
R and Bioconductor provide a comprehensive and mature ecosystem for genomic data analysis. This guide covers the fundamentals, but the field is vast. As you progress, explore specialized packages for your specific analysis needs — whether that's methylation analysis with minfi, ChIP-seq with DiffBind, or spatial transcriptomics with Seurat v5 or Giotto. The Bioconductor community provides excellent vignettes, workshops, and support forums to help you on your journey. Start with a real dataset, follow the workflows above, and you'll be analyzing genomic data with confidence in no time.
관련 읽을거리
- 💊 비타민D 부족이 만성피로의 원인? 혈액검사로 확인하세요 — Genobalance
- 🧠 뇌의 가소성: 우리 뇌는 왜 평생 변화하는가 — K-Brain Map
- 💻 AI 기술 동향: 핫 스타트업부터 윤리적 논쟁까지 — BRIC