유전체학

Genomic Data Analysis in R: A Beginner’s Guide

Why R for Genomic Data Analysis? R has established itself as the dominant programming language for genomic data analysis. The Bioconductor project more

·5 min read
#genomics#R programming#bioinformatics#data analysis

Why R for Genomic Data Analysis?

Multi-omics data integration and analysis workflow

Genomics and proteomics research in modern laboratory

R has established itself as the dominant programming language for genomic data analysis. The Bioconductor project — a repository of over 2,200 specialized packages for biological data analysis — provides an unmatched ecosystem for processing, analyzing, and visualizing omics data. From RNA-seq differential expression to single-cell analysis, from ChIP-seq peak calling to variant annotation, R/Bioconductor covers virtually every genomic analysis task.

This guide introduces the essential R tools and workflows for genomic data analysis, targeting biologists and bioinformaticians who are beginning their journey with computational genomics.

Setting Up Your R Environment

Installing R and RStudio

Start by installing R from CRAN (https://cran.r-project.org/) and RStudio Desktop (https://posit.co/downloads/). RStudio provides an integrated development environment with code editing, visualization, and project management features that make R programming much more productive.

Installing Bioconductor

Bioconductor packages are installed using the BiocManager package:

`# Install BiocManager install.packages("BiocManager")

Install core Bioconductor packages

BiocManager::install(c( "DESeq2", # Differential expression "edgeR", # Differential expression "limma", # Linear models for microarray/RNA-seq "GenomicRanges", # Genomic interval operations "clusterProfiler", # Pathway enrichment "org.Hs.eg.db", # Human gene annotations "EnhancedVolcano", # Volcano plots "ComplexHeatmap" # Advanced heatmaps )) `

Essential Tidyverse Packages

The tidyverse ecosystem complements Bioconductor for data manipulation and visualization:

install.packages(c("tidyverse", "ggplot2", "dplyr", "readr", "patchwork"))

RNA-seq Analysis with DESeq2

Understanding the RNA-seq Workflow

A typical RNA-seq analysis workflow consists of:

  • Quality control: FastQC and MultiQC assess read quality

  • Alignment: STAR or HISAT2 maps reads to the reference genome

  • Quantification: featureCounts or Salmon counts reads per gene

  • Differential expression: DESeq2 or edgeR identifies differentially expressed genes

  • Functional analysis: Pathway enrichment reveals biological meaning

Loading Count Data

DESeq2 takes a count matrix (genes × samples) and a sample information table as input:

`library(DESeq2)

Read count matrix

counts

Running Differential Expression Analysis

`# Filter low-count genes keep = 10) >= 3 dds 1) sig_genes

Visualization

`# MA Plot plotMA(res, ylim = c(-5, 5))

Volcano Plot

library(EnhancedVolcano) EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pvalue', title = 'Treated vs Control', pCutoff = 0.05, FCcutoff = 1 )

PCA Plot

vsd

Gene Set Enrichment Analysis

Over-Representation Analysis (ORA)

`library(clusterProfiler) library(org.Hs.eg.db)

Convert gene symbols to Entrez IDs

sig_symbols

Gene Set Enrichment Analysis (GSEA)

`# Rank all genes by log2FoldChange gene_list

Working with Genomic Ranges

GenomicRanges Package

GenomicRanges is the foundational Bioconductor package for representing and manipulating genomic intervals:

`library(GenomicRanges)

Create genomic ranges

gr

Reading Genomic Data Files

`# Read BED files library(rtracklayer) peaks

Single-Cell RNA-seq Analysis with Seurat

Basic Seurat Workflow

`library(Seurat)

Load 10X Genomics data

data 200 & nFeature_RNA % FindVariableFeatures(nfeatures = 2000) %>% ScaleData() %>% RunPCA(npcs = 30) %>% FindNeighbors(dims = 1:20) %>% FindClusters(resolution = 0.5) %>% RunUMAP(dims = 1:20)

Visualization

DimPlot(seurat_obj, reduction = "umap", label = TRUE) FeaturePlot(seurat_obj, features = c("CD3D", "CD14", "MS4A1", "NKG7"))

Find cluster markers

markers % group_by(cluster) %>% top_n(5, avg_log2FC) DoHeatmap(seurat_obj, features = top_markers$gene) `

Best Practices for Reproducible Genomic Analysis

Project Organization

  • Use RStudio Projects to organize analyses

  • Maintain separate directories for raw data, processed data, scripts, and results

  • Never modify raw data files

Version Control and Reproducibility

  • Use Git for version control of analysis scripts

  • Record R and package versions with sessionInfo() or renv

  • Use R Markdown or Quarto for literate programming — combining code, results, and narrative

  • Consider using Snakemake or Nextflow for pipeline management

Common Pitfalls to Avoid

  • Not filtering low-count genes: Genes with very low counts add noise without information

  • Ignoring batch effects: Use ComBat, limma::removeBatchEffect, or Harmony for batch correction

  • Multiple testing: Always use adjusted p-values (FDR/BH correction), never raw p-values

  • Circular analysis: Don't use the same data for feature selection and validation

Conclusion

R and Bioconductor provide a comprehensive and mature ecosystem for genomic data analysis. This guide covers the fundamentals, but the field is vast. As you progress, explore specialized packages for your specific analysis needs — whether that's methylation analysis with minfi, ChIP-seq with DiffBind, or spatial transcriptomics with Seurat v5 or Giotto. The Bioconductor community provides excellent vignettes, workshops, and support forums to help you on your journey. Start with a real dataset, follow the workflows above, and you'll be analyzing genomic data with confidence in no time.

📚 참고 데이터베이스: PubMed | KEGG | Nature


관련 읽을거리

관련 글