Machine Learning in Systems Biology: Methods, Applications, and Best Practices

The Intersection of Machine Learning and Systems Biology

Artificial intelligence algorithms for biological data analysis

Machine learning model training on biomedical datasets

Systems biology aims to understand biological systems as integrated wholes. Machine learning provides the computational tools to extract patterns from the massive, complex datasets that systems biology generates. Together, they form a powerful partnership: systems biology provides the biological context and data, while machine learning provides the analytical frameworks to turn data into knowledge.

This article surveys the key machine learning methods used in systems biology, their applications, and practical guidelines for applying them effectively to biological data.

Supervised Learning in Systems Biology

Classification: Disease Diagnosis and Subtyping

Classification algorithms predict categorical outcomes from molecular data. In systems biology, common classification tasks include:

Disease diagnosis: Distinguishing cancer from normal tissue based on gene expression profiles
Subtype classification: Assigning tumors to molecular subtypes (e.g., PAM50 breast cancer subtypes)
Drug response prediction: Predicting whether a patient will respond to a specific treatment
Cell type classification: Annotating cell types in single-cell RNA-seq data

Key Classification Algorithms

Random Forests: Ensemble of decision trees that vote on the classification. Random forests handle high-dimensional data well, provide feature importance rankings, and are relatively robust to overfitting. In genomics, they're widely used for gene expression classification and feature selection.

Support Vector Machines (SVMs): Find the optimal hyperplane separating classes in high-dimensional space. SVMs with radial basis function (RBF) kernels are effective for non-linear classification of omics data. They work well with small sample sizes common in biological studies.

Gradient Boosting (XGBoost/LightGBM): Sequentially builds decision trees, each correcting errors of the previous ones. These models often achieve top performance in biological prediction tasks and provide SHAP-based interpretability.

Deep Neural Networks: Multi-layer networks that learn hierarchical representations. While powerful, DNNs typically require more samples than traditional methods and are prone to overfitting on small biological datasets. Transfer learning and careful regularization can mitigate these issues.

Regression: Quantitative Predictions

Regression models predict continuous outcomes:

Gene expression prediction: Predicting mRNA levels from epigenomic features (histone marks, DNA methylation)
Drug sensitivity prediction: Predicting IC50 values from genomic features of cancer cell lines
Protein abundance prediction: Estimating protein levels from mRNA expression
Survival prediction: Predicting time-to-event outcomes (overall survival, progression-free survival)

Unsupervised Learning for Biological Discovery

Clustering

Clustering algorithms group samples or features without predefined labels, revealing hidden structure in biological data:

k-means clustering: Fast and simple, but requires specifying the number of clusters. Use the elbow method, silhouette analysis, or gap statistic to determine optimal k.
Hierarchical clustering: Builds a dendrogram of nested clusters. Ward's method with Euclidean distance is a good default for gene expression data. Commonly used for heatmap visualization.
DBSCAN: Density-based clustering that finds arbitrarily shaped clusters and identifies outliers. Useful for single-cell data where clusters have irregular shapes.
Leiden/Louvain: Graph-based community detection algorithms, standard for single-cell RNA-seq clustering. They operate on k-nearest neighbor graphs and scale well to millions of cells.
Consensus clustering: Runs clustering multiple times with subsampling to identify robust clusters. The ConsensusClusterPlus R package is widely used for cancer subtyping.

Dimensionality Reduction

Reducing high-dimensional omics data to interpretable low-dimensional representations:

PCA (Principal Component Analysis): Linear method that captures maximum variance. Essential for quality control, batch effect detection, and initial data exploration. Always plot PC1 vs PC2 as a first step.
t-SNE: Non-linear method that preserves local structure. Excellent for visualizing clusters in single-cell data. Perplexity parameter should be tuned (typical range: 5-50). Does not preserve global distances.
UMAP: Non-linear method that preserves both local and global structure better than t-SNE. Faster computation. The default choice for single-cell data visualization. Key parameters: n_neighbors (local structure) and min_dist (point spread).
Autoencoders: Neural network-based non-linear dimensionality reduction. Variational autoencoders (VAEs) like scVI learn probabilistic latent spaces useful for data integration and batch correction.

Deep Learning for Biological Sequences

Convolutional Neural Networks (CNNs)

CNNs applied to biological sequences treat DNA/protein sequences as 1D signals. Applications include:

Transcription factor binding prediction: DeepBind and related models predict TF binding from DNA sequence, learning sequence motifs directly from data.
Regulatory element prediction: Basenji and Enformer predict gene expression, chromatin accessibility, and histone modifications from DNA sequence across cell types.
Splice site prediction: SpliceAI predicts splicing from pre-mRNA sequence with remarkable accuracy.

Transformer Models

Transformers have revolutionized biological sequence analysis:

Protein language models: ESM-2 (Meta AI) and ProtTrans learn general-purpose protein representations from millions of sequences. These embeddings capture structural, functional, and evolutionary information. Fine-tuning on downstream tasks achieves state-of-the-art performance for function prediction, structure prediction, and variant effect prediction.
DNA language models: Nucleotide Transformer and DNABERT learn representations of genomic sequences, enabling transfer learning for regulatory genomics tasks.
scGPT: A generative pre-trained transformer for single-cell biology that performs cell annotation, gene perturbation prediction, and multi-omics integration.

Graph Neural Networks (GNNs)

GNNs operate on graph-structured biological data:

Protein structure prediction: AlphaFold2 uses a specialized graph transformer (Evoformer) to predict protein structures from sequences and multiple sequence alignments.
Drug-target interaction: GNNs predict binding affinity between molecular graphs (drugs) and protein graphs (targets).
Gene regulatory network inference: GNNs learn regulatory relationships from single-cell expression data, outperforming traditional methods like GENIE3.
Pathway analysis: Graph-based models incorporate pathway structure (KEGG, Reactome) as inductive bias for disease classification and drug response prediction.

Practical Guidelines for ML in Systems Biology

Data Preprocessing

Normalization: Always normalize omics data appropriately before ML. Log-transform count data. Use VST for RNA-seq. Median-center proteomics data.
Batch correction: If data comes from multiple batches, correct batch effects before analysis. ComBat, Harmony, and scVI handle this well.
Missing data: Handle missing values explicitly. kNN imputation or random forest imputation (missForest) are good general-purpose methods.
Feature scaling: Standardize features (zero mean, unit variance) for methods sensitive to scale (SVM, neural networks, PCA). Tree-based methods don't require scaling.

Model Evaluation

Cross-validation: Use stratified k-fold cross-validation (k=5 or 10) for performance estimation. For small datasets, leave-one-out cross-validation provides lower bias but higher variance.
Nested cross-validation: When performing hyperparameter tuning, use nested CV to prevent information leakage. The inner loop tunes parameters; the outer loop estimates performance.
Independent test sets: Whenever possible, validate on a completely independent dataset, ideally from a different institution or study.
Metrics: Use AUROC for balanced datasets, AUPRC for imbalanced datasets. Report confidence intervals using bootstrapping.

Interpretability

Feature importance: Use permutation importance, SHAP values, or integrated gradients to understand model decisions.
Biological validation: Map important features to biological knowledge. Do important genes belong to relevant pathways? Are they expressed in relevant tissues?
Attention visualization: For transformer models, attention weights can highlight which parts of a sequence drive predictions.

Common Pitfalls

Data leakage: The most dangerous pitfall. Ensure no information from test samples leaks into training. Feature selection, normalization, and imputation must be performed within CV folds.
Batch confounding: If disease status is confounded with batch, the model will learn batch effects rather than biology. Always check for confounding.
Class imbalance: Many biological classification problems have imbalanced classes. Use stratified sampling, class weights, or SMOTE to address this.
p >> n problem: When features vastly outnumber samples (common in omics), regularization is essential. LASSO, elastic net, and tree-based methods handle this naturally.

Software and Resources

Python

scikit-learn: Comprehensive ML library with consistent API for classification, regression, clustering, and preprocessing
PyTorch/TensorFlow: Deep learning frameworks with biological model implementations
Scanpy: Single-cell analysis with integrated ML tools
DeepChem: ML for drug discovery and molecular property prediction

R

caret/tidymodels: ML workflow frameworks
glmnet: Regularized linear models
ranger: Fast random forest implementation
Seurat: Single-cell analysis with built-in ML components

Conclusion

Machine learning has become an indispensable tool in the systems biologist's toolkit. From supervised classification of disease states to unsupervised discovery of cellular subtypes, from deep learning on biological sequences to graph neural networks on interaction networks, ML methods are generating insights that would be impossible with traditional statistical approaches. The key to success is combining computational sophistication with biological knowledge, rigorous evaluation, and careful attention to the unique challenges of biological data. As biological datasets grow larger and ML methods more powerful, this intersection will continue to produce transformative discoveries in our understanding of living systems.

📚 참고 데이터베이스: PubMed | UniProt | Nature