Can an LLM Run an RNA-seq Analysis on Its Own? Building ARIA, a Decision-Aware Transcriptome Framework
A companion to our preprint: reproducible pipelines run RNA-seq steps reliably, but the decisions between steps still need an expert. ARIA puts an LLM in that reasoning seat across 8 decision points — and on 4 public datasets it recovered paired designs, technical covariates, and known biology, with cross-method agreement r > 0.99.
Companion to our preprint. Preprint (Research Square, v1, not yet peer-reviewed): https://doi.org/10.21203/rs.3.rs-9500973/v1 · Code (open source, MIT): github.com/shoo99/ARIA. This describes a research framework and is not a clinical or production tool.
TL;DR (Quick Answer)
Reproducible RNA-seq pipelines (Nextflow, nf-core/rnaseq) solve execution — they run the same steps the same way every time. What they don't solve is the decisions between the steps: is the QC good enough? Is this a paired design? Few DEGs — switch from ORA to GSEA? Those still need an expert. ARIA puts a Large Language Model in that reasoning seat.
- What it is — an open-source framework where an LLM acts as the reasoning engine across 8 Decision Points (QC, DE strategy, design recognition, signature detection, cross-method validation, interpretation, sensitivity analysis, reporting), combining hard rule-based thresholds with LLM contextual reasoning.
- Does it actually work? On 4 public datasets across 3 species it detected paired designs (+21–47% DEGs), caught technical covariates like library type (+4–30% DEGs), cross-validated across DESeq2/edgeR/limma-voom (r > 0.99), and recovered known biology (7/7 dexamethasone-responsive genes; 9/11 FMRP targets; 15/16 tissue markers).
- How it stays honest — execution is rule-based and deterministic; the LLM reasons and interprets; every LLM prompt and response is logged for reproducibility; cross-method validation is a built-in guardrail; hallucination risk is addressed explicitly.
The point isn't to replace bioinformaticians or the tools they trust — it's to automate the adaptive decision layer that currently sits between fixed pipeline steps.
The real bottleneck in RNA-seq isn't execution
A bulk RNA-seq analysis is a chain: QC → alignment (STAR) → quantification (Salmon) → normalization → differential expression (DESeq2/edgeR/limma) → enrichment (GSEA/ORA) → interpretation. Workflow managers like Nextflow and community pipelines like nf-core/rnaseq make that chain reproducible and scalable.
But they execute a fixed path decided before the data is examined. When a dataset yields unexpectedly few DEGs, a fixed pipeline keeps going with ORA — and can miss pathway-level signal that GSEA would catch. The expertise to adapt — "this looks paired, model the pairing"; "library type is confounded, add it as a covariate"; "this DEG count says switch strategy" — is exactly what stays manual. That's the gap ARIA targets.
What ARIA does: 8 decision points, rules + reasoning
ARIA wraps the standard toolchain (DESeq2 v1.46, edgeR v4.4, limma-voom, fgsea, WGCNA, STRING) and inserts an LLM reasoning layer at 8 decision points:
- QC assessment — read the quality metrics, decide pass/adapt.
- DE strategy — pick the modeling approach for the data at hand.
- Design recognition — detect paired/blocked designs from the metadata.
- Signature detection — spot technical covariates (e.g., library type).
- Cross-method validation — run DESeq2/edgeR/limma-voom and compare.
- Interpretation — contextualize hits biologically.
- Sensitivity analysis — sweep a matrix of padj × log-fold-change cutoffs.
- Reporting — assemble the result.
Crucially, execution is rule-based (e.g., the sensitivity step deterministically tests padj cutoffs of 0.01 / 0.05 / 0.1 crossed with log-fold-change cutoffs of 0 / 0.5 / 1 / 2); the LLM's job is to reason about and interpret those outputs, not to invent numbers.
Results: it recovered designs, covariates, and known biology
We benchmarked on four public datasets of escalating difficulty:
| Dataset | Species | Notable feature | DEGs |
|---|---|---|---|
| SEQC (GSE49712) | Human | reference / tissue markers | 10,430 |
| Airway (GSE52778) | Human | paired design | 951 |
| Fmr1 KO (GSE180135) | Mouse | FMRP targets | 398 |
| Pasilla (GSE18508) | Drosophila | mixed library types | 224 |
- Design recognition mattered most. Detecting the paired structure in Airway increased DEG detection by 21–47% — over a thousand genes a naive analysis would miss, from a single adaptive decision.
- Covariate detection (library type) added 4–30% DEGs.
- Cross-method agreement was near-perfect (r > 0.99 across DESeq2/edgeR/limma-voom) — a guardrail that the reasoning didn't drift from established statistics.
- Known biology came back: 7/7 dexamethasone-responsive genes (Himes et al. 2014) in Airway; 9/11 FMRP translational targets in Fmr1 KO; 15/16 tissue markers in SEQC (the one "miss," BCL2, was brain-enriched here despite a cancer-marker label — arguably a correct call).
How an LLM pipeline stays trustworthy
The obvious worry with "LLM runs the analysis" is hallucination and irreproducibility. ARIA's design answers both directly:
- Numbers come from tools, not the model. Statistics are computed by DESeq2/edgeR/limma; the LLM reasons over the results.
- Everything is logged. All LLM prompts and responses are recorded, so a run can be audited and reproduced.
- Cross-method validation is a built-in check — if three DE methods agree at r > 0.99, the reasoning layer didn't distort the statistics.
- Hallucination risk is treated as a first-class design concern, not an afterthought.
Honest Limitations
- Preprint, not peer-reviewed — conclusions are provisional.
- LLM reasoning adds cost and some run-to-run variability — mitigated by logging and rule-based execution, but it's a real tradeoff vs a fixed pipeline.
- Benchmarked on public bulk RNA-seq — single-cell and more exotic designs aren't covered here.
- Not a replacement for expert judgment — it automates routine adaptive decisions; novel or ambiguous situations still need a human.
- Recovery, not infallibility — e.g., 9/11 FMRP targets, not 11/11; treat outputs as strong drafts to verify.
FAQ
Q: Is this just "ChatGPT does my RNA-seq"?
No. The statistics are computed by standard, validated tools; the LLM only makes and explains the decisions between steps (design, covariates, strategy, interpretation). It's a reasoning layer, not a number generator.
Q: How does it avoid hallucinated results?
Execution is rule-based and deterministic, the LLM reasons over tool outputs (not raw data), every prompt/response is logged for audit, and three DE methods are cross-checked (r > 0.99 here).
Q: What did the adaptive decisions actually buy?
Concretely: recognizing the Airway paired design recovered 21–47% more DEGs — a thousand-plus genes a fixed, design-blind run would have missed.
Q: Can I run it on my own data?
Yes — it's open source (MIT) at github.com/shoo99/ARIA, built on DESeq2/edgeR/limma-voom/fgsea with public benchmark datasets included.
Resources
- Preprint (Research Square, v1): https://doi.org/10.21203/rs.3.rs-9500973/v1
- Code (MIT): github.com/shoo99/ARIA
- Toolchain: STAR, Salmon, DESeq2, edgeR, limma-voom, fgsea; pipeline context: nf-core/rnaseq.
관련 글
DESeq2 vs edgeR vs limma-voom: Complete Comparison for RNA-seq Differential Expression
5월 4일 · 12 min read
Bioinformatics처음 R로 RNA-seq 분석했을 때 삽질 기록
2월 15일 · 11 min read
BioinformaticsGO and Pathway Enrichment Analysis: A Complete Practical Guide (clusterProfiler, fgsea, MSigDB)
5월 4일 · 13 min read
ProteomicsDifferential Expression Analysis in Proteomics: A Complete R Pipeline (limma, t-test, ANOVA)
2월 25일 · 9 min read