Can an LLM Run an RNA-seq Analysis on Its Own? Building ARIA, a Decision-Aware Transcriptome Framework

An LLM reasoning layer over the RNA-seq workflow

Companion to our preprint. Preprint (Research Square, v1, not yet peer-reviewed): https://doi.org/10.21203/rs.3.rs-9500973/v1 · Code (open source, MIT): github.com/shoo99/ARIA. This describes a research framework and is not a clinical or production tool.

TL;DR (Quick Answer)

Reproducible RNA-seq pipelines (Nextflow, nf-core/rnaseq) solve execution — they run the same steps the same way every time. What they don't solve is the decisions between the steps: is the QC good enough? Is this a paired design? Few DEGs — switch from ORA to GSEA? Those still need an expert. ARIA puts a Large Language Model in that reasoning seat.

What it is — an open-source framework where an LLM acts as the reasoning engine across 8 Decision Points (QC, DE strategy, design recognition, signature detection, cross-method validation, interpretation, sensitivity analysis, reporting), combining hard rule-based thresholds with LLM contextual reasoning.
Does it actually work? On 4 public datasets across 3 species it detected paired designs (+21–47% DEGs), caught technical covariates like library type (+4–30% DEGs), cross-validated across DESeq2/edgeR/limma-voom (r > 0.99), and recovered known biology (7/7 dexamethasone-responsive genes; 9/11 FMRP targets; 15/16 tissue markers).
How it stays honest — execution is rule-based and deterministic; the LLM reasons and interprets; every LLM prompt and response is logged for reproducibility; cross-method validation is a built-in guardrail; hallucination risk is addressed explicitly.

The point isn't to replace bioinformaticians or the tools they trust — it's to automate the adaptive decision layer that currently sits between fixed pipeline steps.

The real bottleneck in RNA-seq isn't execution

A bulk RNA-seq analysis is a chain: QC → alignment (STAR) → quantification (Salmon) → normalization → differential expression (DESeq2/edgeR/limma) → enrichment (GSEA/ORA) → interpretation. Workflow managers like Nextflow and community pipelines like nf-core/rnaseq make that chain reproducible and scalable.

But they execute a fixed path decided before the data is examined. When a dataset yields unexpectedly few DEGs, a fixed pipeline keeps going with ORA — and can miss pathway-level signal that GSEA would catch. The expertise to adapt — "this looks paired, model the pairing"; "library type is confounded, add it as a covariate"; "this DEG count says switch strategy" — is exactly what stays manual. That's the gap ARIA targets.

What ARIA does: 8 decision points, rules + reasoning

ARIA wraps the standard toolchain (DESeq2 v1.46, edgeR v4.4, limma-voom, fgsea, WGCNA, STRING) and inserts an LLM reasoning layer at 8 decision points:

QC assessment — read the quality metrics, decide pass/adapt.
DE strategy — pick the modeling approach for the data at hand.
Design recognition — detect paired/blocked designs from the metadata.
Signature detection — spot technical covariates (e.g., library type).
Cross-method validation — run DESeq2/edgeR/limma-voom and compare.
Interpretation — contextualize hits biologically.
Sensitivity analysis — sweep a matrix of padj × log-fold-change cutoffs.
Reporting — assemble the result.

Crucially, execution is rule-based (e.g., the sensitivity step deterministically tests padj cutoffs of 0.01 / 0.05 / 0.1 crossed with log-fold-change cutoffs of 0 / 0.5 / 1 / 2); the LLM's job is to reason about and interpret those outputs, not to invent numbers.

Results: it recovered designs, covariates, and known biology

We benchmarked on four public datasets of escalating difficulty:

Dataset	Species	Notable feature	DEGs
SEQC (GSE49712)	Human	reference / tissue markers	10,430
Airway (GSE52778)	Human	paired design	951
Fmr1 KO (GSE180135)	Mouse	FMRP targets	398
Pasilla (GSE18508)	Drosophila	mixed library types	224

Design recognition mattered most. Detecting the paired structure in Airway increased DEG detection by 21–47% — over a thousand genes a naive analysis would miss, from a single adaptive decision.
Covariate detection (library type) added 4–30% DEGs.
Cross-method agreement was near-perfect (r > 0.99 across DESeq2/edgeR/limma-voom) — a guardrail that the reasoning didn't drift from established statistics.
Known biology came back: 7/7 dexamethasone-responsive genes (Himes et al. 2014) in Airway; 9/11 FMRP translational targets in Fmr1 KO; 15/16 tissue markers in SEQC (the one "miss," BCL2, was brain-enriched here despite a cancer-marker label — arguably a correct call).

How an LLM pipeline stays trustworthy

The obvious worry with "LLM runs the analysis" is hallucination and irreproducibility. ARIA's design answers both directly:

Numbers come from tools, not the model. Statistics are computed by DESeq2/edgeR/limma; the LLM reasons over the results.
Everything is logged. All LLM prompts and responses are recorded, so a run can be audited and reproduced.
Cross-method validation is a built-in check — if three DE methods agree at r > 0.99, the reasoning layer didn't distort the statistics.
Hallucination risk is treated as a first-class design concern, not an afterthought.

Honest Limitations

Preprint, not peer-reviewed — conclusions are provisional.
LLM reasoning adds cost and some run-to-run variability — mitigated by logging and rule-based execution, but it's a real tradeoff vs a fixed pipeline.
Benchmarked on public bulk RNA-seq — single-cell and more exotic designs aren't covered here.
Not a replacement for expert judgment — it automates routine adaptive decisions; novel or ambiguous situations still need a human.
Recovery, not infallibility — e.g., 9/11 FMRP targets, not 11/11; treat outputs as strong drafts to verify.

FAQ

Q: Is this just "ChatGPT does my RNA-seq"?

No. The statistics are computed by standard, validated tools; the LLM only makes and explains the decisions between steps (design, covariates, strategy, interpretation). It's a reasoning layer, not a number generator.

Q: How does it avoid hallucinated results?

Execution is rule-based and deterministic, the LLM reasons over tool outputs (not raw data), every prompt/response is logged for audit, and three DE methods are cross-checked (r > 0.99 here).

Q: What did the adaptive decisions actually buy?

Concretely: recognizing the Airway paired design recovered 21–47% more DEGs — a thousand-plus genes a fixed, design-blind run would have missed.

Q: Can I run it on my own data?

Yes — it's open source (MIT) at github.com/shoo99/ARIA, built on DESeq2/edgeR/limma-voom/fgsea with public benchmark datasets included.

Resources

Preprint (Research Square, v1): https://doi.org/10.21203/rs.3.rs-9500973/v1
Code (MIT): github.com/shoo99/ARIA
Toolchain: STAR, Salmon, DESeq2, edgeR, limma-voom, fgsea; pipeline context: nf-core/rnaseq.

Can an LLM Run an RNA-seq Analysis on Its Own? Building ARIA, a Decision-Aware Transcriptome Framework

TL;DR (Quick Answer)

The real bottleneck in RNA-seq isn't execution

What ARIA does: 8 decision points, rules + reasoning

Results: it recovered designs, covariates, and known biology

How an LLM pipeline stays trustworthy

Honest Limitations

FAQ

Resources

관련 글

DESeq2 vs edgeR vs limma-voom: Complete Comparison for RNA-seq Differential Expression

처음 R로 RNA-seq 분석했을 때 삽질 기록

GO and Pathway Enrichment Analysis: A Complete Practical Guide (clusterProfiler, fgsea, MSigDB)

Differential Expression Analysis in Proteomics: A Complete R Pipeline (limma, t-test, ANOVA)