Bioinformatics

Can an LLM Run an RNA-seq Analysis on Its Own? Building ARIA, a Decision-Aware Transcriptome Framework

A companion to our preprint: reproducible pipelines run RNA-seq steps reliably, but the decisions between steps still need an expert. ARIA puts an LLM in that reasoning seat across 8 decision points — and on 4 public datasets it recovered paired designs, technical covariates, and known biology, with cross-method agreement r > 0.99.

·6 min read
#RNA-seq#transcriptomics#large language models#DESeq2#differential expression#automation#bioinformatics#LLM agents#GSEA

An LLM reasoning layer over the RNA-seq workflow

Companion to our preprint. Preprint (Research Square, v1, not yet peer-reviewed): https://doi.org/10.21203/rs.3.rs-9500973/v1 · Code (open source, MIT): github.com/shoo99/ARIA. This describes a research framework and is not a clinical or production tool.

TL;DR (Quick Answer)

Reproducible RNA-seq pipelines (Nextflow, nf-core/rnaseq) solve execution — they run the same steps the same way every time. What they don't solve is the decisions between the steps: is the QC good enough? Is this a paired design? Few DEGs — switch from ORA to GSEA? Those still need an expert. ARIA puts a Large Language Model in that reasoning seat.

  • What it is — an open-source framework where an LLM acts as the reasoning engine across 8 Decision Points (QC, DE strategy, design recognition, signature detection, cross-method validation, interpretation, sensitivity analysis, reporting), combining hard rule-based thresholds with LLM contextual reasoning.
  • Does it actually work? On 4 public datasets across 3 species it detected paired designs (+21–47% DEGs), caught technical covariates like library type (+4–30% DEGs), cross-validated across DESeq2/edgeR/limma-voom (r > 0.99), and recovered known biology (7/7 dexamethasone-responsive genes; 9/11 FMRP targets; 15/16 tissue markers).
  • How it stays honest — execution is rule-based and deterministic; the LLM reasons and interprets; every LLM prompt and response is logged for reproducibility; cross-method validation is a built-in guardrail; hallucination risk is addressed explicitly.

The point isn't to replace bioinformaticians or the tools they trust — it's to automate the adaptive decision layer that currently sits between fixed pipeline steps.

The real bottleneck in RNA-seq isn't execution

A bulk RNA-seq analysis is a chain: QC → alignment (STAR) → quantification (Salmon) → normalization → differential expression (DESeq2/edgeR/limma) → enrichment (GSEA/ORA) → interpretation. Workflow managers like Nextflow and community pipelines like nf-core/rnaseq make that chain reproducible and scalable.

But they execute a fixed path decided before the data is examined. When a dataset yields unexpectedly few DEGs, a fixed pipeline keeps going with ORA — and can miss pathway-level signal that GSEA would catch. The expertise to adapt — "this looks paired, model the pairing"; "library type is confounded, add it as a covariate"; "this DEG count says switch strategy" — is exactly what stays manual. That's the gap ARIA targets.

What ARIA does: 8 decision points, rules + reasoning

ARIA wraps the standard toolchain (DESeq2 v1.46, edgeR v4.4, limma-voom, fgsea, WGCNA, STRING) and inserts an LLM reasoning layer at 8 decision points:

  1. QC assessment — read the quality metrics, decide pass/adapt.
  2. DE strategy — pick the modeling approach for the data at hand.
  3. Design recognition — detect paired/blocked designs from the metadata.
  4. Signature detection — spot technical covariates (e.g., library type).
  5. Cross-method validation — run DESeq2/edgeR/limma-voom and compare.
  6. Interpretation — contextualize hits biologically.
  7. Sensitivity analysis — sweep a matrix of padj × log-fold-change cutoffs.
  8. Reporting — assemble the result.

Crucially, execution is rule-based (e.g., the sensitivity step deterministically tests padj cutoffs of 0.01 / 0.05 / 0.1 crossed with log-fold-change cutoffs of 0 / 0.5 / 1 / 2); the LLM's job is to reason about and interpret those outputs, not to invent numbers.

Results: it recovered designs, covariates, and known biology

We benchmarked on four public datasets of escalating difficulty:

DatasetSpeciesNotable featureDEGs
SEQC (GSE49712)Humanreference / tissue markers10,430
Airway (GSE52778)Humanpaired design951
Fmr1 KO (GSE180135)MouseFMRP targets398
Pasilla (GSE18508)Drosophilamixed library types224
  • Design recognition mattered most. Detecting the paired structure in Airway increased DEG detection by 21–47% — over a thousand genes a naive analysis would miss, from a single adaptive decision.
  • Covariate detection (library type) added 4–30% DEGs.
  • Cross-method agreement was near-perfect (r > 0.99 across DESeq2/edgeR/limma-voom) — a guardrail that the reasoning didn't drift from established statistics.
  • Known biology came back: 7/7 dexamethasone-responsive genes (Himes et al. 2014) in Airway; 9/11 FMRP translational targets in Fmr1 KO; 15/16 tissue markers in SEQC (the one "miss," BCL2, was brain-enriched here despite a cancer-marker label — arguably a correct call).

How an LLM pipeline stays trustworthy

The obvious worry with "LLM runs the analysis" is hallucination and irreproducibility. ARIA's design answers both directly:

  • Numbers come from tools, not the model. Statistics are computed by DESeq2/edgeR/limma; the LLM reasons over the results.
  • Everything is logged. All LLM prompts and responses are recorded, so a run can be audited and reproduced.
  • Cross-method validation is a built-in check — if three DE methods agree at r > 0.99, the reasoning layer didn't distort the statistics.
  • Hallucination risk is treated as a first-class design concern, not an afterthought.

Honest Limitations

  1. Preprint, not peer-reviewed — conclusions are provisional.
  2. LLM reasoning adds cost and some run-to-run variability — mitigated by logging and rule-based execution, but it's a real tradeoff vs a fixed pipeline.
  3. Benchmarked on public bulk RNA-seq — single-cell and more exotic designs aren't covered here.
  4. Not a replacement for expert judgment — it automates routine adaptive decisions; novel or ambiguous situations still need a human.
  5. Recovery, not infallibility — e.g., 9/11 FMRP targets, not 11/11; treat outputs as strong drafts to verify.

FAQ

Q: Is this just "ChatGPT does my RNA-seq"?

No. The statistics are computed by standard, validated tools; the LLM only makes and explains the decisions between steps (design, covariates, strategy, interpretation). It's a reasoning layer, not a number generator.

Q: How does it avoid hallucinated results?

Execution is rule-based and deterministic, the LLM reasons over tool outputs (not raw data), every prompt/response is logged for audit, and three DE methods are cross-checked (r > 0.99 here).

Q: What did the adaptive decisions actually buy?

Concretely: recognizing the Airway paired design recovered 21–47% more DEGs — a thousand-plus genes a fixed, design-blind run would have missed.

Q: Can I run it on my own data?

Yes — it's open source (MIT) at github.com/shoo99/ARIA, built on DESeq2/edgeR/limma-voom/fgsea with public benchmark datasets included.

Resources

관련 글