Biomarker Research

Transcription Factor Activity Inference and Biomarker Integration: A Practical Workflow

Complete guide to TF activity inference using DoRothEA, viper, SCENIC, and ChEA3. Learn why TF expression alone is misleading, how to integrate PPI Hub + Pathway + TF results into validated biomarker candidates, and the REMARK/TRIPOD reporting standards.

·13 min read
#transcription factor#TF activity#DoRothEA#viper#SCENIC#ChEA3#biomarker discovery#multi-omics integration#regulatory network

📚 Series Navigation · This is Part 3 (final) of the post-DEG functional analysis series. Part 1 — PPI + Hub Analysis · Part 2 — GO + Pathway Enrichment · Part 3 — TF Activity + Biomarker Integration (current)

A list of differentially expressed genes describes what changed. The previous two parts of this series identified the structurally important Hub proteins and the functionally enriched biological pathways. This third part asks the deeper, often-skipped question: "What regulatory program caused these changes?" Transcription factor (TF) activity inference reveals the upstream drivers of your observed expression patterns — and integrating TF results with PPI Hubs and pathway enrichment is what separates exploratory bioinformatics from validated biomarker discovery.

TL;DR

  • TF analysis splits into two complementary questions: (1) target gene enrichment (which TF's target list is over-represented in DEGs?) and (2) activity inference (is this TF actually activated, regardless of its own mRNA level?).
  • Never use TF expression as a proxy for TF activity — many TFs are regulated by phosphorylation, dimerization, or nuclear translocation, not transcription.
  • DoRothEA confidence A/B/C only — confidence D and E include too many low-quality target predictions.
  • Motif matching ≠ actual binding — always cross-validate with ATAC-seq or ChIP-seq for the relevant tissue.
  • Genuine biomarker candidates emerge from intersection: PPI Hub ∩ Pathway leading edge ∩ active TF target.

Why TF Activity Matters

When 500 genes change expression, they didn't change independently. Most coordinated expression changes are driven by a relatively small set of transcription factors that activate or repress hundreds of target genes. Identifying these TF "master regulators" provides:

  • Mechanistic insight: Why did these genes change? What signal triggered the cascade?
  • Therapeutic targets: TFs sit upstream of many genes — modulating one TF can affect entire transcriptional programs (compare: the success of estrogen receptor antagonists in breast cancer).
  • Biomarker candidates: TF activity scores often correlate with disease state more robustly than individual gene expression.

The challenge: TF activity is rarely visible in TF mRNA expression. NF-κB sits inactive in the cytoplasm bound to IκB; activation requires IκB degradation and nuclear translocation, with no mRNA change. STAT proteins require phosphorylation-induced dimerization. FOXO is regulated by AKT-mediated phosphorylation that excludes it from the nucleus. This means a TF can be highly active without its own mRNA changing at all.

The solution: infer TF activity from the expression of its known target genes. If TF X normally activates 100 target genes and 80 of them are upregulated in your samples, TF X is probably active — even if TF X's own mRNA is unchanged.

Two Complementary Analytical Questions

Question 1: TF Target Enrichment

"Among my DEGs, are any TF's target gene set over-represented?"

This is essentially the same as pathway enrichment (Part 2), but using TF-target gene sets instead of pathways. Tools and databases:

  • ChEA3 — Integrates ChIP-seq, co-expression, and curated TF-target relationships; provides integrative ranking
  • Enrichr TF libraries — Multiple sets including TRRUST, ENCODE, ChEA
  • TRRUST v2 — Manually curated from literature, with activation/repression labels

Question 2: Activity Inference

"Was each TF actually active in my samples — based on its targets' expression patterns?"

This produces a quantitative TF activity score per sample, enabling group comparisons (e.g., disease vs control) and clustering analyses. Tools:

  • DoRothEA + viper — De facto standard for human/mouse TF activity inference
  • SCENIC — Specialized for single-cell RNA-seq with regulon discovery
  • decoupleR — Modern R package providing multiple inference methods (viper, ulm, mlm, wmean, gsva) in one interface

TF-Target Database Comparison

Database / ToolSourceCoverage / Notes
DoRothEAChIP-seq + motif + literature + co-expression~470 human TFs, confidence levels A-E
TRRUST v2Manually curated from literature~800 human TFs with activation/repression labels
ChEA3ChIP-seq + ARCHS4 co-expressionIntegrative ranking, web + API access
RegNetworkTF-TF, TF-miRNA, miRNA-TF integrationMulti-layer regulatory networks
JASPAR / HOCOMOCOPosition weight matrices (PWMs)For de novo promoter scanning
ENCODE / CistromeRaw ChIP-seq experimentsTissue/cell-line-specific binding evidence

For most analyses, DoRothEA with confidence A, B, C levels offers the best balance of coverage and reliability. The lower-confidence levels (D, E) include too many predicted-but-not-validated TF-target relationships.

DoRothEA + viper: The Standard Workflow

The most widely used TF activity inference combines DoRothEA's curated TF-target network with the viper algorithm, which infers protein activity from gene expression patterns.

library(dorothea)
library(decoupleR)
library(dplyr)

# Load DoRothEA, filter to high-confidence regulons
data(dorothea_hs, package = "dorothea")

regulons <- dorothea_hs %>%
  filter(confidence %in% c("A", "B", "C"))

# Your expression matrix: rows = genes, columns = samples
# Should be normalized (vst, voom, or rlog)
# Recommended: ~500-20000 genes, all samples

tf_activity <- decoupleR::run_viper(
  mat      = expr_mat_normalized,
  network  = regulons,
  .source  = "tf",
  .target  = "target",
  .mor     = "mor",       # mode of regulation: +1 activator, -1 repressor
  minsize  = 5            # require at least 5 targets
)

# tf_activity is long-format with one row per (TF, sample, score)
# Pivot to wide for easier analysis
tf_wide <- tf_activity %>%
  filter(statistic == "viper") %>%
  select(source, condition, score) %>%
  tidyr::pivot_wider(names_from = condition, values_from = score) %>%
  tibble::column_to_rownames("source") %>%
  as.matrix()

# Identify differentially active TFs between groups
group <- factor(c(rep("control", 5), rep("disease", 5)))
library(limma)
design <- model.matrix(~group)
fit <- lmFit(tf_wide, design)
fit <- eBayes(fit)
top_tfs <- topTable(fit, coef = 2, number = Inf, sort.by = "p")

# Filter to most differentially active TFs
significant_tfs <- top_tfs %>%
  filter(adj.P.Val < 0.05, abs(logFC) > 0.5) %>%
  arrange(adj.P.Val)

Interpreting viper NES Scores

viper outputs a Normalized Enrichment Score (NES) per TF per sample:

  • Positive NES: TF is more active than baseline (target genes enriched in the upregulated tail)
  • Negative NES: TF is less active or actively repressed
  • Magnitude: |NES| > 2 typically indicates strong, biologically meaningful activity differences

The mor (mode of regulation) field is critical: viper accounts for whether a TF activates or represses each target. A TF that mainly activates targets will have positive NES when those targets are upregulated; a repressor TF will have positive NES when its targets are downregulated. Always interpret direction carefully.

Motif-Based De Novo TF Inference

Beyond known TF-target relationships, you can discover regulatory candidates by scanning DEG promoters for enriched binding motifs. This is especially useful when DoRothEA coverage is limited (non-model organisms, novel cell types).

Tools:

  • HOMER (findMotifs.pl) — Classic, fast, supports both known and de novo motif discovery
  • MEME Suite (AME, FIMO, SEA) — Academic standard for rigorous motif analysis
  • RcisTarget — SCENIC's motif enrichment engine, also usable standalone
  • PscanChIP, i-cisTarget — Web-based interfaces

The Motif vs Actual Binding Trap

A motif match is a predicted binding site, not a confirmed binding event. Chromatin state determines whether a TF can actually access and bind a motif. Always cross-validate with:

  • ATAC-seq / DNase-seq — Open chromatin regions where TFs can access
  • ChIP-seq in the relevant tissue (ENCODE, Cistrome, ChIP-Atlas)
  • PhastCons / GERP conservation scores — Motifs in conserved regions are more likely to be functional

A practical approach: prioritize motifs that fall within open chromatin (ATAC-seq peaks), in a conserved genomic region, and within ±2kb of a DEG's TSS. This combination dramatically reduces false positives compared to motif scanning alone.

SCENIC: Single-Cell TF Analysis

For single-cell RNA-seq data, SCENIC (Single-Cell rEgulatory Network Inference and Clustering) provides cell-type-specific TF activity inference:

  1. GRNBoost2 / GENIE3 — Co-expression-based regulon construction (TF → candidate targets)
  2. RcisTarget — Filter to direct targets via motif enrichment in target promoters
  3. AUCell — Compute regulon activity score per cell using gene set enrichment
  4. Visualization — Overlay regulon activity on UMAP/tSNE for cell-type-specific patterns

Critical caveats for scRNA-seq:

  • Drop-out is severe — 70-90% of gene expression values are zero per cell, making co-expression unstable
  • Imputation tradeoffs — Methods like MAGIC or SAVER reduce drop-out but can introduce smoothing artifacts. Generally safer to run SCENIC on raw counts and validate findings on imputed data.
  • Sample size matters — SCENIC works best with ≥1,000 cells per cell type
  • Memory requirements — pySCENIC on 100k cells can require 64GB+ RAM

Integration: From Multi-Source Evidence to Biomarker Candidates

The real power of multi-pillar analysis emerges from integration. A protein with strong evidence across all three pillars (PPI Hub + Pathway leading edge + active TF target) is far more likely to be a real biological signal than one with evidence from a single source.

RNA-seq quality control + normalization (DESeq2 vst or limma voom)
  ↓
Differential expression analysis → ranked DEG list
  ↓
┌─────────────────┬─────────────────┬─────────────────┐
│ PPI Hub         │ Pathway/GO      │ TF Activity     │
│ Analysis        │ Enrichment      │ Inference       │
│ (Part 1)        │ (Part 2)        │ (Part 3)        │
└─────────────────┴─────────────────┴─────────────────┘
  ↓                ↓                  ↓
Top centrality    Leading edge       Active TF
candidates        gene lists         target lists
  ↓                ↓                  ↓
        Compute pairwise/triple intersection
                       ↓
        Tier 1 biomarker candidates (multi-evidence)
                       ↓
        Cox regression, ROC analysis (TCGA, GEO)
                       ↓
        Independent cohort validation
                       ↓
        Experimental validation (qPCR, IHC, KO/KD)

Biomarker Candidate Selection Matrix

Selection CriterionTool / MethodRecommended Cutoff
Differential expressionDESeq2 / edgeR / limma|log2FC| > 1, FDR < 0.05
Network centralitycytoHubba MCCTop 10% of network
Functional relevanceGO BP / Reactome enrichmentIn disease-relevant terms (FDR < 0.05)
Regulatory evidenceDoRothEA + viperDirect target of significantly active TF
Prognostic associationCox regression on TCGAHR p < 0.05, validated in independent cohort
Clinical applicabilityTissue/blood measurabilityDetectable by qPCR, ELISA, or IHC

A protein meeting 5 of 6 criteria is a strong biomarker candidate worth investing experimental resources to validate. Meeting all 6 is exceptional.

Validation Strategy

In silico validation is necessary but not sufficient. A robust biomarker validation plan combines:

Computational Validation

  • Independent cohort replication: Test in TCGA, GEO, ArrayExpress datasets that weren't used for discovery
  • Survival analysis tools: Kaplan-Meier plotter, GEPIA2, OncoLnc for cancer biomarkers
  • Meta-analysis: Combine evidence across multiple published cohorts (MetaDE, ComBat for batch correction)
  • Cell line / functional databases: CCLE, DepMap for mechanistic plausibility
  • Single-cell validation: Check expression in relevant cell types via Tabula Sapiens, Human Cell Atlas

Experimental Validation

  • qRT-PCR confirmation: First step for top 20 candidates from your bioinformatics pipeline
  • Western blot or ELISA: Protein-level confirmation
  • siRNA / shRNA / CRISPR knockdown: Functional impact when target is suppressed
  • CRISPRa overexpression: Phenotypic effect of activation
  • Tissue microarrays + IHC: Spatial context in patient samples
  • Liquid biopsy applications: Plasma ddPCR, NanoString, or proteomic measurement for blood-based biomarkers

Reporting Standards for Reproducibility

Biomarker discovery papers face increasingly strict reporting requirements from journals and reviewers. Adhere to:

  • REMARK guidelines — REporting recommendations for tumor MARKer prognostic studies
  • TRIPOD — Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

Practical checklist for any biomarker manuscript:

  • All database versions specified (STRING v12, MSigDB v2023.2, DoRothEA version, etc.)
  • Background gene set definition stated explicitly (Part 2 covered why this matters)
  • Multiple testing correction method named (BH / Bonferroni)
  • Both p-value/FDR and effect size cutoffs reported
  • Leading edge / contributing gene lists in supplementary materials
  • Pre-registration in OSF or similar (increases reviewer confidence)
  • Independent cohort details (not the discovery cohort): number of samples, source, demographics
  • Code and processed data deposited in GitHub + Zenodo for reproducibility

Real Workflow Example: Breast Cancer Biomarker Discovery

To make this concrete, here's how the three-pillar approach plays out in practice:

  1. Discovery cohort: 200 ER+ breast cancer + 50 normal breast tissue (TCGA BRCA subset)
  2. DEG analysis: DESeq2 → 487 genes with FDR < 0.05 and |log2FC| > 1
  3. PPI Hub Analysis (Part 1): STRING + cytoHubba → 18 Tier 1 Hub candidates
  4. Pathway Enrichment (Part 2): clusterProfiler ORA on KEGG + Reactome → cell cycle, DNA repair, p53 signaling pathways enriched (FDR < 0.001)
  5. TF Activity Inference (Part 3): DoRothEA + viper → MYC, E2F1, FOXM1 significantly more active in tumor (NES > 2.5, FDR < 0.001)
  6. Integration: Cross-reference: Of 18 Hub candidates, 14 appear in cell cycle leading edge, and 8 of those are direct targets of active MYC/E2F1
  7. Tier 1 biomarker candidates: 8 proteins with evidence across all three pillars
  8. TCGA validation: 6 of 8 show significant prognostic association (Cox HR p < 0.05) in independent METABRIC cohort
  9. Experimental validation plan: qPCR confirmation in 50 paired tumor/normal → IHC in 200-sample tissue microarray → siRNA knockdown phenotype in MCF-7 cells

This focused approach replaces the typical "long list of 100+ candidates" with a manageable, well-justified shortlist that's far more likely to yield experimentally and clinically meaningful results.

Connecting Back to the Series

This three-part series outlined a complete post-DEG functional analysis workflow:

  • Part 1: Identify structurally important proteins via PPI network centrality
  • Part 2: Identify functionally relevant pathways and biological processes via enrichment
  • Part 3: Identify causal regulatory programs via TF activity inference, then integrate everything into prioritized biomarker candidates

The unifying principle is convergence: a finding supported by a single line of evidence is interesting; a finding supported by three independent lines of evidence is publishable; one supported by computational and experimental validation is translatable.

📚 Series complete — Read from the start: Part 1 — PPI + Hub Analysis · Part 2 — GO + Pathway Enrichment

Further Reading

  • Garcia-Alonso, L. et al. (2019). Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Research, 29(8). (DoRothEA validation paper)
  • Aibar, S. et al. (2017). SCENIC: Single-cell regulatory network inference and clustering. Nature Methods, 14(11).
  • Keenan, A. B. et al. (2019). ChEA3: Transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Research, 47(W1).
  • Han, H. et al. (2018). TRRUST v2: An expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Research, 46(D1).
  • Altman, D. G. et al. (2012). Reporting recommendations for tumor marker prognostic studies (REMARK). PLoS Medicine, 9(5).
  • Collins, G. S. et al. (2015). Transparent reporting of a multivariable prediction model (TRIPOD). Annals of Internal Medicine, 162(1).

관련 글