Transcription Factor Activity Inference and Biomarker Integration: A Practical Workflow

📚 Series Navigation · This is Part 3 (final) of the post-DEG functional analysis series. Part 1 — PPI + Hub Analysis · Part 2 — GO + Pathway Enrichment · Part 3 — TF Activity + Biomarker Integration (current)

A list of differentially expressed genes describes what changed. The previous two parts of this series identified the structurally important Hub proteins and the functionally enriched biological pathways. This third part asks the deeper, often-skipped question: "What regulatory program caused these changes?" Transcription factor (TF) activity inference reveals the upstream drivers of your observed expression patterns — and integrating TF results with PPI Hubs and pathway enrichment is what separates exploratory bioinformatics from validated biomarker discovery.

TL;DR

TF analysis splits into two complementary questions: (1) target gene enrichment (which TF's target list is over-represented in DEGs?) and (2) activity inference (is this TF actually activated, regardless of its own mRNA level?).
Never use TF expression as a proxy for TF activity — many TFs are regulated by phosphorylation, dimerization, or nuclear translocation, not transcription.
DoRothEA confidence A/B/C only — confidence D and E include too many low-quality target predictions.
Motif matching ≠ actual binding — always cross-validate with ATAC-seq or ChIP-seq for the relevant tissue.
Genuine biomarker candidates emerge from intersection: PPI Hub ∩ Pathway leading edge ∩ active TF target.

Why TF Activity Matters

When 500 genes change expression, they didn't change independently. Most coordinated expression changes are driven by a relatively small set of transcription factors that activate or repress hundreds of target genes. Identifying these TF "master regulators" provides:

Mechanistic insight: Why did these genes change? What signal triggered the cascade?
Therapeutic targets: TFs sit upstream of many genes — modulating one TF can affect entire transcriptional programs (compare: the success of estrogen receptor antagonists in breast cancer).
Biomarker candidates: TF activity scores often correlate with disease state more robustly than individual gene expression.

The challenge: TF activity is rarely visible in TF mRNA expression. NF-κB sits inactive in the cytoplasm bound to IκB; activation requires IκB degradation and nuclear translocation, with no mRNA change. STAT proteins require phosphorylation-induced dimerization. FOXO is regulated by AKT-mediated phosphorylation that excludes it from the nucleus. This means a TF can be highly active without its own mRNA changing at all.

The solution: infer TF activity from the expression of its known target genes. If TF X normally activates 100 target genes and 80 of them are upregulated in your samples, TF X is probably active — even if TF X's own mRNA is unchanged.

Two Complementary Analytical Questions

Question 1: TF Target Enrichment

"Among my DEGs, are any TF's target gene set over-represented?"

This is essentially the same as pathway enrichment (Part 2), but using TF-target gene sets instead of pathways. Tools and databases:

ChEA3 — Integrates ChIP-seq, co-expression, and curated TF-target relationships; provides integrative ranking
Enrichr TF libraries — Multiple sets including TRRUST, ENCODE, ChEA
TRRUST v2 — Manually curated from literature, with activation/repression labels

Question 2: Activity Inference

"Was each TF actually active in my samples — based on its targets' expression patterns?"

This produces a quantitative TF activity score per sample, enabling group comparisons (e.g., disease vs control) and clustering analyses. Tools:

DoRothEA + viper — De facto standard for human/mouse TF activity inference
SCENIC — Specialized for single-cell RNA-seq with regulon discovery
decoupleR — Modern R package providing multiple inference methods (viper, ulm, mlm, wmean, gsva) in one interface

TF-Target Database Comparison

Database / Tool	Source	Coverage / Notes
DoRothEA	ChIP-seq + motif + literature + co-expression	~470 human TFs, confidence levels A-E
TRRUST v2	Manually curated from literature	~800 human TFs with activation/repression labels
ChEA3	ChIP-seq + ARCHS4 co-expression	Integrative ranking, web + API access
RegNetwork	TF-TF, TF-miRNA, miRNA-TF integration	Multi-layer regulatory networks
JASPAR / HOCOMOCO	Position weight matrices (PWMs)	For de novo promoter scanning
ENCODE / Cistrome	Raw ChIP-seq experiments	Tissue/cell-line-specific binding evidence

For most analyses, DoRothEA with confidence A, B, C levels offers the best balance of coverage and reliability. The lower-confidence levels (D, E) include too many predicted-but-not-validated TF-target relationships.

DoRothEA + viper: The Standard Workflow

The most widely used TF activity inference combines DoRothEA's curated TF-target network with the viper algorithm, which infers protein activity from gene expression patterns.

library(dorothea)
library(decoupleR)
library(dplyr)

# Load DoRothEA, filter to high-confidence regulons
data(dorothea_hs, package = "dorothea")

regulons <- dorothea_hs %>%
  filter(confidence %in% c("A", "B", "C"))

# Your expression matrix: rows = genes, columns = samples
# Should be normalized (vst, voom, or rlog)
# Recommended: ~500-20000 genes, all samples

tf_activity <- decoupleR::run_viper(
  mat      = expr_mat_normalized,
  network  = regulons,
  .source  = "tf",
  .target  = "target",
  .mor     = "mor",       # mode of regulation: +1 activator, -1 repressor
  minsize  = 5            # require at least 5 targets
)

# tf_activity is long-format with one row per (TF, sample, score)
# Pivot to wide for easier analysis
tf_wide <- tf_activity %>%
  filter(statistic == "viper") %>%
  select(source, condition, score) %>%
  tidyr::pivot_wider(names_from = condition, values_from = score) %>%
  tibble::column_to_rownames("source") %>%
  as.matrix()

# Identify differentially active TFs between groups
group <- factor(c(rep("control", 5), rep("disease", 5)))
library(limma)
design <- model.matrix(~group)
fit <- lmFit(tf_wide, design)
fit <- eBayes(fit)
top_tfs <- topTable(fit, coef = 2, number = Inf, sort.by = "p")

# Filter to most differentially active TFs
significant_tfs <- top_tfs %>%
  filter(adj.P.Val < 0.05, abs(logFC) > 0.5) %>%
  arrange(adj.P.Val)

Interpreting viper NES Scores

viper outputs a Normalized Enrichment Score (NES) per TF per sample:

Positive NES: TF is more active than baseline (target genes enriched in the upregulated tail)
Negative NES: TF is less active or actively repressed
Magnitude: |NES| > 2 typically indicates strong, biologically meaningful activity differences

The mor (mode of regulation) field is critical: viper accounts for whether a TF activates or represses each target. A TF that mainly activates targets will have positive NES when those targets are upregulated; a repressor TF will have positive NES when its targets are downregulated. Always interpret direction carefully.

Motif-Based De Novo TF Inference

Beyond known TF-target relationships, you can discover regulatory candidates by scanning DEG promoters for enriched binding motifs. This is especially useful when DoRothEA coverage is limited (non-model organisms, novel cell types).

Tools:

HOMER (findMotifs.pl) — Classic, fast, supports both known and de novo motif discovery
MEME Suite (AME, FIMO, SEA) — Academic standard for rigorous motif analysis
RcisTarget — SCENIC's motif enrichment engine, also usable standalone
PscanChIP, i-cisTarget — Web-based interfaces

The Motif vs Actual Binding Trap

A motif match is a predicted binding site, not a confirmed binding event. Chromatin state determines whether a TF can actually access and bind a motif. Always cross-validate with:

ATAC-seq / DNase-seq — Open chromatin regions where TFs can access
ChIP-seq in the relevant tissue (ENCODE, Cistrome, ChIP-Atlas)
PhastCons / GERP conservation scores — Motifs in conserved regions are more likely to be functional

A practical approach: prioritize motifs that fall within open chromatin (ATAC-seq peaks), in a conserved genomic region, and within ±2kb of a DEG's TSS. This combination dramatically reduces false positives compared to motif scanning alone.

SCENIC: Single-Cell TF Analysis

For single-cell RNA-seq data, SCENIC (Single-Cell rEgulatory Network Inference and Clustering) provides cell-type-specific TF activity inference:

GRNBoost2 / GENIE3 — Co-expression-based regulon construction (TF → candidate targets)
RcisTarget — Filter to direct targets via motif enrichment in target promoters
AUCell — Compute regulon activity score per cell using gene set enrichment
Visualization — Overlay regulon activity on UMAP/tSNE for cell-type-specific patterns

Critical caveats for scRNA-seq:

Drop-out is severe — 70-90% of gene expression values are zero per cell, making co-expression unstable
Imputation tradeoffs — Methods like MAGIC or SAVER reduce drop-out but can introduce smoothing artifacts. Generally safer to run SCENIC on raw counts and validate findings on imputed data.
Sample size matters — SCENIC works best with ≥1,000 cells per cell type
Memory requirements — pySCENIC on 100k cells can require 64GB+ RAM

Integration: From Multi-Source Evidence to Biomarker Candidates

The real power of multi-pillar analysis emerges from integration. A protein with strong evidence across all three pillars (PPI Hub + Pathway leading edge + active TF target) is far more likely to be a real biological signal than one with evidence from a single source.

Recommended Integration Workflow

RNA-seq quality control + normalization (DESeq2 vst or limma voom)
  ↓
Differential expression analysis → ranked DEG list
  ↓
┌─────────────────┬─────────────────┬─────────────────┐
│ PPI Hub         │ Pathway/GO      │ TF Activity     │
│ Analysis        │ Enrichment      │ Inference       │
│ (Part 1)        │ (Part 2)        │ (Part 3)        │
└─────────────────┴─────────────────┴─────────────────┘
  ↓                ↓                  ↓
Top centrality    Leading edge       Active TF
candidates        gene lists         target lists
  ↓                ↓                  ↓
        Compute pairwise/triple intersection
                       ↓
        Tier 1 biomarker candidates (multi-evidence)
                       ↓
        Cox regression, ROC analysis (TCGA, GEO)
                       ↓
        Independent cohort validation
                       ↓
        Experimental validation (qPCR, IHC, KO/KD)

Biomarker Candidate Selection Matrix

Selection Criterion	Tool / Method	Recommended Cutoff
Differential expression	DESeq2 / edgeR / limma	\|log2FC\| > 1, FDR < 0.05
Network centrality	cytoHubba MCC	Top 10% of network
Functional relevance	GO BP / Reactome enrichment	In disease-relevant terms (FDR < 0.05)
Regulatory evidence	DoRothEA + viper	Direct target of significantly active TF
Prognostic association	Cox regression on TCGA	HR p < 0.05, validated in independent cohort
Clinical applicability	Tissue/blood measurability	Detectable by qPCR, ELISA, or IHC

A protein meeting 5 of 6 criteria is a strong biomarker candidate worth investing experimental resources to validate. Meeting all 6 is exceptional.

Validation Strategy

In silico validation is necessary but not sufficient. A robust biomarker validation plan combines:

Computational Validation

Independent cohort replication: Test in TCGA, GEO, ArrayExpress datasets that weren't used for discovery
Survival analysis tools: Kaplan-Meier plotter, GEPIA2, OncoLnc for cancer biomarkers
Meta-analysis: Combine evidence across multiple published cohorts (MetaDE, ComBat for batch correction)
Cell line / functional databases: CCLE, DepMap for mechanistic plausibility
Single-cell validation: Check expression in relevant cell types via Tabula Sapiens, Human Cell Atlas

Experimental Validation

qRT-PCR confirmation: First step for top 20 candidates from your bioinformatics pipeline
Western blot or ELISA: Protein-level confirmation
siRNA / shRNA / CRISPR knockdown: Functional impact when target is suppressed
CRISPRa overexpression: Phenotypic effect of activation
Tissue microarrays + IHC: Spatial context in patient samples
Liquid biopsy applications: Plasma ddPCR, NanoString, or proteomic measurement for blood-based biomarkers

Reporting Standards for Reproducibility

Biomarker discovery papers face increasingly strict reporting requirements from journals and reviewers. Adhere to:

REMARK guidelines — REporting recommendations for tumor MARKer prognostic studies
TRIPOD — Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis

Practical checklist for any biomarker manuscript:

All database versions specified (STRING v12, MSigDB v2023.2, DoRothEA version, etc.)
Background gene set definition stated explicitly (Part 2 covered why this matters)
Multiple testing correction method named (BH / Bonferroni)
Both p-value/FDR and effect size cutoffs reported
Leading edge / contributing gene lists in supplementary materials
Pre-registration in OSF or similar (increases reviewer confidence)
Independent cohort details (not the discovery cohort): number of samples, source, demographics
Code and processed data deposited in GitHub + Zenodo for reproducibility

Real Workflow Example: Breast Cancer Biomarker Discovery

To make this concrete, here's how the three-pillar approach plays out in practice:

Discovery cohort: 200 ER+ breast cancer + 50 normal breast tissue (TCGA BRCA subset)
DEG analysis: DESeq2 → 487 genes with FDR < 0.05 and |log2FC| > 1
PPI Hub Analysis (Part 1): STRING + cytoHubba → 18 Tier 1 Hub candidates
Pathway Enrichment (Part 2): clusterProfiler ORA on KEGG + Reactome → cell cycle, DNA repair, p53 signaling pathways enriched (FDR < 0.001)
TF Activity Inference (Part 3): DoRothEA + viper → MYC, E2F1, FOXM1 significantly more active in tumor (NES > 2.5, FDR < 0.001)
Integration: Cross-reference: Of 18 Hub candidates, 14 appear in cell cycle leading edge, and 8 of those are direct targets of active MYC/E2F1
Tier 1 biomarker candidates: 8 proteins with evidence across all three pillars
TCGA validation: 6 of 8 show significant prognostic association (Cox HR p < 0.05) in independent METABRIC cohort
Experimental validation plan: qPCR confirmation in 50 paired tumor/normal → IHC in 200-sample tissue microarray → siRNA knockdown phenotype in MCF-7 cells

This focused approach replaces the typical "long list of 100+ candidates" with a manageable, well-justified shortlist that's far more likely to yield experimentally and clinically meaningful results.

Connecting Back to the Series

This three-part series outlined a complete post-DEG functional analysis workflow:

Part 1: Identify structurally important proteins via PPI network centrality
Part 2: Identify functionally relevant pathways and biological processes via enrichment
Part 3: Identify causal regulatory programs via TF activity inference, then integrate everything into prioritized biomarker candidates

The unifying principle is convergence: a finding supported by a single line of evidence is interesting; a finding supported by three independent lines of evidence is publishable; one supported by computational and experimental validation is translatable.

📚 Series complete — Read from the start: Part 1 — PPI + Hub Analysis · Part 2 — GO + Pathway Enrichment