PPI Network Construction and Hub Protein Analysis: A Practical Guide for Researchers

📚 Series Navigation · This is Part 1 of a 3-part deep dive into post-DEG functional analysis. Part 1 — PPI + Hub Analysis (current) · Part 2 — GO + Pathway Enrichment (coming) · Part 3 — TF Activity + Biomarker Integration (coming)

After running differential expression analysis (DESeq2, edgeR, or limma), you face a familiar question: "Out of these 500 differentially expressed genes, which ones actually matter?" Protein-Protein Interaction (PPI) network analysis is the most direct, validated first step toward an answer. This guide walks through every step of constructing reliable PPI networks and identifying biologically meaningful Hub proteins — with the specific tool versions, statistical caveats, and practical pitfalls that bioinformatics tutorials rarely mention.

TL;DR

STRING is the standard PPI database, but you must filter to confidence ≥ 700 and select only experiments + databases evidence channels for reliable analysis.
Identify Hub proteins through the intersection of multiple centrality metrics (Degree, Betweenness, MCC), never a single one.
Use cytoHubba in Cytoscape for ranking — but always cross-check the top candidates against literature to detect study bias.
Beware: PPI databases reflect "possible interactions," not "current interactions" — tissue-specific filtering with HPA or GTEx is critical.

Why PPI Network Analysis Matters

A single gene rarely causes a phenotype on its own. Cellular function emerges from networks of interacting proteins forming complexes, signaling cascades, and metabolic pathways. The human interactome is estimated to contain 300,000 to 650,000 binary interactions among approximately 20,000 proteins.

When you have a list of differentially expressed genes (DEGs) from RNA-seq or proteomics, three biological insights become accessible only through network analysis:

Hub identification: Which proteins interact with many others and likely orchestrate the response?
Functional modules: Which subsets of DEGs work together as complexes or pathways?
Drug targets: Which proteins, if perturbed, would have the broadest cascading effect?

The classic Albert-Barabási analysis showed that biological networks follow scale-free topology — most proteins interact with a few partners, but a small set of "Hub" proteins interact with many. These Hubs are statistically enriched for essential genes and disease-associated genes, making them prime targets for both basic research and therapeutic development.

PPI Database Comparison

Choosing the right PPI database is the single most consequential decision in your network analysis. Here's how the major sources compare:

Database	Evidence Type	Size	Best Use Case
STRING	Physical + functional + text-mining	67M proteins, 14k organisms	Initial exploration, broad coverage
BioGRID	Experimentally validated, strict curation	2M+ interactions, multi-organism	High-quality network analysis
IntAct (EBI)	Manual literature curation	1.5M interactions	Conservative, regulatory-grade analysis
HuRI	Yeast two-hybrid based human interactome	53k interactions	Experimentally validated subset
HPRD	Human-only, manual curation	Legacy DB, ~40k interactions	Historical comparison
MINT	Molecular interactions, manual	Older DB	Legacy reference

Critical STRING Caveats

STRING is by far the most popular PPI database, but a substantial fraction of its edges come from text-mining, which detects co-mentions in literature without verifying actual physical interaction. This can be misleading. Best practices:

Set confidence score ≥ 700 (high) or ≥ 900 (highest) to filter unreliable edges.
In the STRING settings, uncheck text-mining and select only experiments and databases channels.
When reporting results in publications, always state the score threshold and channels used.
For non-model organisms with sparse PPI data, consider ≥ 400 (medium) but caveat the analysis accordingly.

Database Selection by Organism

For human studies, STRING + BioGRID combined gives the most comprehensive coverage. For mouse, STRING is often sufficient. For non-model organisms (e.g., plants, microbes), STRING remains the only practical option but with smaller coverage and higher uncertainty per interaction.

Tool Stack and Workflow

The most common analytical workflow combines:

STRINGdb (R) or STRING REST API — Pull raw network data
igraph (R) / NetworkX (Python) — Compute centrality metrics programmatically
Cytoscape — Interactive visualization, plugin-based analysis (cytoHubba, MCODE, ClueGO)
Gephi — Optional, for very large networks (>5,000 nodes) where Cytoscape becomes slow

A typical analysis pipeline:

DEGs (filtered, mapped to gene symbols)
    ↓
Map to STRING IDs via STRINGdb::map()
    ↓
Retrieve subnetwork at confidence ≥ 700
    ↓
Compute centrality metrics in igraph (Degree, Betweenness, PageRank, MCC)
    ↓
Identify intersection of top-ranked candidates across metrics
    ↓
Visualize in Cytoscape with continuous color/size mapping
    ↓
Cross-validate against literature (PubMed, GeneCards) for study bias

Centrality Metrics: Choosing the Right Hub Definition

There is no single correct definition of "Hub." Different centrality metrics capture different aspects of network importance, and biological insight comes from looking at multiple together.

Metric	Mathematical Definition	Biological Interpretation
Degree centrality	Number of edges per node	Direct interaction partners — simplest Hub measure
Betweenness	Frequency on shortest paths	"Information flow bottleneck" — signaling pathway critical points
Closeness	Inverse of mean shortest distance	Nodes that influence the entire network rapidly
Eigenvector / PageRank	Connectedness to important nodes	Influential nodes connected to other influential nodes
Bottleneck (BN)	Top percentile of betweenness	Strongly correlated with essential genes
MCC (Maximal Clique Centrality)	cytoHubba-specific algorithm	Best at identifying essential proteins (Chin et al., BMC Syst Biol 2014)

Practical Hub Selection Strategy

In real research workflows, biologists rarely rely on a single metric. The recommended approach:

Compute Degree, Betweenness, and MCC for the entire DEG-derived subnetwork
Take the top 50 candidates by each metric
Find the intersection — these proteins rank highly across multiple criteria
The intersection is typically 10–30 proteins — your Tier 1 Hub candidates
For Tier 2, consider proteins ranked highly in 2 of 3 metrics
Validate Tier 1 candidates against literature, KO/KD phenotype databases, and clinical data

R Code: Comprehensive Hub Analysis

library(STRINGdb)
library(igraph)
library(dplyr)

# Initialize STRING database (human, high confidence)
string_db <- STRINGdb$new(
  version = "12.0",
  species = 9606,
  score_threshold = 700,
  input_directory = ""
)

# Map DEG gene symbols to STRING IDs
mapped <- string_db$map(
  deg_df,
  "gene_symbol",
  removeUnmappedRows = TRUE
)
hits <- mapped$STRING_id

# Retrieve subnetwork
g <- string_db$get_subnetwork(hits)

# Clean network — remove self-loops and duplicate edges
g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)

# Compute multiple centrality metrics
hub_df <- data.frame(
  gene     = V(g)$name,
  degree   = degree(g),
  btw      = betweenness(g, normalized = TRUE),
  close    = closeness(g, normalized = TRUE),
  eigen    = eigen_centrality(g)$vector,
  pagerank = page_rank(g)$vector
)

# Identify Hub candidates via intersection
top_n <- 30
top_genes <- Reduce(intersect, list(
  head(hub_df[order(-hub_df$degree),    "gene"], top_n),
  head(hub_df[order(-hub_df$btw),       "gene"], top_n),
  head(hub_df[order(-hub_df$pagerank),  "gene"], top_n)
))

# Save results
write.csv(hub_df, "hub_centrality_metrics.csv", row.names = FALSE)
writeLines(top_genes, "hub_intersection_top30.txt")

Cytoscape: The Visualization and Module Discovery Powerhouse

While R is great for computation, Cytoscape excels at interactive exploration and module discovery. Essential plugins for PPI analysis:

cytoHubba — The Hub Identification Standard

cytoHubba implements 12 different Hub-ranking algorithms in a single Cytoscape plugin:

MCC (Maximal Clique Centrality) — Best for essential protein identification
DMNC (Density of Maximum Neighborhood Component)
MNC (Maximum Neighborhood Component)
Degree — Simple but effective baseline
EPC (Edge Percolated Component)
BottleNeck — Identifies signaling bottlenecks
EcCentricity — How distant a node is from network periphery
Closeness, Radiality, Betweenness, Stress, ClusteringCoefficient

Best practice: Run MCC + Degree + Betweenness simultaneously, take the top 10 from each, and analyze the union (broad candidates) and intersection (high-confidence candidates).

MCODE — Module and Complex Detection

MCODE (Molecular Complex Detection) identifies dense subgraphs that often correspond to known protein complexes or functional modules. Two parameters matter most:

Node Score Cutoff (default 0.2): Lower values find more modules; raise to 0.4 for stricter modularity
K-Core (default 2): Higher values find tighter, smaller complexes

Modules with K-score > 4 typically correspond to known complexes (ribosome, proteasome, spliceosome). MCODE-discovered modules are excellent candidates for follow-up experimental validation.

ClueGO + CluePedia — Functional Overlay

These plugins overlay GO term and pathway membership directly onto your network. Particularly powerful for identifying functional Hubs — proteins that bridge multiple biological processes — versus topological Hubs that are merely well-connected.

stringApp — Direct STRING Integration

The stringApp plugin lets you query STRING directly from Cytoscape, without exporting/importing. It also auto-applies STRING's confidence-based edge styling, which is useful for presentation figures.

Common Pitfalls and How to Avoid Them

The biggest mistakes in PPI Hub analysis are not technical errors but conceptual blind spots. Here are the most common — and most consequential — pitfalls.

Pitfall 1: Study Bias

The problem: Well-studied proteins (TP53, EGFR, AKT1, BRCA1) appear as Hubs in nearly every PPI analysis simply because they have been investigated more — generating more recorded interactions in literature databases.

The solution: When TP53 appears as a top Hub, ask: "Is this a novel finding, or just confirmation of well-known biology?" Check the year of first publication and citation count. Genuinely novel Hub findings often involve less-cited proteins that emerged from your specific experimental context.

Pitfall 2: Ignoring Tissue Specificity

The problem: General PPI databases aggregate interactions across all tissues and conditions. A protein that's a Hub in liver may be irrelevant in your skin sample.

The solution: Filter your gene list to proteins actually expressed in your tissue of interest. Use:

Human Protein Atlas (HPA) — Tissue-specific protein expression
GTEx — Tissue-specific RNA expression across 54 tissues
CCLE / DepMap — Cell-line-specific data for cancer studies

A reasonable threshold: only retain proteins with measurable expression (TPM > 1 or equivalent) in your tissue.

Pitfall 3: Static Snapshot of Dynamic Interactions

The problem: PPI databases represent "possible" interactions, not "current" or "condition-specific" interactions. A protein-protein interaction may only occur during specific cell cycle phases, post-translational modification states, or stress conditions.

The solution: For dynamic insight, complement static PPI with:

Co-IP / AP-MS under your experimental conditions
Cross-linking mass spectrometry (XL-MS) for in-vivo interaction snapshots
BioID / TurboID proximity labeling for context-specific interactomes

Pitfall 4: Self-loops and Duplicate Edges

The problem: Curation errors in source databases sometimes create self-loops (protein A interacts with protein A) or duplicate edges that inflate centrality metrics.

The solution: Always run simplify(g, remove.multiple = TRUE, remove.loops = TRUE) after building your igraph object.

Pitfall 5: Ignoring Network Topology Properties

The problem: Comparing absolute centrality values across different studies is meaningless because networks differ in size and density. A node with degree 50 in a small network may be more "Hub-like" than a node with degree 100 in a huge network.

The solution: Use normalized centrality metrics or percentile rankings within your specific network. Most R functions support this with normalized = TRUE.

Visualization Best Practices

A great Hub network visualization conveys multiple data dimensions simultaneously through visual encoding:

Node size = Degree centrality (visual emphasis on Hubs)
Node color = log2 fold change (red = up, blue = down, white = neutral)
Node border = Statistical significance (thick = high significance)
Edge thickness = Confidence score
Edge color = Evidence type (experimental vs functional)

In Cytoscape, this is done through Continuous Mapping in the Style panel. The result is a Figure 1-quality visualization that communicates network topology, expression direction, and statistical confidence in a single glance.

For larger networks (>200 nodes), consider:

Force-directed layout (Prefuse Force Directed or yFiles Organic) for natural clustering
Hide edges below confidence threshold in display (keep them in data)
Subnetwork extraction showing only top Hubs and their immediate neighbors

Real-World Workflow Example

Consider a typical breast cancer transcriptomics study:

Input: 487 DEGs from MCF-7 vs MCF-10A (FDR < 0.05, |log2FC| > 1)
PPI Construction: STRING confidence ≥ 700, experiments + databases only → 312 proteins map to STRING, yielding a network of 287 nodes and 1,892 edges
Centrality Analysis: Top 30 by Degree, Betweenness, MCC → intersection of 18 candidates
Tier 1 Hubs (intersection): TP53, MYC, EGFR, ESR1, BRCA1, AKT1, MAPK1, CCND1, ...
Tissue Filter (HPA breast tissue): All 18 retained — biologically plausible
Literature Cross-check: 14/18 already known in breast cancer (study bias confirmed); 4 novel candidates — these become priority targets
MCODE Module Discovery: 3 dense modules — DNA repair complex, cell cycle module, ER signaling
Validation Plan: qPCR on the 4 novel candidates → siRNA knockdown phenotype → IP-MS for interactome confirmation

What's Next

A good Hub list answers "which proteins" — but not "which biological functions" or "which signaling pathways." For that, you need:

GO enrichment: What biological processes are over-represented in your Hub set?
Pathway enrichment (KEGG, Reactome): Which signaling cascades dominate?

➡️ Next in the series: GO Annotation and Pathway Enrichment Analysis — the second pillar of post-DEG functional analysis, including the critical "background gene set" trap that invalidates 90% of careless enrichment results.