Network Biology

PPI Network Construction and Hub Protein Analysis: A Practical Guide for Researchers

Complete practical guide to building Protein-Protein Interaction networks and identifying Hub proteins. Compare STRING, BioGRID, IntAct databases, learn centrality metrics (Degree, Betweenness, MCC), and avoid common analysis pitfalls.

·12 min read
#PPI#protein-protein interaction#network analysis#STRING#Cytoscape#Hub protein#centrality#cytoHubba#bioinformatics#systems biology

📚 Series Navigation · This is Part 1 of a 3-part deep dive into post-DEG functional analysis. Part 1 — PPI + Hub Analysis (current) · Part 2 — GO + Pathway Enrichment (coming) · Part 3 — TF Activity + Biomarker Integration (coming)

After running differential expression analysis (DESeq2, edgeR, or limma), you face a familiar question: "Out of these 500 differentially expressed genes, which ones actually matter?" Protein-Protein Interaction (PPI) network analysis is the most direct, validated first step toward an answer. This guide walks through every step of constructing reliable PPI networks and identifying biologically meaningful Hub proteins — with the specific tool versions, statistical caveats, and practical pitfalls that bioinformatics tutorials rarely mention.

TL;DR

  • STRING is the standard PPI database, but you must filter to confidence ≥ 700 and select only experiments + databases evidence channels for reliable analysis.
  • Identify Hub proteins through the intersection of multiple centrality metrics (Degree, Betweenness, MCC), never a single one.
  • Use cytoHubba in Cytoscape for ranking — but always cross-check the top candidates against literature to detect study bias.
  • Beware: PPI databases reflect "possible interactions," not "current interactions" — tissue-specific filtering with HPA or GTEx is critical.

Why PPI Network Analysis Matters

A single gene rarely causes a phenotype on its own. Cellular function emerges from networks of interacting proteins forming complexes, signaling cascades, and metabolic pathways. The human interactome is estimated to contain 300,000 to 650,000 binary interactions among approximately 20,000 proteins.

When you have a list of differentially expressed genes (DEGs) from RNA-seq or proteomics, three biological insights become accessible only through network analysis:

  1. Hub identification: Which proteins interact with many others and likely orchestrate the response?
  2. Functional modules: Which subsets of DEGs work together as complexes or pathways?
  3. Drug targets: Which proteins, if perturbed, would have the broadest cascading effect?

The classic Albert-Barabási analysis showed that biological networks follow scale-free topology — most proteins interact with a few partners, but a small set of "Hub" proteins interact with many. These Hubs are statistically enriched for essential genes and disease-associated genes, making them prime targets for both basic research and therapeutic development.

PPI Database Comparison

Choosing the right PPI database is the single most consequential decision in your network analysis. Here's how the major sources compare:

DatabaseEvidence TypeSizeBest Use Case
STRINGPhysical + functional + text-mining67M proteins, 14k organismsInitial exploration, broad coverage
BioGRIDExperimentally validated, strict curation2M+ interactions, multi-organismHigh-quality network analysis
IntAct (EBI)Manual literature curation1.5M interactionsConservative, regulatory-grade analysis
HuRIYeast two-hybrid based human interactome53k interactionsExperimentally validated subset
HPRDHuman-only, manual curationLegacy DB, ~40k interactionsHistorical comparison
MINTMolecular interactions, manualOlder DBLegacy reference

Critical STRING Caveats

STRING is by far the most popular PPI database, but a substantial fraction of its edges come from text-mining, which detects co-mentions in literature without verifying actual physical interaction. This can be misleading. Best practices:

  • Set confidence score ≥ 700 (high) or ≥ 900 (highest) to filter unreliable edges.
  • In the STRING settings, uncheck text-mining and select only experiments and databases channels.
  • When reporting results in publications, always state the score threshold and channels used.
  • For non-model organisms with sparse PPI data, consider ≥ 400 (medium) but caveat the analysis accordingly.

Database Selection by Organism

For human studies, STRING + BioGRID combined gives the most comprehensive coverage. For mouse, STRING is often sufficient. For non-model organisms (e.g., plants, microbes), STRING remains the only practical option but with smaller coverage and higher uncertainty per interaction.

Tool Stack and Workflow

The most common analytical workflow combines:

  1. STRINGdb (R) or STRING REST API — Pull raw network data
  2. igraph (R) / NetworkX (Python) — Compute centrality metrics programmatically
  3. Cytoscape — Interactive visualization, plugin-based analysis (cytoHubba, MCODE, ClueGO)
  4. Gephi — Optional, for very large networks (>5,000 nodes) where Cytoscape becomes slow

A typical analysis pipeline:

DEGs (filtered, mapped to gene symbols)
    ↓
Map to STRING IDs via STRINGdb::map()
    ↓
Retrieve subnetwork at confidence ≥ 700
    ↓
Compute centrality metrics in igraph (Degree, Betweenness, PageRank, MCC)
    ↓
Identify intersection of top-ranked candidates across metrics
    ↓
Visualize in Cytoscape with continuous color/size mapping
    ↓
Cross-validate against literature (PubMed, GeneCards) for study bias

Centrality Metrics: Choosing the Right Hub Definition

There is no single correct definition of "Hub." Different centrality metrics capture different aspects of network importance, and biological insight comes from looking at multiple together.

MetricMathematical DefinitionBiological Interpretation
Degree centralityNumber of edges per nodeDirect interaction partners — simplest Hub measure
BetweennessFrequency on shortest paths"Information flow bottleneck" — signaling pathway critical points
ClosenessInverse of mean shortest distanceNodes that influence the entire network rapidly
Eigenvector / PageRankConnectedness to important nodesInfluential nodes connected to other influential nodes
Bottleneck (BN)Top percentile of betweennessStrongly correlated with essential genes
MCC (Maximal Clique Centrality)cytoHubba-specific algorithmBest at identifying essential proteins (Chin et al., BMC Syst Biol 2014)

Practical Hub Selection Strategy

In real research workflows, biologists rarely rely on a single metric. The recommended approach:

  1. Compute Degree, Betweenness, and MCC for the entire DEG-derived subnetwork
  2. Take the top 50 candidates by each metric
  3. Find the intersection — these proteins rank highly across multiple criteria
  4. The intersection is typically 10–30 proteins — your Tier 1 Hub candidates
  5. For Tier 2, consider proteins ranked highly in 2 of 3 metrics
  6. Validate Tier 1 candidates against literature, KO/KD phenotype databases, and clinical data

R Code: Comprehensive Hub Analysis

library(STRINGdb)
library(igraph)
library(dplyr)

# Initialize STRING database (human, high confidence)
string_db <- STRINGdb$new(
  version = "12.0",
  species = 9606,
  score_threshold = 700,
  input_directory = ""
)

# Map DEG gene symbols to STRING IDs
mapped <- string_db$map(
  deg_df,
  "gene_symbol",
  removeUnmappedRows = TRUE
)
hits <- mapped$STRING_id

# Retrieve subnetwork
g <- string_db$get_subnetwork(hits)

# Clean network — remove self-loops and duplicate edges
g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)

# Compute multiple centrality metrics
hub_df <- data.frame(
  gene     = V(g)$name,
  degree   = degree(g),
  btw      = betweenness(g, normalized = TRUE),
  close    = closeness(g, normalized = TRUE),
  eigen    = eigen_centrality(g)$vector,
  pagerank = page_rank(g)$vector
)

# Identify Hub candidates via intersection
top_n <- 30
top_genes <- Reduce(intersect, list(
  head(hub_df[order(-hub_df$degree),    "gene"], top_n),
  head(hub_df[order(-hub_df$btw),       "gene"], top_n),
  head(hub_df[order(-hub_df$pagerank),  "gene"], top_n)
))

# Save results
write.csv(hub_df, "hub_centrality_metrics.csv", row.names = FALSE)
writeLines(top_genes, "hub_intersection_top30.txt")

Cytoscape: The Visualization and Module Discovery Powerhouse

While R is great for computation, Cytoscape excels at interactive exploration and module discovery. Essential plugins for PPI analysis:

cytoHubba — The Hub Identification Standard

cytoHubba implements 12 different Hub-ranking algorithms in a single Cytoscape plugin:

  • MCC (Maximal Clique Centrality) — Best for essential protein identification
  • DMNC (Density of Maximum Neighborhood Component)
  • MNC (Maximum Neighborhood Component)
  • Degree — Simple but effective baseline
  • EPC (Edge Percolated Component)
  • BottleNeck — Identifies signaling bottlenecks
  • EcCentricity — How distant a node is from network periphery
  • Closeness, Radiality, Betweenness, Stress, ClusteringCoefficient

Best practice: Run MCC + Degree + Betweenness simultaneously, take the top 10 from each, and analyze the union (broad candidates) and intersection (high-confidence candidates).

MCODE — Module and Complex Detection

MCODE (Molecular Complex Detection) identifies dense subgraphs that often correspond to known protein complexes or functional modules. Two parameters matter most:

  • Node Score Cutoff (default 0.2): Lower values find more modules; raise to 0.4 for stricter modularity
  • K-Core (default 2): Higher values find tighter, smaller complexes

Modules with K-score > 4 typically correspond to known complexes (ribosome, proteasome, spliceosome). MCODE-discovered modules are excellent candidates for follow-up experimental validation.

ClueGO + CluePedia — Functional Overlay

These plugins overlay GO term and pathway membership directly onto your network. Particularly powerful for identifying functional Hubs — proteins that bridge multiple biological processes — versus topological Hubs that are merely well-connected.

stringApp — Direct STRING Integration

The stringApp plugin lets you query STRING directly from Cytoscape, without exporting/importing. It also auto-applies STRING's confidence-based edge styling, which is useful for presentation figures.

Common Pitfalls and How to Avoid Them

The biggest mistakes in PPI Hub analysis are not technical errors but conceptual blind spots. Here are the most common — and most consequential — pitfalls.

Pitfall 1: Study Bias

The problem: Well-studied proteins (TP53, EGFR, AKT1, BRCA1) appear as Hubs in nearly every PPI analysis simply because they have been investigated more — generating more recorded interactions in literature databases.

The solution: When TP53 appears as a top Hub, ask: "Is this a novel finding, or just confirmation of well-known biology?" Check the year of first publication and citation count. Genuinely novel Hub findings often involve less-cited proteins that emerged from your specific experimental context.

Pitfall 2: Ignoring Tissue Specificity

The problem: General PPI databases aggregate interactions across all tissues and conditions. A protein that's a Hub in liver may be irrelevant in your skin sample.

The solution: Filter your gene list to proteins actually expressed in your tissue of interest. Use:

  • Human Protein Atlas (HPA) — Tissue-specific protein expression
  • GTEx — Tissue-specific RNA expression across 54 tissues
  • CCLE / DepMap — Cell-line-specific data for cancer studies

A reasonable threshold: only retain proteins with measurable expression (TPM > 1 or equivalent) in your tissue.

Pitfall 3: Static Snapshot of Dynamic Interactions

The problem: PPI databases represent "possible" interactions, not "current" or "condition-specific" interactions. A protein-protein interaction may only occur during specific cell cycle phases, post-translational modification states, or stress conditions.

The solution: For dynamic insight, complement static PPI with:

  • Co-IP / AP-MS under your experimental conditions
  • Cross-linking mass spectrometry (XL-MS) for in-vivo interaction snapshots
  • BioID / TurboID proximity labeling for context-specific interactomes

Pitfall 4: Self-loops and Duplicate Edges

The problem: Curation errors in source databases sometimes create self-loops (protein A interacts with protein A) or duplicate edges that inflate centrality metrics.

The solution: Always run simplify(g, remove.multiple = TRUE, remove.loops = TRUE) after building your igraph object.

Pitfall 5: Ignoring Network Topology Properties

The problem: Comparing absolute centrality values across different studies is meaningless because networks differ in size and density. A node with degree 50 in a small network may be more "Hub-like" than a node with degree 100 in a huge network.

The solution: Use normalized centrality metrics or percentile rankings within your specific network. Most R functions support this with normalized = TRUE.

Visualization Best Practices

A great Hub network visualization conveys multiple data dimensions simultaneously through visual encoding:

  • Node size = Degree centrality (visual emphasis on Hubs)
  • Node color = log2 fold change (red = up, blue = down, white = neutral)
  • Node border = Statistical significance (thick = high significance)
  • Edge thickness = Confidence score
  • Edge color = Evidence type (experimental vs functional)

In Cytoscape, this is done through Continuous Mapping in the Style panel. The result is a Figure 1-quality visualization that communicates network topology, expression direction, and statistical confidence in a single glance.

For larger networks (>200 nodes), consider:

  • Force-directed layout (Prefuse Force Directed or yFiles Organic) for natural clustering
  • Hide edges below confidence threshold in display (keep them in data)
  • Subnetwork extraction showing only top Hubs and their immediate neighbors

Real-World Workflow Example

Consider a typical breast cancer transcriptomics study:

  1. Input: 487 DEGs from MCF-7 vs MCF-10A (FDR < 0.05, |log2FC| > 1)
  2. PPI Construction: STRING confidence ≥ 700, experiments + databases only → 312 proteins map to STRING, yielding a network of 287 nodes and 1,892 edges
  3. Centrality Analysis: Top 30 by Degree, Betweenness, MCC → intersection of 18 candidates
  4. Tier 1 Hubs (intersection): TP53, MYC, EGFR, ESR1, BRCA1, AKT1, MAPK1, CCND1, ...
  5. Tissue Filter (HPA breast tissue): All 18 retained — biologically plausible
  6. Literature Cross-check: 14/18 already known in breast cancer (study bias confirmed); 4 novel candidates — these become priority targets
  7. MCODE Module Discovery: 3 dense modules — DNA repair complex, cell cycle module, ER signaling
  8. Validation Plan: qPCR on the 4 novel candidates → siRNA knockdown phenotype → IP-MS for interactome confirmation

What's Next

A good Hub list answers "which proteins" — but not "which biological functions" or "which signaling pathways." For that, you need:

  • GO enrichment: What biological processes are over-represented in your Hub set?
  • Pathway enrichment (KEGG, Reactome): Which signaling cascades dominate?

➡️ Next in the series: GO Annotation and Pathway Enrichment Analysis — the second pillar of post-DEG functional analysis, including the critical "background gene set" trap that invalidates 90% of careless enrichment results.

Further Reading

  • Chin, C.-H. et al. (2014). cytoHubba: Identifying hub objects and sub-networks from complex interactome. BMC Systems Biology, 8(S4).
  • Szklarczyk, D. et al. (2023). The STRING database in 2023. Nucleic Acids Research, 51(D1).
  • Bader, G. D. & Hogue, C. W. V. (2003). An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(1).
  • Albert, R. (2005). Scale-free networks in cell biology. Journal of Cell Science, 118(21).
  • Vidal, M., Cusick, M. E. & Barabási, A.-L. (2011). Interactome networks and human disease. Cell, 144(6).