STRING Database Tutorial: Step-by-Step Guide to Protein Network Analysis (2026)

The STRING database is the most widely used protein-protein interaction (PPI) resource in molecular biology. With coverage of 67 million proteins across 14,000+ organisms, it's the default starting point for any analysis that asks "which proteins interact with my gene of interest?" or "what's the network structure of my differential expression results?" This tutorial walks through STRING from first principles to advanced workflows — the web interface, the R package, evidence channel filtering, and integration with downstream tools.

TL;DR

STRING integrates physical and functional protein interactions from 7 evidence channels.
Critical setting: Set confidence ≥ 700 (high) and consider disabling text-mining for rigorous analyses.
Web interface is great for exploration of a few proteins; STRINGdb R package for programmatic analysis of large gene lists.
STRING's coexpression and text-mining channels are useful but less reliable than experimental and database channels.
Always document your STRING version, confidence threshold, and active channels in publications.

What Is STRING?

STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database that aggregates protein-protein interaction (PPI) evidence from multiple sources and assigns confidence scores to each predicted interaction. Unlike databases that contain only direct experimental interactions (BioGRID, IntAct), STRING includes both:

Physical interactions: Direct binding, complex membership
Functional interactions: Proteins that work together but may not physically touch (shared pathways, co-expression)

This dual scope makes STRING uniquely powerful — and uniquely requires careful interpretation. A high-confidence STRING edge doesn't necessarily mean physical binding; it may mean strong functional coupling.

STRING Evidence Channels

STRING aggregates evidence from seven channels, each contributing to the final confidence score:

Channel	Source	Reliability
Experiments	Direct experimental data (Y2H, AP-MS, BioID, etc.)	⭐⭐⭐ Highest
Databases	Curated databases (KEGG, BioCyc, Reactome)	⭐⭐⭐ High
Coexpression	Co-regulated expression across many studies	⭐⭐ Medium
Neighborhood	Conserved gene neighborhoods in genomes	⭐⭐ Medium
Fusion	Gene fusion events across species	⭐⭐ Medium
Co-occurrence	Phylogenetic profile similarity	⭐ Lower
Text-mining	Co-mentions in scientific literature	⭐ Variable

Key practical principle: For rigorous physical interaction analysis, use only Experiments + Databases channels. For broader functional coupling analysis, all channels can be appropriate but must be reported.

Why Text-Mining Is Both Useful and Risky

Text-mining detects when two proteins are co-mentioned in scientific abstracts. This captures plausible biological associations that haven't been experimentally validated, but it also creates artifacts:

Two famous proteins (TP53 and MYC) frequently co-occur in literature simply because both are heavily studied → STRING infers an interaction
Methods papers using protein A as a "control" alongside protein B can create text-mining edges with no biological meaning
Reviews discussing pathways generate hundreds of weak text-mining edges

Recommendation: For network construction in a publication, default to Experiments + Databases only. Use text-mining as a hypothesis generator, not as evidence.

Confidence Scoring

STRING combines evidence from active channels into a single confidence score (0-1000):

Score Range	Label	Interpretation
150-399	Low	Many false positives, exploratory only
400-699	Medium	Reasonable for hypothesis generation
700-899	High	Recommended threshold for most analyses
900-1000	Highest	For analyses requiring maximum specificity

The confidence score combines evidence probabilistically — an edge with low scores in many channels can still reach high confidence if the channels independently support the interaction.

Practical defaults:

Initial exploration: 400 (medium) to see broader patterns
Network analysis for publication: 700 (high)
Stringent analysis (clinical biomarkers): 900 (highest)

Using STRING — Web Interface Walkthrough

The web interface (https://string-db.org) is excellent for exploring a small to medium number of proteins (up to 200-500).

Step 1: Single Protein Lookup

Start at the home page, choose "Single Protein," enter your protein name (e.g., "TP53"), select organism ("Homo sapiens"), and submit. You'll get:

A network view of TP53's predicted interactors
Default settings: medium confidence (400), all evidence channels
Edge colors indicate evidence types

Step 2: Adjust Confidence and Channels

The crucial step: click "Settings" (bottom-left) and:

Set Minimum required interaction score to 0.700 (high)
Under "Active interaction sources," uncheck "Text-mining"
Under "Network type," choose "physical subnetwork" if you want only physical interactions

Re-render the network. You'll typically see a much sparser, more reliable network.

Step 3: Multiple Protein Analysis

For your DEG list, choose "Multiple Proteins," paste your gene symbols (one per line, up to 2000 in web interface), and submit. STRING will:

Map names to proteins
Show the network among your input genes
Provide functional enrichment (GO, KEGG, Reactome, Pfam, InterPro, MSigDB) automatically

The "Analysis" tab shows enrichment results — a quick alternative to running clusterProfiler separately.

Step 4: Export Network Data

Click "Exports" to download:

Network image (PNG, SVG, PDF)
Interaction table (TSV, JSON, XML)
Cytoscape format for further analysis

STRINGdb R Package — Programmatic Workflow

For analyses with thousands of proteins or batch processing, the R package is essential.

Installation

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("STRINGdb")

Basic Workflow

library(STRINGdb)

# Initialize STRING database
# version="12.0" — Specify version explicitly for reproducibility
# species=9606 — NCBI Taxonomy ID for human
# score_threshold=700 — High confidence
# input_directory="" — Cache files in working directory
string_db <- STRINGdb$new(
  version          = "12.0",
  species          = 9606,
  score_threshold  = 700,
  input_directory  = "",
  protocol         = "https"
)

# Load your gene list
deg_df <- read.csv("my_degs.csv")
# Should have columns: gene_symbol, log2FC, padj (or similar)

# Map gene symbols to STRING IDs
mapped <- string_db$map(
  deg_df,
  "gene_symbol",
  removeUnmappedRows = TRUE,
  takeFirst          = TRUE
)

# Check mapping success rate
cat("Mapped", nrow(mapped), "of", nrow(deg_df), "genes\n")
# Typically 85-95% mapping; investigate failures

# Get STRING IDs for downstream
hits <- mapped$STRING_id

Network Plotting

# Plot interactions among your hits (uses STRING's built-in plotter)
string_db$plot_network(
  hits[1:200],   # Limit to top 200 for visualization clarity
  add_link       = FALSE,    # Set TRUE to get a clickable link
  add_summary    = TRUE
)

# Color nodes by log2FC direction
mapped$color <- ifelse(mapped$log2FC > 0, "red", "blue")
payload_id <- string_db$post_payload(
  mapped$STRING_id,
  colors = mapped$color
)
string_db$plot_network(hits, payload_id = payload_id)

Functional Enrichment via STRING

# Get GO BP enrichment for your gene set
enrich_bp <- string_db$get_enrichment(
  hits,
  category = "Process"
)

head(enrich_bp[order(enrich_bp$p_value), ], 20)

# Other categories: "Function" (MF), "Component" (CC), "KEGG", "Reactome"
enrich_kegg <- string_db$get_enrichment(hits, category = "KEGG")

Subnetwork Extraction for Downstream Analysis

# Get the subnetwork as an igraph object
g <- string_db$get_subnetwork(hits)

# Now use full igraph functionality (covered in our PPI/Hub guide)
library(igraph)
hub_df <- data.frame(
  gene     = V(g)$name,
  degree   = degree(g),
  btw      = betweenness(g, normalized = TRUE),
  pagerank = page_rank(g)$vector
)

write.csv(hub_df, "string_hub_centrality.csv", row.names = FALSE)

Filtering by Evidence Channel

# Get full interaction table with all evidence scores
interactions <- string_db$get_interactions(hits)

# Filter to high-confidence physical evidence only
physical_high <- interactions[
  interactions$experimental >= 700 |
  interactions$database >= 700,
]

# Build a network from this strict subset
library(igraph)
g_strict <- graph_from_data_frame(
  physical_high[, c("from", "to")],
  directed = FALSE
)
g_strict <- simplify(g_strict)

Integration with Cytoscape

While R/igraph is great for analysis, Cytoscape excels at interactive visualization. The integration:

Method 1: stringApp Plugin

Install stringApp from Cytoscape App Manager. Then in Cytoscape:

File → Import → Network from Public Databases → STRING protein query
Paste your gene list, select organism
Cytoscape auto-fetches STRING data and applies STRING's edge styling
All Cytoscape analysis tools (cytoHubba, MCODE, ClueGO) immediately work

Method 2: Export from STRINGdb

# Export edge list for Cytoscape import
write.csv(
  interactions[, c("from", "to", "combined_score")],
  "string_edges_for_cytoscape.csv",
  row.names = FALSE
)

# In Cytoscape: File → Import → Network from File → select your CSV

Common STRING Analysis Patterns

Pattern 1: Functional Module Discovery

For a DEG list, find tightly interconnected sub-modules:

library(STRINGdb)
library(igraph)

# Get the network
g <- string_db$get_subnetwork(hits)
g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)

# Find communities
communities <- cluster_louvain(g)

# Examine community membership
membership_df <- data.frame(
  gene = V(g)$name,
  community = membership(communities)
)

# Run enrichment within each community
for (comm in unique(membership_df$community)) {
  comm_genes <- membership_df$gene[membership_df$community == comm]
  if (length(comm_genes) >= 5) {
    enrich <- string_db$get_enrichment(comm_genes, category = "Process")
    cat("Community", comm, "(", length(comm_genes), "genes):\n")
    print(head(enrich[order(enrich$p_value), c("term_description", "p_value")], 5))
    cat("\n")
  }
}

Pattern 2: Hub Protein Identification

(See our dedicated guide on PPI Network Hub Analysis for complete workflow.)

Pattern 3: Drug Target Network

For a known drug target, expand to its functional partners:

# Start with one drug target
drug_target <- "EGFR"
target_string_id <- string_db$mp(drug_target)

# Get all interactors
neighbors <- string_db$get_neighbors(target_string_id)

# Get pairwise interactions among neighbors (and target)
all_proteins <- c(target_string_id, neighbors)
interactions <- string_db$get_interactions(all_proteins)

# Build network and analyze
g <- graph_from_data_frame(
  interactions[, c("from", "to")],
  directed = FALSE
)
plot(g, vertex.size = 5, vertex.label.cex = 0.8)

Critical Pitfalls and How to Avoid Them

Pitfall 1: Defaulting to Medium Confidence

The STRING default is 400 (medium), which produces many spurious edges. Always set explicit confidence threshold (700 for most analyses) and report it.

Pitfall 2: Including Text-Mining for Network Analysis

Text-mining inflates connectivity around well-studied proteins, creating apparent "Hubs" that are publication artifacts. Disable text-mining for centrality/Hub analysis.

Pitfall 3: Using STRING for Tissue-Specific Analysis Without Filtering

STRING aggregates interactions across all conditions and tissues. A protein-protein interaction documented in liver may be irrelevant in your skin sample. Filter your gene list to proteins actually expressed in your tissue (HPA, GTEx).

Pitfall 4: Confusing Functional and Physical Interactions

Many STRING edges represent "function together in same pathway" not "physically bind." If your downstream interpretation requires physical binding (e.g., for AP-MS validation planning), filter to physical_subnetwork or experiments-only evidence.

Pitfall 5: Not Recording STRING Version

STRING is updated annually with new evidence. An analysis run against STRING v11 may give different results from v12. Always specify version in publications for reproducibility.

Pitfall 6: Over-Trusting STRING Enrichment for Custom Backgrounds

STRING's built-in enrichment uses the entire genome as background, which is often inappropriate. For rigorous enrichment, use clusterProfiler with a custom universe (covered in our GO/Pathway Enrichment guide).

STRING Version History and Release Notes

Major updates to be aware of:

STRING v12 (2023-): Improved organism coverage to 14,094 species, refined evidence channel scoring
STRING v11.5 (2022): Added VirusHostDB integration for host-pathogen interactions
STRING v11 (2019): Major coverage expansion, improved physical/functional separation
STRING v10 (2015): Introduction of physical subnetwork concept

For longitudinal studies, lock to a specific version for consistency.

Beyond STRING: When to Use Other Databases

STRING is excellent but not always the right choice:

For physical interactions only: BioGRID (more strict curation), IntAct (manual literature curation)
For human interactome focus: HuRI (yeast two-hybrid systematic mapping)
For pathway-specific analysis: Reactome, KEGG (focus on signaling cascades)
For drug-target analysis: DrugBank, ChEMBL, IUPHAR
For structural interactions: PDB, Interactome3D

In practice, combining STRING with one of these (e.g., STRING for breadth, BioGRID for confirming physical interactions) gives the most defensible analyses.

Reporting STRING Analyses for Publication

A reviewer-proof methods section should include:

STRING version (e.g., "STRING v12.0")
Species ID (e.g., "Homo sapiens, NCBI Taxonomy ID 9606")
Confidence threshold (e.g., "interactions with combined score ≥ 700")
Active evidence channels (e.g., "experimental and database evidence only; text-mining and homology-based predictions excluded")
Analysis date (databases evolve)
Citations: Szklarczyk et al. (2023) for STRING

Example methods text: "Protein-protein interactions among differentially expressed genes were retrieved using STRING v12.0 (Szklarczyk et al., 2023) for Homo sapiens (NCBI Taxonomy 9606), restricted to interactions with combined confidence score ≥ 700 from experimental and database evidence channels (text-mining excluded). The resulting network was analyzed in Cytoscape v3.10 with the cytoHubba plugin for centrality-based Hub identification."

Putting It All Together: A Standard Workflow

For most RNA-seq differential expression follow-up:

Filter DEGs: |log2FC| > 1, FDR < 0.05 → your input gene list
STRING via R: Use STRINGdb with confidence ≥ 700, experiments+database channels
Network construction: Get subnetwork as igraph object
Centrality analysis: Compute Degree, Betweenness, MCC; intersect top candidates
Module discovery: Louvain or MCODE clustering for functional sub-networks
Functional enrichment: GO BP, Reactome on full set and per module (use clusterProfiler with proper background)
Visualization: Export to Cytoscape for publication-quality figures

This is the workflow detailed across our three-part series on functional analysis: PPI Hubs (Part 1) → GO/Pathway (Part 2) → TF Activity + Biomarkers (Part 3).

Conclusion

STRING is one of those rare bioinformatics tools that's both extremely powerful and extremely easy to misuse. The defaults (low confidence, all channels) produce noisy networks that look impressive but contain many spurious edges. Used correctly — with explicit confidence thresholds, deliberate channel selection, and version documentation — STRING is the foundation for nearly all post-DEG functional analysis.

Master the basic web interface for exploration. Use STRINGdb in R for serious analysis. Always integrate downstream with Cytoscape for visualization and clusterProfiler for proper enrichment. And always, always document your settings.