STRING Database Tutorial: Step-by-Step Guide to Protein Network Analysis (2026)
Comprehensive STRING database tutorial covering web interface, R STRINGdb package, confidence scoring, evidence channels, network visualization, and integration with Cytoscape and downstream functional analysis.
The STRING database is the most widely used protein-protein interaction (PPI) resource in molecular biology. With coverage of 67 million proteins across 14,000+ organisms, it's the default starting point for any analysis that asks "which proteins interact with my gene of interest?" or "what's the network structure of my differential expression results?" This tutorial walks through STRING from first principles to advanced workflows — the web interface, the R package, evidence channel filtering, and integration with downstream tools.
TL;DR
- STRING integrates physical and functional protein interactions from 7 evidence channels.
- Critical setting: Set confidence ≥ 700 (high) and consider disabling text-mining for rigorous analyses.
- Web interface is great for exploration of a few proteins; STRINGdb R package for programmatic analysis of large gene lists.
- STRING's coexpression and text-mining channels are useful but less reliable than experimental and database channels.
- Always document your STRING version, confidence threshold, and active channels in publications.
What Is STRING?
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a database that aggregates protein-protein interaction (PPI) evidence from multiple sources and assigns confidence scores to each predicted interaction. Unlike databases that contain only direct experimental interactions (BioGRID, IntAct), STRING includes both:
- Physical interactions: Direct binding, complex membership
- Functional interactions: Proteins that work together but may not physically touch (shared pathways, co-expression)
This dual scope makes STRING uniquely powerful — and uniquely requires careful interpretation. A high-confidence STRING edge doesn't necessarily mean physical binding; it may mean strong functional coupling.
STRING Evidence Channels
STRING aggregates evidence from seven channels, each contributing to the final confidence score:
| Channel | Source | Reliability |
|---|---|---|
| Experiments | Direct experimental data (Y2H, AP-MS, BioID, etc.) | ⭐⭐⭐ Highest |
| Databases | Curated databases (KEGG, BioCyc, Reactome) | ⭐⭐⭐ High |
| Coexpression | Co-regulated expression across many studies | ⭐⭐ Medium |
| Neighborhood | Conserved gene neighborhoods in genomes | ⭐⭐ Medium |
| Fusion | Gene fusion events across species | ⭐⭐ Medium |
| Co-occurrence | Phylogenetic profile similarity | ⭐ Lower |
| Text-mining | Co-mentions in scientific literature | ⭐ Variable |
Key practical principle: For rigorous physical interaction analysis, use only Experiments + Databases channels. For broader functional coupling analysis, all channels can be appropriate but must be reported.
Why Text-Mining Is Both Useful and Risky
Text-mining detects when two proteins are co-mentioned in scientific abstracts. This captures plausible biological associations that haven't been experimentally validated, but it also creates artifacts:
- Two famous proteins (TP53 and MYC) frequently co-occur in literature simply because both are heavily studied → STRING infers an interaction
- Methods papers using protein A as a "control" alongside protein B can create text-mining edges with no biological meaning
- Reviews discussing pathways generate hundreds of weak text-mining edges
Recommendation: For network construction in a publication, default to Experiments + Databases only. Use text-mining as a hypothesis generator, not as evidence.
Confidence Scoring
STRING combines evidence from active channels into a single confidence score (0-1000):
| Score Range | Label | Interpretation |
|---|---|---|
| 150-399 | Low | Many false positives, exploratory only |
| 400-699 | Medium | Reasonable for hypothesis generation |
| 700-899 | High | Recommended threshold for most analyses |
| 900-1000 | Highest | For analyses requiring maximum specificity |
The confidence score combines evidence probabilistically — an edge with low scores in many channels can still reach high confidence if the channels independently support the interaction.
Practical defaults:
- Initial exploration: 400 (medium) to see broader patterns
- Network analysis for publication: 700 (high)
- Stringent analysis (clinical biomarkers): 900 (highest)
Using STRING — Web Interface Walkthrough
The web interface (https://string-db.org) is excellent for exploring a small to medium number of proteins (up to 200-500).
Step 1: Single Protein Lookup
Start at the home page, choose "Single Protein," enter your protein name (e.g., "TP53"), select organism ("Homo sapiens"), and submit. You'll get:
- A network view of TP53's predicted interactors
- Default settings: medium confidence (400), all evidence channels
- Edge colors indicate evidence types
Step 2: Adjust Confidence and Channels
The crucial step: click "Settings" (bottom-left) and:
- Set Minimum required interaction score to 0.700 (high)
- Under "Active interaction sources," uncheck "Text-mining"
- Under "Network type," choose "physical subnetwork" if you want only physical interactions
Re-render the network. You'll typically see a much sparser, more reliable network.
Step 3: Multiple Protein Analysis
For your DEG list, choose "Multiple Proteins," paste your gene symbols (one per line, up to 2000 in web interface), and submit. STRING will:
- Map names to proteins
- Show the network among your input genes
- Provide functional enrichment (GO, KEGG, Reactome, Pfam, InterPro, MSigDB) automatically
The "Analysis" tab shows enrichment results — a quick alternative to running clusterProfiler separately.
Step 4: Export Network Data
Click "Exports" to download:
- Network image (PNG, SVG, PDF)
- Interaction table (TSV, JSON, XML)
- Cytoscape format for further analysis
STRINGdb R Package — Programmatic Workflow
For analyses with thousands of proteins or batch processing, the R package is essential.
Installation
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("STRINGdb")
Basic Workflow
library(STRINGdb)
# Initialize STRING database
# version="12.0" — Specify version explicitly for reproducibility
# species=9606 — NCBI Taxonomy ID for human
# score_threshold=700 — High confidence
# input_directory="" — Cache files in working directory
string_db <- STRINGdb$new(
version = "12.0",
species = 9606,
score_threshold = 700,
input_directory = "",
protocol = "https"
)
# Load your gene list
deg_df <- read.csv("my_degs.csv")
# Should have columns: gene_symbol, log2FC, padj (or similar)
# Map gene symbols to STRING IDs
mapped <- string_db$map(
deg_df,
"gene_symbol",
removeUnmappedRows = TRUE,
takeFirst = TRUE
)
# Check mapping success rate
cat("Mapped", nrow(mapped), "of", nrow(deg_df), "genes\n")
# Typically 85-95% mapping; investigate failures
# Get STRING IDs for downstream
hits <- mapped$STRING_id
Network Plotting
# Plot interactions among your hits (uses STRING's built-in plotter)
string_db$plot_network(
hits[1:200], # Limit to top 200 for visualization clarity
add_link = FALSE, # Set TRUE to get a clickable link
add_summary = TRUE
)
# Color nodes by log2FC direction
mapped$color <- ifelse(mapped$log2FC > 0, "red", "blue")
payload_id <- string_db$post_payload(
mapped$STRING_id,
colors = mapped$color
)
string_db$plot_network(hits, payload_id = payload_id)
Functional Enrichment via STRING
# Get GO BP enrichment for your gene set
enrich_bp <- string_db$get_enrichment(
hits,
category = "Process"
)
head(enrich_bp[order(enrich_bp$p_value), ], 20)
# Other categories: "Function" (MF), "Component" (CC), "KEGG", "Reactome"
enrich_kegg <- string_db$get_enrichment(hits, category = "KEGG")
Subnetwork Extraction for Downstream Analysis
# Get the subnetwork as an igraph object
g <- string_db$get_subnetwork(hits)
# Now use full igraph functionality (covered in our PPI/Hub guide)
library(igraph)
hub_df <- data.frame(
gene = V(g)$name,
degree = degree(g),
btw = betweenness(g, normalized = TRUE),
pagerank = page_rank(g)$vector
)
write.csv(hub_df, "string_hub_centrality.csv", row.names = FALSE)
Filtering by Evidence Channel
# Get full interaction table with all evidence scores
interactions <- string_db$get_interactions(hits)
# Filter to high-confidence physical evidence only
physical_high <- interactions[
interactions$experimental >= 700 |
interactions$database >= 700,
]
# Build a network from this strict subset
library(igraph)
g_strict <- graph_from_data_frame(
physical_high[, c("from", "to")],
directed = FALSE
)
g_strict <- simplify(g_strict)
Integration with Cytoscape
While R/igraph is great for analysis, Cytoscape excels at interactive visualization. The integration:
Method 1: stringApp Plugin
Install stringApp from Cytoscape App Manager. Then in Cytoscape:
- File → Import → Network from Public Databases → STRING protein query
- Paste your gene list, select organism
- Cytoscape auto-fetches STRING data and applies STRING's edge styling
- All Cytoscape analysis tools (cytoHubba, MCODE, ClueGO) immediately work
Method 2: Export from STRINGdb
# Export edge list for Cytoscape import
write.csv(
interactions[, c("from", "to", "combined_score")],
"string_edges_for_cytoscape.csv",
row.names = FALSE
)
# In Cytoscape: File → Import → Network from File → select your CSV
Common STRING Analysis Patterns
Pattern 1: Functional Module Discovery
For a DEG list, find tightly interconnected sub-modules:
library(STRINGdb)
library(igraph)
# Get the network
g <- string_db$get_subnetwork(hits)
g <- simplify(g, remove.multiple = TRUE, remove.loops = TRUE)
# Find communities
communities <- cluster_louvain(g)
# Examine community membership
membership_df <- data.frame(
gene = V(g)$name,
community = membership(communities)
)
# Run enrichment within each community
for (comm in unique(membership_df$community)) {
comm_genes <- membership_df$gene[membership_df$community == comm]
if (length(comm_genes) >= 5) {
enrich <- string_db$get_enrichment(comm_genes, category = "Process")
cat("Community", comm, "(", length(comm_genes), "genes):\n")
print(head(enrich[order(enrich$p_value), c("term_description", "p_value")], 5))
cat("\n")
}
}
Pattern 2: Hub Protein Identification
(See our dedicated guide on PPI Network Hub Analysis for complete workflow.)
Pattern 3: Drug Target Network
For a known drug target, expand to its functional partners:
# Start with one drug target
drug_target <- "EGFR"
target_string_id <- string_db$mp(drug_target)
# Get all interactors
neighbors <- string_db$get_neighbors(target_string_id)
# Get pairwise interactions among neighbors (and target)
all_proteins <- c(target_string_id, neighbors)
interactions <- string_db$get_interactions(all_proteins)
# Build network and analyze
g <- graph_from_data_frame(
interactions[, c("from", "to")],
directed = FALSE
)
plot(g, vertex.size = 5, vertex.label.cex = 0.8)
Critical Pitfalls and How to Avoid Them
Pitfall 1: Defaulting to Medium Confidence
The STRING default is 400 (medium), which produces many spurious edges. Always set explicit confidence threshold (700 for most analyses) and report it.
Pitfall 2: Including Text-Mining for Network Analysis
Text-mining inflates connectivity around well-studied proteins, creating apparent "Hubs" that are publication artifacts. Disable text-mining for centrality/Hub analysis.
Pitfall 3: Using STRING for Tissue-Specific Analysis Without Filtering
STRING aggregates interactions across all conditions and tissues. A protein-protein interaction documented in liver may be irrelevant in your skin sample. Filter your gene list to proteins actually expressed in your tissue (HPA, GTEx).
Pitfall 4: Confusing Functional and Physical Interactions
Many STRING edges represent "function together in same pathway" not "physically bind." If your downstream interpretation requires physical binding (e.g., for AP-MS validation planning), filter to physical_subnetwork or experiments-only evidence.
Pitfall 5: Not Recording STRING Version
STRING is updated annually with new evidence. An analysis run against STRING v11 may give different results from v12. Always specify version in publications for reproducibility.
Pitfall 6: Over-Trusting STRING Enrichment for Custom Backgrounds
STRING's built-in enrichment uses the entire genome as background, which is often inappropriate. For rigorous enrichment, use clusterProfiler with a custom universe (covered in our GO/Pathway Enrichment guide).
STRING Version History and Release Notes
Major updates to be aware of:
- STRING v12 (2023-): Improved organism coverage to 14,094 species, refined evidence channel scoring
- STRING v11.5 (2022): Added VirusHostDB integration for host-pathogen interactions
- STRING v11 (2019): Major coverage expansion, improved physical/functional separation
- STRING v10 (2015): Introduction of physical subnetwork concept
For longitudinal studies, lock to a specific version for consistency.
Beyond STRING: When to Use Other Databases
STRING is excellent but not always the right choice:
- For physical interactions only: BioGRID (more strict curation), IntAct (manual literature curation)
- For human interactome focus: HuRI (yeast two-hybrid systematic mapping)
- For pathway-specific analysis: Reactome, KEGG (focus on signaling cascades)
- For drug-target analysis: DrugBank, ChEMBL, IUPHAR
- For structural interactions: PDB, Interactome3D
In practice, combining STRING with one of these (e.g., STRING for breadth, BioGRID for confirming physical interactions) gives the most defensible analyses.
Reporting STRING Analyses for Publication
A reviewer-proof methods section should include:
- STRING version (e.g., "STRING v12.0")
- Species ID (e.g., "Homo sapiens, NCBI Taxonomy ID 9606")
- Confidence threshold (e.g., "interactions with combined score ≥ 700")
- Active evidence channels (e.g., "experimental and database evidence only; text-mining and homology-based predictions excluded")
- Analysis date (databases evolve)
- Citations: Szklarczyk et al. (2023) for STRING
Example methods text: "Protein-protein interactions among differentially expressed genes were retrieved using STRING v12.0 (Szklarczyk et al., 2023) for Homo sapiens (NCBI Taxonomy 9606), restricted to interactions with combined confidence score ≥ 700 from experimental and database evidence channels (text-mining excluded). The resulting network was analyzed in Cytoscape v3.10 with the cytoHubba plugin for centrality-based Hub identification."
Putting It All Together: A Standard Workflow
For most RNA-seq differential expression follow-up:
- Filter DEGs:
|log2FC| > 1, FDR < 0.05→ your input gene list - STRING via R: Use STRINGdb with confidence ≥ 700, experiments+database channels
- Network construction: Get subnetwork as igraph object
- Centrality analysis: Compute Degree, Betweenness, MCC; intersect top candidates
- Module discovery: Louvain or MCODE clustering for functional sub-networks
- Functional enrichment: GO BP, Reactome on full set and per module (use clusterProfiler with proper background)
- Visualization: Export to Cytoscape for publication-quality figures
This is the workflow detailed across our three-part series on functional analysis: PPI Hubs (Part 1) → GO/Pathway (Part 2) → TF Activity + Biomarkers (Part 3).
Conclusion
STRING is one of those rare bioinformatics tools that's both extremely powerful and extremely easy to misuse. The defaults (low confidence, all channels) produce noisy networks that look impressive but contain many spurious edges. Used correctly — with explicit confidence thresholds, deliberate channel selection, and version documentation — STRING is the foundation for nearly all post-DEG functional analysis.
Master the basic web interface for exploration. Use STRINGdb in R for serious analysis. Always integrate downstream with Cytoscape for visualization and clusterProfiler for proper enrichment. And always, always document your settings.
Further Reading
- Szklarczyk, D. et al. (2023). The STRING database in 2023. Nucleic Acids Research, 51(D1), D638-D646.
- Snel, B. et al. (2000). STRING: A web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Research, 28(18). (Original STRING paper)
- Doncheva, N. T. et al. (2019). Cytoscape stringApp: Network analysis and visualization of proteomics data. Journal of Proteome Research, 18(2).
- Chin, C.-H. et al. (2014). cytoHubba: Identifying hub objects and sub-networks from complex interactome. BMC Systems Biology, 8(S4).