바이오인포매틱스

Top 10 Bioinformatics Databases Every Researcher Should Know

Why Bioinformatics Databases Matter Bioinformatics databases are the backbone of modern biological research. They store, organize, and provide access to the vast more

·11 min read
#bioinformatics#R#Python#data analysis#computational biology

Why Bioinformatics Databases Matter

Laboratory scientist analyzing protein samples with mass spectrometry equipment

Protein structure visualization for proteomics research

Bioinformatics databases are the backbone of modern biological research. They store, organize, and provide access to the vast amounts of molecular data generated by the global research community. Knowing which databases to use — and how to use them effectively — can dramatically accelerate your research, from hypothesis generation to experimental validation.

This guide covers the top 10 bioinformatics databases that every researcher in systems biology, genomics, proteomics, and computational biology should know. For each database, we describe what it contains, how to access it, and practical tips for getting the most out of it.

1. UniProt — The Protein Knowledge Base

What It Contains

UniProt (Universal Protein Resource) is the most comprehensive and widely used protein database. It consists of two main sections:

  • Swiss-Prot: Manually reviewed and annotated entries (~570,000 proteins). Each entry contains curated information about protein function, domains, PTMs, subcellular localization, interactions, and disease associations.

  • TrEMBL: Computationally annotated entries (~250 million proteins) from automated translation of nucleotide sequences.

How to Use It

Search by protein name, gene name, or accession number at uniprot.org. Use the advanced search to filter by organism, function, subcellular location, or PTM type. The "Retrieve/ID mapping" tool converts between identifier systems (UniProt, Ensembl, RefSeq, PDB). The programmatic API enables batch queries:

# Python example: fetch protein info from UniProt import requests url = "https://rest.uniprot.org/uniprotkb/P04637.json" response = requests.get(url) data = response.json() print(f"Protein: {data['proteinDescription']['recommendedName']['fullName']['value']}")

Pro Tips

  • Always use Swiss-Prot reviewed entries for curated, reliable annotations

  • Download reference proteomes (one protein per gene) for proteomics database searches

  • Check the "Expression" tab for tissue-specific expression data

  • Use UniProt's disease annotations to link proteins to human diseases

2. NCBI Gene / GenBank — The Genomic Reference

What It Contains

NCBI (National Center for Biotechnology Information) hosts multiple interconnected databases. NCBI Gene provides gene-centric information for all sequenced organisms. GenBank contains all publicly available nucleotide sequences. Together with RefSeq (curated reference sequences), they form the foundation of genomic reference data.

Key NCBI Resources

  • Gene: Gene summaries, genomic context, expression, pathways, and interactions

  • PubMed: Biomedical literature (36+ million citations)

  • GEO: Gene Expression Omnibus — public repository for functional genomics data

  • SRA: Sequence Read Archive — raw sequencing data

  • ClinVar: Clinical significance of genetic variants

  • dbSNP: Single nucleotide polymorphisms and short variants

How to Use It

Search across all NCBI databases simultaneously using Entrez at ncbi.nlm.nih.gov. Use the Entrez Programming Utilities (E-utilities) for programmatic access. The BLAST suite searches for sequence similarity across GenBank.

Pro Tips

  • Use the NCBI Datasets tool for downloading genome assemblies and annotations

  • GEO is invaluable for finding public omics datasets for validation or meta-analysis

  • Set up MyNCBI email alerts for new publications on your genes of interest

3. Ensembl — Genome Browser and Annotation

What It Contains

Ensembl provides genome assemblies, gene annotations, and comparative genomics data for vertebrates and other eukaryotes. It offers comprehensive gene models including alternative transcripts, regulatory features, and variation data. Ensembl's BioMart tool enables complex queries across multiple data types.

How to Use It

Browse genomes at ensembl.org. Use BioMart for bulk data retrieval — for example, extracting all protein-coding genes on chromosome 21 with their associated GO terms and HGNC symbols.

`# R example: Query Ensembl with biomaRt library(biomaRt) ensembl

Pro Tips

  • Use Ensembl VEP (Variant Effect Predictor) to annotate genetic variants with predicted functional consequences

  • The Ensembl REST API enables programmatic access from any language

  • Ensembl Compara provides ortholog and paralog mappings across species

4. STRING — Protein-Protein Interaction Networks

What It Contains

STRING (Search Tool for Retrieval of Interacting Genes/Proteins) integrates known and predicted protein-protein interactions. It combines experimental data, computational predictions, text mining, and database imports to provide confidence-scored interaction networks for over 67 million proteins across 14,000+ organisms.

How to Use It

Enter a protein or gene list at string-db.org. STRING generates an interactive network visualization with functional enrichment analysis. Adjust the confidence threshold (0.15 low to 0.9 highest) and evidence types to control network density.

Pro Tips

  • Use the "Multiple Proteins" search to analyze your gene list as a network

  • Export networks to Cytoscape for advanced visualization and analysis

  • The STRING API and Cytoscape stringApp enable programmatic access

  • Filter by evidence type: "Experiments" for physical interactions, "Databases" for curated pathway data

5. KEGG — Pathway Maps and Molecular Networks

What It Contains

KEGG (Kyoto Encyclopedia of Genes and Genomes) provides manually drawn pathway maps linking genes, proteins, and metabolites. KEGG covers metabolic pathways, signaling pathways, disease pathways, and drug action pathways. Each pathway map provides a visual overview of molecular interactions with links to relevant genes, compounds, and diseases.

Key KEGG Databases

  • KEGG PATHWAY: ~550 reference pathway maps

  • KEGG DISEASE: Human disease entries with associated genes and pathways

  • KEGG DRUG: Drug information with target and pathway annotations

  • KEGG COMPOUND: Small molecule metabolites and chemical compounds

  • KEGG ORTHOLOGY (KO): Functional orthologs across organisms

Pro Tips

  • Use KEGG Mapper to color pathway maps with your experimental data

  • KEGG pathway identifiers (e.g., hsa04010 for MAPK signaling) are used by most enrichment analysis tools

  • The KEGG REST API provides programmatic access: https://rest.kegg.jp/get/hsa04010/kgml

6. Gene Ontology (GO) — Standardized Gene Function Annotation

What It Contains

The Gene Ontology provides a standardized vocabulary for describing gene and protein function across all organisms. GO terms are organized into three hierarchical ontologies:

  • Biological Process (BP): The larger biological objectives (e.g., "apoptotic process," "DNA repair")

  • Molecular Function (MF): Biochemical activities (e.g., "protein kinase activity," "DNA binding")

  • Cellular Component (CC): Subcellular localization (e.g., "nucleus," "mitochondrial matrix")

How to Use It

GO enrichment analysis is perhaps the most common bioinformatics analysis. Tools include:

  • clusterProfiler (R): Comprehensive enrichment analysis with beautiful visualizations

  • g:Profiler: Web-based enrichment analysis supporting multiple organisms

  • DAVID: Classic web tool for functional annotation clustering

  • Enrichr: Fast web-based enrichment with diverse gene set libraries

Pro Tips

  • Use "Biological Process" for functional interpretation of gene lists; "Cellular Component" for localization

  • GO terms are hierarchical — results at intermediate specificity are often most informative

  • Consider semantic similarity tools (GOSemSim) to reduce redundancy in GO enrichment results

7. GEO / ArrayExpress — Public Omics Data Repositories

What They Contain

GEO (Gene Expression Omnibus, NCBI) and ArrayExpress (EMBL-EBI) are the primary public repositories for functional genomics data. They host microarray, RNA-seq, ChIP-seq, ATAC-seq, and other omics datasets from published studies. Together they contain over 200,000 experiments and millions of samples.

How to Use Them

Search by disease, tissue, organism, or technology. GEO datasets (GDS) provide curated, analysis-ready data. GEO Series (GSE) contain complete experiment submissions. The GEOquery R package enables direct data download into R:

`library(GEOquery) gse

Pro Tips

  • Use GEO2R for quick online differential expression analysis of GEO datasets

  • Download raw data (FASTQ files) from SRA for reanalysis with current pipelines

  • Check the ARCHS4 resource for uniformly processed RNA-seq data from GEO

8. PDB / AlphaFold DB — Protein Structures

What They Contain

The Protein Data Bank (PDB) contains ~220,000 experimentally determined 3D structures of proteins, nucleic acids, and complexes solved by X-ray crystallography, cryo-EM, and NMR. The AlphaFold Protein Structure Database provides AI-predicted structures for over 200 million proteins — essentially every known protein sequence.

How to Use Them

Search PDB at rcsb.org by protein name, gene, or organism. View structures interactively using Mol* or PyMOL. Access AlphaFold predictions at alphafold.ebi.ac.uk. Each AlphaFold model includes per-residue confidence scores (pLDDT) indicating prediction reliability.

Pro Tips

  • Check pLDDT scores for AlphaFold models: >90 is very reliable, 70-90 is good, <50 is unreliable (often disordered regions)

  • Use PDB structures for molecular docking and drug design when available; AlphaFold structures for coverage

  • The PDBe (European PDB) provides enhanced annotations and analysis tools

9. Reactome — Curated Biological Pathways

What It Contains

Reactome is a free, open-source, curated database of biological pathways and reactions. Unlike KEGG's manually drawn maps, Reactome models pathways as detailed reaction networks with specific molecular participants, modifications, and regulatory relationships. It covers ~2,700 human pathways including metabolism, signaling, gene expression, cell cycle, immune system, and disease processes.

How to Use It

Browse pathways at reactome.org. The "Analyze Data" tool performs pathway enrichment analysis on gene or protein lists. The pathway browser provides interactive visualization with drill-down from high-level categories to individual reactions.

Pro Tips

  • Reactome's pathway hierarchy is excellent for identifying which broad biological processes are affected

  • Use Reactome's "Analysis" tool for pathway enrichment — it's fast and provides publication-ready visualizations

  • The ReactomeFIViz Cytoscape app integrates Reactome pathways with network analysis

  • Reactome pathways are used by many enrichment tools (clusterProfiler, GSEA, Enrichr)

10. Human Protein Atlas — Protein Expression Maps

What It Contains

The Human Protein Atlas (HPA) maps all human proteins using a combination of antibody-based imaging, transcriptomics, and proteomics. It provides:

  • Tissue Atlas: Protein expression and localization across 44 normal human tissues

  • Cell Atlas: Subcellular protein localization in cultured cells using immunofluorescence

  • Pathology Atlas: Protein expression in 17 major cancer types with survival analysis

  • Blood Atlas: Proteins detected in blood (plasma and blood cells)

  • Brain Atlas: Detailed protein expression in brain regions

  • Single Cell Atlas: Single-cell RNA expression across human tissues

How to Use It

Search for any human gene at proteinatlas.org. Each gene page provides immunohistochemistry images, RNA expression data across tissues, subcellular localization images, and cancer survival plots. The downloadable datasets enable bulk analysis.

Pro Tips

  • Use the Pathology Atlas to check if your protein of interest is prognostic in specific cancers

  • The Cell Atlas subcellular localization data can validate or complement your proteomics experiments

  • Check the Blood Atlas when designing blood-based biomarker studies — is your target detectable in plasma?

  • The "Tissue specificity" classification helps identify tissue-enriched proteins

Honorable Mentions

Several other databases deserve recognition:

  • DrugBank: Comprehensive drug and drug target database

  • COSMIC: Catalogue of Somatic Mutations in Cancer

  • ClinicalTrials.gov: Registry of clinical studies

  • PRIDE: Proteomics data repository (the ProteomeXchange partner)

  • MetaboLights: Metabolomics data repository

  • BioModels: Repository of computational models in systems biology

  • PhosphoSitePlus: Post-translational modification database

Building an Effective Database Strategy

For a Typical Research Project

  • Literature review: PubMed for publications, GEO/ArrayExpress for public datasets

  • Gene/protein annotation: UniProt for protein info, NCBI Gene for genomic context, Ensembl for genome-level data

  • Expression data: Human Protein Atlas for tissue expression, GTEx for RNA expression across tissues

  • Pathway analysis: KEGG and Reactome for pathway context, GO for functional annotation

  • Network analysis: STRING for interaction networks

  • Structural biology: PDB for experimental structures, AlphaFold DB for predicted structures

Conclusion

Bioinformatics databases are essential tools for modern biological research. Mastering these ten databases will equip you to annotate genes, explore pathways, analyze networks, access public datasets, and contextualize your experimental findings. The databases described here are actively maintained, freely accessible, and widely used by the research community. Bookmark them, explore their APIs, and integrate them into your daily research workflow. In the age of big data biology, knowing where to find and how to use biological knowledge is just as important as generating new data.

📚 참고 데이터베이스: Nature


관련 읽을거리

관련 글