OptimusKG

Introduction

OptimusKG is a modern multimodal knowledge graph with type-specific metadata across biomedical domains.

Highlights

  • A modern biomedical knowledge graph with molecular, anatomical, clinical, and environmental modalities.
  • Integrates 65 heterogeneous resources grounded with 18 ontologies and controlled vocabularies using the BioCypher framework and the Biolink Model.
  • Contains 190,531 nodes across 10 entity types, 21,813,816 edges across 27 relation types, and 67,249,863 property instances encoding 110,276,843 values across 150 distinct property keys.
  • Independently validated using PaperQA3, a multimodal agent that retrieves and reasons over scientific literature.
  • Reproducible, deterministic and infrastructure-agnostic data pipeline with parallel execution.
  • Distributed as Apache Parquet files and downloadable via the optimuskg python client.

OptimusKG is developed at the Zitnik Lab, Harvard Medical School.

Using OptimusKG

OptimusKG is available via Harvard Dataverse. The graph can be programmatically accessed using the Python client, available on PyPI:

uv add optimuskg

The client fetches files from the gold layer with local caching, and supports loading the graph either as Polars DataFrames or as a NetworkX MultiDiGraph:

import optimuskg

# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")

# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")

# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)

# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)

Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux). The cache location can be overridden via the $OPTIMUSKG_CACHE_DIR environment variable or programmatically with optimuskg.set_cache_dir(path).

To target a different dataset (e.g., a pre-release), set the $OPTIMUSKG_DOI environment variable or use optimuskg.set_doi("doi:10.xxxx/XXXX").

What's Next?

On this page