OptimusKG
OptimusKG Client

Introduction

Python client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse.

The optimuskg PyPI package is a thin client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse. The client resolves Dataverse file IDs automatically, caches downloads locally, and returns the graph as Polars DataFrames or a NetworkX MultiDiGraph.

Looking for function signatures and docstrings? Jump to the API Reference.

Installation

uv add optimuskg

Quick Start

import optimuskg

# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")

# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")

# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)

# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)

See get_file, load_parquet, load_graph, and load_networkx on the reference page for parameters and return types.

Loader overview

Files

get_file downloads the file if not already cached and returns its local path. Paths follow the layout of the catalog in the source repository:

optimuskg.get_file("nodes.parquet")                               # full nodes table
optimuskg.get_file("edges.parquet")                               # full edges table
optimuskg.get_file("largest_connected_component_nodes.parquet")   # LCC nodes table
optimuskg.get_file("largest_connected_component_edges.parquet")   # LCC edges table
optimuskg.get_file("nodes/gene.parquet")                          # only Gene nodes
optimuskg.get_file("edges/disease_gene.parquet")                  # only DIS-GEN edges

Pass force=True to re-download the file even if it is already cached.

DataFrames

load_parquet calls get_file then polars.read_parquet. Any extra keyword arguments are forwarded directly to pl.read_parquet.

drugs = optimuskg.load_parquet("nodes/drug.parquet")
only_ids = optimuskg.load_parquet("nodes/drug.parquet", columns=["id"])

load_graph returns (nodes, edges) as Polars DataFrames. Pass lcc=True to get only the largest connected component.

nodes, edges = optimuskg.load_graph()          # full graph
nodes, edges = optimuskg.load_graph(lcc=True)  # LCC only

DataFrame schemas are documented in Nodes and Edges.

NetworkX

load_networkx builds a networkx.MultiDiGraph on top of load_graph:

G = optimuskg.load_networkx(lcc=True)
print(G.number_of_nodes(), G.number_of_edges())

# Filter by node label
genes = [n for n, attrs in G.nodes(data=True) if attrs["label"] == "GEN"]

# Filter by relation type
expression_edges = [
    (u, v) for u, v, attrs in G.edges(data=True)
    if attrs["relation"] == "EXPRESSION_PRESENT"
]
properties JSON strings are parsed and merged into node/edge attribute dicts. Set parse_properties=False to skip this and keep the raw string.
The graph is always a MultiDiGraph regardless of edge directionality. Call G.to_undirected() if you need an undirected view.

Configuration

Cache directory

Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux, and C:\Users\<User>\AppData\Local\optimuskg\optimuskg on Windows). Files within the cache are keyed by <doi_slug>/<dataset_version>/<relative_path>, so a new Dataverse release invalidates the cache automatically.

It is possible to override the cache directory from code with set_cache_dir:

optimuskg.set_cache_dir("/data/optimuskg-cache")

Or by using environment variables:

export OPTIMUSKG_CACHE_DIR=/data/optimuskg-cache

Pointing at a different dataset

By default the client targets the published OptimusKG dataset at doi:10.7910/DVN/IYNGEV. To target a different version (e.g. a pre-release or private dataset), override the DOI from code with set_doi:

optimuskg.set_doi("doi:10.7910/DVN/EXAMPLE")

Or by using environment variables:

export OPTIMUSKG_DOI=doi:10.7910/DVN/EXAMPLE

For a non-Harvard Dataverse installation, override the server URL from code with set_server:

optimuskg.set_server("https://dataverse.example.org")

Or by using environment variables:

export OPTIMUSKG_SERVER=https://dataverse.example.org

Agent Skill

OptimusKG ships an agent skill that teaches AI agents how to use the optimuskg client. With the skill installed, you can ask your agent to load, filter, and analyze the graph in natural language, and it already knows the client's API, file layout, and schema.

Installing the skill

Add the skill to any project with the skills CLI:

npx skills add https://github.com/mims-harvard/optimuskg --skill optimuskg

What the skill does

The skill bundles everything an agent needs to be productive with OptimusKG:

  • Loaders: when to reach for get_file, load_parquet, load_graph, or load_networkx, and what each returns.
  • Graph schema: the node type codes, the node and edge columns, and how properties are expanded into typed columns.
  • Common patterns: filtering DataFrames or a NetworkX graph by entity type and relation for analysis, Graph-RAG, or ML.
  • Configuration: caching, and pointing the client at a different Dataverse release or server via set_doi / set_server or environment variables.
  • Citation and license: reminders to cite OptimusKG and to respect each source dataset's terms.

Using it

Once installed, describe what you want in natural language and the agent applies the skill automatically. For example:

Load the largest connected component as a NetworkX graph and list the drugs associated with Alzheimer's disease.

The agent installs optimuskg if needed, picks the right loader, filters by the correct node labels and relations, and caches the downloads, following the conventions documented in the skill.

See Also

On this page