OptimusKG
OptimusKG Client

Introduction

Python client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse.

The optimuskg PyPI package is a thin client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse. The client resolves Dataverse file IDs automatically, caches downloads locally, and returns the graph as Polars DataFrames or a NetworkX MultiDiGraph.

Looking for function signatures and docstrings? Jump to the API Reference.

Installation

uv add optimuskg

Quick Start

import optimuskg

# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")

# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")

# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)

# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)

See get_file, load_parquet, load_graph, and load_networkx on the reference page for parameters and return types.

Loader overview

Files

get_file downloads the file if not already cached and returns its local path. Paths follow the layout of the catalog in the source repository:

optimuskg.get_file("nodes.parquet")                               # full nodes table
optimuskg.get_file("edges.parquet")                               # full edges table
optimuskg.get_file("largest_connected_component_nodes.parquet")   # LCC nodes table
optimuskg.get_file("largest_connected_component_edges.parquet")   # LCC edges table
optimuskg.get_file("nodes/gene.parquet")                          # only Gene nodes
optimuskg.get_file("edges/disease_gene.parquet")                  # only DIS-GEN edges

Pass force=True to re-download the file even if it is already cached.

DataFrames

load_parquet calls get_file then polars.read_parquet. Any extra keyword arguments are forwarded directly to pl.read_parquet.

drugs = optimuskg.load_parquet("nodes/drug.parquet")
only_ids = optimuskg.load_parquet("nodes/drug.parquet", columns=["id"])

load_graph returns (nodes, edges) as Polars DataFrames. Pass lcc=True to get only the largest connected component.

nodes, edges = optimuskg.load_graph()          # full graph
nodes, edges = optimuskg.load_graph(lcc=True)  # LCC only

DataFrame schemas are documented in Nodes and Edges.

NetworkX

load_networkx builds a networkx.MultiDiGraph on top of load_graph:

G = optimuskg.load_networkx(lcc=True)
print(G.number_of_nodes(), G.number_of_edges())

# Filter by node label
genes = [n for n, attrs in G.nodes(data=True) if attrs["label"] == "GEN"]

# Filter by relation type
expression_edges = [
    (u, v) for u, v, attrs in G.edges(data=True)
    if attrs["relation"] == "EXPRESSION_PRESENT"
]
properties JSON strings are parsed and merged into node/edge attribute dicts. Set parse_properties=False to skip this and keep the raw string.
The graph is always a MultiDiGraph regardless of edge directionality. Call G.to_undirected() if you need an undirected view.

Configuration

Cache directory

Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux, and C:\Users\<User>\AppData\Local\optimuskg\optimuskg on Windows). Files within the cache are keyed by <doi_slug>/<dataset_version>/<relative_path>, so a new Dataverse release invalidates the cache automatically.

It is possible to override the cache directory from code with set_cache_dir:

optimuskg.set_cache_dir("/data/optimuskg-cache")

Or by using environment variables:

export OPTIMUSKG_CACHE_DIR=/data/optimuskg-cache

Pointing at a different dataset

By default the client targets the published OptimusKG dataset at doi:10.7910/DVN/IYNGEV. To target a different version (e.g. a pre-release or private dataset), override the DOI from code with set_doi:

optimuskg.set_doi("doi:10.7910/DVN/EXAMPLE")

Or by using environment variables:

export OPTIMUSKG_DOI=doi:10.7910/DVN/EXAMPLE

For a non-Harvard Dataverse installation, override the server URL from code with set_server:

optimuskg.set_server("https://dataverse.example.org")

Or by using environment variables:

export OPTIMUSKG_SERVER=https://dataverse.example.org

See Also

On this page