Introduction
Python client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse.
The optimuskg PyPI package is a thin client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse. The client resolves Dataverse file IDs automatically, caches downloads locally, and returns the graph as Polars DataFrames or a NetworkX MultiDiGraph.
Looking for function signatures and docstrings? Jump to the API Reference.
Installation
uv add optimuskgQuick Start
import optimuskg
# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")
# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")
# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)
# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)See get_file, load_parquet, load_graph, and load_networkx on the reference page for parameters and return types.
Loader overview
Files
get_file downloads the file if not already cached and returns its local path. Paths follow the layout of the catalog in the source repository:
optimuskg.get_file("nodes.parquet") # full nodes table
optimuskg.get_file("edges.parquet") # full edges table
optimuskg.get_file("largest_connected_component_nodes.parquet") # LCC nodes table
optimuskg.get_file("largest_connected_component_edges.parquet") # LCC edges table
optimuskg.get_file("nodes/gene.parquet") # only Gene nodes
optimuskg.get_file("edges/disease_gene.parquet") # only DIS-GEN edgesPass force=True to re-download the file even if it is already cached.
DataFrames
load_parquet calls get_file then polars.read_parquet. Any extra keyword arguments are forwarded directly to pl.read_parquet.
drugs = optimuskg.load_parquet("nodes/drug.parquet")
only_ids = optimuskg.load_parquet("nodes/drug.parquet", columns=["id"])load_graph returns (nodes, edges) as Polars DataFrames. Pass lcc=True to get only the largest connected component.
nodes, edges = optimuskg.load_graph() # full graph
nodes, edges = optimuskg.load_graph(lcc=True) # LCC onlyNetworkX
load_networkx builds a networkx.MultiDiGraph on top of load_graph:
G = optimuskg.load_networkx(lcc=True)
print(G.number_of_nodes(), G.number_of_edges())
# Filter by node label
genes = [n for n, attrs in G.nodes(data=True) if attrs["label"] == "GEN"]
# Filter by relation type
expression_edges = [
(u, v) for u, v, attrs in G.edges(data=True)
if attrs["relation"] == "EXPRESSION_PRESENT"
]properties JSON strings are parsed and merged into node/edge attribute dicts. Set parse_properties=False to skip this and keep the raw string.MultiDiGraph regardless of edge directionality. Call G.to_undirected() if you need an undirected view.Configuration
Cache directory
Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux, and C:\Users\<User>\AppData\Local\optimuskg\optimuskg on Windows). Files within the cache are keyed by <doi_slug>/<dataset_version>/<relative_path>, so a new Dataverse release invalidates the cache automatically.
It is possible to override the cache directory from code with set_cache_dir:
optimuskg.set_cache_dir("/data/optimuskg-cache")Or by using environment variables:
export OPTIMUSKG_CACHE_DIR=/data/optimuskg-cachePointing at a different dataset
By default the client targets the published OptimusKG dataset at doi:10.7910/DVN/IYNGEV. To target a different version (e.g. a pre-release or private dataset), override the DOI from code with set_doi:
optimuskg.set_doi("doi:10.7910/DVN/EXAMPLE")Or by using environment variables:
export OPTIMUSKG_DOI=doi:10.7910/DVN/EXAMPLEFor a non-Harvard Dataverse installation, override the server URL from code with set_server:
optimuskg.set_server("https://dataverse.example.org")Or by using environment variables:
export OPTIMUSKG_SERVER=https://dataverse.example.org