Introduction
Python client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse.
The optimuskg PyPI package is a thin client for loading the OptimusKG biomedical knowledge graph from Harvard Dataverse. The client resolves Dataverse file IDs automatically, caches downloads locally, and returns the graph as Polars DataFrames or a NetworkX MultiDiGraph.
Looking for function signatures and docstrings? Jump to the API Reference.
Installation
uv add optimuskgQuick Start
import optimuskg
# Download a specific file and store it locally
local_path = optimuskg.get_file("nodes/gene.parquet")
# Load a single Parquet file as a Polars DataFrame
drugs = optimuskg.load_parquet("nodes/drug.parquet")
# Load nodes and edges as Polars DataFrames
# Set lcc=True to load only the largest connected component
nodes, edges = optimuskg.load_graph(lcc=True)
# Load the graph as a NetworkX MultiDiGraph with metadata
# Set lcc=True to load only the largest connected component
G = optimuskg.load_networkx(lcc=True)See get_file, load_parquet, load_graph, and load_networkx on the reference page for parameters and return types.
Loader overview
Files
get_file downloads the file if not already cached and returns its local path. Paths follow the layout of the catalog in the source repository:
optimuskg.get_file("nodes.parquet") # full nodes table
optimuskg.get_file("edges.parquet") # full edges table
optimuskg.get_file("largest_connected_component_nodes.parquet") # LCC nodes table
optimuskg.get_file("largest_connected_component_edges.parquet") # LCC edges table
optimuskg.get_file("nodes/gene.parquet") # only Gene nodes
optimuskg.get_file("edges/disease_gene.parquet") # only DIS-GEN edgesPass force=True to re-download the file even if it is already cached.
DataFrames
load_parquet calls get_file then polars.read_parquet. Any extra keyword arguments are forwarded directly to pl.read_parquet.
drugs = optimuskg.load_parquet("nodes/drug.parquet")
only_ids = optimuskg.load_parquet("nodes/drug.parquet", columns=["id"])load_graph returns (nodes, edges) as Polars DataFrames. Pass lcc=True to get only the largest connected component.
nodes, edges = optimuskg.load_graph() # full graph
nodes, edges = optimuskg.load_graph(lcc=True) # LCC onlyNetworkX
load_networkx builds a networkx.MultiDiGraph on top of load_graph:
G = optimuskg.load_networkx(lcc=True)
print(G.number_of_nodes(), G.number_of_edges())
# Filter by node label
genes = [n for n, attrs in G.nodes(data=True) if attrs["label"] == "GEN"]
# Filter by relation type
expression_edges = [
(u, v) for u, v, attrs in G.edges(data=True)
if attrs["relation"] == "EXPRESSION_PRESENT"
]properties JSON strings are parsed and merged into node/edge attribute dicts. Set parse_properties=False to skip this and keep the raw string.MultiDiGraph regardless of edge directionality. Call G.to_undirected() if you need an undirected view.Configuration
Cache directory
Downloads are cached by default in platformdirs.user_cache_dir("optimuskg") (~/Library/Caches/optimuskg on macOS, ~/.cache/optimuskg on Linux, and C:\Users\<User>\AppData\Local\optimuskg\optimuskg on Windows). Files within the cache are keyed by <doi_slug>/<dataset_version>/<relative_path>, so a new Dataverse release invalidates the cache automatically.
It is possible to override the cache directory from code with set_cache_dir:
optimuskg.set_cache_dir("/data/optimuskg-cache")Or by using environment variables:
export OPTIMUSKG_CACHE_DIR=/data/optimuskg-cachePointing at a different dataset
By default the client targets the published OptimusKG dataset at doi:10.7910/DVN/IYNGEV. To target a different version (e.g. a pre-release or private dataset), override the DOI from code with set_doi:
optimuskg.set_doi("doi:10.7910/DVN/EXAMPLE")Or by using environment variables:
export OPTIMUSKG_DOI=doi:10.7910/DVN/EXAMPLEFor a non-Harvard Dataverse installation, override the server URL from code with set_server:
optimuskg.set_server("https://dataverse.example.org")Or by using environment variables:
export OPTIMUSKG_SERVER=https://dataverse.example.orgAgent Skill
OptimusKG ships an agent skill that teaches AI agents how to use the optimuskg client. With the skill installed, you can ask your agent to load, filter, and analyze the graph in natural language, and it already knows the client's API, file layout, and schema.
Installing the skill
Add the skill to any project with the skills CLI:
npx skills add https://github.com/mims-harvard/optimuskg --skill optimuskgWhat the skill does
The skill bundles everything an agent needs to be productive with OptimusKG:
- Loaders: when to reach for
get_file,load_parquet,load_graph, orload_networkx, and what each returns. - Graph schema: the node type codes, the node and edge columns, and how
propertiesare expanded into typed columns. - Common patterns: filtering DataFrames or a NetworkX graph by entity type and relation for analysis, Graph-RAG, or ML.
- Configuration: caching, and pointing the client at a different Dataverse release or server via
set_doi/set_serveror environment variables. - Citation and license: reminders to cite OptimusKG and to respect each source dataset's terms.
Using it
Once installed, describe what you want in natural language and the agent applies the skill automatically. For example:
Load the largest connected component as a NetworkX graph and list the drugs associated with Alzheimer's disease.
The agent installs optimuskg if needed, picks the right loader, filters by the correct node labels and relations, and caches the downloads, following the conventions documented in the skill.