OptimusKG

Introduction

OptimusKG is an opinionated, production-ready data pipeline for constructing biomedical knowledge graphs.

OptimusKG is a data pipeline designed to construct, validate, and maintain biomedical knowledge graphs following software engineering best practices.

Principles

  • Ready-to-use: Comes with pre-built processing nodes that unify many biomedical data sources into a single knowledge graph.
  • Reproducible: All data transformations are deterministic, validated through checksum checks, and infrastructure-agnostic.
  • Extensible: A superset of the Kedro framework (hosted by the Linux Foundation), providing a uniform project template, dataset abstraction, configuration management, and pipeline assembly.

Architecture at a Glance

ComponentDescription
CatalogThe single source of truth for all datasets, their schemas, format, and metadata.
DatasetAn abstraction that handles file formats, storage locations, and persistence logic.
NodeA pure Python function whose output follows solely from its input values.
PipelineA sequence of nodes wired into a DAG-based workflow.
LayerFollows the medallion architecture: landing, bronze, silver, and gold.
ParametersConstants for filtering data across the construction process.
ProviderAn abstraction for versioned, automatic data downloads from external sources.
HookInjects custom behavior into the execution flow (e.g., checksum validation).

What's Next?

On this page