OptimusKG
Architecture

Architecture Overview

How OptimusKG is structured and how its components work together.

OptimusKG is built on top of the Kedro framework and follows a medallion architecture pattern to transform raw biomedical data into a unified knowledge graph.

Core Components

The pipeline is composed of the following components:

ComponentDescription
LayersMedallion architecture: landing, bronze, silver, gold
CatalogDataset definitions with schemas, checksums, and metadata
PipelinesDAG-based workflows composed of pure Python functions
HooksLifecycle callbacks for validation and data downloads
RunnersExecution engines for parallel and dry-run modes
DatasetsCustom dataset types for OWL, XML, SQL dumps, etc.
ProvidersAutomatic data download from external sources
ConfigurationEnvironment-based settings management

Data Flow

External Sources -> Landing -> Bronze -> Silver -> Gold -> Export (CSV/Parquet/Neo4j)
  1. Landing: Raw data files downloaded from external sources via Providers
  2. Bronze: Extracted and standardized Polars DataFrames
  3. Silver: Consolidated entities and relationships across sources
  4. Gold: Final knowledge graph exported via BioCypher

Technology Stack

  • Kedro - Pipeline orchestration and project structure
  • Polars - All data transformations (not Pandas)
  • BioCypher - Knowledge graph schema and export
  • Docker - Neo4j database for graph exploration

On this page