Architecture Overview

OptimusKG is built on top of the Kedro framework and follows a medallion architecture pattern to transform raw biomedical data into a unified knowledge graph.

Core Components

The pipeline is composed of the following components:

Component	Description
Layers	Medallion architecture: landing, bronze, silver, gold
Catalog	Dataset definitions with schemas, checksums, and metadata
Pipelines	DAG-based workflows composed of pure Python functions
Hooks	Lifecycle callbacks for validation and data downloads
Runners	Execution engines for parallel and dry-run modes
Datasets	Custom dataset types for OWL, XML, SQL dumps, etc.
Providers	Automatic data download from external sources
Configuration	Environment-based settings management

Data Flow

External Sources -> Landing -> Bronze -> Silver -> Gold -> Export (CSV/Parquet/Neo4j)

Landing: Raw data files downloaded from external sources via Providers
Bronze: Extracted and standardized Polars DataFrames
Silver: Consolidated entities and relationships across sources
Gold: Final knowledge graph exported via BioCypher

Technology Stack

Kedro - Pipeline orchestration and project structure
Polars - All data transformations (not Pandas)
BioCypher - Knowledge graph schema and export
Docker - Neo4j database for graph exploration

Architecture Overview

Core Components

Data Flow

Technology Stack

On this page