Architecture
Architecture Overview
How OptimusKG is structured and how its components work together.
OptimusKG is built on top of the Kedro framework and follows a medallion architecture pattern to transform raw biomedical data into a unified knowledge graph.
Core Components
The pipeline is composed of the following components:
| Component | Description |
|---|---|
| Layers | Medallion architecture: landing, bronze, silver, gold |
| Catalog | Dataset definitions with schemas, checksums, and metadata |
| Pipelines | DAG-based workflows composed of pure Python functions |
| Hooks | Lifecycle callbacks for validation and data downloads |
| Runners | Execution engines for parallel and dry-run modes |
| Datasets | Custom dataset types for OWL, XML, SQL dumps, etc. |
| Providers | Automatic data download from external sources |
| Configuration | Environment-based settings management |
Data Flow
External Sources -> Landing -> Bronze -> Silver -> Gold -> Export (CSV/Parquet/Neo4j)- Landing: Raw data files downloaded from external sources via Providers
- Bronze: Extracted and standardized Polars DataFrames
- Silver: Consolidated entities and relationships across sources
- Gold: Final knowledge graph exported via BioCypher
Technology Stack
- Kedro - Pipeline orchestration and project structure
- Polars - All data transformations (not Pandas)
- BioCypher - Knowledge graph schema and export
- Docker - Neo4j database for graph exploration