OptimusKG
Architecture

Medallion Layers

The four-layer data architecture from raw data to knowledge graph.

OptimusKG follows the medallion architecture pattern to logically organize data through four layers of increasing refinement.

Landing

Raw data files as downloaded from external sources. No transformations are applied. Files are stored in data/landing/ organized by source.

The Origin Hook automatically downloads data into this layer when datasets are first accessed.

Bronze

Location: optimuskg/pipelines/bronze/

Extracts and standardizes raw data from each source. Each data source has its own node file under nodes/. Outputs standardized Polars DataFrames stored as Parquet in data/bronze/.

Key responsibilities:

  • Parse source-specific file formats (CSV, XML, OWL, SQL dumps)
  • Normalize column names to snake_case
  • Extract relevant fields into a consistent schema
  • Store as typed Parquet files with schema metadata

Silver

Location: optimuskg/pipelines/silver/

Consolidates entities across multiple sources into unified node and edge tables. This is where cross-source reconciliation happens.

Node types: Gene, Drug, Disease, Protein, Anatomy, Pathway, Phenotype, Exposure, Biological Process, Cellular Component, Molecular Function

Edge types: ~26 relationship types connecting the node types above.

Gold

Location: optimuskg/pipelines/gold/

Exports the final knowledge graph via BioCypher in configured formats (CSV, Parquet, Neo4j-JSONL). The exported graph is written to data/gold/formats/.

On this page