OptimusKG
Architecture

Catalog

The single source of truth for all datasets in the pipeline.

The catalog is a collection of YAML files in conf/base/catalog/ that defines every dataset in the pipeline. Each entry specifies filepath, format, schema, and metadata.

Structure

Catalog files are organized by layer and source:

conf/base/catalog/
  landing/
    opentargets.yml
    drugbank.yml
    ...
  bronze/
    opentargets.yml
    drugbank.yml
    ...
  silver/
    nodes.yml
    edges.yml

Entry Format

A typical catalog entry:

bronze.opentargets.disease:
  type: optimuskg.datasets.polars.ParquetDataset
  filepath: data/bronze/opentargets/disease.parquet
  schema:
    disease_id: Utf8
    disease_name: Utf8
    therapeutic_areas: List(Utf8)
  metadata:
    checksum: "abc123..."
    origin:
      provider: http
      url: https://...

Key Fields

FieldDescription
typeThe dataset class (e.g., ParquetDataset, OWLDataset)
filepathPath to the data file relative to project root
schemaColumn names and Polars types for typed datasets
metadata.checksumBLAKE2b checksum for file integrity validation
metadata.originDownload configuration for automatic data retrieval

Checksums

Every dataset with a metadata.checksum field is validated by the Checksum Hook before being loaded. If the checksum doesn't match, the pipeline raises an error.

To update checksums after modifying a node:

uv run cli sync-catalog --dataset <catalog_id>

Never delete the checksum property from a catalog entry. If the code that generates a file changes, rerun the node and sync the catalog.

On this page