Architecture
Catalog
The single source of truth for all datasets in the pipeline.
The catalog is a collection of YAML files in conf/base/catalog/ that defines every dataset in the pipeline. Each entry specifies filepath, format, schema, and metadata.
Structure
Catalog files are organized by layer and source:
conf/base/catalog/
landing/
opentargets.yml
drugbank.yml
...
bronze/
opentargets.yml
drugbank.yml
...
silver/
nodes.yml
edges.ymlEntry Format
A typical catalog entry:
bronze.opentargets.disease:
type: optimuskg.datasets.polars.ParquetDataset
filepath: data/bronze/opentargets/disease.parquet
schema:
disease_id: Utf8
disease_name: Utf8
therapeutic_areas: List(Utf8)
metadata:
checksum: "abc123..."
origin:
provider: http
url: https://...Key Fields
| Field | Description |
|---|---|
type | The dataset class (e.g., ParquetDataset, OWLDataset) |
filepath | Path to the data file relative to project root |
schema | Column names and Polars types for typed datasets |
metadata.checksum | BLAKE2b checksum for file integrity validation |
metadata.origin | Download configuration for automatic data retrieval |
Checksums
Every dataset with a metadata.checksum field is validated by the Checksum Hook before being loaded. If the checksum doesn't match, the pipeline raises an error.
To update checksums after modifying a node:
uv run cli sync-catalog --dataset <catalog_id>Never delete the checksum property from a catalog entry. If the code that generates a file changes, rerun the node and sync the catalog.