Custom Datasets

OptimusKG extends Kedro's AbstractVersionedDataset with custom dataset types in optimuskg/datasets/.

ParquetDataset

File: optimuskg/datasets/polars/parquet_dataset.py

Reads and writes Apache Parquet files using Polars. Supports schema validation via YAML catalog definitions with custom type parsing.

JsonDataset

File: optimuskg/datasets/polars/json_dataset.py

Reads and writes JSON files using Polars. Used for data sources that provide JSON exports.

OWLDataset

File: optimuskg/datasets/owl_dataset.py

Reads OWL (Web Ontology Language) files using owlready2. Used for loading biomedical ontologies like Gene Ontology, MONDO, and Human Phenotype Ontology.

LXMLDataset

File: optimuskg/datasets/lxml_dataset.py

Reads XML files using lxml. Used for data sources that provide XML exports (e.g., DrugBank).

ZipDataset

File: optimuskg/datasets/zip_dataset.py

Handles compressed ZIP archives. Automatically extracts content for downstream processing.

SQLDumpQueryDataset

File: optimuskg/datasets/sqldump_query_dataset/

Loads SQL dump files into a temporary database and executes queries against them. Used for data sources distributed as SQL dumps (e.g., DrugCentral). Runs a Docker container for the database.