Custom Datasets
Dataset types for handling various biomedical data formats.
OptimusKG extends Kedro's AbstractVersionedDataset with custom dataset types in optimuskg/datasets/.
ParquetDataset
File: optimuskg/datasets/polars/parquet_dataset.py
Reads and writes Apache Parquet files using Polars. Supports schema validation via YAML catalog definitions with custom type parsing.
JsonDataset
File: optimuskg/datasets/polars/json_dataset.py
Reads and writes JSON files using Polars. Used for data sources that provide JSON exports.
OWLDataset
File: optimuskg/datasets/owl_dataset.py
Reads OWL (Web Ontology Language) files using owlready2. Used for loading biomedical ontologies like Gene Ontology, MONDO, and Human Phenotype Ontology.
LXMLDataset
File: optimuskg/datasets/lxml_dataset.py
Reads XML files using lxml. Used for data sources that provide XML exports (e.g., DrugBank).
ZipDataset
File: optimuskg/datasets/zip_dataset.py
Handles compressed ZIP archives. Automatically extracts content for downstream processing.
SQLDumpQueryDataset
File: optimuskg/datasets/sqldump_query_dataset/
Loads SQL dump files into a temporary database and executes queries against them. Used for data sources distributed as SQL dumps (e.g., DrugCentral). Runs a Docker container for the database.