Graph Schema
Nodes
Node types and their schema in OptimusKG.
OptimusKG encodes 10 node types spanning molecular, clinical, anatomical, and environmental entities.
| Label | Type | Count |
|---|---|---|
GEN | Gene | 61,306 |
DIS | Disease | 36,345 |
BPO | Biological Process | 25,754 |
PHE | Phenotype | 19,341 |
DRG | Drug | 16,766 |
ANA | Anatomy | 13,120 |
MFN | Molecular Function | 10,161 |
CCO | Cellular Component | 4,052 |
PWY | Pathway | 2,805 |
EXP | Exposure | 881 |
All nodes share the same base schema:
idStringGlobally unique node identifier in CURIE format (e.g. ENSG00000139618)
labelStringNode type 3-letter abbreviation (e.g. GEN, DRG)
propertiesStringJSON-encoded type-specific properties. Expanded to a native Struct in per-type parquet files.
In the stratified per-type parquet files (nodes/<type>.parquet), properties is expanded into native typed columns as a Polars Struct.
Gene
idStringNode identifier in CURIE format (e.g. ENSG00000141510)
labelStringNode type abbreviation (GEN)
propertiesStructGene-specific properties
symbolStringOfficial HGNC gene symbol (e.g. TP53)
nameStringFull gene name
biotypeStringGene biotype (e.g. protein_coding, lncRNA)
genomic_locationStructChromosomal coordinates
chromosomeStringChromosome name
startInt64Start position (0-based)
endInt64End position
strandInt32Strand (+1 forward, -1 reverse)
transcription_start_siteInt64Transcription start site position
canonical_transcriptStructCanonical transcript details
idStringEnsembl transcript ID
chromosomeStringChromosome name
startInt64Start position
endInt64End position
strandStringStrand
canonical_exonsList[String]Canonical exon coordinates
transcript_idsList[String]All associated Ensembl transcript IDs
alternative_genesList[String]Alternative gene entries at the same locus
function_descriptionsList[String]Functional descriptions
synonymsList[Struct]General gene synonyms
labelStringSynonym label
sourceStringSource database
symbol_synonymsList[Struct]Alternative gene symbols
labelStringSymbol label
sourceStringSource database
name_synonymsList[Struct]Alternative gene names
labelStringName label
sourceStringSource database
obsolete_symbolsList[Struct]Deprecated gene symbols
labelStringSymbol label
sourceStringSource database
obsolete_namesList[Struct]Deprecated gene names
labelStringName label
sourceStringSource database
subcellular_locationsList[Struct]Subcellular localization annotations
locationStringLocation name
sourceStringSource database
term_slStringSubcellular location ontology term
label_slStringSubcellular location label
target_classList[Struct]Drug target class classification
idInt64Target class ID
labelStringTarget class label
levelStringHierarchy level
target_enabling_packageStructTarget enabling package annotation
target_from_source_idStringSource target ID
descriptionStringPackage description
therapeutic_areaStringTherapeutic area
urlStringReference URL
tractabilityList[Struct]Drug tractability assessments per modality
modalityStringDrug modality (e.g. sm, ab, pr)
idStringTractability category ID
valueBooleanTractability assessment value
constraint_scoresList[Struct]Evolutionary constraint scores (e.g. pLI, LOEUF)
constraint_typeStringScore type (e.g. lof, mis)
scoreFloat32Constraint score
expFloat32Expected variant count
obsInt32Observed variant count
oeFloat32Observed/expected ratio
oe_lowerFloat32O/E 90% CI lower bound
oe_upperFloat32O/E 90% CI upper bound
upper_rankInt32Upper rank (gnomAD)
upper_binInt32Upper bin (10-bin)
upper_bin6Int32Upper bin (6-bin)
hallmarks_attributesList[Struct]Cancer hallmark attributes (Cancer Gene Census)
pmidInt64PubMed ID of supporting reference
descriptionStringHallmark description
attribute_nameStringAttribute name
cancer_hallmarksList[Struct]Associated cancer hallmarks
pmidInt64PubMed ID of supporting reference
descriptionStringHallmark description
impactStringFunctional impact (promotes/suppresses)
labelStringHallmark label
associated_proteinsList[Struct]Associated UniProt protein entries
idStringUniProt accession
sourceStringSource database
xrefsList[Struct]Cross-references to external databases
idStringExternal identifier
sourceStringDatabase name
chemical_probesList[Struct]Chemical probe annotations (Probes & Drugs)
homologuesList[Struct]Ortholog and paralog information
species_idStringNCBI taxonomy ID
species_nameStringSpecies name
homology_typeStringHomology type (ortholog/paralog)
target_gene_idStringTarget gene identifier
is_high_confidenceStringHigh-confidence flag
target_gene_symbolStringTarget gene symbol
query_percentage_identityFloat64Query % sequence identity
target_percentage_identityFloat64Target % sequence identity
priorityInt32Priority rank
safety_liabilitiesList[Struct]Safety liability annotations (OpenTargets)
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Drug
idStringNode identifier in CURIE format (e.g. DB00001)
labelStringNode type abbreviation (DRG)
propertiesStructDrug-specific properties
nameStringPrimary drug name
typeStringDrug type (e.g. small molecule, biologic)
descriptionStringDrug description
synonymsList[String]Drug synonyms
trade_namesList[String]Commercial trade names
accession_numbersList[String]Database accession numbers
source_idsList[String]Source-specific identifiers
struct_idStringDrugCentral structure ID
cd_idStringDrugCentral compound ID
inchi_keyStringHashed InChIKey identifier
inchiStringIUPAC InChI string
canonical_smilesStringCanonical SMILES string
cd_formulaStringMolecular formula
cd_mol_weightFloat64Molecular weight (Da)
mol_file_base64StringBase64-encoded MOL file
mol_image_base64StringBase64-encoded 2D structure image
calculated_log_pFloat64Calculated LogP (lipophilicity)
alogsFloat64ALogS (aqueous solubility estimate)
tpsaFloat64Topological polar surface area (Ų)
lipinskiFloat64Lipinski rule of five score
aromatic_carbonsInt32Number of aromatic carbon atoms
sp3_countInt32Number of sp³ carbon atoms
sp2_countInt32Number of sp² carbon atoms
sp_countInt32Number of sp carbon atoms
halogen_countInt32Number of halogen atoms
hetero_sp2_countInt32Number of heteroaromatic sp² atoms
rotatable_bondsInt32Number of rotatable bonds
o_nInt32H-bond acceptors (O + N atom count)
oh_nhInt32H-bond donors (OH + NH group count)
rgbFloat64RGB color value for structure rendering
enhanced_stereoBooleanHas enhanced stereochemistry annotation
is_approvedBooleanCurrently approved for clinical use
has_been_withdrawnBooleanHas been withdrawn from market
black_box_warningBooleanCarries an FDA black box warning
year_of_first_approvalInt64Year of first regulatory approval
maximum_clinical_trial_phaseFloat64Highest clinical trial phase reached
statusStringRegulatory status
fda_labelsInt32Number of associated FDA drug labels
number_of_formulationsInt32Number of approved formulations
chemical_abstracts_service_numberStringCAS registry number
unique_ingredient_identifierStringFDA UNII identifier
mrdefStringMeSH pharmacological action definition
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Disease
idStringNode identifier in CURIE format (e.g. MONDO:0005148)
labelStringNode type abbreviation (DIS)
propertiesStructDisease-specific properties
nameStringDisease name
descriptionStringDisease description
codeStringPrimary ontology code
xrefsList[String]Cross-references to external databases (ICD-10, OMIM, etc.)
parentsList[String]Parent disease terms in the ontology hierarchy
childrenList[String]Child disease terms in the ontology hierarchy
ancestorsList[String]All ancestor terms (transitive parents)
descendantsList[String]All descendant terms (transitive children)
exact_synonymsList[String]Exact synonym labels
related_synonymsList[String]Related synonym labels
narrow_synonymsList[String]Narrow synonym labels
broad_synonymsList[String]Broad synonym labels
obsolete_termsList[String]Deprecated ontology terms
obsolete_xrefsList[String]Deprecated cross-references
therapeutic_areasList[String]Associated therapeutic area codes
is_leafBooleanTrue if this term has no children
concept_idsList[String]UMLS concept IDs
concept_namesList[String]UMLS concept names
umls_cuiStringPrimary UMLS Concept Unique Identifier
snomed_full_namesList[String]SNOMED CT full concept names
snomed_concept_idsList[String]SNOMED CT concept IDs
cui_semantic_typeStringUMLS semantic type of the primary CUI
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Phenotype
idStringNode identifier in CURIE format (e.g. HP:0001250)
labelStringNode type abbreviation (PHE)
propertiesStructPhenotype-specific properties
nameStringPhenotype name
descriptionStringPhenotype description
codeStringPrimary ontology code
typeStringPhenotype type classification
xrefsList[String]Cross-references to external databases
parentsList[String]Parent phenotype terms in the ontology hierarchy
childrenList[String]Child phenotype terms in the ontology hierarchy
ancestorsList[String]All ancestor terms (transitive parents)
descendantsList[String]All descendant terms (transitive children)
exact_synonymsList[String]Exact synonym labels
related_synonymsList[String]Related synonym labels
narrow_synonymsList[String]Narrow synonym labels
broad_synonymsList[String]Broad synonym labels
obsolete_termsList[String]Deprecated ontology terms
obsolete_xrefsList[String]Deprecated cross-references
concept_idsList[String]UMLS concept IDs
concept_namesList[String]UMLS concept names
umls_cuiStringPrimary UMLS Concept Unique Identifier
snomed_full_namesList[String]SNOMED CT full concept names
snomed_concept_idsList[String]SNOMED CT concept IDs
cui_semantic_typeStringUMLS semantic type of the primary CUI
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Anatomy
idStringNode identifier in CURIE format (e.g. UBERON:0000948)
labelStringNode type abbreviation (ANA)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Pathway
idStringNode identifier in CURIE format (e.g. R-HSA-109582)
labelStringNode type abbreviation (PWY)
propertiesStructPathway-specific properties
nameStringPathway name
speciesStringSpecies name (e.g. Homo sapiens)
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Biological Process
idStringNode identifier in CURIE format (e.g. GO:0006915)
labelStringNode type abbreviation (BPO)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Cellular Component
idStringNode identifier in CURIE format (e.g. GO:0005737)
labelStringNode type abbreviation (CCO)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Molecular Function
idStringNode identifier in CURIE format (e.g. GO:0003677)
labelStringNode type abbreviation (MFN)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity
Exposure
idStringNode identifier in CURIE format (e.g. CTD:D001564)
labelStringNode type abbreviation (EXP)
propertiesStructExposure-specific properties
nameStringExposure name
source_categoriesList[String]Exposure source categories (e.g. chemical, biological)
source_detailsStringAdditional source details
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity