OptimusKG
Graph Schema

Nodes

Node types and their schema in OptimusKG.

OptimusKG encodes 10 node types spanning molecular, clinical, anatomical, and environmental entities.

LabelTypeCount
GENGene61,306
DISDisease36,345
BPOBiological Process25,754
PHEPhenotype19,341
DRGDrug16,766
ANAAnatomy13,120
MFNMolecular Function10,161
CCOCellular Component4,052
PWYPathway2,805
EXPExposure881

All nodes share the same base schema:

idStringGlobally unique node identifier in CURIE format (e.g. ENSG00000139618)
labelStringNode type 3-letter abbreviation (e.g. GEN, DRG)
propertiesStringJSON-encoded type-specific properties. Expanded to a native Struct in per-type parquet files.

In the stratified per-type parquet files (nodes/<type>.parquet), properties is expanded into native typed columns as a Polars Struct.


Gene

idStringNode identifier in CURIE format (e.g. ENSG00000141510)
labelStringNode type abbreviation (GEN)
propertiesStructGene-specific properties
symbolStringOfficial HGNC gene symbol (e.g. TP53)
nameStringFull gene name
biotypeStringGene biotype (e.g. protein_coding, lncRNA)
genomic_locationStructChromosomal coordinates
chromosomeStringChromosome name
startInt64Start position (0-based)
endInt64End position
strandInt32Strand (+1 forward, -1 reverse)
transcription_start_siteInt64Transcription start site position
canonical_transcriptStructCanonical transcript details
idStringEnsembl transcript ID
chromosomeStringChromosome name
startInt64Start position
endInt64End position
strandStringStrand
canonical_exonsList[String]Canonical exon coordinates
transcript_idsList[String]All associated Ensembl transcript IDs
alternative_genesList[String]Alternative gene entries at the same locus
function_descriptionsList[String]Functional descriptions
synonymsList[Struct]General gene synonyms
labelStringSynonym label
sourceStringSource database
symbol_synonymsList[Struct]Alternative gene symbols
labelStringSymbol label
sourceStringSource database
name_synonymsList[Struct]Alternative gene names
labelStringName label
sourceStringSource database
obsolete_symbolsList[Struct]Deprecated gene symbols
labelStringSymbol label
sourceStringSource database
obsolete_namesList[Struct]Deprecated gene names
labelStringName label
sourceStringSource database
subcellular_locationsList[Struct]Subcellular localization annotations
locationStringLocation name
sourceStringSource database
term_slStringSubcellular location ontology term
label_slStringSubcellular location label
target_classList[Struct]Drug target class classification
idInt64Target class ID
labelStringTarget class label
levelStringHierarchy level
target_enabling_packageStructTarget enabling package annotation
target_from_source_idStringSource target ID
descriptionStringPackage description
therapeutic_areaStringTherapeutic area
urlStringReference URL
tractabilityList[Struct]Drug tractability assessments per modality
modalityStringDrug modality (e.g. sm, ab, pr)
idStringTractability category ID
valueBooleanTractability assessment value
constraint_scoresList[Struct]Evolutionary constraint scores (e.g. pLI, LOEUF)
constraint_typeStringScore type (e.g. lof, mis)
scoreFloat32Constraint score
expFloat32Expected variant count
obsInt32Observed variant count
oeFloat32Observed/expected ratio
oe_lowerFloat32O/E 90% CI lower bound
oe_upperFloat32O/E 90% CI upper bound
upper_rankInt32Upper rank (gnomAD)
upper_binInt32Upper bin (10-bin)
upper_bin6Int32Upper bin (6-bin)
hallmarks_attributesList[Struct]Cancer hallmark attributes (Cancer Gene Census)
pmidInt64PubMed ID of supporting reference
descriptionStringHallmark description
attribute_nameStringAttribute name
cancer_hallmarksList[Struct]Associated cancer hallmarks
pmidInt64PubMed ID of supporting reference
descriptionStringHallmark description
impactStringFunctional impact (promotes/suppresses)
labelStringHallmark label
associated_proteinsList[Struct]Associated UniProt protein entries
idStringUniProt accession
sourceStringSource database
xrefsList[Struct]Cross-references to external databases
idStringExternal identifier
sourceStringDatabase name
chemical_probesList[Struct]Chemical probe annotations (Probes & Drugs)
homologuesList[Struct]Ortholog and paralog information
species_idStringNCBI taxonomy ID
species_nameStringSpecies name
homology_typeStringHomology type (ortholog/paralog)
target_gene_idStringTarget gene identifier
is_high_confidenceStringHigh-confidence flag
target_gene_symbolStringTarget gene symbol
query_percentage_identityFloat64Query % sequence identity
target_percentage_identityFloat64Target % sequence identity
priorityInt32Priority rank
safety_liabilitiesList[Struct]Safety liability annotations (OpenTargets)
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Drug

idStringNode identifier in CURIE format (e.g. DB00001)
labelStringNode type abbreviation (DRG)
propertiesStructDrug-specific properties
nameStringPrimary drug name
typeStringDrug type (e.g. small molecule, biologic)
descriptionStringDrug description
synonymsList[String]Drug synonyms
trade_namesList[String]Commercial trade names
accession_numbersList[String]Database accession numbers
source_idsList[String]Source-specific identifiers
struct_idStringDrugCentral structure ID
cd_idStringDrugCentral compound ID
inchi_keyStringHashed InChIKey identifier
inchiStringIUPAC InChI string
canonical_smilesStringCanonical SMILES string
cd_formulaStringMolecular formula
cd_mol_weightFloat64Molecular weight (Da)
mol_file_base64StringBase64-encoded MOL file
mol_image_base64StringBase64-encoded 2D structure image
calculated_log_pFloat64Calculated LogP (lipophilicity)
alogsFloat64ALogS (aqueous solubility estimate)
tpsaFloat64Topological polar surface area (Ų)
lipinskiFloat64Lipinski rule of five score
aromatic_carbonsInt32Number of aromatic carbon atoms
sp3_countInt32Number of sp³ carbon atoms
sp2_countInt32Number of sp² carbon atoms
sp_countInt32Number of sp carbon atoms
halogen_countInt32Number of halogen atoms
hetero_sp2_countInt32Number of heteroaromatic sp² atoms
rotatable_bondsInt32Number of rotatable bonds
o_nInt32H-bond acceptors (O + N atom count)
oh_nhInt32H-bond donors (OH + NH group count)
rgbFloat64RGB color value for structure rendering
enhanced_stereoBooleanHas enhanced stereochemistry annotation
is_approvedBooleanCurrently approved for clinical use
has_been_withdrawnBooleanHas been withdrawn from market
black_box_warningBooleanCarries an FDA black box warning
year_of_first_approvalInt64Year of first regulatory approval
maximum_clinical_trial_phaseFloat64Highest clinical trial phase reached
statusStringRegulatory status
fda_labelsInt32Number of associated FDA drug labels
number_of_formulationsInt32Number of approved formulations
chemical_abstracts_service_numberStringCAS registry number
unique_ingredient_identifierStringFDA UNII identifier
mrdefStringMeSH pharmacological action definition
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Disease

idStringNode identifier in CURIE format (e.g. MONDO:0005148)
labelStringNode type abbreviation (DIS)
propertiesStructDisease-specific properties
nameStringDisease name
descriptionStringDisease description
codeStringPrimary ontology code
xrefsList[String]Cross-references to external databases (ICD-10, OMIM, etc.)
parentsList[String]Parent disease terms in the ontology hierarchy
childrenList[String]Child disease terms in the ontology hierarchy
ancestorsList[String]All ancestor terms (transitive parents)
descendantsList[String]All descendant terms (transitive children)
exact_synonymsList[String]Exact synonym labels
related_synonymsList[String]Related synonym labels
narrow_synonymsList[String]Narrow synonym labels
broad_synonymsList[String]Broad synonym labels
obsolete_termsList[String]Deprecated ontology terms
obsolete_xrefsList[String]Deprecated cross-references
therapeutic_areasList[String]Associated therapeutic area codes
is_leafBooleanTrue if this term has no children
concept_idsList[String]UMLS concept IDs
concept_namesList[String]UMLS concept names
umls_cuiStringPrimary UMLS Concept Unique Identifier
snomed_full_namesList[String]SNOMED CT full concept names
snomed_concept_idsList[String]SNOMED CT concept IDs
cui_semantic_typeStringUMLS semantic type of the primary CUI
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Phenotype

idStringNode identifier in CURIE format (e.g. HP:0001250)
labelStringNode type abbreviation (PHE)
propertiesStructPhenotype-specific properties
nameStringPhenotype name
descriptionStringPhenotype description
codeStringPrimary ontology code
typeStringPhenotype type classification
xrefsList[String]Cross-references to external databases
parentsList[String]Parent phenotype terms in the ontology hierarchy
childrenList[String]Child phenotype terms in the ontology hierarchy
ancestorsList[String]All ancestor terms (transitive parents)
descendantsList[String]All descendant terms (transitive children)
exact_synonymsList[String]Exact synonym labels
related_synonymsList[String]Related synonym labels
narrow_synonymsList[String]Narrow synonym labels
broad_synonymsList[String]Broad synonym labels
obsolete_termsList[String]Deprecated ontology terms
obsolete_xrefsList[String]Deprecated cross-references
concept_idsList[String]UMLS concept IDs
concept_namesList[String]UMLS concept names
umls_cuiStringPrimary UMLS Concept Unique Identifier
snomed_full_namesList[String]SNOMED CT full concept names
snomed_concept_idsList[String]SNOMED CT concept IDs
cui_semantic_typeStringUMLS semantic type of the primary CUI
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Anatomy

idStringNode identifier in CURIE format (e.g. UBERON:0000948)
labelStringNode type abbreviation (ANA)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Pathway

idStringNode identifier in CURIE format (e.g. R-HSA-109582)
labelStringNode type abbreviation (PWY)
propertiesStructPathway-specific properties
nameStringPathway name
speciesStringSpecies name (e.g. Homo sapiens)
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Biological Process

idStringNode identifier in CURIE format (e.g. GO:0006915)
labelStringNode type abbreviation (BPO)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Cellular Component

idStringNode identifier in CURIE format (e.g. GO:0005737)
labelStringNode type abbreviation (CCO)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Molecular Function

idStringNode identifier in CURIE format (e.g. GO:0003677)
labelStringNode type abbreviation (MFN)
propertiesStructEntity-specific properties
nameStringEntity name
definitionStringOntology definition
xrefsList[String]Cross-references to external databases
synonymsList[String]Synonym labels
ontologyStructSource ontology metadata
titleStringOntology title
descriptionStringOntology description
licenseStringOntology license
versionStringOntology version
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

Exposure

idStringNode identifier in CURIE format (e.g. CTD:D001564)
labelStringNode type abbreviation (EXP)
propertiesStructExposure-specific properties
nameStringExposure name
source_categoriesList[String]Exposure source categories (e.g. chemical, biological)
source_detailsStringAdditional source details
sourcesStructProvenance of this node
directList[String]Datasets that directly contributed this entity
indirectList[String]Datasets that referenced this entity

On this page