-
Notifications
You must be signed in to change notification settings - Fork 12
Generalizability
OHDSI Study Protocol: OHDSI_Study_Protocol_v1.0
Collaborators: Anna Ostropolets and Patrick Ryan
This study aims to evaluate and characterize the generalizability or coverage of the OMOP vocabulary terms included in the OMOP2OBO
mapping set to OMOP vocabulary terms utilized in the Observational Health Data Sciences and Informatics (OHDSI) Concept Prevalence study sites.
As described here, the Concept Prevalence study was designed to provide researchers with additional context regarding the frequency at which different clinical codes occur across the OHDSI research network:
We want to study the usage patterns of Concepts across different OMOP CDM instances. This in itself could be useful information to answer many questions, but we have a concrete reason: For any one medical entity, the granularity of codes captured in a data source can vary greatly. For example, Chronic Kidney Disorder stage II can be coded as ICD9 code 585.2 Chronic kidney disease, Stage II (mild); 585.9 Chronic kidney disease, unspecified or even as 586 Renal failure, unspecified. However, this information is key for any cohort definition. Currently, researchers have no way of knowing whether a certain concept with high granularity is even available for selection, or whether they have to use a generic concept in combination with some auxiliary information to define the cohort correctly. Each data source instance is a black box and knowledge about the distribution of the concepts is limited to the very instance researchers have access to. But OHDSI Network Studies are dependent on cohort definitions that work across the network.
The main research question is how does the coverage of the OMOP vocabulary terms present in the OMOP2OBO mappings differ across the OHDSI Concept Prevalence study sites?
The specific aims of this study are as follows:
- Examine
OMOP2OBO
coverage across the Concept Prevalence sites by identifying:- OMOP vocabulary terms that exist in OMOP2OBO and one or more site.
- OMOP vocabulary terms only present in OMOP2OBO and none of the Concept Prevalence sites
- OMOP vocabulary terms only present in one or more the site.
- Demonstrate the potential for [molecular] biological inference of OMOP2OBO by characterizing differences in OBO ontology term enrichment across the Concept Prevalence sites when varying different aspects of data provenance (e.g. site type, clinical specialty, and site location).
Study Sites
In addition to the Concept Prevalence
study sites (n=22
), data was obtained from two independent academic medical centers. High-level descriptions of each site, including the total number of records and concepts are provided below.
Database | Type | Location | Record Count | Concept Count |
---|---|---|---|---|
Ajou University Database (Ajou) | EHR | Non-US | 30,238,709 | 6,055 |
Australian Electronic practice based research network (AU-ePBRN) | EHR | Non-US | 11,658,378 | 5,027 |
Columbia University Medical Center Database (CUMC) | EHR | US | 938,078,465 | 21,502 |
IBM MarketScan Commercial Database (CCAE) | CLAIMS | US | 12,649,562,658 | 31,570 |
IBM MarketScan Medicare Supplemental Database (MDCR) | CLAIMS | US | 2,770,787,154 | 25,121 |
IBM MarketScan Multi-State Medicaid Database (MDCD) | CLAIMS | US | 4,283,172,117 | 19,133 |
IQVIA Disease Analyzer (DA) France | EHR | Non-US | 39,632,134 | 3,423 |
IQVIA Disease Analyzer (DA) Germany | EHR | Non-US | 851,853,377 | 9,276 |
IQVIA Longitudinal Patient Data (LPD) Australia | EHR | Non-US | 56,940,803 | 5,833 |
IQVIA US Ambulatory EMR (AmbEMR) | EHR | US | 10,634,058,375 | 62,161 |
IQVIA US Hospital Charge Data Master (CDM) | EHR | US | 4,857,228,360 | 19,352 |
IQVIA US LRxDx Open Claims (Open Claims) | CLAIMS | US | 71,678,847,042 | 20,083 |
Japan Medical Data Center database (JMDC) | EHR | Non-US | 1,184,325,523 | 6,833 |
Korea National Health Insurance Service / National Sample Cohort (NHIS/NSC Korea) | CLAIMS | Non-US | 323,096,899 | 6,667 |
Medical Information Mart for Intensive Care III (MIMIC3) | EHR | US | 124,127,038 | 3,781 |
Optum De-Identified Clinformatics Data-Mart-Database— Socio-Economic Status (SES) | CLAIMS | US | 13,369,194,028 | 36,943 |
Optum De-Identified Clinformatics Data-Mart-Database—Date of Death (DOD) | CLAIMS | US | 9,716,879,363 | 34,853 |
Optum De-identified Electronic Health Record Dataset (PANTHER) | EHR | US | 27,894,204,112 | 59,777 |
Premier Healthcare Database (PREMIER) | CLAIMS | US | 16,794,698,039 | 18,903 |
Stanford Medicine Research Data Repository (STaRR) | EHR | US | 416,175,821 | 11,161 |
The Healthcare Cost and Utilization ProjectNationwide Inpatient Sample (HCUP) | EHR | US | 744,807,853 | 9,391 |
Tufts Medical Center Database (Tufts) | EHR | US | 66,863,985 | 21,118 |
UCHealth | EHR | US | 1,215,613,326 | 19,073 |
USC PScanner | EHR | US | 29,703,213 | 11,476 |
Data
For each data site, standard concepts used at least once in practice were obtained from the Condition Occurrence (i.e. SNOMED-CT), Drug Exposure (i.e. ingredient-level; RxNorm), and Measurement (i.e. LOINC) tables. For all concepts, the total frequency was obtained and consistent with the Concept Prevalence
study, all concepts occurring fewer than 10 times were ignored and all remaining concepts occurring fewer than 100 times were assigned a count of 100.
SQL Query: OMOP2OBO_ConceptPrevalence_ErrorAnalysis.sql
An error analysis was performed to help provide insight into the Concept Prevalence
study concepts that were not covered by the OMOP2OBO
mapping sets. The OMOP2OBO
mapping set was created off of the OMOP common data model (CDM) v5.0
, which contained vocabulary concepts with a timestamp of June 26,2018
. Given how quickly the vocabulary changes, we hypothesized that some of the concepts that were were unable to cover could be brand new concepts and/or concepts which have been updated or replaced by pre-existing concepts.
To perform this analysis, the following SQL query was against a current version of the OMOP CDM:
SELECT
DISTINCT r.relationship_id,
c1.concept_id AS SOURCE_CONCEPT_ID,
c1.concept_name AS SOURCE_CONCEPT_LABEL,
c2.concept_id AS TARGET_CONCEPT_ID,
c2.concept_name AS TARGET_CONCEPT_LABEL,
FROM
sandbox-omop.oct_2020.concept_relationship r
JOIN sandbox-omop.oct_2020.concept c1 ON c1.concept_id = r.concept_id_1
JOIN sandbox-omop.oct_2020.concept c2 ON c2.concept_id = r.concept_id_2
WHERE
r.concept_id_1 IN (SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.OMOP2OBO_Conditions_Concepts_Merged
UNION DISTINCT
SELECT ingredient_concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.OMOP2OBO_Medications_Concepts_Merged
UNION DISTINCT
SELECT concept_id FROM sandbox-tc.CHCO_DeID_Oct2018.OMOP2OBO_Measurements_Concepts_Merged)
AND r.relationship_id IN ("Concept replaced by", "Maps to", "Concept same_as from", "Concept poss_eq from", "Concept was_a from", "Is a")
AND (r.valid_start_date > '2018-06-26' AND r.valid_start_date < '2020-10-17')
ORDER BY r.relationship_id;
The relationship_id
column contains different relationships that can be utilized to explain the relationship between OMOP concept-ids
. Of the relationship_ids
included in the query above are organized such that they allow us to identify two types of scenarios: (1) Newly Added Concepts: Concepts that did not exist in the version of the OMOP CDM used to create the OMOP2OBO
mappings, but that do exist in the current CDM; (2) Updated Concepts: Concepts that existed in the version of the OMOP CDM used to create the OMOP2OBO
mappings, but which have been updated and now exist under a new concept_id
. The table below organizes the OMOP CDM relationship_ids
by scenario.
Relationship_ID | Scenario Type |
---|---|
Newly Added Concepts | Maps to |
Newly Added Concepts |
Concept poss_eq from (synonyms) |
Newly Added Concepts |
Concept same_as from (synonyms) |
Newly Added Concepts |
Concept was_a from (concept type) |
Newly Added Concepts |
Is a (concept type) |
Replaced Concept | Concept replaced by |
Analysis
We used this information to categorize uncovered concepts (i.e. concepts included in the Concept Prevalence
data sets, but missing from the OMOP2OBO
mapping set). Specifically, for each clinical domain we obtained three lists: (1) Uncovered concepts in the error analysis data; (2) Uncovered concepts in the OMOP2OBO
mapping data, but ineligible for mapping; and (3) Uncovered concepts that are truly unable to be accounted for by existing data sources. For lists 1 and 2, we aimed to explain the uncovered concepts by categorizing them according to an explanation for their missingness (i.e. concept present in newer OMOP vocabulary or replaced concept). For all the lists, we also obtained prevalence information for each concept as frequency of use within and across the Concept Prevalence
data sites. The concept prevalence was used as metric to measure the importance of each uncovered concept.
Results are presented below by clinical domain. Overall, the OMOP vocabulary terms included in the OMOP2OBO
mapping set provided exceptional coverage, which differed both by Concept Prevalence
study site and clinical domain.
Missing Condition Concepts (n=441)
- 367 found in newer version of OMOP CDM (
v6.0.0
) - 74 truly missing from 58.2% of sites
- Concepts occurred in an average of 2.74 sites (1-14)
- Concepts had a mean frequency of 5,320.06 (100-100,483)
Most Frequently Missing Concepts
- Increased fluid intake
- Disease caused by 2019-nCoV
- Polycystic ovary syndrome
- Saddle embolus of the pulmonary artery with acute cor pulmonale
- Adjustment disorder with mixed anxiety and depressed mood
Missing Ingredient Concepts (n=95)
- 5 found in a newer version of OMOP CDM (v6.0.0)
- 90 truly missing from 58.2% of sites
- Concepts occurred in an average of 2.66 sites (1-14)
- Concepts had a mean frequency of 3,361.15 (100-175,551.29)
Most Frequently Missing Ingredient Concepts
- hepatitis A virus strain CR 326F antigen inactivated.
- erenumab
- fremanezumab
- galcanezumab
- baloxavir marboxil
Missing Measurements Concepts (n=20,893)
- 13 found in newer version of OMOP CDM (v6.0.0)
- 20,722 truly missing from 58.2% of sites
- Concepts occurred in an average of 2.82 sites (1-14)
- Concepts had a mean frequency of 218,874.03 (100-121,984,682)
Most Frequently Missing Measurement Concepts
- Pulse intensity of Unspecified artery palpation
- Penicillin G potassium [Mass] of Dose
- Sodium [Moles/volume] in Saliva (oral fluid)
- Cotinine/Creatinine [Mass Ratio] in Urine
- Chloride [Moles/volume] in Saliva (oral fluid)