diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md new file mode 100644 index 0000000000..14ee5dc505 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/README.md @@ -0,0 +1,152 @@ +# Importing Medical Subject Headings (MeSH) data from NCBI + +## Table of Contents + +1. [About the Dataset](#about-the-dataset) + 1. [Download Data](#download-data) + 2. [Overview](#overview) + 3. [Notes and Caveats](#notes-and-caveats) + 4. [dcid Generation](#dcid-generation) + 5. [License](#license) +2. [About the import](#about-the-import) + 1. [Artifacts](#artifacts) + 1. [Scripts](#scripts) + 2. [tMCF Files](#tmcf-files) + 2. [Import Procdeure](#import-procedure) + 3. [Tests](#tests) + +## About the Dataset + +“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Supplementary Concept Record, and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). This dataset is updated on an annual basis on the first of January every year. +Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds). + +### Download Data + +All the terminology referenced in the MeSH data can be downloaded in various formats [here](https://www.nlm.nih.gov/databases/download/mesh.html). The current xml files version can also be downloaded by running [`download.sh`](download.sh). To represent the entirity of the MeSH ontology in Biomedical Data Commons we download all for xml files from MeSH: `desc.xml`, `pa.xml`, `qual.xml`, and `supp.xml`. We also download from pubchem the mapping file between pubchem compound ids (CIDs) and corresponding MeSH Descriptor or Supplementary Concept Records MeSH unique ids (`CID-MeSH.csv`). All files required for this import can be downloaded by running[`download.sh`](download.sh) + +### Overview + +MeSH provides the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and supplementary concept record. In this import we use the four MeSH xml files to define MeSH Concept, Descriptor, Qualifiers, Supplementary Concept Records, and Terms as individual nodes as well as maintaining mappings to each other. We also maintain links between these data types to one other as indicated below. An overview on the MeSH Record Types can be found (here)[https://www.nlm.nih.gov/mesh/intro_record_types.html]. + +| Node Type | Property | Property Value Range\n(Out Link Node Type) | +| --- | --- | --- | +| MeSHConcept | preferredConcept | MeSHConcept | +| MeSHConcept | parent | MeSHDescriptor | +| MeSHConcept | hasMeSHQualifier | MeSHQualifier | +| MeSHDescpritor | sameAs | ChemicalCompund | +| MeSHDescriptor | mechanismOfAction | MeSHDescriptor | +| MeSHDescriptor | specializationOf | MeSHDescriptor | +| MeSHDescriptor | hasMeSHQualifier | MeSHQualifier | +| MeSHSupplementaryConceptRecord | mechanismOfAction | MeSHDescriptor | +| MeSHSupplementaryConceptRecord | parent | MeSHDescriptor | +| MeSHTerm | parent | MeSHConcept | + +### Notes and Caveats + +The total file size of all original downloaded files for this import is ~1.1 GB. The files from MeSH are in XML format and therefore use the python package `xml.etree.ElementTree` to read these files into pandas dataframes for further processing. Please, note that extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. Please note that special care needs to be given when traversing through the XML tree to ensure that the properties at each level are associated with the appropriate MeSHTerm node type. As part of this process, we ended up making a seperate csv+tmcf pair for each node type from each file with an additional mapping csv+tmcf file pair to bring in mappings between node types as necessary. The total file size for all sixteen formatted csvs is ~135 MB. Finally, we also decided not to include `LexicalTag` or `IsPermutedTermYN` as properties for MeSHTerms from the `qual.xml` file because for all Terms the property value was `NON` or `False` respectively, and thus these properties did not contain any additional information. + +The `pa.xml` file provided information on the pharmalogical action or mechanismOfAction of MeSHDescriptor and MeSHSupplementaryConceptRecord nodes. This provides pharmacological information about a subset of applicable MeSH records. Therefore, for MeSHDescriptor and MeSHSupplementaryConceptRecord nodes that were included in the `pa.xml` as having mechanismOfAction that are connected MeShDescriptor nodes, we noted that these nodes were of Drug node type as well. + +### dcid Generation +The dcids for all MeShRecordType nodes (MeSHConcept, MeSHDescriptor, MeSHQualifier, MeSHSupplementaryConceptRecord, and MeSHTerm) are generated using the mesh unique ids with the bio prefix: `bio/`. For MeSH unique ids they are formatted as starting with a letter followed by a string of numbers with the identity of the starting letter indicating the MeSH record type. The mapping of MeSH record type by the first letter of its unique ID is indicated below. In addition to using the MeSH unique ID to generate the dcid, the unique id is recorded as the value of the `identifier` property for all MeSHRecordType nodes. + +| Node Type | Starting Letter for MeSH unique ID | +| --- | --- | +| MeSHConcept | M | +| MeSHDescriptor | D | +| MeSHQualifier | Q | +| MeSHSupplementaryConceptRecord | C | +| MeSHTerm | T | + +The dcids for ChemicalCompounds were generated using the PubChem compound ID with the chem prefix: `chem/CID` the PubChem Compound ID provided by PubChem is a string of numbers, therefore we added the specifier to the front of this id as part of the dcid to provide context. The PubChem Compound ID is also seperately stored as a string value to property `pubChemCompoundID`. + +### License + +Any works found on National Library of Medicine (NLM) Web sites may be freely used or reproduced without permission in the U.S. More information about the license can be found [here](https://www.nlm.nih.gov/web_policies.html). + +## About the import + +### Artifacts + +#### Scripts + +##### Bash Scripts + +[`download.sh`](scripts/download.sh) downloads the desc, pa, qual, and supp xml files from MeSH as well as the CID-MeSH mapping file from pubchem. + +[`run.sh`](scripts/run.sh) converts raw data from MeSH into csv files formatted for import into the Data Commons knowledge graph. + +[`tests.sh`](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting. + +##### Python Scripts + +[`format_mesh_desc.py`](scripts/format_mesh_desc.py) converts the original xml into eight formatted csv files, which each can be imported alongside it's matching tMCF. + +[`format_mesh_pa.py`](scripts/format_mesh_pa.py) converts the original csv file into two formatted csv files, which can be imported alongside it's matching tMCF. + +[`format_mesh_qual.py`](scripts/format_mesh_qual.py) converts the original xml into four formatted csv files, which each can be imported alongside it's matching tMCF. + +[`format_mesh_supp.py`](scripts/format_mesh_supp.py) converts the supplementary MeSH supplementary record file into a csv mapped to MeSH descriptor ID, +and it maps the MeSH supplementary records to pubchem compound IDs resulting in a second separate csv. + +#### tMCF Files + +The tMCF files that map each column in the corresponding CSV file to the appropriate property can be found [here](tmcf). They include: + +[`mesh_desc_concept.tmcf`](tMCFs/mesh_desc_concept.tmcf) contains the tmcf mapping to the csv of concept nodes generated from the mesh desc file. + +[`mesh_desc_concept_mapping.tmcf`](tMCFs/mesh_desc_concept_mapping.tmcf) contains the tmcf mapping to the csv of the links of concept nodes to descriptor nodes generated from the mesh desc file. + +[`mesh_desc_descriptor.tmcf`](tMCFs/mesh_desc_descriptor.tmcf) contains the tmcf mapping to the csv of descriptor nodes generated from the mesh desc file. + +[`mesh_desc_descriptor_mapping.tmcf`](tMCFs/mesh_desc_descriptor_mapping.tmcf) contains the tmcf mapping to the csv of descriptor nodes liks to parent (more general) descriptor nodes from the mesh desc file. + +[`mesh_desc_qualifier.tmcf`](tMCFs/mesh_desc_qualifier.tmcf) contains the tmcf mapping to the csv of qualifier nodes generated from the mesh desc file. + +[`mesh_desc_qualifier_mapping.tmcf`](tMCFs/mesh_desc_qualifier_mapping.tmcf) contains the tmcf mapping to the csv of desciptor nodes links to qualifier nodes generated from the mesh desc file. + +[`mesh_desc_term.tmcf`](tMCFs/mesh_desc_term.tmcf) contains the tmcf mapping to the csv of term nodes generated from the mesh desc file. + +[`mesh_desc_term_mapping.tmcf`](tMCFs/mesh_desc_term_mapping.tmcf) contains the tmcf mapping to the csv of the mappings of term nodes links to concept nodes from the mesh desc file. + +[`mesh_pharmacological_action_descriptor.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions of mesh descriptors to the appropriate mesh descriptor nodes from the mesh pa file. + +[`mesh_pharmacological_action_record.tmcf`](tMCFs/mesh_pharmacological_action.tmcf) contains the tmcf mapping to the csv of pharmacological actions of mesh supplementary concept records to the appropriate mesh descriptor nodes from the mesh pa file. + +[`mesh_pubchem_mapping.tmcf`](tMCFs/mesh_pubchem_mapping.tmcf) contains the tmcf mapping to the csv of pubchem compound CIDs to MeSH Supplementary Records from the `CID-MESH.csv` and the mesh supp file. + +[`mesh_qual_concept.tmcf`](tMCFs/mesh_qual_concept.tmcf) contains the tmcf mapping to the csv of concept nodes generated from the mesh qual file. + +[`mesh_qual_concept_mapping.tmcf`](tMCFs/mesh_qual_concept_mapping.tmcf) contains the tmcf mapping to the csv of mappings of concept nodes to other mesh node types generated from the mesh qual file. + +[`mesh_qual_qualifier.tmcf`](tMCFs/mesh_qual_qualifier.tmcf) contains the tmcf mapping to the csv of qualifier nodes generated from the mesh qual file. + +[`mesh_qual_term.tmcf`](tMCFs/mesh_qual_term.tmcf) contains the tmcf mapping to the csv of term nodes generated from the mesh qual file. + +[`mesh_record.tmcf`](tMCFs/mesh_record.tmcf) ontains the tmcf mapping to the csv of supplementary record nodes generated from the mesh supp file. + +### Import Procedure + +Download the most recent versions of all mesh files (desc, pa, qual, and supp) and the pubchem file that maps CID to MeSH Supplementary Records: + +```bash +sh download.sh +``` + +Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file: + +```bash +sh run.sh +``` + +### Tests + +The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. All referenced nodes are created as part of the same csv+tmcf import pair, therefore any Existence Missing Reference warnings can be ignored. + +To run tests: + +```bash +sh tests.sh +``` + +This will generate an output file for the results of the tests on each csv + tmcf pair diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh new file mode 100644 index 0000000000..d4c0de277c --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/download.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +mkdir -p input; cd input + +# downloads the mesh xml file +curl -o mesh-desc.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2024.xml + +# downloads the mesh pharmacological action xml file +curl -o mesh-pa.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/pa2024.xml + +# downloads the mesh qualifier xml file +curl -o mesh-qual.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/qual2024.xml + +# downloads the mesh record xml file +curl -o mesh-supp.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2024.xml + +# downloads the pubchem compound ID and name csv file +curl -o mesh-pubchem.csv https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py new file mode 100644 index 0000000000..fbfff0046a --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_desc.py @@ -0,0 +1,409 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Suhana Bedi +Date: 09/17/2021 +Name: format_mesh_desc.py +Edited By: Samantha Piekos +Last Modified: 03/11/24 +Description: converts nested .xml to .csv and further breaks down the csv +into five different csvs, each describing relations between terms, qualifiers, +descriptors and concepts with an additional file mapping descriptors to +qualifiers. +@file_input: input .xml downloaded from NCBI +@file_output: five formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + + +# set up environment +import sys +import pandas as pd +import numpy as np +from xml.etree.ElementTree import parse + + +# declare universal variables +FILEPATH_MESH_CONCEPT = 'CSVs/mesh_desc_concept.csv' +FILEPATH_MESH_CONCEPT_MAPPING = 'CSVs/mesh_desc_concept_mapping.csv' +FILEPATH_MESH_DESCRIPTOR = 'CSVs/mesh_desc_descriptor.csv' +FILEPATH_MESH_DESCRIPTOR_MAPPING = 'CSVs/mesh_desc_descriptor_mapping.csv' +FILEPATH_MESH_QUALIFIER = 'CSVs/mesh_desc_qualifier.csv' +FILEPATH_MESH_QUALIFIER_MAPPING = 'CSVs/mesh_desc_qualifier_mapping.csv' +FILEPATH_MESH_TERM = 'CSVs/mesh_desc_term.csv' +FILEPATH_MESH_TERM_MAPPING = 'CSVs/mesh_desc_term_mapping.csv' + + +def format_mesh_xml(mesh_xml): + """ + Parses the xml file and converts it to a csv with + required columns + Args: + mesh_xml = xml file to be parsed + Returns: + pandas df after parsing + """ + document = parse(mesh_xml) + d = [] + ## column names for parsed xml tags + dfcols = [ + 'DescriptorID', 'DescriptorName', 'DateCreated-Year', + 'DateCreated-Month', 'DateCreated-Day', 'DateRevised-Year', + 'DateRevised-Month', 'DateRevised-Day', 'DateEstablished-Year', + 'DateEstablished-Month', 'DateEstablished-Day', 'QualifierID', + 'QualifierName', 'QualifierAbbreviation', 'ConceptID', 'ConceptName', + 'ScopeNote', 'TermID', 'TermName', 'TreeNumber', 'NLMClassificationNumber' + ] + df = pd.DataFrame(columns=dfcols) + for item in document.iterfind('DescriptorRecord'): + ## parses the Descriptor ID + d1 = item.findtext('DescriptorUI') + ## parses the Descriptor Name + elem = item.find(".//DescriptorName") + d1_name = elem.findtext("String") + ## parses the Date of Creation + date_created = item.find(".//DateCreated") + if date_created is None: + d1_created_year = np.nan + d1_created_month = np.nan + d1_created_day = np.nan + else: + d1_created_year = date_created.findtext("Year") + d1_created_month = date_created.findtext("Month") + d1_created_day = date_created.findtext("Day") + ## parses the Date of Revision + date_revised = item.find(".//DateRevised") + if date_revised is None: + d1_revised_year = np.nan + d1_revised_month = np.nan + d1_revised_day = np.nan + else: + d1_revised_year = date_revised.findtext("Year") + d1_revised_month = date_revised.findtext("Month") + d1_revised_day = date_revised.findtext("Day") + ## parses the Date of Establishment + date_established = item.find(".//DateEstablished") + if date_established is None: + d1_established_year = np.nan + d1_established_month = np.nan + d1_established_day = np.nan + else: + d1_established_year = date_established.findtext("Year") + d1_established_month = date_established.findtext("Month") + d1_established_day = date_established.findtext("Day") + tree_list = item.find(".//TreeNumberList") + if tree_list is None: + tree_num = np.nan + else: + tree_num = [] + for i in range(len(tree_list)): + ## parses the Tree Number + tree_num.append(tree_list.findtext("TreeNumber")) + ## parses the NLM Classification Number + nlm_num = item.findtext("NLMClassificationNumber") + if nlm_num is None: + nlm_num = np.nan + quantifier_list = item.find(".//AllowableQualifiersList") + qualID = [] + qual_name = [] + qual_abbr = [] + if quantifier_list is None: + qualID.append(np.nan) + qual_name.append(np.nan) + qual_abbr.append(np.nan) + else: + l1 = quantifier_list.findall(".//AllowableQualifier") + for i in range(len(l1)): + l2 = l1[i].find(".//QualifierReferredTo") + ## parses the Qualifier ID + qualID.append(l2.findtext("QualifierUI")) + ## parses the Qualifier Name + l3 = l2.find(".//QualifierName") + qual_name.append(l3.findtext("String")) + ## parses the Qualifier Abbreviation + qual_abbr.append(l1[i].findtext("Abbreviation")) + + concept_list = item.find(".//ConceptList") + if concept_list is None: + conceptID = np.nan + conceptName = np.nan + scopeNote = np.nan + termUI = np.nan + termName = np.nan + else: + c1 = concept_list.findall(".//Concept") + conceptID = [] + conceptName = [] + scopeNote = [] + termUI = [] + termName = [] + for i in range(len(c1)): + ## parses the Concept ID + conceptID.append(c1[i].findtext("ConceptUI")) + ## parses the Scope Note + scopeNote.append(c1[i].findtext("ScopeNote")) + ## parses the Concept Name + c2 = c1[i].find(".//ConceptName") + conceptName.append(c2.findtext("String")) + c3 = c1[i].find(".//TermList") + c4 = c3.findall(".//Term") + subtermUI = [] + subtermName = [] + for j in range(len(c4)): + ## parses the Term ID + subtermUI.append(c4[j].findtext("TermUI")) + subtermName.append(c4[j].findtext("String")) + termUI.append(subtermUI) + termName.append(subtermName) + d.append({'DescriptorID':d1, 'DescriptorName':d1_name, 'DateCreated-Year':d1_created_year, +'DateCreated-Month':d1_created_month, 'DateCreated-Day':d1_created_day, 'DateRevised-Year':d1_revised_year, +'DateRevised-Month':d1_revised_month, 'DateRevised-Day':d1_revised_day, 'DateEstablished-Year':d1_established_year, +'DateEstablished-Month':d1_established_month, 'DateEstablished-Day':d1_established_day, +'QualifierID':qualID, 'QualifierName':qual_name, 'QualifierAbbreviation':qual_abbr, +'ConceptID':conceptID, 'ConceptName':conceptName, 'ScopeNote':scopeNote, 'TermID':termUI, +'TermName':termName, 'TreeNumber':tree_num, 'NLMClassificationNumber':nlm_num}) + + df = pd.DataFrame(d) + return df + + +def date_modify(df): + """ + Modifies the dates in a df, into an ISO format + Args: + df = df with date columns + Returns: + df with modified date columns + + """ + df['DateCreated'] = df['DateCreated-Year'].astype( + str) + "-" + df['DateCreated-Month'].astype( + str) + "-" + df['DateCreated-Day'].astype(str) + df['DateRevised'] = df['DateRevised-Year'].astype( + str) + "-" + df['DateRevised-Month'].astype( + str) + "-" + df['DateRevised-Day'].astype(str) + df['DateEstablished'] = df['DateEstablished-Year'].astype( + str) + "-" + df['DateEstablished-Month'].astype( + str) + "-" + df['DateEstablished-Day'].astype(str) + ## adds quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['DateCreated', 'DateRevised', 'DateEstablished'] + for col in col_names_quote: + df[col] = df[col].replace(["nan-nan-nan"],np.nan) + ## drop repetitive column values + df = df.drop(columns=[ + 'DateCreated-Year', 'DateCreated-Month', 'DateCreated-Day', + 'DateRevised-Year', 'DateRevised-Month', 'DateRevised-Day', + 'DateEstablished-Year', 'DateEstablished-Month', 'DateEstablished-Day' + ]) + return df + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def write_decriptor_df_to_csvs(df): + # write descriptor node info to a csv + df_descriptor = df.drop(columns=['DescriptorParentID']).drop_duplicates() + df_descriptor.to_csv(FILEPATH_MESH_DESCRIPTOR, doublequote=False, escapechar='\\') + # write descriptor mapping info to csv file + df_mapping = df[['Descriptor_dcid', 'DescriptorParentID']].dropna().drop_duplicates() + df_mapping.to_csv(FILEPATH_MESH_DESCRIPTOR_MAPPING, doublequote=False, escapechar='\\') + return + + +def format_descriptor_df(df): + # prepares csv specific to descriptor nodes and their properties + # drop columns not required for the descriptor file + df = df.drop(columns=[ + 'QualifierID', 'QualifierName', 'QualifierAbbreviation', 'ConceptID', + 'ConceptName', 'TermID', 'TermName' + ]) + # retrieve first value from ScopeNote list + df['ScopeNote'] = df['ScopeNote'].str[0] + # explode the TreeNumber column + df = df.explode('TreeNumber') + # create descriptor dcid + df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) + # add quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['DescriptorName', 'ScopeNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['DescriptorName'] = df['DescriptorName'].fillna(df['DescriptorID']) + # retrieve the descriptor parent ID using tree number + df['DescriptorParentID'] = df['TreeNumber'].str[:-4] + map_dict = dict(zip(df['TreeNumber'], df['Descriptor_dcid'])) + df = df.replace({"DescriptorParentID": map_dict}) + df["DescriptorParentID"] = np.where(df['DescriptorParentID'].str[0] == "b", df["DescriptorParentID"], np.nan) + # write descriptor data to csv files + write_decriptor_df_to_csvs(df) + return + + +def format_qualifier_df(df): + # prepares a csv specific to qualifier nodes and their properties + df = df.drop(columns=[ + 'DescriptorID', 'DescriptorName', 'ConceptID', 'ConceptName', + 'ScopeNote', 'TermID', 'TermName', 'TreeNumber', + 'NLMClassificationNumber', 'DateCreated', 'DateRevised', + 'DateEstablished' + ]) + # Explode the Qualifier columns + explode_cols = ['QualifierID', 'QualifierName', 'QualifierAbbreviation'] + df = df.explode(explode_cols) + # remove missing qualifier rows + df = df[df['QualifierID'].notna()] + # add quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['QualifierName'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['QualifierName'] = df['QualifierName'].fillna(df['QualifierID']) + # create qualifier dcids + df['Qualifier_dcid'] = 'bio/' + df['QualifierID'].astype(str) + # drop duplicate rows + df = df.drop_duplicates() + # write df to csv file + df.to_csv(FILEPATH_MESH_QUALIFIER, doublequote=False, escapechar='\\') + return + + +def format_qualifier_mapping_df(df): + # processes a csv containing the mappings between descriptors and qualifiers + # drops columns not required for the qualifier file + df = df.drop(columns=[ + 'DescriptorName', 'ConceptID', 'ConceptName', 'ScopeNote', + 'TermID', 'TermName', 'TreeNumber', 'NLMClassificationNumber', + 'QualifierName', 'QualifierAbbreviation', 'DateCreated', + 'DateRevised', 'DateEstablished' + ]) + # Explode the Qualifier ID column + df = df.explode('QualifierID') + # drop duplicate rows and rows with missing values + df = df.dropna() + df = df.drop_duplicates() + # create qualifier and descriptor dcids + df['Qualifier_dcid'] = 'bio/' + df['QualifierID'].astype(str) + df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) + # write df to csv file + df.to_csv(FILEPATH_MESH_QUALIFIER_MAPPING, doublequote=False, escapechar='\\') + return + + +def write_concpet_df_to_csvs(df): + # write descriptor node info to a csv + df_concept = df.drop(columns=['DescriptorID']) + df_concept.to_csv(FILEPATH_MESH_CONCEPT, doublequote=False, escapechar='\\') + # write descriptor mapping info to csv file + df_mapping = df[['Concept_dcid', 'DescriptorID']].dropna().drop_duplicates() + df_mapping['Descriptor_dcid'] = 'bio/' + df_mapping['DescriptorID'].astype(str) # generate Descriptor dcid + df_mapping.to_csv(FILEPATH_MESH_CONCEPT_MAPPING, doublequote=False, escapechar='\\') + return + + +def format_concept_df(df): + # writes df specific to concept nodes and properties + df = df.drop(columns=[ + 'DescriptorName', 'QualifierID', 'QualifierName', + 'QualifierAbbreviation', 'TermID', 'TermName', 'TreeNumber', + 'NLMClassificationNumber', 'DateCreated', 'DateRevised', + 'DateEstablished' + ]) + # explode on Concept columns + explode_cols = ['ConceptID', 'ConceptName', 'ScopeNote'] + df = df.explode(explode_cols) + # reformat missing values remove and trailing white space in ScopeNote + df['ScopeNote'] = df['ScopeNote'].replace('None', '') + # adds quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['ConceptName', 'ScopeNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['ConceptName'] = df['ConceptName'].fillna(df['ConceptID']) + # generates concept and descriptor dcids + df['Concept_dcid'] = 'bio/' + df['ConceptID'].astype(str) + # write df to csvs + write_concpet_df_to_csvs(df) + return + + +def write_term_df_to_csvs(df): + # write descriptor node info to a csv + df_term = df.drop(columns=['ConceptID']).drop_duplicates() + df_term.to_csv(FILEPATH_MESH_TERM, doublequote=False, escapechar='\\') + # write descriptor mapping info to csv file + df_mapping = df[['ConceptID', 'Term_dcid']].dropna().drop_duplicates() + df_mapping['Concept_dcid'] = 'bio/' + df_mapping['ConceptID'].astype(str) # generate Concept dcid + df_mapping.to_csv(FILEPATH_MESH_TERM_MAPPING, doublequote=False, escapechar='\\') + return + + +def format_term_df(df): + # prepares csv specific to term nodes and their properties + df = df.drop(columns=[ + 'QualifierID', 'QualifierName', 'QualifierAbbreviation', 'ScopeNote', + 'DescriptorName', 'DescriptorID', 'TreeNumber', 'NLMClassificationNumber', + 'DateCreated', 'DateRevised', 'DateEstablished', 'ConceptName' + ]) + # explode on concept and term and then again on term columns + explode_cols = ['ConceptID', 'TermID', 'TermName'] + df = df.explode(explode_cols) + explode_cols_2 = ['TermID', 'TermName'] + df = df.explode(explode_cols_2) + # add quotes from text type columns and replaces "nan" with np.nan + col_names_quote = ['TermName'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['TermName'] = df['TermName'].fillna(df['TermID']) + # generate term dcids + df['Term_dcid'] = 'bio/' + df['TermID'].astype(str) + # write df to csvs + write_term_df_to_csvs(df) + return + + +def main(): + # read in file + file_input = sys.argv[1] + # convert xml to pandas df + df = format_mesh_xml(file_input) + df = date_modify(df) + # format csvs corresponding to different mesh node types + format_descriptor_df(df) + format_qualifier_df(df) + format_qualifier_mapping_df(df) + format_concept_df(df) + format_term_df(df) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py new file mode 100644 index 0000000000..d033f782af --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_pa.py @@ -0,0 +1,162 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Samantha Piekos +Date: 03/06/24 +Name: format_mesh_pa.py +Description: converts nested .xml to .csv and further breaks down the csv +into a csv about the pharmacological actions associated with drugs. +@file_input: input .xml downloaded from NCBI +@file_output: formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + +# set up environment +#from lxml import etree +import xml.etree.ElementTree as ET +import numpy as np +import pandas as pd +import sys + + +# declare universal variables +FILEPATH_OUTPUT_PREFIX = 'CSVs/mesh_pharmacological_action_' + + +def extract_data_from_xml(xml_filepath): + """ + extract data on descriptors and substances from the xml file + and store in list + """ + # read in xml data + with open(xml_filepath, 'r') as file: + xml_data = file.read() + root = ET.fromstring(xml_data) + data = [] # List to store extracted data + + for action in root.findall('PharmacologicalAction'): + descriptor_ui = action.find('DescriptorReferredTo/DescriptorUI').text + descriptor_name = action.find('DescriptorReferredTo/DescriptorName/String').text + + record_ui_data = [] + record_name_data = [] + for substance in action.find('PharmacologicalActionSubstanceList'): + record_ui = substance.find('RecordUI').text + record_name = substance.find('RecordName/String').text + record_name = record_name.strip('^') # remove bad character + record_ui_data.append(record_ui) + record_name_data.append(record_name) + + data.append({'DescriptorUI': descriptor_ui, 'DescriptorName': descriptor_name,\ + 'RecordUI': record_ui_data, 'RecordName': record_name_data}) + + return data + + +def format_mesh_xml(xml_data): + """ + Parses the xml file and converts it to a csv with + required columns + Args: + xml_data = xml file to be parsed + Returns: + pandas df after parsing + """ + # parse xml file + data = extract_data_from_xml(xml_data) + # initiate pandas df + df = pd.DataFrame(data) + # Explode the 'Substances' column + df = df.explode(['RecordUI', 'RecordName']) + # Reset the index + df = df.reset_index(drop=True) + return df + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def get_first_letter(data_type): + # returns the first letter in the mesh unique id based on data type + if data_type == 'descriptor': + return 'D' + if data_type == 'record': + return 'C' + print('Warning! Unexpected MeSH data type in RecordUI column!') + return + + +def generate_mesh_type_specific_csv(df, data_type): + # get expected first letter of RecordUI for mesh data type of interest + first_letter = get_first_letter(data_type) + # filter for rows containing RecordUIs that are the data type of interest + df = df[df['RecordUI'].str[0] == first_letter] + # save df to csv + filepath_output = FILEPATH_OUTPUT_PREFIX + data_type + '.csv' + df.to_csv(filepath_output, doublequote=False, escapechar='\\') + return + + +def format_pharmacological_action_df(df): + """ + Formats strings and dcids for import into the kg + """ + # adds quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['DescriptorName', 'RecordName'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['DescriptorName'] = df['DescriptorName'].fillna(df['DescriptorUI']) + df['RecordName'] = df['RecordName'].fillna(df['RecordUI']) + # create descriptor dcids and dcids for corresponding descriptor or records + df['Descriptor_dcid'] = 'bio/' + df['DescriptorUI'].astype(str) + df['dcid'] = 'bio/' + df['RecordUI'].astype(str) + # drops the duplicate rows + df = df.drop_duplicates() + # create csvs mapping pharamacological actions to descriptors or supplemntar records + generate_mesh_type_specific_csv(df, 'descriptor') + generate_mesh_type_specific_csv(df, 'record') + + +def main(): + # read in file + file_input = sys.argv[1] + # convert xml to pandas df + df = format_mesh_xml(file_input) + # format csvs for ingestion into biomedical data commons + format_pharmacological_action_df(df) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py new file mode 100644 index 0000000000..b9f6f61b6e --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_qual.py @@ -0,0 +1,314 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Samantha Piekos +Date: 04/02/24 +Name: format_mesh_qual.py +Description: converts nested .xml to .csv and further breaks down the csv +into four csvs containing invormation on qualifiers, concepts, terms or +concept mappings to other concepts. +@file_output: formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + +# set up environment +import numpy as np +import pandas as pd +import sys +import xml.etree.ElementTree as ET + +# declare universal variables +FILEPATH_MESH_CONCEPT = 'CSVs/mesh_qual_concept.csv' +FILEPATH_MESH_CONCEPT_MAPPING = 'CSVs/mesh_qual_concept_mapping.csv' +FILEPATH_MESH_QUALIFIER = 'CSVs/mesh_qual_qualifier.csv' +FILEPATH_MESH_TERM = 'CSVs/mesh_qual_term.csv' + + +def parse_date(date_element, col): + # extract date elements from xml and format + # return as string YYYY-MM-DD value + if date_element is not None: + year = date_element.find('Year').text + month = date_element.find('Month').text + day = date_element.find('Day').text + date = ('-').join([year, month, day]) + return date + return None + + +def parse_tree_list(tree): + # extract all tree numbers and store in list + tree_data = [] + if tree.find('TreeNumber'): + for tree in action.find('TreeNumberList'): + if tree.find('TreeNumber') is not None: + tree_number = tree.find('TreeNumber').text + tree_data.append(tree_number) + else: + tree_data.append('') + return tree_data + + +def parse_associated_concepts(concept): + # extract all concept relationship pairs storing pairs as lists + # return all pairs in list (nested list) + list_concepts = [] + if concept.find('ConceptRelationList') is not None: + for relation in concept.find('ConceptRelationList'): + concept1 = relation.find('Concept1UI').text + concept2 = relation.find('Concept2UI').text + list_concepts.append([concept1, concept2]) + return list_concepts + + +def parse_booleans(tree, query): + # extract boolean values and convert to boolean + data_str = tree.get(query) + data = data_str == 'Y' + return data + + +def handle_potentially_missing_col(data, col): + # extract data element that may be missing from xml + if data.find(col) is not None: + return data.find(col).text + return '' + + +def parse_terms(concept): + # store all terms data in dictonary with values for terms + # associated with a given concept stored in lists as values + terms = { + 'TermUI': [], 'TermName': [], 'Abbreviation': [],\ + 'Display': [], 'DateCreated': [],\ + 'is_concept_preferred_term': [], 'is_permuted_term': [],\ + 'is_record_preferred_term': [] + } + for term in concept.find('TermList'): + terms['TermUI'].append(term.find('TermUI').text) + terms['TermName'].append(term.find('String').text) + terms['DateCreated'].append(parse_date(term.find('DateCreated'), 'DateCreated')) + terms['Abbreviation'].append(handle_potentially_missing_col(term, 'Abbreviation')) + terms['Display'].append(handle_potentially_missing_col(term, 'EntryVersion')) + terms['is_concept_preferred_term'].append(parse_booleans(term, 'ConceptPreferredTermYN')) + terms['is_permuted_term'].append(parse_booleans(term, 'ConceptPermutedYN')) + terms['is_record_preferred_term'].append(parse_booleans(term, 'ConceptRecordTermYN')) + return terms + + +def format_mesh_xml(xml_filepath): + """ + extract data on descriptors and substances from the xml file + and store in list + """ + # read in xml data + with open(xml_filepath, 'r') as file: + xml_data = file.read() + root = ET.fromstring(xml_data) + data = [] # List to store extracted data + + for action in root.findall('QualifierRecord'): + # parse qualifier data + qualifier_ui = action.find('QualifierUI').text + qualifier_name = action.find('QualifierName/String').text + annotation = action.find('Annotation').text + history_note = action.find('HistoryNote').text + tree_list = action.find('TreeNumberList') + tree_data = [number.text for number in tree_list.findall('TreeNumber')] + tree_data = ','.join(tree_data) + + # parse dates + date_created = parse_date(action.find('DateCreated'), 'DateCreated') + date_revised = parse_date(action.find('DateRevised'), 'DateRevised') + date_established = parse_date(action.find('DateEstablished'), 'DateEstablished') + + # parse concept info + concept_ui = [] + concept_name = [] + scope_note = [] + associated_concepts = [] + is_preferred_concept = [] + terms = [] + for concept in action.find('ConceptList'): + concept_ui.append(concept.find('ConceptUI').text) + concept_name.append(concept.find('ConceptName/String').text) + associated_concepts.append(parse_associated_concepts(concept)) + is_preferred_concept.append(parse_booleans(concept, 'PreferredConceptYN')) + terms.append(parse_terms(concept)) + scope_note.append(handle_potentially_missing_col(concept, 'ScopeNote')) + + data.append({ + 'QualifierUI': qualifier_ui, 'QualifierName': qualifier_name, + 'Annotation': annotation, 'HistoryNote': history_note, + 'TreeNumber': tree_data, 'DateCreated': date_created, + 'DateRevised': date_revised, 'DateEstablished': date_established, + 'ConceptUI': concept_ui, 'ConceptName': concept_name, + 'ScopeNote': scope_note, 'AssociatedConcepts': associated_concepts, + 'IsPreferredConcept': is_preferred_concept, 'Terms': terms + }) + + return pd.DataFrame(data) + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def format_qualifier_df(df): + # create csv specific to qualifiers and their properties + # drop columns not required for the qualifier file + df = df.drop(columns=[ + 'ConceptUI', 'ConceptName', 'ScopeNote', 'AssociatedConcepts', + 'IsPreferredConcept', 'Terms' + ]) + # remove missing qualifier rows + df = df[df['QualifierUI'].notna()] + # adds quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['QualifierName', 'Annotation', 'HistoryNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['QualifierName'] = df['QualifierName'].fillna(df['QualifierUI']) + # creates qualifier dcids + df['Qualifier_dcid'] = 'bio/' + df['QualifierUI'].astype(str) + # drops the duplicate rows + df = df.drop_duplicates() + # write df to csv + df.to_csv(FILEPATH_MESH_QUALIFIER, doublequote=False, escapechar='\\') + return df + + +def format_concept_df(df): + # create csv specific to concept nodes and their properties + # drop columns not required for the qualifier file + df = df.drop(columns=[ + 'QualifierName', 'Annotation', 'HistoryNote', 'TreeNumber', + 'DateCreated', 'DateRevised', 'DateEstablished', 'Terms', + 'AssociatedConcepts' + ]) + # remove missing concept rows + df = df[df['ConceptUI'].notna()] + # Explode the Concept columns + explode_cols = ['ConceptUI', 'ConceptName', 'ScopeNote', 'IsPreferredConcept'] + df = df.explode(explode_cols) + # adds quote from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['ConceptName', 'ScopeNote'] + df = format_text_strings(df, col_names_quote) + # replace missing names with ID + df['ConceptName'] = df['ConceptName'].fillna(df['ConceptUI']) + # create qualifier and concept dcids + df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Qualifier_dcid'] = 'bio/' + df['QualifierUI'].astype(str) + # drop the duplicate rows + df = df.drop_duplicates() + # write df to csv + df.to_csv(FILEPATH_MESH_CONCEPT, doublequote=False, escapechar='\\') + return df + + +def format_concept_relations_df(df): + # create csv specific to mapping concept to other mesh data types + # drop columns not required for the qualifier mapping file + df = df.drop(columns=[ + 'QualifierName', 'Annotation', 'TreeNumber', 'HistoryNote', + 'DateCreated', 'DateRevised', 'DateEstablished', 'Terms', + 'ScopeNote', 'ConceptName' + ]) + # remove missing concept rows + df = df[df['ConceptUI'].notna()] + # Explode the Concept columns + explode_cols = ['ConceptUI', 'IsPreferredConcept', 'AssociatedConcepts'] + df = df.explode(explode_cols) + df = df.explode('AssociatedConcepts') + df = df.explode('AssociatedConcepts') + df[df['ConceptUI'] != df['AssociatedConcepts']] + df = df[df['IsPreferredConcept'] == False] + # create qualifier and concept dcids + df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Preferred_Concept_dcid'] = 'bio/' + df['AssociatedConcepts'].astype(str) + # drop the duplicate rows and extra columns + df = df.drop(['QualifierUI', 'IsPreferredConcept', 'ConceptUI', 'AssociatedConcepts'], axis=1) + df = df.drop_duplicates() + rows_to_drop = df['Concept_dcid'] == df['Preferred_Concept_dcid'] + df = df[~rows_to_drop] + # write df to csv + df.to_csv(FILEPATH_MESH_CONCEPT_MAPPING, doublequote=False, escapechar='\\') + return df + + +def format_terms_df(df): + # create formatted csv specific to Term nodes and thier properties + # drop columns not required for the qualifier file + df = df.drop(columns=[ + 'QualifierName', 'Annotation', 'HistoryNote', 'TreeNumber', + 'DateCreated', 'DateRevised', 'DateEstablished', 'ScopeNote', + 'ConceptName', 'IsPreferredConcept', 'AssociatedConcepts' + ]) + # remove missing concept rows + df = df[df['ConceptUI'].notna()] + # Explode the Concept columns + explode_cols = ['ConceptUI', 'Terms'] + df = df.explode(explode_cols).reset_index() + df2 = pd.json_normalize(df['Terms']) + df = pd.concat([df.drop(['Terms'], axis=1), df2], axis=1) + df = df.drop(['index'], axis=1) + explode_cols = list(df2.columns) + df = df.explode(explode_cols) + # adds quotes from text type columns and replaces "nan" with qualifier ID + col_names_quote = ['TermName', 'Abbreviation', 'Display'] + df = format_text_strings(df, col_names_quote) + # creates qualifier and concept dcids + df['Concept_dcid'] = 'bio/' + df['ConceptUI'].astype(str) + df['Term_dcid'] = 'bio/' + df['TermUI'].astype(str) + # drops the duplicate rows and extra columns + df = df.drop(['QualifierUI', 'is_permuted_term'], axis=1) + df = df.drop_duplicates() + # write df to csv + df.to_csv(FILEPATH_MESH_TERM, doublequote=False, escapechar='\\') + return + + +def main(): + # read in file + file_input = sys.argv[1] + # convert xml file to pandas df + df = format_mesh_xml(file_input) + # format CSV files for each level of the xml file + format_qualifier_df(df) + format_concept_df(df) + format_concept_relations_df(df) + format_terms_df(df) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py new file mode 100644 index 0000000000..e5bd649639 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/format_mesh_supp.py @@ -0,0 +1,174 @@ +# Copyright 2022 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Suhana Bedi +Date: 08/02/2022 +Edited By: Samantha Piekos +Last Edited: 03/06/2024 +Name: format_mesh_supp.py +Description: converts nested .xml to .csv and the csv entails the relationship between +the descriptor record ID and descriptor ID for MESH terms +@file_input: input .xml downloaded from NCBI +''' + + +# set up environment +import sys +import pandas as pd +import numpy as np +from xml.etree.ElementTree import parse + + +# declare universal variables +FILEPATH_MESH_PUBCHEM_MAPPING = 'CSVs/mesh_pubchem_mapping.csv' +FILEPATH_RECORD = 'CSVs/mesh_record.csv' + +def read_mesh_record(mesh_record_xml): + """ + Parses the xml file and converts it to a csv with + required columns + Args: + mesh_xml = xml file to be parsed + Returns: + df = pandas df after parsing + """ + document = parse(mesh_record_xml) + d = [] + dfcols = [ + 'RecordID', 'RecordName', 'DateCreated-Year', + 'DateCreated-Month', 'DateCreated-Day', 'DateRevised-Year', + 'DateRevised-Month', 'DateRevised-Day', 'DescriptorID' + ] + df = pd.DataFrame(columns=dfcols) + for item in document.iterfind('SupplementalRecord'): + d1 = item.findtext('SupplementalRecordUI') + elem = item.find(".//SupplementalRecordName") + d1_name = elem.findtext("String") + date_created = item.find(".//DateCreated") + if date_created is None: + d1_created_year = np.nan + d1_created_month = np.nan + d1_created_day = np.nan + else: + d1_created_year = date_created.findtext("Year") + d1_created_month = date_created.findtext("Month") + d1_created_day = date_created.findtext("Day") + date_revised = item.find(".//DateRevised") + if date_revised is None: + d1_revised_year = np.nan + d1_revised_month = np.nan + d1_revised_day = np.nan + else: + d1_revised_year = date_revised.findtext("Year") + d1_revised_month = date_revised.findtext("Month") + d1_revised_day = date_revised.findtext("Day") + heading_list = item.find(".//HeadingMappedToList") + headID = [] + if heading_list is None: + headID.append(np.nan) + else: + l1 = heading_list.findall(".//HeadingMappedTo") + for i in range(len(l1)): + l2 = l1[i].find(".//DescriptorReferredTo") + headID.append(l2.findtext("DescriptorUI")) + d.append({'RecordID':d1, 'RecordName':d1_name, 'DateCreated-Year':d1_created_year, 'DateCreated-Month':d1_created_month, 'DateCreated-Day':d1_created_day, + 'DateRevised-Year':d1_revised_year, 'DateRevised-Month':d1_revised_month, 'DateRevised-Day':d1_revised_day, + 'DescriptorID':headID}) + df = pd.DataFrame(d) + return df + + +def format_dates(df): + """ + Modifies the dates in a df, into an ISO format + Args: + df1 = df with date columns + Returns: + df with modified date columns + + """ + df['DateCreated'] = df['DateCreated-Year'].astype( + str) + "-" + df['DateCreated-Month'].astype( + str) + "-" + df['DateCreated-Day'].astype(str) + df['DateRevised'] = df['DateRevised-Year'].astype( + str) + "-" + df['DateRevised-Month'].astype( + str) + "-" + df['DateRevised-Day'].astype(str) + col_names_quote = ['DateCreated', 'DateRevised'] + ## adds quotes from text type columns and replaces "nan" with np.nan + for col in col_names_quote: + df[col] = df[col].replace(["nan-nan-nan"],np.nan) + ## drop repetitive column values + df = df.drop(columns=[ + 'DateCreated-Year', 'DateCreated-Month', 'DateCreated-Day', + 'DateRevised-Year', 'DateRevised-Month', 'DateRevised-Day' + ]) + return df + + +def format_record_csv(df): + """ + Formats the MESH record ID, record name and corresponding descriptor IDs and DCIDs + Args: + df: pandas dataframe with zipped and unformatted descriptor IDs + + Returns: + df : pandas dataframe with formatted and unzipped descriptor IDs corresponding to record ID + """ + # Explode the DescriptorID column + df = df.explode('DescriptorID') + # Clean up DescriptorID values (remove leading/trailing '*') + df['DescriptorID'] = df['DescriptorID'].str.strip('*') + ## removes special characters from the descriptor column + df['DescriptorID'] = df['DescriptorID'].str.replace(r'\W', '') + ## puts quotes around record name string values + df['RecordName'] = '"' + df.RecordName + '"' + ## generates record and descriptor dcids + df['Record_dcid'] = 'bio/' + df['RecordID'].astype(str) + df['Descriptor_dcid'] = 'bio/' + df['DescriptorID'].astype(str) + df.to_csv(FILEPATH_RECORD, doublequote=False, escapechar='\\') + return df + + +def format_pubchem_mesh_mapping(pubchem_file, df_mesh): + # read in pubchem mesh mapping csv file + df_pubchem = pd.read_csv(pubchem_file, on_bad_lines='skip', sep='\t', header = None, names = ['CID', 'CompoundName']) + # seperate compound name as own column + df_pubchem['CompoundName'] = '"' + df_pubchem.CompoundName + '"' + # merge with mesh record df on names + df_match = pd.merge(df_mesh, df_pubchem, left_on='RecordName', right_on='CompoundName', how = 'inner') + # filter for desired columns in output csv + df_match = df_match.filter(['CID', 'RecordID', 'RecordName', 'Record_dcid'], axis=1) + # format compound dcids + df_match['CID_dcid'] = 'chem/CID' + df_match['CID'].astype(str) + # drop duplicates + df_match = df_match.drop_duplicates() + # write df to csv + df_match.to_csv(FILEPATH_MESH_PUBCHEM_MAPPING, doublequote=False, escapechar='\\') + return + + +def main(): + # read in files + file_input = sys.argv[1] + file_pubchem = sys.argv[2] + # convert mesh record xml file to pandas df + df = read_mesh_record(file_input) + df = format_dates(df) + df_mesh = format_record_csv(df) + # create pubchem mesh mapping csv and mesh record csv + format_pubchem_mesh_mapping(file_pubchem, df_mesh) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh new file mode 100644 index 0000000000..77c66a9f6c --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/run.sh @@ -0,0 +1,19 @@ +#!/bin/bash + +mkdir -p CSVs + +# extracts the mesh descriptor, term, concept, qualifier terms into 4 csvs +python3 scripts/format_mesh_desc.py input/mesh-desc.xml +echo "MeSH descriptor file processed" + +# extracts pharmacological actions associated with substances +python3 scripts/format_mesh_pa.py input/mesh-pa.xml +echo "MeSH pharmacological action file processed" + +# extracts qualifier data +python3 scripts/format_mesh_qual.py input/mesh-qual.xml +echo "MeSH qualifier file processed" + +# extracts the mesh records and maps it with pubchem IDs +python3 scripts/format_mesh_supp.py input/mesh-supp.xml input/mesh-pubchem.csv +echo "MeSH record file and pubchem mappings processed" diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh new file mode 100644 index 0000000000..4ae2f6b959 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/scripts/tests.sh @@ -0,0 +1,85 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 03/11/2024 +Name: tests +Description: This file runs the Data Commons Java tool to run standard +tests on tmcf + CSV pairs for the NIH NLM MeSH import. This assumes that +the user has Java Remote Environment (JRE) installed, which is needed to +locally install Data Commons test tool (v. 0.1-alpha.1k) prior to calling +the tool to evaluate tmcf + CSV pairs. +""" + +#!/bin/bash + +# download data commons java test tool version 0.1-alpha.1k +mkdir -p tmp; cd tmp +wget https://github.com/datacommonsorg/import/releases/download/0.1-alpha.1k/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar +cd .. + +# run tests on desc file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_concept.tmcf CSVs/mesh_desc_concept.csv +mv dc_generated mesh_desc_concept + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_concept_mapping.tmcf CSVs/mesh_desc_concept_mapping.csv +mv dc_generated mesh_desc_concept_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_descriptor.tmcf CSVs/mesh_desc_descriptor.csv +mv dc_generated mesh_desc_descriptor + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_descriptor_mapping.tmcf CSVs/mesh_desc_descriptor_mapping.csv +mv dc_generated mesh_desc_descriptor_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_qualifier.tmcf CSVs/mesh_desc_qualifier.csv +mv dc_generated mesh_desc_qualifier + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_qualifier_mapping.tmcf CSVs/mesh_desc_qualifier_mapping.csv +mv dc_generated mesh_desc_qualifier_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_term.tmcf CSVs/mesh_desc_term.csv +mv dc_generated mesh_desc_term + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_desc_term_mapping.tmcf CSVs/mesh_desc_term_mapping.csv +mv dc_generated mesh_desc_term_mapping + + +# run tests on pa file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_pharmacological_action_descriptor.tmcf CSVs/mesh_pharmacological_action_descriptor.csv +mv dc_generated mesh_pharmacological_action_descriptor + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_pharmacological_action_record.tmcf CSVs/mesh_pharmacological_action_record.csv +mv dc_generated mesh_pharmacological_action_record + + +# run tests on qual file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_concept.tmcf CSVs/mesh_qual_concept.csv +mv dc_generated mesh_qual_concept + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_concept_mapping.tmcf CSVs/mesh_qual_concept_mapping.csv +mv dc_generated mesh_qual_concept_mapping + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_qualifier.tmcf CSVs/mesh_qual_qualifier.csv +mv dc_generated mesh_qual_qualifier + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_qual_term.tmcf CSVs/mesh_qual_term.csv +mv dc_generated mesh_qual_term + + +# run tests on record and pubchem files csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_record.tmcf CSVs/mesh_record.csv +mv dc_generated mesh_record + +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCFs/mesh_pubchem_mapping.tmcf CSVs/mesh_pubchem_mapping.csv +mv dc_generated mesh_pubchem_mapping diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf new file mode 100644 index 0000000000..e418966a3e --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept.tmcf @@ -0,0 +1,6 @@ +Node: E:mesh_desc_concept->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_desc_concept->Concept_dcid +name: C:mesh_desc_concept->ConceptName +identifier: C:mesh_desc_concept->ConceptID +scopeNote: C:mesh_desc_concept->ScopeNote diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf new file mode 100644 index 0000000000..cd44c39840 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_concept_mapping.tmcf @@ -0,0 +1,8 @@ +Node: E:mesh_desc_concept->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_concept->Descriptor_dcid + +Node: E:mesh_desc_concept->E2 +typeOf: dcs:MeSHConcept +dcid: C:mesh_desc_concept->Concept_dcid +parent: E:mesh_desc_concept->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf new file mode 100644 index 0000000000..11f073a786 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor.tmcf @@ -0,0 +1,11 @@ +Node: E:mesh_desc_descriptor->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor->Descriptor_dcid +name: C:mesh_desc_descriptor->DescriptorName +dateCreated: C:mesh_desc_descriptor->DateCreated +dateRevised: C:mesh_desc_descriptor->DateRevised +dateEstablished: C:mesh_desc_descriptor->DateEstablished +identifier: C:mesh_desc_descriptor->DescriptorID +medicalSubjectHeadingTreeNumber: C:mesh_desc_descriptor->TreeNumber +nationalLibraryOfMedicineClassificationNumber: C:mesh_desc_descriptor->NLMClassificationNumber +scopeNote: C:mesh_desc_descriptor->ScopeNote diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf new file mode 100644 index 0000000000..06124cb138 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_descriptor_mapping.tmcf @@ -0,0 +1,8 @@ +Node: E:mesh_desc_descriptor->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor->DescriptorParentID + +Node: E:mesh_desc_descriptor->E2 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor->Descriptor_dcid +specializationOf: E:mesh_desc_descriptor->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf new file mode 100644 index 0000000000..49bb698c79 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier.tmcf @@ -0,0 +1,6 @@ +Node: E:mesh_desc_qualifier->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_desc_qualifier->Qualifier_dcid +name: C:mesh_desc_qualifier->QualifierName +identifier: C:mesh_desc_qualifier->QualifierID +abbreviation: C:mesh_desc_qualifier->QualifierAbbreviation diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf new file mode 100644 index 0000000000..96c863be34 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_qualifier_mapping.tmcf @@ -0,0 +1,10 @@ +Node: E:mesh_desc_descriptor_qualifier_mapping->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_desc_descriptor_qualifier_mapping->Qualifier_dcid +identifier: C:mesh_desc_descriptor_qualifier_mapping->QualifierID + +Node: E:mesh_desc_descriptor_qualifier_mapping->E2 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_desc_descriptor_qualifier_mapping->Descriptor_dcid +identifier: C:mesh_desc_descriptor_qualifier_mapping->DescriptorID +hasMeSHQualifier: E:mesh_desc_descriptor_qualifier_mapping->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf new file mode 100644 index 0000000000..e1d1dbb173 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term.tmcf @@ -0,0 +1,5 @@ +Node: E:mesh_desc_term->E2 +typeOf: dcs:MeSHTerm +dcid: C:mesh_desc_term->Term_dcid +name: C:mesh_desc_term->TermName +identifier: C:mesh_desc_term->TermID diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf new file mode 100644 index 0000000000..a9c4494b6d --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_desc_term_mapping.tmcf @@ -0,0 +1,9 @@ +Node: E:mesh_desc_term->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_desc_term->Concept_dcid +identifier: C:mesh_desc_term->ConceptID + +Node: E:mesh_desc_term->E2 +typeOf: dcs:MeSHTerm +dcid: C:mesh_desc_term->Term_dcid +parent: E:mesh_desc_term->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf new file mode 100644 index 0000000000..712948c4ee --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_descriptor.tmcf @@ -0,0 +1,14 @@ +Node: E:mesh_pharmacological_action_descriptor->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_pharmacological_action_descriptor->Descriptor_dcid +name: C:mesh_pharmacological_action_descriptor->DescriptorName +identifier: C:mesh_pharmacological_action_descriptor->DescriptorUI + +Node: E:mesh_pharmacological_action_descriptor->E2 +typeOf: dcs:MeSHDescriptor +typeOf: schema:Drug +dcid: C:mesh_pharmacological_action_descriptor->dcid +name: C:mesh_pharmacological_action_descriptor->RecordName +identifier: C:mesh_pharmacological_action_descriptor->RecordUI +mechanismOfAction: E:mesh_pharmacological_action_descriptor->E1 + \ No newline at end of file diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf new file mode 100644 index 0000000000..f915fca71b --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pharmacological_action_record.tmcf @@ -0,0 +1,12 @@ +Node: E:mesh_pharmacological_action_record->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_pharmacological_action_record->Descriptor_dcid +identifier: C:mesh_pharmacological_action_record->DescriptorUI + +Node: E:mesh_pharmacological_action_record->E2 +typeOf: dcs:MeSHSupplementaryConceptRecord +typeOf: schema:Drug +dcid: C:mesh_pharmacological_action_record->dcid +name: C:mesh_pharmacological_action_record->RecordName +identifier: C:mesh_pharmacological_action_record->RecordUI +mechanismOfAction: E:mesh_pharmacological_action_record->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf new file mode 100644 index 0000000000..7d81a05c26 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_pubchem_mapping.tmcf @@ -0,0 +1,11 @@ +Node: E:mesh_pubchem_mapping->E1 +typeOf: dcs:ChemicalCompound +dcid: C:mesh_pubchem_mapping->CID_dcid +pubChemCompoundID: C:mesh_pubchem_mapping->CID + +Node: E:mesh_pubchem_mapping->E2 +typeOf: dcs:MeSHSupplementaryConceptRecord +dcid: C:mesh_pubchem_mapping->Record_dcid +name: C:mesh_pubchem_mapping->RecordName +identifier: C:mesh_pubchem_mapping->RecordID +sameAs: E:mesh_pubchem_mapping->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf new file mode 100644 index 0000000000..82bc358222 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept.tmcf @@ -0,0 +1,13 @@ +Node: E:mesh_qual_concept->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_qual_concept->Qualifier_dcid +identifier: C:mesh_qual_concept->QualifierUI + +Node: E:mesh_qual_concept->E2 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_concept->Concept_dcid +name: C:mesh_qual_concept->ConceptName +hasMeSHQualifier: E:mesh_qual_concept->E1 +identifier: C:mesh_qual_concept->ConceptUI +isPreferredConcept: C:mesh_qual_concept->IsPreferredConcept +scopeNote: C:mesh_qual_concept->ScopeNote diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf new file mode 100644 index 0000000000..9bf2fb73e3 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_concept_mapping.tmcf @@ -0,0 +1,8 @@ +Node: E:mesh_qual_concept_mapping->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_concept_mapping->Preferred_Concept_dcid + +Node: E:mesh_qual_concept_mapping->E2 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_concept_mapping->Concept_dcid +preferredMeSHConcept: E:mesh_qual_concept_mapping->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf new file mode 100644 index 0000000000..cb2d81c7e8 --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_qualifier.tmcf @@ -0,0 +1,11 @@ +Node: E:mesh_qual_qualifier->E1 +typeOf: dcs:MeSHQualifier +dcid: C:mesh_qual_qualifier->Qualifier_dcid +name: C:mesh_qual_qualifier->QualifierName +dateCreated: C:mesh_qual_qualifier->DateCreated +dateRevised: C:mesh_qual_qualifier->DateRevised +dateEstablished: C:mesh_qual_qualifier->DateEstablished +description: C:mesh_qual_qualifier->Annotation +identifier: C:mesh_qual_qualifier->QualifierUI +note: C:mesh_qual_qualifier->HistoryNote +medicalSubjectHeadingTreeNumber: C:mesh_qual_qualifier->TreeNumber diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf new file mode 100644 index 0000000000..74df8e9bba --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_qual_term.tmcf @@ -0,0 +1,16 @@ +Node: E:mesh_qual_term->E1 +typeOf: dcs:MeSHConcept +dcid: C:mesh_qual_term->Concept_dcid +identifier: C:mesh_qual_term->ConceptUI + +Node: E:mesh_qual_term->E2 +typeOf: dcs:MeSHTerm +dcid: C:mesh_qual_term->Term_dcid +name: C:mesh_qual_term->TermName +abbreviation: C:mesh_qual_term->Abbreviation +abbreviation: C:mesh_qual_term->Display +dateCreated: C:mesh_qual_term->DateCreated +identifier: C:mesh_qual_term->TermUI +isConceptPreferredTerm: C:mesh_qual_term->is_concept_preferred_term +isRecordPreferredTerm: C:mesh_qual_term->is_record_preferred_term +parent: E:mesh_qual_term->E1 diff --git a/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf new file mode 100644 index 0000000000..65e18bdc9a --- /dev/null +++ b/scripts/biomedical/NIH_NLM/Medical_Subject_Headings/tMCFs/mesh_record.tmcf @@ -0,0 +1,13 @@ +Node: E:mesh_record->E1 +typeOf: dcs:MeSHDescriptor +dcid: C:mesh_record->Descriptor_dcid +identifier: C:mesh_record->DescriptorID + +Node: E:mesh_record->E2 +typeOf: dcs:MeSHSupplementaryConceptRecord +dcid: C:mesh_record->Record_dcid +identifier: C:mesh_record->RecordID +name: C:mesh_record->RecordName +dateCreated: C:mesh_record->DateCreated +dateRevised: C:mesh_record->DateRevised +parent: E:mesh_record->E1