This repository contains instructions and code to create concept tables of Dutch medical names, such as primary names, synonyms, abbreviations, and common misspellings. By basing a concept table on UMLS, which includes MeSH, MedDRA, ICD-10 and ICPC, and adding SNOMED CT and a few manually added names, a comprehensive set of words commonly used in Dutch medical language is generated. Workflows for creating SNOMED and HPO based concept tables are also in this repository.
The resulting concept tables can be used in named entity recognition and linking methods, such as MedCAT, to identify entities in Dutch medical text.
Ontology | Number of concepts | Number of names | Primary source |
---|---|---|---|
Dutch UMLS | 254835 | 574475 | UMLS 2022AB |
Dutch UMLS with English drug names | 367913 | 754326 | UMLS 2022AB |
Dutch SNOMED | 230277 | 521118 | SNOMED CT Netherlands Edition September 2022 v1.0 |
Dutch HPO | 13360 | 29164 | Dutch HPO translations |
Data and licenses should be acquired from UMLS Terminology Services and SNOMED MLDS.
- Download pre-made Dutch MedCAT models
- Folder structure
- Output format
- Data-flow
- Generate UMLS concept table
- Generate SNOMED concept table
- Generate HPO concept table
- Generate MedCAT models
To download a pre-made Dutch MedCAT model based on the Dutch UMLS concept table generated in this repository, navigate to the MedCAT source repository (https://github.com/CogStack/MedCAT) and click on the link for the Dutch model. This requires authentication via UMLS Terminology Services to verify the user's UMLS license.
The methods of this repository require a number of input files, which should be downloaded by the user, and intermediate and output files are generated. This repository contains a folder structure that can be used for storing these files. The contents of these folders, except for 05_CustomChanges
, are added not tracked by Git, which makes it easier to replace input files or recreate output files.
dutch-medical-concepts
└───01_Download
└───02_ExtractSubset
└───03_SqlDB
└───04_ConceptDB
└───05_CustomChanges
cui | name | ontologies | name_status | type_ids |
---|---|---|---|---|
C0000001 | kanker | ONTOLOGY1|ONTOLOGY2 | P | T001 |
C0000001 | neoplasma maligne | ONTOLOGY1 | A | T001 |
C0000002 | longkanker | ONTOLOGY1 | P | T001 |
C0000003 | kleincellige longkanker | ONTOLOGY1 | P | T001 |
C0000004 | chirurgie | ONTOLOGY1 | P | T002 |
C0000004 | chirurgische verrichting | ONTOLOGY1 | A | T002 |
C0000005 | chirurgie | ONTOLOGY1 | P | T003 |
C0000005 | specialisme chirurgie | ONTOLOGY1 | A | T003 |
See https://github.com/CogStack/MedCAT/tree/master/examples for a detailed explanation of the columns.
I'm not sure whether the UMLS license allows for publishing snippets of UMLS data for demonstration purposes, so this repository uses mock data in the examples.
To download UMLS, visit the NIH National Library of Medicine website. You'll have to apply for a license before you can download the files on https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html. In the following description I downloaded Full Release (umls-2022AB-full.zip)
. The advantage over UMLS Metathesaurus Full Subset
is that the Full Release includes MetamorphoSys which makes it possible to create a subset of UMLS prior loading the data in a SQL database. This significantly decreases the required disk space and processing time.
After decompressing the *-full.zip
file, go into the folder (e.g. 01_Download/2022AB-full
in this example) and decompress mmsys.zip
. Afterwards, move the files in the new mmsys
folder one level up, so they are in 2022AB-full
. Next, run MetamorphoSys (./run_mac.sh
on macOS)
MetamorphoSys is used to install a subset of UMLS. During the installation process it is possible to select multiple sources, and thereby to craft a specific subset for your use case. In our case, our primary goal is to select the Dutch terms and we add some English sources for concept categories for common used English names in Dutch (such as drug names).
- Select
Install UMLS
. - Select destination directory, for example
./02_ExtractSubset/2022AB/
- Keep
Metathesaurus
checked, and uncheckSemantic Network
andSPECIALIST Lexicon & Lexical Tools
. SelectOK
. - Select
New Configuration...
, clickAccept
and clickOk
. TheDefault Subset
does not matter because we are making our own subset in the next step. - In the
Output Options
tab, selectMySQL 5.6
underSelect database
. - In the
Source List
tab, SelectSelect sources to INCLUDE in subset
. Sort the sources on the language column and at least select the 7 Dutch sources. To select multiple sources, hold the CMD key on macOS. In the popup window that will ask if you also want to include related sources, clickCancel
. The Dutch sources I included were:
Source name | Source ID |
---|---|
ICD10, Dutch Translation, 200403 | ICD10DUT_200403 |
ICPC2-ICD10 Thesaurus, Dutch Translation | ICPC2ICD10DUT_200412 |
ICPC2E Dutch | ICPC2EDUT_200203 |
ICPC, Dutch Translation, 1993 | ICPCDUT_1993 |
LOINC Linguistic Variant - Dutch, Netherlands | LNC-NL-NL_273 |
MedDRA Dutch | MDRDUT25_0 |
MeSH Dutch | MSHDUT2005 |
More information on the sources can be found at: https://www.nlm.nih.gov/research/umls/sourcereleasedocs/index.html
- For drug names some commonly used synonyms in Dutch langague are missing in these vocabularies. Therefore I also selected the following English sources:
ATC_2022_22_09_06
DRUGBANK5.0_2022_08_01
RXNORM_20AA_220906F
.
- Also, it's useful to include other categories useful for remapping source ontologies (such as Dutch SNOMED -> English SNOMED -> UMLS). The selection of Dutch and English drug names is done in a Jupyter Notebook in a later step, so it's okay to include some more non-Dutch sources in this step. At UMCU we included the following sources:
SNOMEDCT_US_2022_09_01
(required for adding Dutch SNOMED terms in a later step)HPO2020_10_12
MTH
- In the
Suppressibility
tab, make sure the obsolete terms are suppressed (LO
for LOINC,OL
for MedDRA; see https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html). In my configuration, I unsuppressed the "Abbreviation in any source vocabulary" (AB
) concepts from MedDRA. - On macOS, in the top bar, select
Advanced Suppressibility Options
and check all checkboxes. This makes sure the suppressed terms are excluded from the subset. - On macOS, in the top bar, select
Done
andBegin Subset
. This process takes 5-10 minutes. - Save the SubSet log for future reference.
To select only the columns required for the target list of terms, first put all the resulting subset in a SQL DB.
# Create local .env file
cp .env-example .env
# Set local file paths & MySQL root password in .env
# Set MySQL loading config settings
vim ./02_ExtractSubset/2022AB/META/populate_mysql_db.sh
MYSQL_HOME=/usr
user=root
password=${MYSQL_ROOT_PASSWORD}
db_name=${MYSQL_DATABASE}
# Start MySQL container in Docker
docker-compose up -d
# Enter docker container
docker exec -it umls bash
# Execute mysql loading script
cd /src_files/2022AB/META/
bash populate_mysql_db.sh
The official documentation for loading UMLS in a MySQL DB can be found at here.
Some source vocabularies contain concept types that are not useful for entity linking. Also, Dutch vocabularies in UMLS do not contain many drug names. Use dutch-umls_to_concept-table.ipynb to:
- Download concepts, names and types from the MySQL database
- Remove irrelevant concepts based on term type in source vocabulary
- Remove irrelevant concepts based on UMLS type
- Remove problematic names for Dutch language
- Add many concepts and synonyms from Dutch SNOMED
- Add custom names
- Save the concept table in a format compatible with MedCAT
Note: This requires a Dutch SNOMED concept table, which can be created in dutch-snomed_to_concept-table.ipynb.
To setup the Python environment, install the required Python Packages with PIP:
pip install -r requirements.txt
Target file should follow the specifications defined in the MedCAT repository.
In short:
Column | Description | Example values |
---|---|---|
cui | concept id | C0000001 |
name | term name | Longkanker |
name_status | Term type in source | PN (Primary name), SY (Synonym), for others see https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/abbreviations.html#TTY |
type_ids | Semantic type identifier | T047 (Based on UMLS) |
ontologies | Source Ontology | SNOMEDCT_US, MSHDUT, MDRDUT, ICPC2EDUT, ICPCDUT, or something custom |
Select the relevant columns using your preferred way of interacting with SQL databases and save the results to a CSV-file. Also include the header. A quick way to do this, is:
SELECT distinct MRCONSO.cui, str as name, sab as ontologies, tty as name_status, tui as type_ids
FROM MRCONSO
LEFT JOIN MRSTY ON MRSTY.cui = MRCONSO.cui
ORDER BY MRCONSO.cui ASC;
The output should look like this:
% head -5 umls-dutch.csv
cui,name,ontologies,name_status,type_ids
C0000001,A-zenuwvezels,ONTOLOGY1,PN,T001
C0000002,Abattoir,ONTOLOGY1,PN,T002
C0000002,Abattoirs,ONTOLOGY1,SY,T002
C0000003,Abbreviated Injury Scale,ONTOLOGY1,PN,T003
Method in: dutch-snomed_to_concept-table.ipynb
License and source files can be acquired from SNOMED MLDS.
Method in: dutch-hpo_to_concept-table.ipynb
Most HPO names were translated to Dutch by Radboudumc's Clinical Genetics team via CrowdIn. This is the primary source in creating a concept table of Dutch HPO concepts. In an effort to make a more complete set of synonyms, names from UMLS and SNOMED, and a few manually created names, are added.
To generate MedCAT models from the concept table, see the instructions in the MedCAT repository. For Dutch language, use these parameters in the configuration:
# Set Dutch spaCy model. The small or medium models should also be fine.
cat.general.spacy_model = 'nl_core_news_lg'
# This handles diacritics, which are common in Dutch language but not in English language.
cat.general.diacritics = True
# This distinguishes "ALS" (the disease) from "als" (voegwoord en voorzetsel).
cat.ner.check_upper_case_names = True
MedCAT contains functionality for unsupervised training to solve ambiguity of medical concepts. A relatively easy way to do this is by providing Dutch medical wikipedia articles. Although this is not representable for clinical language used in EHRs, the resulting model performs surprisingly well. See dutch-medical-wikipedia for instructions on how to generate this dataset.