Old Title: Multimodal Clinical Pseudo-notes for Emergency Department Prediction Tasks using Multiple Embedding Model for EHR (MEME)
In this work, we introduce Multiple Embedding Model for EHR (MEME), an approach that views Electronic Health Records (EHR) as multimodal data. It uniquely represents tabular concepts like diagnoses and medications as structured natural language text using our "pseudo-notes" method. This approach allows us to effectively employ Large Language Models (LLMs) for individual EHR representation, proving beneficial in a variety of text-classification tasks. We demonstrate the effectiveness of MEME by applying it to diverse tasks within the Emergency Department across multiple hospital systems. Our findings show that MEME surpasses the performance of both single modality/embedding methods and traditional machine learning approaches, highlighting its effectiveness. Additionally, our tests on the model's generalizability reveal that training solely on the MIMIC-IV database does not guarantee effective application across different hospital institutions.
Electronic Health Records are heterogenous containing a mixture of numerical, categorical, free text data. However traditional machine learning models struggle to learn representations of categorical data due to traditional one hot encoding schemes that result in sparse matrices. Therefore in this work we present pseudo-notes, which is a data transformation from EHR tabular data to text that allows us to leverage recent Large Language Models that have better understanding of English and context. These in turn generate better representations of EHR data as demonstrated by this research which benchmarks text vs tabular input as well as multimodal vs singular modality input.
TLDR Multiple Embedding Model for EHR (MEME) outperforms all benchmark models on various tasks recorded in the Emergency Department. A further description is referenced in the paper.
The MIMIC-IV-ED dataset, part of the extensive MIMIC-IV collection, concentrates on emergency department records from a major hospital. It anonymizes and details patient demographics, triage, vitals, tests, medications, and outcomes, aiding research in emergency care and hospital operations. Access follows strict privacy regulations.
For UCLA Authorized Mednet users, institutional data access is granted. However, due to privacy constraints, UCLA EHR Data distribution is restricted outside these parameters.
To replicate our results, install the MedBERT model from Hugging Face. We recommend using PyTorch 1.13. Ensure you use the specified seed for consistent train, validation, and test splits with our provided split function.
from transformers import AutoTokenizer, AutoModel
# Load MedBERT from Hugging Face
tokenizer = AutoTokenizer.from_pretrained("Charangan/MedBERT")
model = AutoModel.from_pretrained("Charangan/MedBERT")
...
# Split your dataset
train, validate, test = train_validate_test_split(data, seed=7)
More details on MedBERT can be found on its Hugging Face page.
For training, ensure you have GPU access. Adjust the batch size based on your GPU memory; we used a batch size of 64. Note that changes in batch size might slightly affect the results. Set up your training environment by creating a .json
configuration file and a logs/
directory. Here's a sample structure for your configuration file:
{
"log_dir": "logs",
"experiment_name": "Your_Experiment_Name",
"batch": "Your_Batch_Size",
"num_workers": 8, // Enhances tokenization speed
"data": "Path_to_Your_Data_File",
"model": "./Charangan/MedBERT", // Refer to the Installation section for model setup
"mode": "multimodal",
"gpu": "Your_GPU_ID",
"task": "Your_Task", // 'eddispo' or 'multitask'
"epoch": "Your_Num_Epochs", // Early stopping is implemented
"dataset": "MIMIC"
}
To start the training, initiate a screen session and execute the following command:
python3 trainer.py -c train_config.json
Ensure you replace placeholder values with actual data specific to your setup.
For running inference, you will need to set up a .json
configuration file similar to the training setup but with additional fields specific to inference. Here's a sample structure:
{
"log_dir": "logs",
"experiment_name": "Your_Experiment_Name",
"batch": "Your_Batch_Size",
"num_workers": 8,
"data": "Path_to_Your_Data_File",
"model": "./Charangan/MedBERT",
"weights": "Path_to_Your_Model_Weights",
"mode": "multimodal",
"gpu": "Your_GPU_ID",
"task": "Your_Task",
"dataset": "MIMIC",
"validation": "Validation_Mode" // 'within' or 'across'
}
To initiate the inference process, start a screen session and run the following command:
python3 inference.py -c inference_config.json
Model weights for MIMIC trained models can be found on the huggingface Website: here
@misc{lee2024multimodal,
title={Multimodal clinical pseudo-notes for emergency department prediction tasks using multiple embedding model for ehr (meme)},
author={Lee, Simon A and Jain, Sujay and Chen, Alex and Biswas, Arabdha and Fang, Jennifer and Rudas, Akos and Chiang, Jeffrey N},
journal={arXiv preprint arXiv:2402.00160},
year={2024}
}