Skip to content

Latest commit

 

History

History
289 lines (177 loc) · 8.6 KB

File metadata and controls

289 lines (177 loc) · 8.6 KB

Benchmark

To standardize the clinical trial outcome prediction, we create a benchmark dataset for Trial Outcome Prediction named TOP, which incorporate rich data components about clinical trials, including drug, disease and protocol (eligibility criteria). Benchmark can be mainly divided into two parts:

  • Raw Data describes all the data sources.
  • Data Curation Process describes data curation process.
    • Collect all the records
    • diseases to icd10
    • drug to SMILES
    • ICD-10 code hierarchy
    • Sentence Embedding for trial protocol
    • Selection criteria of clinical trial
    • Data split
  • Tutorial

Raw Data

ClinicalTrial.gov

  • description

    • We download all the clinical trials records from ClinicalTrial.gov. The processed data are based on ClinicalTrials.gov database on Feb 20, 2021. It contains 348,891 clinical trial records. The data size grows with time because more clinical trial records are added. It describes many important information about clinical trials, including NCT ID (i.e., identifiers to each clinical study), disease names, drugs, brief title and summary, phase, criteria, and statistical analysis results.
    • Outcome labels are provided by IQVIA.
  • output

    • ./raw_data: store all the xml files for all the trials (identified by NCT ID).
mkdir -p raw_data
cd raw_data
wget https://clinicaltrials.gov/AllPublicXML.zip

Then we unzip the ZIP file. The unzipped file occupies over 8.6 G. Please make sure you have enough space.

unzip AllPublicXML.zip
cd ../

DrugBank

  • description

    • We use DrugBank to get the molecule structures (SMILES, simplified molecular-input line-entry system) of the drug.
  • input

    • None
  • output

    • data/drugbank_drugs_info.csv

ClinicalTable

ClinicalTable is a public API to convert disease name (natural language) into ICD-10 code.

MoleculeNet

  • description

    • MoleculeNet include five datasets across the main categories of drug pharmaco-kinetics (PK). For absorption, we use the bioavailability dataset. For distribution, we use the blood-brain-barrier experimental results provided. For metabolism, we use the CYP2C19 experiment paper, which is hosted in the PubChem biassay portal under AID 1851. For excretion, we use the clearance dataset from the eDrug3D database. For toxicity, we use the ToxCast dataset, provided by MoleculeNet. We consider drugs that are not toxic across all toxicology assays as not toxic and otherwise toxic.
  • input

    • None
  • output

    • data/ADMET

Data Curation Process

Collect all the records

  • description

    • download all the records from clinicaltrial.gov. The current version has 370K trial IDs.
  • input

    • raw_data/: raw data, store all the xml files for all the trials (identified by NCT ID).
  • output

    • data/all_xml: store NCT IDs for all the xml files for all the trials.
find raw_data/ -name NCT*.xml | sort > data/all_xml

Disease to ICD-10 code

  • description

    • The diseases in ClinicalTrialGov are described in natural language.

    • On the other hand, ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Related Health Problems (ICD), a medical classification list by the World Health Organization (WHO). It leverages the hierarchical information inherent to medical ontologies.

    • We use ClinicalTable, a public API to convert disease name (natural language) into ICD-10 code.

  • input

    • raw_data/
    • data/all_xml
  • output

    • data/diseases.csv

It takes around 2 hours.

python benchmark/collect_disease_from_raw.py

drug to SMILES

  • description

    • SMILES is simplified molecular-input line-entry system of the molecule.

    • The drugs in ClinicalTrialGov are described in natural language.

    • DrugBank contains rich information about drugs.

    • We use DrugBank to get the molecule structures in terms of SMILES.

  • input

    • data/drugbank_drugs_info.csv
  • output

    • data/drug2smiles.pkl
python benchmark/drug2smiles.py 

Selection criteria of clinical trial

We design the following inclusion/exclusion criteria to select eligible clinical trials for learning.

  • inclusion criteria

    • study-type is interventional
    • intervention-type is small molecules drug
    • it has outcome label
    • disease codes are available
    • drug molecules are available
  • exclusion criteria

    • study-type is observational
    • intervention-type is surgery, biological, device
    • outcome label is not available
    • disease codes are not available
    • drug molecules are not available

The csv file contains following features:

  • nctid: NCT ID, e.g., NCT00000378, NCT04439305.
  • status: completed, terminated, active, not recruiting, withdrawn, unknown status, suspended, recruiting.
  • label: 0 (failure) or 1 (success).
  • phase: I, II, III or IV.
  • diseases: list of diseases.
  • icdcodes: list of icd-10 codes.
  • drugs: list of drug names
  • smiless: list of SMILES
  • criteria: egibility criteria
  • input

    • data/diseases.csv
    • data/drug2smiles.pkl
    • data/all_xml
  • output

    • data/raw_data.csv
python benchmark/collect_raw_data.py | tee data_process.log 
python benchmark/nctid2date.py 
  • input

    • 'data/raw_data.csv'
    • './raw_data'
  • output

    • 'data/nctid_date.txt'

Data Split

  • description (Split criteria)

    • phase I: phase I trials
    • phase II: phase II trials
    • phase III: phase III trials
  • input

    • data/raw_data.csv
  • output:

    • data/phase_I_{train/valid/test}.csv
    • data/phase_II_{train/valid/test}.csv
    • data/phase_III_{train/valid/test}.csv
python benchmark/data_split.py 

ICD-10 code hierarchy

  • description

    • get all the ancestor code for the current icd-10 code.
  • input

    • data/raw_data.csv
  • output:

    • data/icdcode2ancestor_dict.pkl
python benchmark/icdcode_encode.py 

Sentence embedding

  • description

    • BERT embedding to get sentence embedding for sentence in clinical protocol.
  • input

    • data/raw_data.csv
  • output:

    • data/sentence2embedding.pkl
python benchmark/protocol_encode.py 

Tutorial

We provide a jupyter notebook tutorial in tutorial_benchmark.ipynb (in the main folder), which describes some key components of the data curation process.

Contact

Please contact futianfan@gmail.com for help or submit an issue. This is a joint work with Kexin Huang, Cao(Danica) Xiao, Lucas M. Glass and Jimeng Sun.

Benchmark Usage Agreement

The benchmark dataset and code (including data collection and preprocessing, model construction, learning process, evaluation), referred as the Works, are publicly available for Non-Commercial Use only at https://github.com/futianfan/clinical-trial-outcome-prediction. Non-Commercial Use is defined as for academic research or other non-profit educational use which is: (1) not-for-profit; (2) not conducted or funded (unless such funding confers no commercial rights to the funding entity) by an entity engaged in the commercial use, application or exploitation of works similar to the Works; and (3) not intended to produce works for commercial use.