Skip to content

Data processing and models related to "Quantifying the performance of machine learning models in materials discovery"

License

Notifications You must be signed in to change notification settings

CitrineInformatics-ERD-public/sl_discovery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quantifying the performance of machine learning models in materials discovery [code repository]

This repository contain data and processing scripts to reproduce work performed in the article: Quantifying machine learning model performance in materials discovery, Borg et al., arXiv.2210.13587 [cond-mat.mtrl-sci] (2022). DOI: 10.48550/arXiv.2210.13587.

Simulated Sequential Learning (SL) Quickstart

  1. install required packages (example below using Anaconda)
conda create -n [ENV_NAME] pip numpy
conda activate [ENV_NAME]
pip install -r requirements.txt
  1. Setup configuration files
  • To perform a simulated SL run, create an SL configuration file (e.g. test.yaml) and a dataset configuration file (e.g. matbench_expt_gap_test.yaml.) These files define the parameters for parsing a dataset and configuring the SL run.
  • This repo is currently set up for creating datasets from Matbench and Starrydata2 to address design challenges that connect chemical compositions (i.e. chemcial formula) to a real-valued physical property.
    • Matbench: The latest matbench dataset will be queried and returned
    • Starrydata2: Uses data queried August 2021. Processing defined in Starrydata processing.
  1. Run 1-execute_sl_workflow.ipynb. Path for the configuration file(s) can be set in cell 2.

  2. Run 2-quickplot.ipynb. Quickplot takes a single SL run as input (i.e. for one target range) and generates a figure with 6 subplots:

  • (a) Discovery yield as a function of iteration
  • (b) Model error as a function of iteration
  • (c) Discovery probability as a function of iteration
  • (d-f) Discovery accleration factor for n = 1, 3, and 5 target materials.
  1. Scripts to generate figures shown in the manuscript are stored in simulated_SL with seperate script for every figure.

   

Configuration file parameters:

  • Dataset parameters:
    • dataset (str): The name of the input dataset to be processed (processing steps defined in load_datasets.py)
    • output (str): output property (must be column in dataset)
    • categoricals (str, null): Categorical features

 

  • Starrydata specific parameters:
    • comp_class (str, null): Selects a subset of records based on composition (e.g. '111-type') using logic we have predefined here.
    • material_family (str, null): Starrydata generated label for material family.
    • filtered (True/False): Performs filtering of starrydata datasets based on physically-relevant property values (e.g. filters on records where ZT < 2).
    • sample_form (str, null): Performs filtering of starrydata based on sample form (e.g. 'bulk').

 

  • SL parameters:
    • n_sample (int): Number of datapoints to sample / downselect from raw data. Set to 0 to use full dataset.
    • n_training (int): Number of training rows to start SL process.
    • iterations (int): Number of SL iterations to perform.
    • trials (int): Number of trials (i.e. independent SL processes) to perform.
    • batch (int): Number of candidates to select at each SL iteration.
    • discovery_break_number (int): Number of candidates to find before halting SL process. If set to 0, SL will continue for n_iterations.
    • poi (str, null): Point of interest. Index of point to be included in training set. Forces training set to include "point of interest". Typically set to null.
    • holdout_fraction (float): Percent of dataset to holdout (test).
    • targets (list): Min and max of the target range, e.g. [90, 100] will target 10th decile materials.

About

Data processing and models related to "Quantifying the performance of machine learning models in materials discovery"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages