Optical and transport data as elemental pseudo-inverse contributions #892

gbrunin · 2022-11-18T13:27:45Z

Summary

This is work done with @davidwaroquiers, @gpetretto and @gmrigna.

The idea is to use the data from refractiveindex.info and the transport properties from the Materials Project to featurize new systems based on their composition.

As an example, let's take the effective mass of electrons. From the MP, we have >45 000 systems with corresponding effective masses. We can write the equations
Composition matrix x Pseudo-inverse contributions ≃ Effective masses
where each of these matrices have > 45 000 lines. The composition matrix has a number of columns equal to the number of chemical elements present in the dataset, and the others have a single column. The pseudo-inverse contributions can be computed for a given dataset. They represent the least-square fit between the compositions and the effective mass, and can be seen as the average contribution of each element to the effective mass once they are present in a system (could be negative if the presence of an element generally decreases the effective mass).

From our tests on industrial cases, including these pseudo-inverse contributions as composition features improves the ML models (it will depend on what is predicted though).

In this PR, we have done this for optical data (refractive index, extinction coefficient, reflectivity), as taken from refractiveindex.info, and for transport properties (all those present in the MP). For optical data, the properties are spectra and by default 10 wavelengths are selected in the visible range. This range and frequency selection can be changed by the user if, say, the IR spectra is more important for their application.
The code can be used to generate new pseudo-inverse contributions from new data and add these as features as well.

TODO

Since the user can change the range and sampling of the optical spectra, the whole database from refractiveindex should be stored. We have added it in a tar.xz format (< 2 Mb). The code starts by untarring the file in a ~/.matminer directory that can be changed manually by the user if this is not desirable. This is to avoid adding too much untarred files in the source code that would more than double the current size of the repo.
This is of course open for discussion, depending on what you would prefer.

We are open to having a chat about all this if you think it is necessary. Maybe I did not explain everything correctly and things have to be clarified.

…d the optical database, better use of files for restart

Added pseudo-inverse contributions of elements to properties as Element features. The properties already included are the optical data from refractiveindex.info and the transport properties from the Materials Project.

Merge optics into main

…he Materials Project.

Optics

…he composition as strings

davidwaroquiers · 2023-01-09T08:39:50Z

Good work @gbrunin!

Following up on this PR, @computron is there anyone we should contact to have it merged or discussed ?

Thanks,

David

davidwaroquiers · 2023-02-14T10:52:59Z

Hello @janosh,

Would you need any additional information to follow up on this PR ?

Thanks,

David

janosh · 2023-02-14T14:30:00Z

@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one.

davidwaroquiers · 2023-02-14T14:57:15Z

@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one.

Hello @janosh ,

Ok thanks for the update!

@computron and @ardunn do you need any additional input about this topic ?

Best,

… fixed after this.

…_nans

Impute NaNs for elemental values in element properties

…_nans

…future default in the meantime.

ml-evs

Some minor comments, hope we can get this in very soon @gbrunin!

I'll keep triggering the workflows if you make any changes but will also try to make the effort to test this myself very soon.

ml-evs · 2024-04-08T15:37:04Z

matminer/utils/data_files/mp_transport/transport_database.csv

This is the only "big" file that I can see, and it is approximately twice the size of the next biggest descriptor data (Jarvis). If I compress it with gzip or xz I get it down to ~2.6 MB which might be preferable but will need the decompression logic added too.

I'm not sure its worth using LFS for this (as GH much more heavily enforces bandwidth limits on LFS vs normal transactions, and very annoyingly includes GH actions traffic in this bandwidth in a very hard-to-cache way). An extra 2 MB for a file that will never change does not seem like a big deal to me compared to the additional effort.

The alternative would be hosting the file on figshare/Zenodo and downloading it on-demand. Any thoughts?

I compressed it, the total repo size increases now by 20%. I agree that the additional space taken by the repo does not justify the additional effort to either use LFS or Zenodo. We should see with Anubhav if that's fine by him.

Great, are you happy to contact him again? I can't see there being a problem... (you have my blessing)

matminer/utils/data.py

matminer/featurizers/composition/element.py

ml-evs · 2024-04-09T13:14:49Z

3.9 failures are still simply for the Test PyPI upload which won't work from forks (I'll probably fix this at some point after this PR) -- see #933

ml-evs

This is looking good to me now, let's just wait to check about the data size before merging. Thanks @gbrunin!

ml-evs · 2024-04-10T13:20:11Z

I'm happy that everything works locally, and we're fine to merge the dataset in. I'll raise a couple of minor issues that have come up, but otherwise great work and thanks again @gbrunin!

gbrunin and others added 26 commits June 28, 2022 14:52

Added optical spectra from refractiveindex.info as composite features

a9e0c5d

Add mode as statistical feature, fixed typo in element name

0a827f2

Added the possibility to change the frequency range.

3ffcf61

Added pure and pseudo-inversed transport properties from Ricci et al

bc7ad2a

Updated defaults and added tests for transport features

f957b10

Slight update to get rid of useless from_preset

49b63f6

Merge branch 'main' of github.com:hackingmaterials/matminer into optics

9ba7f47

Moved optical and transport featurizer inside ElementProperty, update…

cd57b82

…d the optical database, better use of files for restart

Merge branch 'main' of github.com:hackingmaterials/matminer into optics

d6c36fd

Removed print, used pre-commit

0fdf9aa

Updated documentation in utils

79fb8e7

Updated documentation in utils

60afc58

Removed calls to system executables

b6662e4

Use a folder to save csv files and untar optical database

14b8cbc

Alpha was not correctly used

1d110f1

f-strings

ac8b68c

Added citations

60da2db

Merge pull request #1 from gbrunin/optics

85ffaab

Added pseudo-inverse contributions of elements to properties as Element features. The properties already included are the optical data from refractiveindex.info and the transport properties from the Materials Project.

Added script to create optical database from the refractiveinfo repo

1cc01f1

Rm requirements

2ac092f

Merge pull request #2 from Matgenix/optics

fc1ee35

Merge optics into main

Added transport data for pure oxygen. Took the most stable phase of t…

e89d850

…he Materials Project.

Merge branch 'main' of github.com:Matgenix/matminer into optics

87afe93

Merge pull request #3 from gbrunin/optics

1a7b312

Optics

Make sure the index of the dataframe to be pseudo-inversed contains t…

cb56908

…he composition as strings

Merge branch 'main' of github.com:Matgenix/matminer into optics

86303d5

ml-evs mentioned this pull request May 30, 2023

Skip the thermo tests rather than crash if MP API is missing #905

Merged

gbrunin and others added 14 commits May 30, 2023 15:09

Merge branch 'main' of github.com:hackingmaterials/matminer into optics

db64499

Merge branch 'hackingmaterials:main' into main

ba8812f

Merge branch 'main' of github.com:Matgenix/matminer into optics

d5af4b1

Updated some tests.

3ca5c2c

Handling the NaNs output by Composition featurizers. Tests need to be…

de1b22b

… fixed after this.

Merge branch 'main' of github.com:hackingmaterials/matminer into main

f42bb61

Merge branch 'main' of github.com:hackingmaterials/matminer into elem…

dc2d7a9

…_nans

Merge branch 'main' of github.com:Matgenix/matminer into elem_nans

79c17c2

Merge pull request #5 from Matgenix/elem_nans

7f4c896

Impute NaNs for elemental values in element properties

Merge branch 'main' of github.com:hackingmaterials/matminer into elem…

d90f83e

…_nans

Merge branch 'main' of github.com:Matgenix/matminer into elem_nans

66d905b

Merge branch 'main' of github.com:hackingmaterials/matminer into main

57042c7

Fix yaml safe load in optical data.

ba0dcc7

Merge branch 'main' of github.com:hackingmaterials/matminer into main

414bd19

ml-evs self-requested a review March 29, 2024 15:04

gbrunin added 3 commits April 2, 2024 10:07

Merge branch 'main' of github.com:hackingmaterials/matminer into main

f574ced

Reverted default of impute_nan to False. Advertise and recommend the …

ac3c28c

…future default in the meantime.

Fix for Z=119 issue.

e8be676

ml-evs reviewed Apr 8, 2024

View reviewed changes

gbrunin added 5 commits April 9, 2024 13:27

Added tests for impute_nan. Some bug fixes.

507b741

Compressed MP transport data.

8b9b7ae

Ignore the transport csv file with git.

f4ea31d

Merge branch 'main' of github.com:hackingmaterials/matminer into main

2eae594

Missed impute_nan warnings.

bd5d213

ml-evs approved these changes Apr 9, 2024

View reviewed changes

ml-evs merged commit 0763527 into hackingmaterials:main Apr 10, 2024
3 of 4 checks passed

gbrunin mentioned this pull request May 16, 2024

Handling NaNs from ElementProperty #898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optical and transport data as elemental pseudo-inverse contributions #892

Optical and transport data as elemental pseudo-inverse contributions #892

gbrunin commented Nov 18, 2022

davidwaroquiers commented Jan 9, 2023

davidwaroquiers commented Feb 14, 2023

janosh commented Feb 14, 2023

davidwaroquiers commented Feb 14, 2023

ml-evs left a comment

ml-evs Apr 8, 2024

gbrunin Apr 9, 2024

ml-evs Apr 9, 2024

ml-evs commented Apr 9, 2024 •

edited

Loading

ml-evs left a comment

ml-evs commented Apr 10, 2024

Optical and transport data as elemental pseudo-inverse contributions #892

Optical and transport data as elemental pseudo-inverse contributions #892

Conversation

gbrunin commented Nov 18, 2022

Summary

TODO

davidwaroquiers commented Jan 9, 2023

davidwaroquiers commented Feb 14, 2023

janosh commented Feb 14, 2023

davidwaroquiers commented Feb 14, 2023

ml-evs left a comment

Choose a reason for hiding this comment

ml-evs Apr 8, 2024

Choose a reason for hiding this comment

gbrunin Apr 9, 2024

Choose a reason for hiding this comment

ml-evs Apr 9, 2024

Choose a reason for hiding this comment

ml-evs commented Apr 9, 2024 • edited Loading

ml-evs left a comment

Choose a reason for hiding this comment

ml-evs commented Apr 10, 2024

ml-evs commented Apr 9, 2024 •

edited

Loading