-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optical and transport data as elemental pseudo-inverse contributions #892
Conversation
…d the optical database, better use of files for restart
Added pseudo-inverse contributions of elements to properties as Element features. The properties already included are the optical data from refractiveindex.info and the transport properties from the Materials Project.
Merge optics into main
…he Materials Project.
…he composition as strings
Good work @gbrunin! Following up on this PR, @computron is there anyone we should contact to have it merged or discussed ? Thanks, David |
Hello @janosh, Would you need any additional information to follow up on this PR ? Thanks, David |
@davidwaroquiers Sorry to say I'm prob not the right person to merge this. Maybe ping @computron and @ardunn again for green-lighting a big PR like this one. |
Hello @janosh , Ok thanks for the update! @computron and @ardunn do you need any additional input about this topic ? Best, |
… fixed after this.
Impute NaNs for elemental values in element properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, hope we can get this in very soon @gbrunin!
I'll keep triggering the workflows if you make any changes but will also try to make the effort to test this myself very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only "big" file that I can see, and it is approximately twice the size of the next biggest descriptor data (Jarvis). If I compress it with gzip or xz I get it down to ~2.6 MB which might be preferable but will need the decompression logic added too.
I'm not sure its worth using LFS for this (as GH much more heavily enforces bandwidth limits on LFS vs normal transactions, and very annoyingly includes GH actions traffic in this bandwidth in a very hard-to-cache way). An extra 2 MB for a file that will never change does not seem like a big deal to me compared to the additional effort.
The alternative would be hosting the file on figshare/Zenodo and downloading it on-demand. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I compressed it, the total repo size increases now by 20%. I agree that the additional space taken by the repo does not justify the additional effort to either use LFS or Zenodo. We should see with Anubhav if that's fine by him.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, are you happy to contact him again? I can't see there being a problem... (you have my blessing)
3.9 failures are still simply for the Test PyPI upload which won't work from forks (I'll probably fix this at some point after this PR) -- see #933 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me now, let's just wait to check about the data size before merging. Thanks @gbrunin!
I'm happy that everything works locally, and we're fine to merge the dataset in. I'll raise a couple of minor issues that have come up, but otherwise great work and thanks again @gbrunin! |
Summary
This is work done with @davidwaroquiers, @gpetretto and @gmrigna.
The idea is to use the data from refractiveindex.info and the transport properties from the Materials Project to featurize new systems based on their composition.
As an example, let's take the effective mass of electrons. From the MP, we have >45 000 systems with corresponding effective masses. We can write the equations
Composition matrix x Pseudo-inverse contributions ≃ Effective masses
where each of these matrices have > 45 000 lines. The composition matrix has a number of columns equal to the number of chemical elements present in the dataset, and the others have a single column. The pseudo-inverse contributions can be computed for a given dataset. They represent the least-square fit between the compositions and the effective mass, and can be seen as the average contribution of each element to the effective mass once they are present in a system (could be negative if the presence of an element generally decreases the effective mass).
From our tests on industrial cases, including these pseudo-inverse contributions as composition features improves the ML models (it will depend on what is predicted though).
In this PR, we have done this for optical data (refractive index, extinction coefficient, reflectivity), as taken from refractiveindex.info, and for transport properties (all those present in the MP). For optical data, the properties are spectra and by default 10 wavelengths are selected in the visible range. This range and frequency selection can be changed by the user if, say, the IR spectra is more important for their application.
The code can be used to generate new pseudo-inverse contributions from new data and add these as features as well.
TODO
Since the user can change the range and sampling of the optical spectra, the whole database from refractiveindex should be stored. We have added it in a tar.xz format (< 2 Mb). The code starts by untarring the file in a ~/.matminer directory that can be changed manually by the user if this is not desirable. This is to avoid adding too much untarred files in the source code that would more than double the current size of the repo.
This is of course open for discussion, depending on what you would prefer.
We are open to having a chat about all this if you think it is necessary. Maybe I did not explain everything correctly and things have to be clarified.