Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542

Merged
merged 21 commits into from
Jan 17, 2024

Conversation

esoteric-ephemera
Copy link
Contributor

@esoteric-ephemera esoteric-ephemera commented Jan 5, 2024

This PR does a few things:

  • pymatgen.io.cif.CifParser now checks that the composition reported in a CIF matches that of the resulting pymatgen Structure and if not, warns the user what elements are missing (most often Hydrogens in experimental CIFs which are hard to locate without using neutron scattering), or when the parsing has gone awry and the stoichiometries differ
  • Adds support for isotopes, particularly Deuterium and Tritium.
    • pymatgen.core.periodic_table.Element now has a few more attributes, such as the atomic mass number (number of protons and neutrons) as the .A attr
    • A structure containing an isotope will appear to only have the base element in it (e.g., D2O will look like H2O), and when files are written, the isotopes are converted to the element. However, the isotope names, mass, etc. are preserved in the element attr
    • This closes issue #2151
  • As of VASP version 6.4.2, when using POTCARs with a SHA, all files like CONTCAR, CHG*, LOCPOT, AECCAR* will have the element names followed by a slash
    • This closes issue #3503 and atomate2 issue #662
    • Allow for correct parsing of these via changes to pymatgen.io.vasp.inputs.Poscar
  • Tests for these additions

@mkhorton
Copy link
Member

mkhorton commented Jan 5, 2024

Thanks for these nice improvements @esoteric-ephemera

Since you’re looking at this part of the code, do you have a view of how H2O or OH sites should be handled? It’s one remaining weakness of the CIF parser I think.

@mkhorton
Copy link
Member

mkhorton commented Jan 5, 2024

Also adding a note that the mp prefix has been registered with the IUCr, if we wanted to add support for writing out custom pymatgen fields (like site_properties). There’s definitely a way in which pymatgen could losslessly convert Structure to/from CIF. It’s been something on the back burner for me for a while.

@esoteric-ephemera
Copy link
Contributor Author

Since you’re looking at this part of the code, do you have a view of how H2O or OH sites should be handled? It’s one remaining weakness of the CIF parser I think.

Are you thinking of experimental structures where the H sites are undetermined/unlisted? For organics, @Andrew-S-Rosen had a good suggestion of trying to mimic how Olex (used by crystallographers) adds hydrogen atoms to structures where their positions are unknown

@kristinpersson had a suggestion that was based on a geometric construction - that would also be a cheap solution

I have a few other ideas that would use a DFT charge density, but these are more intensive options

Also adding a note that the mp prefix has been registered with the IUCr, if we wanted to add support for writing out custom pymatgen fields (like site_properties). There’s definitely a way in which pymatgen could losslessly convert Structure to/from CIF. It’s been something on the back burner for me for a while.

That's great to know! The precision of atomic sites in CIFs seems to be a bottleneck to lossless conversion, since this can result in composition mismatches. Since a lot of CIFs are written by hand (especially ICSD ones), this happens frequently, see e.g., issue #3373

@shyuep
Copy link
Member

shyuep commented Jan 5, 2024

NIce work! One minor thing: does the cif assessor really need to be a class? I would think a CifParser.assess or CifParser.validate method would suffice. I am heavy proponent for OOP but in this case, the CifAssessor doesn't seem to conceptually an object per se.

@mkhorton
Copy link
Member

mkhorton commented Jan 5, 2024

Are you thinking of experimental structures where the H sites are undetermined/unlisted?

Not quite, there are two problems: (1) “implicit hydrogens” whereby the hydrogens are not listed in the atoms at all (so we might not realise hydrogens are missing unless we do a composition check as suggested), (2) the CIF files which list a molecule on a single atomic site, eg “OH” or “H2O” etc. Here, the CIF file at least tells us that a H should be present, but we still do not parse appropriately (I believe for OH, we parse as just O, and provide a warning).

For (2), it’s a difficult situation, because all options are unideal in some way. Omitting the H completely is wrong. Putting the H in an arbitrary location (eg correct bond length, perhaps do a bond valence sum or similar to find a plausible position) is tricky. Defining the site as containing a molecule is probably most correct, but does not fit with the current pymatgen object model.

@esoteric-ephemera
Copy link
Contributor Author

NIce work! One minor thing: does the cif assessor really need to be a class? I would think a CifParser.assess or CifParser.validate method would suffice. I am heavy proponent for OOP but in this case, the CifAssessor doesn't seem to conceptually an object per se.

Thanks! Yes I'll try to refactor the assessor

@mkhorton
Copy link
Member

mkhorton commented Jan 5, 2024

Sorry, sent that message too soon. I’m not familiar with the Olex method. My point anyway is just that there are cases where we have to figure out the hydrogen is missing based on composition, but also cases where we’re explicitly told a hydrogen should be there.

In terms of lossless conversion, correct me if I’m wrong but there’s no limit on the precision with which we write out atomic positions? We can write out as many significant figures as we do with JSON. I know the issue with the handwritten CIFs which is why we put in the rounding logic inside the parser (eg converting 0.333 -> 1/3 etc), but that seems like a separate issue.

@shyuep
Copy link
Member

shyuep commented Jan 5, 2024

@mkhorton @esoteric-ephemera I would just say that there are now universal potentials. For an OH, you can take the coordinates as the position of the O and then initialize a H at 0.5A bond distance (maybe away from other atom as well). Then use a UP to relax the H positions. :-) For this very constrained use case, I think it would lead to very reasonable results (much more than pure geometry / empirical constructions).

@esoteric-ephemera
Copy link
Contributor Author

esoteric-ephemera commented Jan 5, 2024

(2) the CIF files which list a molecule on a single atomic site, eg “OH” or “H2O” etc. Here, the CIF file at least tells us that a H should be present, but we still do not parse appropriately (I believe for OH, we parse as just O, and provide a warning).

OK I see what you mean. The best solution there in keeping with pymatgen's model is probably putting O on the site position and adding in H

My point anyway is just that there are cases where we have to figure out the hydrogen is missing based on composition, but also cases where we’re explicitly told a hydrogen should be there.

This PR warns the user in at least the latter case: if H is reported in the CIF but missing in the PMG structure, the user is warned. It usually seems to be the case that H's are reported in the CIF composition even when no H site positions are given. But if they're not, we might want to include a charge balance check

I know the issue with the handwritten CIFs which is why we put in the rounding logic inside the parser (eg converting 0.333 -> 1/3 etc), but that seems like a separate issue.

We can specify arbitrary position, but a lot of the ICSD CIFs seem to have lower precision than the default rounding tolerance in CifParser (4 digits). ICSD-191196 is a good example where the composition is only reported to two digits of precision

I would just say that there are now universal potentials.

Might be good in this case to freeze the known positions and only relax the H atoms. My only concern here is that H is often underrepresented in ML-UIP training data. Maybe @janosh can comment more on that

@shyuep
Copy link
Member

shyuep commented Jan 5, 2024

Yes H is often underrepresented in UP training. But I think for a limited case like OH, it should be ok. Regardless, it is better than any of the rule of thumb alternatives. The only better option is DFT (with certain functionals).

@esoteric-ephemera esoteric-ephemera changed the title Improved CIF checking, support for isotopes, and correct handling of new VASP POSCAR fmt [WIP] Improved CIF checking, support for isotopes, and correct handling of new VASP POSCAR fmt Jan 6, 2024
@esoteric-ephemera
Copy link
Contributor Author

Just adding the WIP tag for now, since I realized that oxidation state handling for isotopes isn't working as intended (also pending refactor of CifAssessor)

@janosh janosh added enhancement A new feature or improvement to an existing one fix Bug fix PRs vasp Vienna Ab initio Simulation Package cif Crystallographic information file labels Jan 6, 2024
@janosh
Copy link
Member

janosh commented Jan 6, 2024

The hydrogen convex hull distance errors for MLIPs are definitely elevated based on Matbench Discovery testing.
See "Per-Element Model Error Heatmaps" at https://matbench-discovery.materialsproject.org/models with M3GNet and MACE at 260 meV/atom, CHGNet at 240.

I think this not so much stems from underrepresentation (MPtrj contains 142k relaxation frames with hydrogen) but hints that the models are confused due to inconsistent training data, perhaps due to issues like the one (partially) addressed in this PR.

But I agree with @shyuep, doing H-only relaxation with any of the above mentioned models could still get you closer to true H positions.

@esoteric-ephemera
Copy link
Contributor Author

This is almost ready for review, I think. One thing I'm not satisfied with: the user can manually specify an oxidation state for an isotope, however calling convenience functions like Composition(...).add_charges_from_oxi_state_guesses() or Structure(...).add_oxidation_state_by_guess() will convert any isotope to its parent bare Element

For example, The composition Composition({"D+": 2, "O2-": 1}) includes D+ in the composition. However calling add_charges_from_oxi_state_guesses() on Composition({"D": 2, "O": 1}) will return the equivalent of Composition({"H+": 2 "O2-": 1}).

Maybe this isn't such an issue since this represents an extreme edge case. Issuing a warning to the user when calling any of the add_* oxi state methods would be sufficient?

@janosh janosh changed the title [WIP] Improved CIF checking, support for isotopes, and correct handling of new VASP POSCAR fmt Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header Jan 13, 2024
tests/io/test_cif.py Outdated Show resolved Hide resolved
dev_scripts/periodic_table.yaml Outdated Show resolved Hide resolved
Copy link
Member

@janosh janosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @esoteric-ephemera! 👍 Super valuable guard rails for everyone parsing CIFs with pymatgen.

@janosh janosh merged commit f463ac1 into materialsproject:master Jan 17, 2024
22 checks passed
@esoteric-ephemera esoteric-ephemera deleted the cif-sanity-checks branch January 29, 2024 19:27
njzjz pushed a commit to deepmodeling/dpgen that referenced this pull request Feb 29, 2024
There are 2 changes in pymatgen which destroy the previous unittests:

1. materialsproject/pymatgen#3542
Previously we set the type of DummySpecies to "Type", which is forbidden
by pymatgen through this PR. Here we replace "Type" with "X".

2. materialsproject/pymatgen#3561
The functions from_string and get_string have been replaced with
from_str and get_str. We should keep up to date with these changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cif Crystallographic information file enhancement A new feature or improvement to an existing one fix Bug fix PRs vasp Vienna Ab initio Simulation Package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants