Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542

esoteric-ephemera · 2024-01-05T20:25:09Z

This PR does a few things:

pymatgen.io.cif.CifParser now checks that the composition reported in a CIF matches that of the resulting pymatgen Structure and if not, warns the user what elements are missing (most often Hydrogens in experimental CIFs which are hard to locate without using neutron scattering), or when the parsing has gone awry and the stoichiometries differ
Adds support for isotopes, particularly Deuterium and Tritium.
- pymatgen.core.periodic_table.Element now has a few more attributes, such as the atomic mass number (number of protons and neutrons) as the .A attr
- A structure containing an isotope will appear to only have the base element in it (e.g., D2O will look like H2O), and when files are written, the isotopes are converted to the element. However, the isotope names, mass, etc. are preserved in the element attr
- This closes issue #2151
As of VASP version 6.4.2, when using POTCARs with a SHA, all files like CONTCAR, CHG*, LOCPOT, AECCAR* will have the element names followed by a slash
- This closes issue #3503 and atomate2 issue #662
- Allow for correct parsing of these via changes to pymatgen.io.vasp.inputs.Poscar
Tests for these additions

… (primarily Deuterium and Tritium), and correct handling of new VASP POSCAR formats

mkhorton · 2024-01-05T21:11:32Z

Thanks for these nice improvements @esoteric-ephemera

Since you’re looking at this part of the code, do you have a view of how H2O or OH sites should be handled? It’s one remaining weakness of the CIF parser I think.

mkhorton · 2024-01-05T21:13:16Z

Also adding a note that the mp prefix has been registered with the IUCr, if we wanted to add support for writing out custom pymatgen fields (like site_properties). There’s definitely a way in which pymatgen could losslessly convert Structure to/from CIF. It’s been something on the back burner for me for a while.

esoteric-ephemera · 2024-01-05T22:25:00Z

Since you’re looking at this part of the code, do you have a view of how H2O or OH sites should be handled? It’s one remaining weakness of the CIF parser I think.

Are you thinking of experimental structures where the H sites are undetermined/unlisted? For organics, @Andrew-S-Rosen had a good suggestion of trying to mimic how Olex (used by crystallographers) adds hydrogen atoms to structures where their positions are unknown

@kristinpersson had a suggestion that was based on a geometric construction - that would also be a cheap solution

I have a few other ideas that would use a DFT charge density, but these are more intensive options

Also adding a note that the mp prefix has been registered with the IUCr, if we wanted to add support for writing out custom pymatgen fields (like site_properties). There’s definitely a way in which pymatgen could losslessly convert Structure to/from CIF. It’s been something on the back burner for me for a while.

That's great to know! The precision of atomic sites in CIFs seems to be a bottleneck to lossless conversion, since this can result in composition mismatches. Since a lot of CIFs are written by hand (especially ICSD ones), this happens frequently, see e.g., issue #3373

shyuep · 2024-01-05T22:42:46Z

NIce work! One minor thing: does the cif assessor really need to be a class? I would think a CifParser.assess or CifParser.validate method would suffice. I am heavy proponent for OOP but in this case, the CifAssessor doesn't seem to conceptually an object per se.

mkhorton · 2024-01-05T22:57:36Z

Are you thinking of experimental structures where the H sites are undetermined/unlisted?

Not quite, there are two problems: (1) “implicit hydrogens” whereby the hydrogens are not listed in the atoms at all (so we might not realise hydrogens are missing unless we do a composition check as suggested), (2) the CIF files which list a molecule on a single atomic site, eg “OH” or “H2O” etc. Here, the CIF file at least tells us that a H should be present, but we still do not parse appropriately (I believe for OH, we parse as just O, and provide a warning).

For (2), it’s a difficult situation, because all options are unideal in some way. Omitting the H completely is wrong. Putting the H in an arbitrary location (eg correct bond length, perhaps do a bond valence sum or similar to find a plausible position) is tricky. Defining the site as containing a molecule is probably most correct, but does not fit with the current pymatgen object model.

esoteric-ephemera · 2024-01-05T23:00:21Z

NIce work! One minor thing: does the cif assessor really need to be a class? I would think a CifParser.assess or CifParser.validate method would suffice. I am heavy proponent for OOP but in this case, the CifAssessor doesn't seem to conceptually an object per se.

Thanks! Yes I'll try to refactor the assessor

mkhorton · 2024-01-05T23:01:39Z

Sorry, sent that message too soon. I’m not familiar with the Olex method. My point anyway is just that there are cases where we have to figure out the hydrogen is missing based on composition, but also cases where we’re explicitly told a hydrogen should be there.

In terms of lossless conversion, correct me if I’m wrong but there’s no limit on the precision with which we write out atomic positions? We can write out as many significant figures as we do with JSON. I know the issue with the handwritten CIFs which is why we put in the rounding logic inside the parser (eg converting 0.333 -> 1/3 etc), but that seems like a separate issue.

shyuep · 2024-01-05T23:11:45Z

@mkhorton @esoteric-ephemera I would just say that there are now universal potentials. For an OH, you can take the coordinates as the position of the O and then initialize a H at 0.5A bond distance (maybe away from other atom as well). Then use a UP to relax the H positions. :-) For this very constrained use case, I think it would lead to very reasonable results (much more than pure geometry / empirical constructions).

esoteric-ephemera · 2024-01-05T23:25:11Z

(2) the CIF files which list a molecule on a single atomic site, eg “OH” or “H2O” etc. Here, the CIF file at least tells us that a H should be present, but we still do not parse appropriately (I believe for OH, we parse as just O, and provide a warning).

OK I see what you mean. The best solution there in keeping with pymatgen's model is probably putting O on the site position and adding in H

My point anyway is just that there are cases where we have to figure out the hydrogen is missing based on composition, but also cases where we’re explicitly told a hydrogen should be there.

This PR warns the user in at least the latter case: if H is reported in the CIF but missing in the PMG structure, the user is warned. It usually seems to be the case that H's are reported in the CIF composition even when no H site positions are given. But if they're not, we might want to include a charge balance check

I know the issue with the handwritten CIFs which is why we put in the rounding logic inside the parser (eg converting 0.333 -> 1/3 etc), but that seems like a separate issue.

We can specify arbitrary position, but a lot of the ICSD CIFs seem to have lower precision than the default rounding tolerance in CifParser (4 digits). ICSD-191196 is a good example where the composition is only reported to two digits of precision

I would just say that there are now universal potentials.

Might be good in this case to freeze the known positions and only relax the H atoms. My only concern here is that H is often underrepresented in ML-UIP training data. Maybe @janosh can comment more on that

shyuep · 2024-01-05T23:32:18Z

Yes H is often underrepresented in UP training. But I think for a limited case like OH, it should be ok. Regardless, it is better than any of the rule of thumb alternatives. The only better option is DFT (with certain functionals).

esoteric-ephemera · 2024-01-06T00:28:33Z

Just adding the WIP tag for now, since I realized that oxidation state handling for isotopes isn't working as intended (also pending refactor of CifAssessor)

janosh · 2024-01-06T07:14:00Z

The hydrogen convex hull distance errors for MLIPs are definitely elevated based on Matbench Discovery testing.
See "Per-Element Model Error Heatmaps" at https://matbench-discovery.materialsproject.org/models with M3GNet and MACE at 260 meV/atom, CHGNet at 240.

I think this not so much stems from underrepresentation (MPtrj contains 142k relaxation frames with hydrogen) but hints that the models are confused due to inconsistent training data, perhaps due to issues like the one (partially) addressed in this PR.

But I agree with @shyuep, doing H-only relaxation with any of the above mentioned models could still get you closer to true H positions.

… tests

esoteric-ephemera · 2024-01-10T20:48:07Z

This is almost ready for review, I think. One thing I'm not satisfied with: the user can manually specify an oxidation state for an isotope, however calling convenience functions like Composition(...).add_charges_from_oxi_state_guesses() or Structure(...).add_oxidation_state_by_guess() will convert any isotope to its parent bare Element

For example, The composition Composition({"D+": 2, "O2-": 1}) includes D+ in the composition. However calling add_charges_from_oxi_state_guesses() on Composition({"D": 2, "O": 1}) will return the equivalent of Composition({"H+": 2 "O2-": 1}).

Maybe this isn't such an issue since this represents an extreme edge case. Issuing a warning to the user when calling any of the add_* oxi state methods would be sufficient?

…ashes for isotopes

…str format

tests/io/test_cif.py

dev_scripts/periodic_table.yaml

…ecies.A to return either int or None

janosh

Thanks a lot @esoteric-ephemera! 👍 Super valuable guard rails for everyone parsing CIFs with pymatgen.

There are 2 changes in pymatgen which destroy the previous unittests: 1. materialsproject/pymatgen#3542 Previously we set the type of DummySpecies to "Type", which is forbidden by pymatgen through this PR. Here we replace "Type" with "X". 2. materialsproject/pymatgen#3561 The functions from_string and get_string have been replaced with from_str and get_str. We should keep up to date with these changes.

esoteric-ephemera and others added 3 commits January 5, 2024 12:18

Add CIF checking for incorrect/missing elements, support for isotopes…

1918eea

… (primarily Deuterium and Tritium), and correct handling of new VASP POSCAR formats

Fix LAMMPS guess_element

81f1b8c

Merge branch 'materialsproject:master' into cif-sanity-checks

d4cc9ff

esoteric-ephemera changed the title ~~Improved CIF checking, support for isotopes, and correct handling of new VASP POSCAR fmt~~ [WIP] Improved CIF checking, support for isotopes, and correct handling of new VASP POSCAR fmt Jan 6, 2024

janosh added enhancement A new feature or improvement to an existing one fix Bug fix PRs vasp Vienna Ab initio Simulation Package cif Crystallographic information file labels Jan 6, 2024

esoteric-ephemera and others added 3 commits January 8, 2024 15:25

Refactor CifAssessor to be class method on CifParser + refactor assoc…

bac988b

… tests

Merge branch 'materialsproject:master' into cif-sanity-checks

6caa6ec

Fix failing test by adding DummySpecies.A attr

a7bb4cc

janosh force-pushed the master branch from d7c651a to 9751f96 Compare January 10, 2024 10:42

Fix oxidation state parsing for isotopes

f36346a

esoteric-ephemera and others added 4 commits January 10, 2024 12:52

Revert element hash to be Z if not an isotope, and add new distinct h…

2f502b4

…ashes for isotopes

Merge branch 'materialsproject:master' into cif-sanity-checks

5686b8f

fix types in doc strings and tweak var names

f95e730

fix DummySpecies.A attr, refactor __eq__

7ca7901

janosh added 5 commits January 13, 2024 17:01

CifParser document defaults for check_cif, cif_assessor_tol, fix doc …

1c70563

…str format

rename method CifParser.assess->check

f3badc0

check D+ oxi state is correctly parsed

e7c8a1d

tighten atomic mass test on H isotopes

c6167cc

del debug prints

d4b93ea

janosh changed the title ~~[WIP] Improved CIF checking, support for isotopes, and correct handling of new VASP POSCAR fmt~~ Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header Jan 13, 2024

janosh suggested changes Jan 13, 2024

View reviewed changes

tests/io/test_cif.py Outdated Show resolved Hide resolved

dev_scripts/periodic_table.yaml Outdated Show resolved Hide resolved

janosh and others added 5 commits January 14, 2024 16:17

merge TestCifParserAssess into TestCifIO

4afd273

Merge branch 'master' into pr/esoteric-ephemera/3542

8b5ed74

Update docstring for DummySpecies.Z and DummySpecies.A; allow DummySp…

678afa5

…ecies.A to return either int or None

Revert format of dev_scripts/periodic_table.yaml to minimize diff

3cbc579

Merge branch 'materialsproject:master' into cif-sanity-checks

5eb12a7

janosh approved these changes Jan 17, 2024

View reviewed changes

janosh merged commit f463ac1 into materialsproject:master Jan 17, 2024
22 checks passed

esoteric-ephemera mentioned this pull request Jan 22, 2024

Unable to parse LOCPOT with VASP 6.4.2 #3571

Closed

esoteric-ephemera deleted the cif-sanity-checks branch January 29, 2024 19:27

esoteric-ephemera mentioned this pull request Jan 30, 2024

Interpretation and Handling of Unique SHA256 Identifiers in the CONTCAR/CHGCAR/CHG Files Generated by VASP.6.4.2 Compiled with HDF5 Support when using Pymatgen. #3549

Closed

JaGeo mentioned this pull request Feb 6, 2024

Simple composition-based featurization fails due to an upgrade in pymatgen hackingmaterials/matminer#920

Closed

Yi-FanLi mentioned this pull request Feb 29, 2024

Fix UTs destroyed by pymatgen's update deepmodeling/dpgen#1471

Merged

esoteric-ephemera mentioned this pull request Mar 26, 2024

Correct units of Element.atomic_orbitals #3714

Merged

DanielYang59 mentioned this pull request Nov 23, 2024

Remove unnecesary isotope conversion in core.periodic_table.get_el_sp #4193

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542

Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542

esoteric-ephemera commented Jan 5, 2024 •

edited by janosh

Loading

mkhorton commented Jan 5, 2024

mkhorton commented Jan 5, 2024

esoteric-ephemera commented Jan 5, 2024

shyuep commented Jan 5, 2024 •

edited

Loading

mkhorton commented Jan 5, 2024

esoteric-ephemera commented Jan 5, 2024

mkhorton commented Jan 5, 2024

shyuep commented Jan 5, 2024 •

edited

Loading

esoteric-ephemera commented Jan 5, 2024 •

edited

Loading

shyuep commented Jan 5, 2024

esoteric-ephemera commented Jan 6, 2024

janosh commented Jan 6, 2024 •

edited

Loading

esoteric-ephemera commented Jan 10, 2024

janosh left a comment

Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542

Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542

Conversation

esoteric-ephemera commented Jan 5, 2024 • edited by janosh Loading

mkhorton commented Jan 5, 2024

mkhorton commented Jan 5, 2024

esoteric-ephemera commented Jan 5, 2024

shyuep commented Jan 5, 2024 • edited Loading

mkhorton commented Jan 5, 2024

esoteric-ephemera commented Jan 5, 2024

mkhorton commented Jan 5, 2024

shyuep commented Jan 5, 2024 • edited Loading

esoteric-ephemera commented Jan 5, 2024 • edited Loading

shyuep commented Jan 5, 2024

esoteric-ephemera commented Jan 6, 2024

janosh commented Jan 6, 2024 • edited Loading

esoteric-ephemera commented Jan 10, 2024

janosh left a comment

Choose a reason for hiding this comment

esoteric-ephemera commented Jan 5, 2024 •

edited by janosh

Loading

shyuep commented Jan 5, 2024 •

edited

Loading

shyuep commented Jan 5, 2024 •

edited

Loading

esoteric-ephemera commented Jan 5, 2024 •

edited

Loading

janosh commented Jan 6, 2024 •

edited

Loading