-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542
Improve CIF checking, support for isotopes, and correct handling of new VASP 6.4.2 POSCAR format incl. slashes in header #3542
Conversation
… (primarily Deuterium and Tritium), and correct handling of new VASP POSCAR formats
Thanks for these nice improvements @esoteric-ephemera Since you’re looking at this part of the code, do you have a view of how H2O or OH sites should be handled? It’s one remaining weakness of the CIF parser I think. |
Also adding a note that the |
Are you thinking of experimental structures where the H sites are undetermined/unlisted? For organics, @Andrew-S-Rosen had a good suggestion of trying to mimic how Olex (used by crystallographers) adds hydrogen atoms to structures where their positions are unknown @kristinpersson had a suggestion that was based on a geometric construction - that would also be a cheap solution I have a few other ideas that would use a DFT charge density, but these are more intensive options
That's great to know! The precision of atomic sites in CIFs seems to be a bottleneck to lossless conversion, since this can result in composition mismatches. Since a lot of CIFs are written by hand (especially ICSD ones), this happens frequently, see e.g., issue #3373 |
NIce work! One minor thing: does the cif assessor really need to be a class? I would think a CifParser.assess or CifParser.validate method would suffice. I am heavy proponent for OOP but in this case, the CifAssessor doesn't seem to conceptually an object per se. |
Not quite, there are two problems: (1) “implicit hydrogens” whereby the hydrogens are not listed in the atoms at all (so we might not realise hydrogens are missing unless we do a composition check as suggested), (2) the CIF files which list a molecule on a single atomic site, eg “OH” or “H2O” etc. Here, the CIF file at least tells us that a H should be present, but we still do not parse appropriately (I believe for OH, we parse as just O, and provide a warning). For (2), it’s a difficult situation, because all options are unideal in some way. Omitting the H completely is wrong. Putting the H in an arbitrary location (eg correct bond length, perhaps do a bond valence sum or similar to find a plausible position) is tricky. Defining the site as containing a molecule is probably most correct, but does not fit with the current pymatgen object model. |
Thanks! Yes I'll try to refactor the assessor |
Sorry, sent that message too soon. I’m not familiar with the Olex method. My point anyway is just that there are cases where we have to figure out the hydrogen is missing based on composition, but also cases where we’re explicitly told a hydrogen should be there. In terms of lossless conversion, correct me if I’m wrong but there’s no limit on the precision with which we write out atomic positions? We can write out as many significant figures as we do with JSON. I know the issue with the handwritten CIFs which is why we put in the rounding logic inside the parser (eg converting 0.333 -> 1/3 etc), but that seems like a separate issue. |
@mkhorton @esoteric-ephemera I would just say that there are now universal potentials. For an OH, you can take the coordinates as the position of the O and then initialize a H at 0.5A bond distance (maybe away from other atom as well). Then use a UP to relax the H positions. :-) For this very constrained use case, I think it would lead to very reasonable results (much more than pure geometry / empirical constructions). |
OK I see what you mean. The best solution there in keeping with pymatgen's model is probably putting O on the site position and adding in H
This PR warns the user in at least the latter case: if H is reported in the CIF but missing in the PMG structure, the user is warned. It usually seems to be the case that H's are reported in the CIF composition even when no H site positions are given. But if they're not, we might want to include a charge balance check
We can specify arbitrary position, but a lot of the ICSD CIFs seem to have lower precision than the default rounding tolerance in
Might be good in this case to freeze the known positions and only relax the H atoms. My only concern here is that H is often underrepresented in ML-UIP training data. Maybe @janosh can comment more on that |
Yes H is often underrepresented in UP training. But I think for a limited case like OH, it should be ok. Regardless, it is better than any of the rule of thumb alternatives. The only better option is DFT (with certain functionals). |
Just adding the WIP tag for now, since I realized that oxidation state handling for isotopes isn't working as intended (also pending refactor of CifAssessor) |
The hydrogen convex hull distance errors for MLIPs are definitely elevated based on Matbench Discovery testing. I think this not so much stems from underrepresentation (MPtrj contains 142k relaxation frames with hydrogen) but hints that the models are confused due to inconsistent training data, perhaps due to issues like the one (partially) addressed in this PR. But I agree with @shyuep, doing H-only relaxation with any of the above mentioned models could still get you closer to true H positions. |
This is almost ready for review, I think. One thing I'm not satisfied with: the user can manually specify an oxidation state for an isotope, however calling convenience functions like For example, The composition Maybe this isn't such an issue since this represents an extreme edge case. Issuing a warning to the user when calling any of the |
…ecies.A to return either int or None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @esoteric-ephemera! 👍 Super valuable guard rails for everyone parsing CIFs with pymatgen
.
There are 2 changes in pymatgen which destroy the previous unittests: 1. materialsproject/pymatgen#3542 Previously we set the type of DummySpecies to "Type", which is forbidden by pymatgen through this PR. Here we replace "Type" with "X". 2. materialsproject/pymatgen#3561 The functions from_string and get_string have been replaced with from_str and get_str. We should keep up to date with these changes.
This PR does a few things:
pymatgen.io.cif.CifParser
now checks that the composition reported in a CIF matches that of the resultingpymatgen
Structure
and if not, warns the user what elements are missing (most often Hydrogens in experimental CIFs which are hard to locate without using neutron scattering), or when the parsing has gone awry and the stoichiometries differpymatgen.core.periodic_table.Element
now has a few more attributes, such as the atomic mass number (number of protons and neutrons) as the.A
attrpymatgen.io.vasp.inputs.Poscar