-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InChIKey property #466
InChIKey property #466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @merkys, this would be a useful field to have in many cases.
Just to keep a record of the discussion, could you briefly explain the domain of applicability of this representation in this PR? If it is somehow narrow (e.g., restricted to organic molecules [I know InChI does more]), perhaps this could be a target for a new molecular/cheminformatics namespace (alongside smiles, smarts etc)?
Also just picking up on the discussion yesterday, if a subcomponent of my periodic structure can be represented by an InChIKey, should that be allowed here? e.g., a database of hybrid perovskites would probably like to expose a SMILES/InChI search on the "methylammonium" part of methylammonium lead iodide (CH3NH3PbI3) rather than by reduced formula (CH6NI3Pb -- I guess in this case the other option is to use chemical_formula_descriptive
)
Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com>
Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com>
optimade.rst
Outdated
inchikey | ||
~~~~~~~~ | ||
|
||
- **Description**: The standard InChIKey representation of the structure, as laid out by the `InChI Trust <https://www.inchi-trust.org>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should also specify which InChI/InChiKey version should be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I would also like to discuss how will we handle deprecation and switching to new InChiKey versions in the future, as I do not believe it will be possible to keep using the old version when new ones are released. I can think of the following options:
- When InChiKey version changes, put the newest version in the upcoming OPTIMADE specification vX and require all implementers of vX to recompute their InChiKey.
- Define the InChiKey version as metadata (after Adding per property meta_data field. #462 gets solved).
First solution is better for interoperability, but puts more strain on the implementers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that for now we should go with option 1 since this approach is already used in the suggested SMILES property (PR #392). Later on when (and if) the metadata approach gets merged, option 2 can be revisited.
However, I think that option 1 greatly simplifies queries for the client. Each time needing to calculate the InChIKey with several different versions of the algorithm to just to query several databases does not seem very convenient.
optimade.rst
Outdated
- **Description**: The standard InChIKey identifier of the structure, as laid out by the `InChI Trust <https://www.inchi-trust.org>`_ | ||
Standard InChIKey is non-unique, thus the same InChIKey can be assigned to several different structures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the workshop: must be clarified if the key MUST cover all parts of the structure or not. Consider if this should be a list so that components can be covered, or if that goes in a separate field. (May need careful definition of what a component/fragment is).
Workshop: after the above workshop comment has been considered/handled this should be merged. We have discussed if "complicated derived non-unique descriptors" really belongs inside the structures entry. However, before we have a decision/design on where they should go, we should accept fields that are useful and desired into the structures endpoint for now. A similar regard holds for the desire to allow application area/subconsortia prefixes. Until we have a model for that, we merge fields without them. |
@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace? These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging) Edit: possibly also #400, #396, or would those go in a "bio" prefix? |
On purely technical/scientific terms I think this field would be perfect to seed a cheminformatics namespace (along with SMARTS/SMILES etc -- would have to figure out how to allow |
@ml-evs maybe we can use this example to try out the infrastructure and see where we hit snags? I'm not sure we absolutely need engagement at this point, we already have 4-6 PR:s for properties to place in such a namespace-provider standard, which we can do under a "v0.1" to mark that things are highly experimental. If people show up who want to manage this, we'll happily hand off the repo to them. I've created a couple of new repos for this:
|
Great! Thanks for this @rartino -- I definitely stalled in my attempts to do the same thing. I will try to migrate https://github.com/Materials-Consortia/optimade-stability-namespace in the same direction. |
I agree that these properties #466, #436, #398 could go to their own namespace. I will try to describe them in https://github.com/Materials-Consortia/namespace-cheminformatics. Not sure about #465 though, to me it seems generic enough to appear in the core specification, but again, I might be speaking from cheminformatics PoV.
Properties introduced in #400 make sense to macromolecules only, but hierarchical grouping introduced in #396 might have an even wider usage than in cheminformatics. Thus I would not rush them under |
InChIKey has been implemented in https://github.com/Materials-Consortia/namespace-cheminformatics in Materials-Consortia/namespace-cheminformatics#1. Now about the remaining cheminformatics PRs. #436 introduces a new SMILES OPTIMADE data type and #398 introduces a new URI query parameter. I wonder whether property definition format and namespaces are ready to accept such extensions? If not, is this something that should be allowed to be extended in namespaces? |
Once the current property defs etc. are merged, lets work on a similar design for user-defined datatypes. I'm thinking a similar declarative format as for units, properties for datatypes, where one with human language declare how every operator should work. However, I don't want to draft this until the property framework is merged. The need for user-defined filer languages should perhaps inform the design of #398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters. |
Makes sense.
For now queries in filters act on property values only. Query by SMARTS will not be bound to a specific property. A possible solution would be to allow queries on entries themselves by introducing property-less operators, viz. |
Good point. After some additional thinking, I've realized that we already (of course) have a facility for prefixed |
I suppose that by command line parameters you mean URL query parameters. Indeed, custom query parameters are already allowed. Thus #398 already can go to cheminformatics namespace. |
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
It has been decided to move cheminformatics properties to a repository of its own, which has been done in Materials-Consortia/namespace-cheminformatics#1 and Materials-Consortia/namespace-cheminformatics#2. Closing this PR. |
This PR adds InChIKey property for structure entries. InChIKey is a chemical structure descriptor alternative to SMILES (proposed in #368). InChIKey is said to avoid the the ambiguity the SMILES possesses, moreover, it does not have any internal structure (essentially being a "chemical checksum"), thus there should be no issues with its comparisons.
Pinging people who have expressed their interest for comments: @eimrek @utf @Austin243