Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InChIKey property #466

Closed
wants to merge 7 commits into from

Conversation

merkys
Copy link
Member

@merkys merkys commented Jun 8, 2023

This PR adds InChIKey property for structure entries. InChIKey is a chemical structure descriptor alternative to SMILES (proposed in #368). InChIKey is said to avoid the the ambiguity the SMILES possesses, moreover, it does not have any internal structure (essentially being a "chemical checksum"), thus there should be no issues with its comparisons.

Pinging people who have expressed their interest for comments: @eimrek @utf @Austin243

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @merkys, this would be a useful field to have in many cases.

Just to keep a record of the discussion, could you briefly explain the domain of applicability of this representation in this PR? If it is somehow narrow (e.g., restricted to organic molecules [I know InChI does more]), perhaps this could be a target for a new molecular/cheminformatics namespace (alongside smiles, smarts etc)?

Also just picking up on the discussion yesterday, if a subcomponent of my periodic structure can be represented by an InChIKey, should that be allowed here? e.g., a database of hybrid perovskites would probably like to expose a SMILES/InChI search on the "methylammonium" part of methylammonium lead iodide (CH3NH3PbI3) rather than by reduced formula (CH6NI3Pb -- I guess in this case the other option is to use chemical_formula_descriptive)

optimade.rst Outdated Show resolved Hide resolved
Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com>
optimade.rst Outdated Show resolved Hide resolved
Co-authored-by: Matthew Evans <7916000+ml-evs@users.noreply.github.com>
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated
inchikey
~~~~~~~~

- **Description**: The standard InChIKey representation of the structure, as laid out by the `InChI Trust <https://www.inchi-trust.org>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also specify which InChI/InChiKey version should be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I would also like to discuss how will we handle deprecation and switching to new InChiKey versions in the future, as I do not believe it will be possible to keep using the old version when new ones are released. I can think of the following options:

  1. When InChiKey version changes, put the newest version in the upcoming OPTIMADE specification vX and require all implementers of vX to recompute their InChiKey.
  2. Define the InChiKey version as metadata (after Adding per property meta_data field. #462 gets solved).

First solution is better for interoperability, but puts more strain on the implementers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for now we should go with option 1 since this approach is already used in the suggested SMILES property (PR #392). Later on when (and if) the metadata approach gets merged, option 2 can be revisited.

However, I think that option 1 greatly simplifies queries for the client. Each time needing to calculate the InChIKey with several different versions of the algorithm to just to query several databases does not seem very convenient.

optimade.rst Outdated Show resolved Hide resolved
@rartino rartino requested a review from sauliusg June 9, 2023 07:25
optimade.rst Outdated
Comment on lines 2506 to 2507
- **Description**: The standard InChIKey identifier of the structure, as laid out by the `InChI Trust <https://www.inchi-trust.org>`_
Standard InChIKey is non-unique, thus the same InChIKey can be assigned to several different structures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the workshop: must be clarified if the key MUST cover all parts of the structure or not. Consider if this should be a list so that components can be covered, or if that goes in a separate field. (May need careful definition of what a component/fragment is).

@rartino
Copy link
Contributor

rartino commented Jun 9, 2023

Workshop: after the above workshop comment has been considered/handled this should be merged.

We have discussed if "complicated derived non-unique descriptors" really belongs inside the structures entry. However, before we have a decision/design on where they should go, we should accept fields that are useful and desired into the structures endpoint for now.

A similar regard holds for the desire to allow application area/subconsortia prefixes. Until we have a model for that, we merge fields without them.

@rartino
Copy link
Contributor

rartino commented Jan 10, 2024

@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace?

These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging)

Edit: possibly also #400, #396, or would those go in a "bio" prefix?

@ml-evs
Copy link
Member

ml-evs commented Jan 10, 2024

@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace?

These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging)

Edit: possibly also #400, #396, or would those go in a "bio" prefix?

On purely technical/scientific terms I think this field would be perfect to seed a cheminformatics namespace (along with SMARTS/SMILES etc -- would have to figure out how to allow filter_smarts as a custom URL param too...). However on a purely development practice, I worry that we don't have the level of engagement or scale to spread ourselves so thinly across these various namespaces, in which case just loading up the core OPTIMADE namespace is maybe preferable. Happy to discuss!

@rartino
Copy link
Contributor

rartino commented Jan 11, 2024

@ml-evs maybe we can use this example to try out the infrastructure and see where we hit snags? I'm not sure we absolutely need engagement at this point, we already have 4-6 PR:s for properties to place in such a namespace-provider standard, which we can do under a "v0.1" to mark that things are highly experimental. If people show up who want to manage this, we'll happily hand off the repo to them.

I've created a couple of new repos for this:

@ml-evs
Copy link
Member

ml-evs commented Jan 11, 2024

Great! Thanks for this @rartino -- I definitely stalled in my attempts to do the same thing. I will try to migrate https://github.com/Materials-Consortia/optimade-stability-namespace in the same direction.

@merkys
Copy link
Member Author

merkys commented Jan 15, 2024

@merkys @ml-evs What do you think - should we still add this to "core OPTIMADE" as decided in the web meet. Or, now that we are hopefully somewhat close to merging #473, should all of these 'chemically oriented fields" currently waiting in PRs by @merkys instead go in its own namespace?

These PRs are affected: #466 #465 #436 #398 - and the question is if they should be labeled 'status/blocked' (blocking on #473) or 'status/waiting-for-update' (since they all currently are sitting with comments to address before merging)

I agree that these properties #466, #436, #398 could go to their own namespace. I will try to describe them in https://github.com/Materials-Consortia/namespace-cheminformatics. Not sure about #465 though, to me it seems generic enough to appear in the core specification, but again, I might be speaking from cheminformatics PoV.

Edit: possibly also #400, #396, or would those go in a "bio" prefix?

Properties introduced in #400 make sense to macromolecules only, but hierarchical grouping introduced in #396 might have an even wider usage than in cheminformatics. Thus I would not rush them under namespace-cheminformatics umbrella for the time being.

@merkys
Copy link
Member Author

merkys commented Jan 16, 2024

InChIKey has been implemented in https://github.com/Materials-Consortia/namespace-cheminformatics in Materials-Consortia/namespace-cheminformatics#1.

Now about the remaining cheminformatics PRs. #436 introduces a new SMILES OPTIMADE data type and #398 introduces a new URI query parameter. I wonder whether property definition format and namespaces are ready to accept such extensions? If not, is this something that should be allowed to be extended in namespaces?

@rartino
Copy link
Contributor

rartino commented Jan 16, 2024

Now about the remaining cheminformatics PRs. #436 introduces a new SMILES OPTIMADE data type and #398 introduces a new URI query parameter. I wonder whether property definition format and namespaces are ready to accept such extensions? If not, is this something that should be allowed to be extended in namespaces?

Once the current property defs etc. are merged, lets work on a similar design for user-defined datatypes. I'm thinking a similar declarative format as for units, properties for datatypes, where one with human language declare how every operator should work. However, I don't want to draft this until the property framework is merged.

The need for user-defined filer languages should perhaps inform the design of #398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters.

@merkys
Copy link
Member Author

merkys commented Jan 17, 2024

Once the current property defs etc. are merged, lets work on a similar design for user-defined datatypes. I'm thinking a similar declarative format as for units, properties for datatypes, where one with human language declare how every operator should work. However, I don't want to draft this until the property framework is merged.

Makes sense.

The need for user-defined filer languages should perhaps inform the design of #398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters.

For now queries in filters act on property values only. Query by SMARTS will not be bound to a specific property. A possible solution would be to allow queries on entries themselves by introducing property-less operators, viz. /structures?filter=SMARTS "[CX4]".

@rartino
Copy link
Contributor

rartino commented Jan 17, 2024

The need for user-defined filter languages should perhaps inform the design of #398. Can we instead allow some syntax for the usual filter= to provide a list of filters with some kind of specifier what kind of filter it is? Then we can allow user-defined filter languages without new query parameters.

For now queries in filters act on property values only. Query by SMARTS will not be bound to a specific property. A possible solution would be to allow queries on entries themselves by introducing property-less operators, viz. /structures?filter=SMARTS "[CX4]".

Good point. After some additional thinking, I've realized that we already (of course) have a facility for prefixed command line query parameters, so #398 can just use that with the _cheminfo_ prefix. It can be included in the cheminformatics namespace provider specification right now by simply documenting its intended existence and operation in human-readable form, e.g., in the top README and/or the description field of the 'cheminformatics standard' source file. Moving forward, we need to tie all this machinery into OpenAPI schemas, thus making the declaration of this query parameter in an OpenAPI-compatible way.

@merkys
Copy link
Member Author

merkys commented Jan 17, 2024

I suppose that by command line parameters you mean URL query parameters. Indeed, custom query parameters are already allowed. Thus #398 already can go to cheminformatics namespace.

@ml-evs ml-evs added the status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus. label Mar 22, 2024
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
optimade.rst Outdated Show resolved Hide resolved
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
merkys added a commit to Materials-Consortia/namespace-cheminformatics that referenced this pull request Jun 11, 2024
@merkys
Copy link
Member Author

merkys commented Jun 12, 2024

It has been decided to move cheminformatics properties to a repository of its own, which has been done in Materials-Consortia/namespace-cheminformatics#1 and Materials-Consortia/namespace-cheminformatics#2. Closing this PR.

@merkys merkys closed this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/requires-discussion status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants