-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SMILES property #368
Comments
I would suggest so. |
@JPBergsma the topic of SMILES have come up a few times and a standardization for SMILES use in OPTIMADE would likely be very useful. If you are familiar with SMILES usage, could you perhaps describe a few "search scenarios" of SMILES data? E.g., what would you be searching for? How do you envision such a search could be expressed, etc.? |
Sorry, I did not read the specification for chemical_formula_descriptive well enough the first time and I overlooked that it is already defined by the IUPAC's Nomenclature. I, therefore, had already closed the issue but unfortunately, I did not have sufficient privileges to remove it. It would indeed be better to add a separate field for the SMILES string, although we could also think about other ways to add topological information, as smiles strings cannot be compared directly. |
(I took the liberty of editing your issue title to match - feel free to adjust it) |
First of all, defining the topology of a molecule allows you to distinguish between molecules with the same elemental composition but a different structure. Perhaps the current IUPAC definition is also able to do so, but via the link in optimade.rst https://www.qmul.ac.uk/sbcs/iupac/bibliog/blue.html I only found information about how to name chemical compounds and not how to write the structural formula. (IUPAC did define the InChI format which does contain the molecular structure, but that is different from the example fields in OPTIMADE.) Ideally, having the structural data of a molecule would also allow you to find molecules with a mostly similar structure but some small differences. For example, a structure where a hydrogen atom has been replaced by a methyl group or a bromine atom has been replaced by a chlorine atom. While this would be quite useful, it may be difficult to implement such a search. I am not sure whether SMILES is the best option for this. It has the advantage that the strings are relatively human-readable but multiple SMILES strings can encode for the same molecule. So you first have to convert the string to a structure before you know whether they are identical, or you have to agree on which algorithm to use to generate SMILES strings. There are other ways to store the structure of a molecule, like InChI, and another option would be to use a connectivity matrix. |
During OMDI I talked with someone from the Ocelot database. |
I support standardizing a separate property for SMILES. However, there are some issues related both to its definition and usability.
InChI is an alternative representation, however, it does not solve the matching problem. Moreover, it has licensing issues impeding its convenient usage. |
There is also the question how we handle this type of extension into string-like complex properties in the OPTIMADE filter language (and otherwise in our type system). Far back I wrote up my thoughts on this here: #157 (comment) But, in short, we probably need to have some way to tell a normal string and a smiles string apart since they will have different comparison semantics. |
1 The OpenSmiles standard is definitively an option. It seems practically the same as the SMILES definition on the Daylight website so if necessary we could switch. Ideally, we would also use the SMARTS extension, which is specifically focused on querying structures, although it is not included in the OpenSmiles standard.
2 Either the server would have to canonicalize the input from the client or we would have to agree on a canonicalization algorithm that should be used by all clients and servers.
3 I think it will indeed be necessary to generate a molecular graph. Although a preselection could be made using fingerprinting, for example, by looking at the atom composition of the searched fragment, or by comparing which common structural elements are present.
4 At first I was thinking about limiting the requirement for SMILES structures to organic compounds, but after reading your article we could perhaps expand The SMILES definition to a broader range of compounds. In that case, we should formalize the method further than is currently described in the article. There may still be some arbitrariness with describing the atomistic structures though, as some arbitrary cut-off point has to be chosen for defining a bond.
5 It seems that the discussion about the InChI licensing issue, you refer to, is still ongoing so perhaps it will be resolved. I do not think using InChI for our database would go against the intention of the InChI Trust. Standard InChI has the limitation that tautomers have the same InChI code. In a laboratory setting, it is usually not possible to separate the tautomers so this would not be a problem. But in computational chemistry, the timescales are usually so short that no conversion takes place. There is an extension for this so I think we should implement it if we would want to use InChI. That way each InChI should belong to exactly one structure. A final option would be to use a molecular graph for searching. Unless we decide on a canonicalization algorithm, the SMILES field should indeed not have the string type as a direct comparison of uncanonicalized SMILES strings is not possible. |
(For brevity, I am not citing and explicitly responding to @JPBergsma sentences with which I completely agree)
This can already be implemented by using custom extension endpoint mechanism.
Yes, this makes sense.
Not necessarily. The server, for example, may just pass user input to Open Babel which either reconstructs molecular graphs or does fingerprint matching.
Preferably yes.
This would be nice, but again, all providers should use conventions as similar as possible.
Strictly speaking, this is true only if providers manage to use InChI library without modifying its code. |
I am not sure what you mean with custom extension endpoint mechanism. There is a custom extension endpoint in the Optimade standard, but I do not see why that would be relevant here. You would want to use the SMARTS/SMILES to find particular structures. Creating a separate endpoint to do this seems cumbersome. You would also want to standardize the way this works across multiple databases. Which would be difficult if each database would create a custom endpoint.
In that case the SMILES string would still be processed on the server(as in the physical computer that deals with the request.) |
Sorry, I misparsed the term "extension". I believe the SMARTS were originally described by Daylight. I am not sure about the state of other parallel SMARTS specifications, though.
Yes, that is true. |
Looking back at my discussion checklist, I think we at least agree on using OpenSMILES. However, other issues still need more discussion. My suggestions to speed up the introduction of SMILES property would be the following:
This would make the SMILES property a descriptive one. Thus, the client will be able to retrieve SMILES values alongside other structural data, but would not be able to query on them. For dealing with inorganics I could propose adhering to Quirós et al. 2018 (disclaimer: I am one of the authors), but this would not be convenient for providers using their own conventions, or producing SMILES by Open Babel or some other software. |
I agree on point 1, that databases are allowed to use their own canonicalization method. Part of the reason to implement this though is to make it easier to search for organic molecules, as these can have the same chemical formula. For that to work, it should be possible to search for SMILES strings. Quirós et al. 2018 could indeed be useful for describing metal complexes and such, as far as that they are not covered by the OpenSMILES standard. |
Aren't we landing in that we should just standardize a SMILES field to be a normal OPTIMADE String which is specified to contain an OpenSMILES representation of the implementer's choice (much like (I realize it was said above that it cannot be a String because uncanonicalized SMILES "cannot be compared", but, the same issue technically holds for
I'm not sure why you mean such conversions would be needed (?), but if so, then this query support can only be on MAY level since it goes far beyond what can be handled efficiently by a typical query layer. |
I agree that we can define SMILES as a regular OPTIMADE String with all string handling operations. Thus for the time being So it seems we have consensus on the most of SMILES-related issues. |
If we define the SMILES field as a normal OPTIMADE string we should define the canonicalization method that should be used with OPTIMADE. Otherwise, it does not make sense to put the requirement on MUST or SHOULD level for the (partial) string matching filter operators, as one molecule can have multiple different SMILES strings. One of the main reasons to implement the SMILES notation is to enable searching on molecular structures.
The conversion would be needed if we do not agree on a canonicalization method. If you start generating the SMILES string from different atoms within a molecule, you would get a valid SMILES string for each starting atom, but they would all be different. There are already python packages that can convert SMILES strings into structures and back. RDkit can do this, and it also guarantees the created SMILES string is canonicalized, i.e. you will always get the same string regardless of SMILES string you originally used. A simple way to make your structures with SMILES strings searchable is to covert your SMILES into structures and then back into SMILES strings with RDkit. This way, you can be sure all strings have the same canonicalization method. One issue that we have not yet discussed is how we are going to handle structures with multiple molecules. |
I agree that to implement reliable querying of exact structures we have to define canonicalization method. This will most likely boil down to choosing common software package to produce canonical SMILES for OPTIMADE output, be it RDKit, Open Babel or something else. In addition, if we want to support inorganics, all providers will have to select a common set of rules to describe them. As for reliable partial molecular matching, IMO we will never get around with simple substring matching. Imagine for example patterns to match rings. Here I would like to draw attention to the distinction between database querying and screening. The first one expects the database to perform entry selection, whereas the second one downloads whole database and performs entry selection locally. I do not believe it is feasible to push all the providers to implement exact querying mechanisms. Thus IMO it is better to provide descriptive data in some common format and let the users perform the screening. With OPTIMADE provisions to include only specific fields in the response, downloads should not be too large. Thus I very much would want to avoid forcing all the providers to use the same canonicalization method. I am afraid that instead being a useful descriptive property, SMILES would be supported by only a few providers. |
Right. I would prefer sticking to string, not list because of how I imagine SMILES property to be used (screening instead of querying). In addition to that, the only list member comparison operator for string is equality (i.e., |
Indeed, matching substructures is much more complicated and beyond the scope of PR#392.
Screening would be less efficient for both the client and the server:
There are not many useful substring queries you can do on SMILES strings. You could check whether triple and quadruple bonds or charges are present, but that's about it. So we would not lose that much by converting the field to a list. |
Agree, this does not look elegant.
Completely agree.
Agree with every word here! So it seems we are arriving at these properties (all OPTIONAL):
Plus How about this? Still we have some homework to do regarding |
In today's Web meeting @JPBergsma advocated for specific handling of string comparisons on I would be happy to include |
I think it would be best to create a separate issue/PR for the I would prefer it if Smart queries could be a part of the filter, just as any other condition. This would give the user the maximum freedom to create queries. And it would also be more efficient for the server. In the proposal of @rartino some structures would be returned twice because OR could only be executed by doing two separate queries. A structure that fulfils both conditions would thus be returned twice. |
Agree. It makes sense to have separate PRs. I will open a separate PR for
These are very valid arguments. However, @rartino's post and Friday's Web meeting convinced me otherwise.
I do not see this as a problem. All entries have IDs and they can be used to pick only the unique structures. |
I agree that allowing intermixed queries gives more flexibility in what queries can be expressed. There are two different possible solutions on the backend for backends that can handle SMILES:
Now, the question is - will (1) or (2) be the more common one? My somewhat unfounded suspicion is that there is no backend today that can efficiently do (1).
|
Didn't you suggest before that given
I have the same feeling about (1). By the way, I have opened PR #398 introducing |
Right, sorry - this was just a mistype - replace every "OR" in that reply with "AND" Edit: Eh - I see that my confusion runs deeper. @JPBergsma is right in that OR queries are less efficient in that you'd need to run two queries and will get duplicates; but @merkys is right that they can be matched by ID. Even so, I think this is a less important point than choosing the construct that the majority of backends can support without having to parse and unwrap the query string. |
Just saw this blogpost on Twitter and thought it would nicely complement the SMARTS discussion for those who don't know already know what it is: Easy way to visualize SMARTS |
Citing myself:
Saubern et al., 2011 present some evidence that SMARTS are understood differently by different cheminformatics packages, the fact which I was almost sure about. Nevertheless, we will have to live with that - I hope the differences are minimal. |
I'll weigh in here. Canonicalization. This is a much misunderstood term. "Canonicalization" is a local database strategy that can be used to do a rapid string search for a specific compound. Databases should not/do not require that a user use any particular canonicalization. Maybe they use OpenSmiles v. 2.0.5; maybe they use something else. It doesn't matter in OPTIMADE context, because nobody cares what canonicalization the implementer used. What the database does is to convert the SMILES query to their specifically implemented canonicalization so that can do a direct string match. That's all. Think of "canonicalization" as similar to "software name and version." It's just a given algorithm written at a specific time. Point: Don't worry about canonicalization. SMARTS. This is the real power of the SMILES business. The goal is to find substructures within a database - all the compounds that have six-membered aromatic rings with adjacent OH groups, for example a1aa(O[H])a(O[H])aa1. Some databases can do this; others cannot. Again, no relevance of canonicalization. This is a model search, not a string search. But not every database can do this sort of thing. I agree completely that any SMILES needs to be its own property (perhaps as chemical_SMILES). |
@BobHanson Thanks for your opinion here! Could you please also check out the related PR #392 and maybe approve if you agree? |
Bit of a tangent, but this same reasoning also worried me a bit about the way we standardized chemical formulae to be alphabetical in elements, should we really return zero results for |
There is an ongoing discussion in #416 regarding symmetry properties which I believe may be related as well. I think that canonicalization may be delegated to providers, but if so, it has to be well-specified. Otherwise databases will differ in the way they do it, and we risk returning to pre-OPTIMADE state. Also, query canonicalization will put a strain on providers, not sure if negligible. |
I really would not worry about canonicalization of SMILES. As long as the
SMILES is valid (here specifying OpenSmiles is sufficient) everyone
understands valid as generally acceptable. Note that one of the key aspects
that distinguishes OpenSmiles is its treatment of aromaticity (which is not
the organic chemist's typical view). [1] A key point there is that there is
ample use of MAY and PREFERRED rather than MUST. So, for example, aromatic
atoms MAY be represented with lower case letters but need not be.
I think the real question is on query. MUST a repository be able to process
a SMILES query in a meaningful noncanonical sense, or MAY it treat it as
an exact string?
Apologies if this has already been decided and I am repeating myself.
Probably have missed a few clicks of this discussion.
[1] http://opensmiles.org/opensmiles.html
…On Mon, Jul 4, 2022, 5:45 AM Andrius Merkys ***@***.***> wrote:
@ml-evs <https://github.com/ml-evs>
Bit of a tangent, but this same reasoning also worried me a bit about the
way we standardized chemical formulae to be alphabetical in elements,
should we really return zero results for
?filter=chemical_formula_reduced="SiO2", or should we ask database to
handle this themselves (provided the return formulae are in the canonical
order)? Does adding this feature suggest we need to rethink how we handle
our string fields (chemical_formula_reduced being the only really
important one, I would argue)?
There is an ongoing discussion in #416
<#416> regarding
symmetry properties which I believe may be related as well. I think that
canonicalization may be delegated to providers, but if so, it has to be
well-specified. Otherwise databases will differ in the way they do it, and
we risk returning to pre-OPTIMADE state. Also, query canonicalization will
put a strain on providers, not sure if negligible.
—
Reply to this email directly, view it on GitHub
<#368 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCWYVAH3YTO576E3HNETVSK6GHANCNFSM472Y77EA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @BobHanson, I think this has been decided for SMILES, my comment is about whether we should adopt the same approach for simpler fields like chemical formula too. |
Ah. That makes more sense. Shouldn't that be a different thread and PR? Not
#368?
…On Mon, Jul 4, 2022, 11:21 AM Matthew Evans ***@***.***> wrote:
I think the real question is on query. MUST a repository be able to
process a SMILES query in a meaningful noncanonical sense, or MAY it treat
it as an exact string? Apologies if this has already been decided and I am
repeating myself. Probably have missed a few clicks of this discussion. [1]
http://opensmiles.org/opensmiles.html
Hi @BobHanson <https://github.com/BobHanson>, I think this has been
decided for SMILES, my comment is about whether we should adopt the same
approach for simpler fields like chemical formula too.
—
Reply to this email directly, view it on GitHub
<#368 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW3NXO5URXYZI7NXCZLVSMFO5ANCNFSM472Y77EA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
My take on this as an implementer is that I really want fields to have clear data types with strict comparison operator semantics. So, if Nevertheless, chemical formulas are obviously a major thing for us. So, if unordered element-wise comparison is useful, I see no issue with redefining |
I agree with @rartino here, but I would really prefer keeping things simple. My main concern is that both defining the new semantics and implementing them (properly) would require much effort. |
What an interesting quandary. it seems to me that "reduced" here is a fine
qualifier that can specify "O2Si" and not "SiO2". No one will know what
"reduced" means unless they read the information anyway, and that
information can explicitly say, "for example, 'O2Si', not 'SiO2' " to make
it absolutely clear what is required. Totally with the idea that a string
is a string. (Except in the case of SMILES, which I would argue is a
special case.) Machines will not care.
Bob
…On Wed, Jul 6, 2022 at 2:02 PM Rickard Armiento ***@***.***> wrote:
this same reasoning also worried me a bit about the way we standardized
chemical formulae to be alphabetical in elements, should we really return
zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask
database to handle this themselves (provided the return formulae are in the
canonical order)? Does adding this feature suggest we need to rethink how
we handle our string fields (chemical_formula_reduced being the only really
important one, I would argue)?
My take on this as an implementer is that I really want fields to have
clear data types with strict comparison operator semantics. So, if
chemical_formula is a string, then I want = to always mean normal string
comparison - no: "but for this field equality also holds if the string has
the same elements in a different order". Early drafts of OPTIMADE headed in
this direction with each field describing its own operator rules, and IMO
that leads to madness (and highly non-interoperable implementations).
Nevertheless, chemical formulas are obviously a major thing for us. So, if
unordered element-wise comparison is useful, I see no issue with redefining
chemical_formula_reduced to be a new *chemical formula* data type with
its own clear comparison semantics, i.e., with = meaning unordered
comparison over elements, but are < and > allowed? what do they mean?, etc.
Furthermore, if used also for chemical_formula_descriptive we need to
figure out how = works for constructs with parenthesis, brackets, etc.
—
Reply to this email directly, view it on GitHub
<#368 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW3AVTB5IRCLFAOFAZTVSXJ3ZANCNFSM472Y77EA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Robert M. Hanson
Professor of Chemistry
St. Olaf College
Northfield, MN
http://www.stolaf.edu/people/hansonr
If nature does not answer first what we want,
it is better to take what answer we get.
-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900
*We stand on the homelands of the Wahpekute Band of the Dakota Nation. We
honor with gratitude the people who have stewarded the land throughout the
generations and their ongoing contributions to this region. We acknowledge
the ongoing injustices that we have committed against the Dakota Nation,
and we wish to interrupt this legacy, beginning with acts of healing and
honest storytelling about this place.*
|
I think we should return an error message in this case, stating that the value for the chemical elements should be in alphabetical order. |
I suggest if a server is expected to return an error, then it could just as
easily normalize any order and continue. The algorithm would be about the
same.
…On Fri, Sep 23, 2022 at 2:14 PM Johan Bergsma ***@***.***> wrote:
this same reasoning also worried me a bit about the way we standardized
chemical formulae to be alphabetical in elements, should we really return
zero results for ?filter=chemical_formula_reduced="SiO2", or should we ask
database to handle this themselves (provided the return formulae are in the
canonical order)? Does adding this feature suggest we need to rethink how
we handle our string fields (chemical_formula_reduced being the only really
important one, I would argue)?
I think we should return an error message in this case, stating that the
value for the chemical elements should be in alphabetical order.
—
Reply to this email directly, view it on GitHub
<#368 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEHNCW2MAPT7VJFIJJLA4QTV7X6TDANCNFSM472Y77EA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Robert M. Hanson
Professor of Chemistry
St. Olaf College
Northfield, MN
http://www.stolaf.edu/people/hansonr
If nature does not answer first what we want,
it is better to take what answer we get.
-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900
*We stand on the homelands of the Wahpekute Band of the Dakota Nation. We
honor with gratitude the people who have stewarded the land throughout the
generations and their ongoing contributions to this region. We acknowledge
the ongoing injustices that we have committed against the Dakota Nation,
and we wish to interrupt this legacy, beginning with acts of healing and
honest storytelling about this place.*
|
The code for checking whether each value is smaller than the next value is much simpler than that for a sorting algorithm. Although higher programming languages can provide their own sorting algorithms, so in terms of programming work it may not make much difference. I think it would be good if a server gives an error when a query is malformed. It is easy to make a typo, and this way we can at least in some cases inform the user about this. ps. (If the user queries for SC did he/she mean to search for CS or Sc?) |
I completely agree with @JPBergsma on reporting malformed queries as errors and possibility to relax the specification in the future. I would not hurry with the latter, though. |
Maybe I misunderstand you, but as far as I know the word "reduced" in chemical formula is rather meant to refer to the following requirement (quoted from the specification): "For structures with no partial occupation, the chemical proportion numbers are the smallest integers for which the chemical proportion is exactly correct." I think this is a fairly standard use of "reduced"? There is no word in the field name meant to state the need to order elements. That is "just" a part of the specification ("elements MUST be placed in alphabetical order, followed by their integer chemical proportion number.") I think one ends up with rather different viewpoints here if one views OPTIMADE as "the user interface" for materials data queries, or "just" an underlying standardized communication protocol. I see no problem with, e.g., Jmol sorting elements for a user who use OPTIMADE to query an OPTIMADE database for a chemical formula before sending the query to OPTIMADE.
We probably need to pick up the discussion again in the smiles thread on what semantics people who want to filter on smiles want. I would argue that if they are different from strings, there should be a smiles datatype. |
Well put. I view OPTIMADE as "just" an underlying standardized communication protocol, hence my animosity towards some of provider-intensive extensions.
Maybe this is the right thread to do so? Or probably even better way would be to put together an alternative to PR #392 defining SMILES as datatype with its own query semantics. Admittedly, I am not a fan (the COD will not be able to handle such queries; I cannot see a way to elegantly introduce SMILES datatype at grammar level, although we have |
As promised, I have created PR #436 introducing SMILES data type. |
Do we want to allow the use of smiles string in the field chemical_formula_descriptive ?
The SMILES notation for molecular formulas uses '#' and '$' to indicate triple and quadruple bonds,
the characters '/' and '' to indicate whether the bonds are in the cis or trans orientation and '@' and '@@' to differentiate enantiomers. Finally, ring numbers with more than one digit have to be preceded by a '%' sign.
It, therefore, seems reasonable to me to add these to the allowed characters for the chemical_formula_descriptive field.
Or do you think we should add a separate SMILES field instead?
The text was updated successfully, but these errors were encountered: