-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Pubchem CID numbers in MassBank were deactivated in Pubchem Compound #186
Comments
Thanks - this is something we are aware of but it is not necessarily trivial to overwrite old CIDs in some MassBank records due to licensing issues. @meier-rene this is potentially something we could take care of at validation and/or run occasionally? We have various functions that can help distinguish current live CID from deprecated CIDs. @meier-rene is it possible to get an overview of how many records are affected? Would cross-linking help (so the CID directs to the current CID) or would updating the CIDs our side be better? |
On 2021-11-08 09:53, Emma Schymanski wrote:
Thanks - this is something we are aware of but it is not necessarily
trivial to overwrite old CIDs in some MassBank records due to
licensing issues.
@meier-rene [1] this is potentially something we could take care of at
validation and/or run occasionally?
It's not clear (yet) how many records are affected (whether 10s or
100s), nor whether this affects records that we can't necessarily
edit. Some CIDs, e.g. guanylurea, actually migrate back and forth
between CIDs occasionally ...
We have various functions that can help distinguish current live CID
from deprecated CIDs.
https://github.com/schymane/RChemMass/blob/master/R/ChemicalCuration.R#L1516
(maybe webchem does this better by now).
@meier-rene [1] is it possible to get an overview of how many records
are affected? Would cross-linking help (so the CID directs to the
current CID) or would updating the CIDs our side be better?
--
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub [2], or unsubscribe
[3].
Triage notifications on the go with GitHub Mobile for iOS [4] or
Android [5].
Links:
------
[1] https://github.com/meier-rene
[2]
#186 (comment)
[3]
https://github.com/notifications/unsubscribe-auth/AC2K46LNUJJCJDEMTUXLK53UK76AVANCNFSM5HTAW4UQ
[4]
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
[5]
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
Hi Emma,
Thank you for checking into this. I did not realize the issue was a
known problem.
I had noticed a few non-live CIDs in the past but it looks like a
significant number of these changes were made in Jan 2019. Fortunately
the Pubchem hyperlinks to Pubchem from MassBank still work to bring up
the non-live CIDs with the references to the new ones. (I usually type
the numbers in!)
To check a large number of CIDs to see if they are still active, you can
use the Pubchem Identifier Exchange Service and then choose to convert
CIDs into CIDs with output into two columns. Non-live CIDs will return a
blank in the second column. I asked the Pubchem folks last week for a
cross-index of non-live CIDs to active CIDs and apparently there is
none.
Best,
Dan
|
Thanks Dan - an alternative way of tackling it would be to map CIDs using the InChIKey in the records (which will return the best current CID)... and see where they differ. The InChIKeys themselves should not have changed. It depends what you are trying to do. Yes, in 2019 PubChem switched the software behind the scenes which resulted in this shift of CIDs - so it will potentially affect a fraction of the records contributed before then (which is most of our records); and maybe some after. But due to the construct of MassBank, and since some of them shift periodically still, the fix is not trivial ... but we can discuss amongst us what to do. We're also in contact with PubChem as necessary. @meier-rene we could also map back current CIDs via our deposition files ... but the API is likely the easier option. Thanks, |
Thank you Dan for reporting. Emma already gave some information about this issue.
I could now write a lengthy paragraph explaining all the complication we have with external database identifiers(which I actually already did below), but rather I would like to explain how I see external database identifiers in MassBank. I consider them as nice to have possibility to link out to external resources. They are not always stable and not always correctly deposited. I see no chance to get them always correct. The main identifier in MassBank for compounds are the InChI and the SMILES. Lengthy paragraph Second we have to fix our existing records. The code to query the PUG REST and set the CID in the record files exists. But I'm hesitating to apply this in a blind way to all records. We have several identifiers for the chemical structure in the record files and there are sometimes mistakes/mismatches. It helps the have the whole unchanged information to find out what is correct and whats a mistake. Conclusion: flagging errors is easy, but fixing remains a partly manual thing. That has low priority on my task list. |
Some additional tips from Paul from PubChem: We have a file on the FTP site that has a mapping from old to new CID: https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Preferred.gz And that includes this example: You can also get this with PUG REST: https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/7682/cids/JSON?cids_type=preferred
|
Thanks Emma, that's great. With this information I can easily update old CID to new CID. |
I'm working on this atm. Here are some numbers which I extracted, thanks to the great list which Emma pointed to... There are 67 CID referenced in MassBank which are "non-live". |
Some Pubchem records in Pubchem Compound were modified and made "Non-live". For example, phenylthiourea (CID 7682) was replaced with a different tautomer of phenylthiourea (CID 676454). These changes happened fairly recently, and there is no cross-referencing index of old to new CIDs that I am aware of. As a result, some of the Pubchem CID entries in MassBank are no longer "live" even though the CIDs were entered correctly.
Normally you can search CID numbers by typing the CID as an integer (e.g. "7682") into the Pubchem search box. If no CID record is returned, typing "CID 7682" will bring up the Non-live record with a hyperlink to the CID of the current "Preferred Compound".
The text was updated successfully, but these errors were encountered: