Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some Pubchem CID numbers in MassBank were deactivated in Pubchem Compound #186

Open
dlswee opened this issue Nov 8, 2021 · 7 comments
Open

Comments

@dlswee
Copy link

dlswee commented Nov 8, 2021

Some Pubchem records in Pubchem Compound were modified and made "Non-live". For example, phenylthiourea (CID 7682) was replaced with a different tautomer of phenylthiourea (CID 676454). These changes happened fairly recently, and there is no cross-referencing index of old to new CIDs that I am aware of. As a result, some of the Pubchem CID entries in MassBank are no longer "live" even though the CIDs were entered correctly.

Normally you can search CID numbers by typing the CID as an integer (e.g. "7682") into the Pubchem search box. If no CID record is returned, typing "CID 7682" will bring up the Non-live record with a hyperlink to the CID of the current "Preferred Compound".

@schymane
Copy link
Member

schymane commented Nov 8, 2021

Thanks - this is something we are aware of but it is not necessarily trivial to overwrite old CIDs in some MassBank records due to licensing issues.

@meier-rene this is potentially something we could take care of at validation and/or run occasionally?
It's not clear (yet) how many records are affected (whether 10s or 100s), nor whether this affects records that we can't necessarily edit. Some CIDs, e.g. guanylurea, actually migrate back and forth between CIDs occasionally ...

We have various functions that can help distinguish current live CID from deprecated CIDs.
https://github.com/schymane/RChemMass/blob/master/R/ChemicalCuration.R#L1516
(maybe webchem does this better by now).

@meier-rene is it possible to get an overview of how many records are affected? Would cross-linking help (so the CID directs to the current CID) or would updating the CIDs our side be better?

@dlswee
Copy link
Author

dlswee commented Nov 8, 2021 via email

@schymane
Copy link
Member

schymane commented Nov 8, 2021

Thanks Dan - an alternative way of tackling it would be to map CIDs using the InChIKey in the records (which will return the best current CID)... and see where they differ. The InChIKeys themselves should not have changed. It depends what you are trying to do.

Yes, in 2019 PubChem switched the software behind the scenes which resulted in this shift of CIDs - so it will potentially affect a fraction of the records contributed before then (which is most of our records); and maybe some after. But due to the construct of MassBank, and since some of them shift periodically still, the fix is not trivial ... but we can discuss amongst us what to do. We're also in contact with PubChem as necessary.

@meier-rene we could also map back current CIDs via our deposition files ... but the API is likely the easier option.

Thanks,
Emma

@meier-rene
Copy link
Collaborator

Thank you Dan for reporting. Emma already gave some information about this issue.
I know that Pubchem CIDs in MassBank are not correct for several reasons.

  • First one is that they are not determined by an algorithm at runtime(that would slow down the website and create an unnecessary dependency to an external service), but rather supplied at deposition time by our contributors. There might be errors in the original data(missmatch between CID and SID or just plain errors)
  • Second, pubchem makes CIDs sometimes "non-live", because they change the representation(drawing) of a compound. This deprecates the CIDs in Massbank without our knowledge.

I could now write a lengthy paragraph explaining all the complication we have with external database identifiers(which I actually already did below), but rather I would like to explain how I see external database identifiers in MassBank. I consider them as nice to have possibility to link out to external resources. They are not always stable and not always correctly deposited. I see no chance to get them always correct. The main identifier in MassBank for compounds are the InChI and the SMILES.

Lengthy paragraph
What could be a solution?
First thing is, that we need to prevent the acceptance of new contributions with errors. Thats not trivial to integrate in our standard validation procedure due to runtime problems. Communication with PUG REST is too slow to have this as routine procedure. So I need to set up a mechanism which distinguishes between existing and new records. Thats certainly possible but not in place. I think I should give more priority to this one.

Second we have to fix our existing records. The code to query the PUG REST and set the CID in the record files exists. But I'm hesitating to apply this in a blind way to all records. We have several identifiers for the chemical structure in the record files and there are sometimes mistakes/mismatches. It helps the have the whole unchanged information to find out what is correct and whats a mistake. Conclusion: flagging errors is easy, but fixing remains a partly manual thing. That has low priority on my task list.

@schymane
Copy link
Member

schymane commented Nov 9, 2021

Some additional tips from Paul from PubChem:

We have a file on the FTP site that has a mapping from old to new CID:

https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Preferred.gz

And that includes this example:
7682 676454

You can also get this with PUG REST:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/7682/cids/JSON?cids_type=preferred

{
  "IdentifierList": {
    "CID": [
      676454
    ]
  }
}

@meier-rene
Copy link
Collaborator

Thanks Emma, that's great. With this information I can easily update old CID to new CID.

@meier-rene
Copy link
Collaborator

I'm working on this atm. Here are some numbers which I extracted, thanks to the great list which Emma pointed to...

There are 67 CID referenced in MassBank which are "non-live".
There are 672 records effected by this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants