Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement gorule-0000022 QC check for annotations to retracted publications #676

Closed
pgaudet opened this issue Jun 6, 2018 · 43 comments
Closed

Comments

@pgaudet
Copy link
Contributor

pgaudet commented Jun 6, 2018

From @vanaukenk on December 19, 2016 14:32

Hi,
Following on from a help desk ticket:
http://jira.geneontology.org/browse/GO-1431

Can we explore adding a QC check for annotations to retracted publications?

A possible approach might be:

PubMed indexes retracted publications in the PublicationTypeList tag. Here's an example (XML formatting not coming through):

PublicationTypeList

PublicationType UI="D016428" Journal Article PublicationType

PublicationType UI="D013485" Research Support, Non-U.S. Gov't PublicationType

PublicationType UI="D013486" Research Support, U.S. Gov't, Non-P.H.S. PublicationType

PublicationType UI="D016441" Retracted Publication PublicationType

PublicationTypeList

Perhaps implementing a periodic query to PubMed for articles with Type "Retracted Publication" and then checking those PMIDs against the PMIDs in the GO database would work.

Thx.

Copied from original issue: geneontology/go-annotation#1479

The corresponds to GAF column 6 /GPAD 1.1/2.0 column 5

@kltm
Copy link
Member

kltm commented Jun 6, 2018

This should possibly be related to the blacklist system.
A "hot" query of this type would be part of the pipeline and done at the earliest stages, as part of the metadata get.

@pgaudet pgaudet self-assigned this Jun 7, 2018
@pgaudet pgaudet changed the title QC check for annotations to retracted publications Implement gorule-0000022 QC check for annotations to retracted publications Jun 19, 2019
pgaudet added a commit that referenced this issue Nov 8, 2023
@kltm
Copy link
Member

kltm commented Dec 1, 2023

@pgaudet I think we need to touch bases on this as "low-hanging" fruit--there is a bit more here as we need to draw in external APIs, etc.

@cmungall
Copy link
Member

@pgaudet "there is a file of all retracted PMIDs that is available for download"

This would be easier than using the API over all PMIDs

@mugitty
Copy link
Contributor

mugitty commented Dec 11, 2023

@cmungall , What is the link to the file with the retracted PMIDs that is available for download? I did not know one existed.

When I looked into the NCBI API that would return the list of retracted PMID's, it would only return a maximum of 10K records. There are ~20k retracted publications. Retrieving all of the retracted publications requires, downloading an application.

@kltm
Copy link
Member

kltm commented Dec 11, 2023

Not to derail this conversation, and it's fine to work out where potential resources are, but I'd still like to touch bases with @pgaudet on this before proceeding.

@cmungall
Copy link
Member

@mugitty - just paginate til you get them all?

What do you mean downloading an application?

@mugitty
Copy link
Contributor

mugitty commented Dec 11, 2023

@cmungall, the URL to retrieve the retracted publications is:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=0

Note, the response indicates that there are 19582 retracted publications.

The URL can be modified to return more than 20 records by specifying the max parameter:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=0&RetMax=9999

When the URL is updated to retrieve the 10000th record and greater:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Retracted+Publication[pt]&RetStart=10000

The system responds with error message:
...
Search Backend failed: Exception: 'retstart' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved. For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/
...

This is not just NCBI. In the "absence of large data stores and web servers", this model of supporting "HIGH VOLUME" pagination with "LARGE DATASETS" via API is unsustainable.

There is a way to retrieve all the retracted publications, it is not via API. At a minimum, this list of retracted publications has to be updated with every "GO release". This will add the the release process overhead (tagging @kltm). The list of retracted publications can be stored as a YAML, JSON, etc. The GO rule validator would have to parse and keep in memory as a hash set (negligible impact) to cross check against the references.

@deustp01
Copy link

deustp01 commented Dec 11, 2023

Note, the response indicates that there are 19582 retracted publications.

... and only 10000 can be downloaded at a time.

Here is a truly ugly hack. Search for "Retracted publication"[pt] to get the list of 19,582 items. Then choose the "save" option from the bar just under the header, and choose format:PMID (all you need if your goal is to get a list to be checked against your list of references that you have relied on). That will get you a list of the first 10,000 starting from the oldest.
Then on the same PubMed results page click on the little up-arrow in the small box next to the sort by: publication date box near the top of the page. That will get the first 10,000 starting from the newest. The two lists, concatenated and uniquified should be what you need. Truly, truly ugly, but it has the effect of getting a local copy of all retracted PMIDs that local code can check against a local list of PMIDs used for annotation, without hammering anyone's API.

Screenshot 2023-12-11 at 3 22 22 PM

A disturbing side note: the list goes back to 1951. The 20 most recent retractions appear to have happened since mid-August 2023, about 4 months. The oldest 20 took 27 years to accumulate, from 1951 to 1978. This also suggests that, at the current rate of retraction we are soon going to need a way of getting more than 2 x 10,000 items so an improvement on this hack will be needed.

@kltm
Copy link
Member

kltm commented Dec 11, 2023

Talking to @pgaudet this morning, and following up with conversation from last week with @mugitty, while this is certainly something we want to do, it's no longer in the immediate TODO list for this specific project. We'll keep it open in this project, as we want to make sure we have eyes on it for how to proceed, but we want to make sure we plan this out for stability and consistency given how our pipeline currently is working.

@kltm
Copy link
Member

kltm commented Dec 11, 2023

Note, the response indicates that there are 19582 retracted publications.

... and only 10000 can be downloaded at a time.

Here is a truly ugly hack [...] but it has the effect of getting a local copy of all retracted PMIDs that local code can check against a local list of PMIDs used for annotation, without hammering anyone's API.

We also have to keep in mind ToS of eutils, etc. Ideally, we would be able to grab an upstream file and simply filter against it. Second best is making that upstream file ourselves and maintaining it as best we can.

@balhoff
Copy link
Member

balhoff commented Dec 11, 2023

Check out SemOpenAlex: https://semopenalex.org/

You can query for retracted PMIDs: https://api.triplydb.com/s/RpkYEr-qN

@mugitty
Copy link
Contributor

mugitty commented Mar 22, 2024

@pgaudet, https://semopenalex.org/ was back up today. I ran the following query for retracted PMID's. This site only gives 7021 results where as we got over 20000 results on 20240321

The command I used to retrieve the results is:
curl https://semopenalex.org/sparql --data query=PREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX%20fabio%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fspar%2Ffabio%2F%3E%0APREFIX%20soa%3A%20%3Chttps%3A%2F%2Fsemopenalex.org%2Fontology%2F%3E%0ASELECT%20%2A%20WHERE%20%7B%0A%20%20%3Fpub%20fabio%3AhasPubMedId%20%3Fpmid%20.%0A%20%20%3Fpub%20soa%3AisRetracted%20true%20.%0A%7D -X POST > retracted.xml

or open browser to https://yasgui.triply.cc/#, enter the following query:
PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#
PREFIX fabio: http://purl.org/spar/fabio/
PREFIX soa: https://semopenalex.org/ontology/
SELECT * WHERE {
?pub fabio:hasPubMedId ?pmid .
?pub soa:isRetracted true .
}

and select page size 'All'.

I have attached the output from the query.
retracted.csv

@kltm
Copy link
Member

kltm commented Mar 22, 2024

@pgaudet @mugitty I want to check in on this as we are now pulling in external files from external resources and needing to thread them into the system. If this is "low-hanging fruit", I want to make sure we're doing this is a robust and flexible way (e.g. file storage, update frquency).

@mugitty
Copy link
Contributor

mugitty commented Mar 22, 2024

@kltm, the plan is for @pgaudet to create a file with all the retracted publications and @pgaudet will update at 'certain intervals'. The annotation parsers can use the file to check for retracted publications. This will free the pipeline from being dependent on undependable external resources.

@kltm
Copy link
Member

kltm commented Mar 22, 2024

I think I'm wanting to hammer out exact availability and frequency here. Naturally, once we have the file worked out, it will be made available statically in the pipeline (where ontobio will run). Generally speaking, when we start working with external resources on internal systems, we want to make sure that expectations and use are hammered out. (Typically, these kinds of things would be hammered out in the "architecture" portion of project planning, but since this has become a bit of a "rolling project, we haven't had a chance to do that this time. I just want to follow through on this part.)

@kltm
Copy link
Member

kltm commented Mar 27, 2024

@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs.

@mugitty
Copy link
Contributor

mugitty commented Mar 27, 2024

@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs.

Yes

@kltm
Copy link
Member

kltm commented Mar 29, 2024

Okay, quick play here, I think I have something with: retracted-publications.txt.

I pulled this out with:

esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed > /tmp/pubmed-retracted.xml
cat /tmp/pubmed-retracted.xml | grep -oh ">[0-9]*<\/PMID>" | sort | uniq | cut -d '>' -f 2 | cut -d '<' -f 1 | sed 's/^/PMID:/'

If I were to do this again, I might try a different command, which would make the retractions more clear (I'm not 100% sure above, which is why I haven't committed it to the repo yet).

esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed -mode asn.1 > /tmp/pubmed-retracted.txt

However, I seem to have hit some kind of query limit; best to try again later.

@kltm
Copy link
Member

kltm commented Mar 29, 2024

Okay, I'm not liking my file here. I think reprocessing Pascale's above is a good choice for now.

cat europepmc_id.txt | cut -d ',' -f 1 | sort | uniq > retracted-publications-2.txt

@mugitty
Copy link
Contributor

mugitty commented Mar 29, 2024

@kltm, Do you want to add it to metadata somewhere, for now, and I can work off of it

Thanks

@kltm
Copy link
Member

kltm commented Apr 2, 2024

From a conversation with @mugitty , I wanted to clarify the current state.

pgaudet added a commit that referenced this issue May 6, 2024
Added file for gorule-0000022 
for #676
@pgaudet
Copy link
Contributor Author

pgaudet commented May 6, 2024

@mugitty The europe-pmc-retracted.txt file is here:

https://github.com/geneontology/go-site/blob/master/docs/europe-pmc-retracted.txt

Moved to a better location: https://github.com/geneontology/go-site/blob/master/metadata/retracted-publications.txt

Updated today. I made a note in my calendar to update it monthly.

Thanks, Pascale

pgaudet added a commit that referenced this issue May 6, 2024
Added test for gorule-0000022 ##676
@kltm
Copy link
Member

kltm commented May 6, 2024

@pgaudet Maybe next week we can work out my third point here? #676 (comment)

@pgaudet
Copy link
Contributor Author

pgaudet commented May 7, 2024

@kltm sure

But for now -

I did not add the metadata to the repo as I'm a little concerned that the europepmc file and the data I extracted with eutils are quite different,

@mugitty and I figured that if the europepmc file has most of the content then this is better than having no check at all.

If you have the complete file from eutils I can compare them; we suspect there are some synchronization issues?

@mugitty
Copy link
Contributor

mugitty commented May 7, 2024

@kltm, @pgaudet, currently the retracted publications file is in docs. Do you want to move into metadata? I understand the contents and format will change.

@kltm
Copy link
Member

kltm commented May 7, 2024

@pgaudet The file is at #676 (comment)

If you're putting this in, the filename should be generic, like "retracted-publications.txt" or the like. As well, adding a note to the README.md in that directory.

mugitty added a commit that referenced this issue May 8, 2024
mugitty added a commit that referenced this issue May 9, 2024
@pgaudet
Copy link
Contributor Author

pgaudet commented May 29, 2024

@mugitty will check status of this one.

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 10, 2024

  • Still no appearing in reports (including @mugitty 's local tests)
  • Need SOP to update the file of retracted puclications

pgaudet added a commit that referenced this issue Jul 18, 2024
Changed status to status: implemented gorule-0000022
for #676
pgaudet added a commit that referenced this issue Jul 18, 2024
@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 18, 2024

  • Update retracted-publication file
  • Need to confirm with test from the test-gaf

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 18, 2024

Working on snapshot:

###gorule-0000022

Check for, and filter, annotations made to retracted publications

  • total: 3

Messages

  • ERROR - Violates GO Rule:GORULE:0000022: Check for, and filter, annotations made to retracted publications--UniProtKB P55211 CASP9 enables GO:0008233 PMID:20663920 IDA F Caspase-9 CASP9|MCH6 protein taxon:9606 20120309 MGI UniProtKB:P55211

@pgaudet pgaudet closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

7 participants