-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement gorule-0000022 QC check for annotations to retracted publications #676
Comments
This should possibly be related to the blacklist system. |
@pgaudet I think we need to touch bases on this as "low-hanging" fruit--there is a bit more here as we need to draw in external APIs, etc. |
@pgaudet "there is a file of all retracted PMIDs that is available for download" This would be easier than using the API over all PMIDs |
@cmungall , What is the link to the file with the retracted PMIDs that is available for download? I did not know one existed. When I looked into the NCBI API that would return the list of retracted PMID's, it would only return a maximum of 10K records. There are ~20k retracted publications. Retrieving all of the retracted publications requires, downloading an application. |
Not to derail this conversation, and it's fine to work out where potential resources are, but I'd still like to touch bases with @pgaudet on this before proceeding. |
@mugitty - just paginate til you get them all? What do you mean downloading an application? |
@cmungall, the URL to retrieve the retracted publications is: Note, the response indicates that there are 19582 retracted publications. The URL can be modified to return more than 20 records by specifying the max parameter: When the URL is updated to retrieve the 10000th record and greater: The system responds with error message: This is not just NCBI. In the "absence of large data stores and web servers", this model of supporting "HIGH VOLUME" pagination with "LARGE DATASETS" via API is unsustainable. There is a way to retrieve all the retracted publications, it is not via API. At a minimum, this list of retracted publications has to be updated with every "GO release". This will add the the release process overhead (tagging @kltm). The list of retracted publications can be stored as a YAML, JSON, etc. The GO rule validator would have to parse and keep in memory as a hash set (negligible impact) to cross check against the references. |
Talking to @pgaudet this morning, and following up with conversation from last week with @mugitty, while this is certainly something we want to do, it's no longer in the immediate TODO list for this specific project. We'll keep it open in this project, as we want to make sure we have eyes on it for how to proceed, but we want to make sure we plan this out for stability and consistency given how our pipeline currently is working. |
We also have to keep in mind ToS of eutils, etc. Ideally, we would be able to grab an upstream file and simply filter against it. Second best is making that upstream file ourselves and maintaining it as best we can. |
Check out SemOpenAlex: https://semopenalex.org/ You can query for retracted PMIDs: https://api.triplydb.com/s/RpkYEr-qN |
@pgaudet, https://semopenalex.org/ was back up today. I ran the following query for retracted PMID's. This site only gives 7021 results where as we got over 20000 results on 20240321 The command I used to retrieve the results is: or open browser to https://yasgui.triply.cc/#, enter the following query: and select page size 'All'. I have attached the output from the query. |
I think I'm wanting to hammer out exact availability and frequency here. Naturally, once we have the file worked out, it will be made available statically in the pipeline (where ontobio will run). Generally speaking, when we start working with external resources on internal systems, we want to make sure that expectations and use are hammered out. (Typically, these kinds of things would be hammered out in the "architecture" portion of project planning, but since this has become a bit of a "rolling project, we haven't had a chance to do that this time. I just want to follow through on this part.) |
@mugitty just a note: for format and "line per pub", the formal format we'll be targeting with be CURIEs, not internal IDs. |
Yes |
Okay, quick play here, I think I have something with: retracted-publications.txt. I pulled this out with: esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed > /tmp/pubmed-retracted.xml
cat /tmp/pubmed-retracted.xml | grep -oh ">[0-9]*<\/PMID>" | sort | uniq | cut -d '>' -f 2 | cut -d '<' -f 1 | sed 's/^/PMID:/' If I were to do this again, I might try a different command, which would make the retractions more clear (I'm not 100% sure above, which is why I haven't committed it to the repo yet). esearch -db pubmed -query "Retracted Publication [pt]" | efetch -format pubmed -mode asn.1 > /tmp/pubmed-retracted.txt However, I seem to have hit some kind of query limit; best to try again later. |
Okay, I'm not liking my file here. I think reprocessing Pascale's above is a good choice for now.
|
@kltm, Do you want to add it to metadata somewhere, for now, and I can work off of it Thanks |
From a conversation with @mugitty , I wanted to clarify the current state.
|
Added file for gorule-0000022 for #676
@mugitty The europe-pmc-retracted.txt file is here:
Moved to a better location: https://github.com/geneontology/go-site/blob/master/metadata/retracted-publications.txt Updated today. I made a note in my calendar to update it monthly. Thanks, Pascale |
@pgaudet Maybe next week we can work out my third point here? #676 (comment) |
@kltm sure But for now -
@mugitty and I figured that if the europepmc file has most of the content then this is better than having no check at all. If you have the complete file from eutils I can compare them; we suspect there are some synchronization issues? |
@pgaudet The file is at #676 (comment) If you're putting this in, the filename should be generic, like "retracted-publications.txt" or the like. As well, adding a note to the README.md in that directory. |
@mugitty will check status of this one. |
|
Changed status to status: implemented gorule-0000022 for #676
|
Working on snapshot: ###gorule-0000022 Check for, and filter, annotations made to retracted publications
Messages
|
From @vanaukenk on December 19, 2016 14:32
Hi,
Following on from a help desk ticket:
http://jira.geneontology.org/browse/GO-1431
Can we explore adding a QC check for annotations to retracted publications?
A possible approach might be:
PubMed indexes retracted publications in the PublicationTypeList tag. Here's an example (XML formatting not coming through):
PublicationTypeList
PublicationType UI="D016428" Journal Article PublicationType
PublicationType UI="D013485" Research Support, Non-U.S. Gov't PublicationType
PublicationType UI="D013486" Research Support, U.S. Gov't, Non-P.H.S. PublicationType
PublicationType UI="D016441" Retracted Publication PublicationType
PublicationTypeList
Perhaps implementing a periodic query to PubMed for articles with Type "Retracted Publication" and then checking those PMIDs against the PMIDs in the GO database would work.
Thx.
Copied from original issue: geneontology/go-annotation#1479
The corresponds to GAF column 6 /GPAD 1.1/2.0 column 5
The text was updated successfully, but these errors were encountered: