Skip to content
This repository has been archived by the owner on Sep 9, 2022. It is now read-only.

Get full text links for closed articles via pubmed? #36

Closed
dwinter opened this issue Mar 16, 2015 · 13 comments
Closed

Get full text links for closed articles via pubmed? #36

dwinter opened this issue Mar 16, 2015 · 13 comments
Assignees
Labels
Milestone

Comments

@dwinter
Copy link
Contributor

dwinter commented Mar 16, 2015

Hi scott.

My drive to complete rentrez turned up something that might be helpful for fulltext. It turns out the elinkendpoint form NCBI can find links to outside providers for a given paper via it's PMID.

This example requires the work I've been doing in the elink feature branch (progress being tracked on ropensci/rentrez#39):

rec <- entrez_link(db="pubmed", dbfrom="pubmed", cmd="llinks", id=19822631)
rec
elink object with contents
 $linkouts links to external websites
rec$linkouts
$ID_19822631
$ID_19822631[[1]]
Linkout from HighWire 
 $Url: http://ctj.sagepub.com/cgi ...

$ID_19822631[[2]]
Linkout from EBSCO 
 $Url: http://openurl.ebscohost.c ...

$ID_19822631[[3]]
Linkout from ProQuest 
 $Url: http://gateway.proquest.co ...

$ID_19822631[[4]]
Linkout from COS Scholar Universe 
 $Url: http://www.scholaruniverse ...

$ID_19822631[[5]]
Linkout from Genetic Alliance 
 $Url: http://www.diseaseinfosear ...

$ID_19822631[[6]]
Linkout from MedlinePlus Health Information 
 $Url: http://www.nlm.nih.gov/med ...

The links can include things like ResearchGate.

I don't know if just finding full text links is within the scope of fulltext, but thought I'd give you a head ups about this now in case it's helpful 😄

@sckott
Copy link
Contributor

sckott commented Mar 16, 2015

hey @dwinter -

I think getting full text links is a good use case for sure. The rcrossref pkg has a function that does this, or at least attempts to - there's some publishers that don't share the appropriate metadata, so the links that come back can be wrong, have the wrong content types, etc.

anyway, yeah, I think another interface we could have in addition to search and get_full_text, is get_link_to_full_text (tehse aren't actual function names) - Perhaps someone would want to get links, then take them elsewhere in R or another language.

@dwinter
Copy link
Contributor Author

dwinter commented Mar 16, 2015

Hi @sckott,

Cool, the elink branch is almost ready to merge back with master, so will definitely be all in place for the stable release. Let me know if I can help integrating it into fulltext as and when you get to it

@sckott
Copy link
Contributor

sckott commented Mar 16, 2015

Great, sounds good.

Gotta think about this a little bit. That is, should this feature/use case just be a subset within the search interface here, or a separate set of functions. Maybe it makes most as downstream after seaerching, so:

  1. user searches, e.g, res <- ft_search(query='ecology', from='plos')
  2. user wants full text links, so e.g.., (using a new function) ft_links(res) takes DOIs from the output of the call to ft_search(), or any user defined subset, and either gives back full text links if they are already in the result metadata, or go out and try to get them

@sckott sckott self-assigned this Sep 28, 2015
@sckott sckott added this to the v0.1.2 milestone Sep 28, 2015
@sckott
Copy link
Contributor

sckott commented Sep 29, 2015

@dwinter working on this now. What's the best way to get data for many DOIs? Seems like we can do 10.1371/journal.pone.0086169[doi] OR 10.1016/j.ympev.2010.07.013[doi] and so on for X DOIs, but is that best practice?

@dwinter
Copy link
Contributor Author

dwinter commented Sep 29, 2015

Yeah -- if you start from dois you first have to search to get the pubmed IDs, and this is the best way.

FWIW, when I've been trying to think of ways to automatically generate the search syntax, the best I've come up with is

dois <- c("10.1371/journal.pone.0086169", "10.1016/j.ympev.2010.07.013")
paste(paste0(dois, "[doi]"), collapse=" OR ")
[1] "10.1371/journal.pone.0086169[doi] OR 10.1016/j.ympev.2010.07.013[doi]"

At some point, with very many DOIs the REST URL will get too long. I've never found any documentation for how long is too long, but rentrez should at least pass on a useful error message.

Happy to help on any of this.

@sckott
Copy link
Contributor

sckott commented Sep 29, 2015

Right, that's the same way I combine the DOIs

Right, the URI too long code, 414 maybe

I could see with text mining use case how one may pass in far too many DOIs for the URI length restriction. Would be great to have a way around this :)

@dwinter
Copy link
Contributor Author

dwinter commented Sep 29, 2015

So, a bit of trial and error suggest "too long" is around 7000 chars. With these dois that's about 80.

termify <- function(dois) paste(paste0(dois, "[doi]"), collapse=" OR ") 

entrez_search(db="pubmed", term=termify(rep(dois, 90)))
Error in entrez_check(response) : 
  HTTP failure 414, the request is too large. For large requests, try using web history as described in the tutorial
entrez_search(db="pubmed", term=termify(rep(dois, 80)))
 Search term (as translated):  10.1371/journal.pone.0086169[doi] OR 10.1016/j.ymp ...

So maybe the solution is to check that there aren't more than ~70 or so dois, and "chunk" the requests if there are more.

There might also be a problem with searching for multiple IDs -- there is no guarantee that the returned object is going have the records in the same order as they appear in the search term. The summary records have both the DOI and the PMID, so it is possible to reconstruct that relationship. But there might also be an easier way to bulk convert DOIs to PMIDs?

@sckott
Copy link
Contributor

sckott commented Sep 29, 2015

Thanks for looking into this!

Chunking sounds like the way forward.

There might also be a problem with searching for multiple IDs -- there is no guarantee that the returned object is going have the records in the same order as they appear in the search term. The summary records have both the DOI and the PMID, so it is possible to reconstruct that relationship

I did notice that - that the returned data isn't the same length as the input, so I can't reliably attach the input DOIs to the output - but I didn't know DOI was also in there, I'll use that to reconstruct.

sckott added a commit that referenced this issue Sep 30, 2015
#36
a few data sources still have no plugin ready yet, so are not available
@sckott
Copy link
Contributor

sckott commented Sep 30, 2015

@dwinter initial attempt commited, havent made any changed per our discussion above yet

sckott added a commit that referenced this issue Sep 30, 2015
@sckott
Copy link
Contributor

sckott commented Oct 1, 2015

Now Getting other plugins to work for a suite of publishers

sckott added a commit that referenced this issue Oct 1, 2015
in addition, got more publisher plugins for ft_links working
arxiv and biorxiv still not working yet
fix to some code in plugins_search
@sckott
Copy link
Contributor

sckott commented Oct 1, 2015

essentially implemented, will open new issues as needed for this fxn

@sckott sckott closed this as completed Oct 1, 2015
@dwinter
Copy link
Contributor Author

dwinter commented Oct 2, 2015

Nice work! I will check out the details and play around with it this weekend if you want a tester :)

@sckott
Copy link
Contributor

sckott commented Oct 2, 2015

yeah, plz do

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants