-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract DOI from the current web page URL #1799
base: master
Are you sure you want to change the base?
Conversation
Shouldn't we return |
Do we want to change this behavior for DOIs that are scraped from document too? Why then it was set to return |
No. In the page we don't know if it describes the main item for the page. In the URL we do. |
If a random string, which looks like a DOI, is extracted from a URL, it would prevent the further DOIs from document extraction. Which means if we want to return a single item we have to firstly resolve DOI metadata to make sure it's valid. And if not then extract and resolve items from a document. The current commit just puts all DOIs into one list, where all items are resolved and user just needs to select which one he thinks is correct. To be able to separately resolve DOI extracted from URL and from document, we would probably need serious modifications of the translator. |
DOI.js
Outdated
var dois = [], m; | ||
|
||
// Extract DOIs from the current URL | ||
var rx = /10.[0-9]{4,}?\/[^\s&"'?#,]*[^\s&"'?#\/.,]/g; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use re
for this, not rx
.
Also, seems like we don't need to exclude quotes in this context.
And maybe we do need to URL decode? I'm not sure whether a URL that passes through here is necessarily decoded.
b1a50c0
to
92f8713
Compare
So as I said previously, it's not that rare to encounter web page URLs that contain a DOI, but we can't extract it reliably. For example: URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87 URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full And we can't do anything about this. DOI from URL can be extracted incorrectly but it applies not only for the web page URL, but also for URLs found in the body. For DOI(s) extracted from a web page URL there are a few possible outcomes:
So the translator should work like this, depending on where and how many DOIs were found:
|
Is this what you meant? These are the same.
We could try stripping some likely suffixes, like |
given the most common academic CMS's, (edit: removed the ones already covered by the regex) |
Yeah, I just wanted to demonstrate the difference between the two URLs.
Worth to investigate this idea, but there can be many variants. More examples: |
I think you can here also test wether the one in the body is the beginning part of the one in the URL. This would be IMO more stable and do (at least for all the examples here) the same as deleting
How about the case: There are multiple DOIs in the body, one in the URL which matches the first one in the body. Should we then just go for this one, i.e. a single case, e.g. https://olh.openlibhums.org/article/10.16995/olh.46/ (there is another DOI in the reference section at the bottom). The same would be true for the Frontiers paper. There is the drawback that then it is not possible to save all references with DOI instead of the article from pages like https://www.frontiersin.org/articles/10.3389/fmicb.2014.00402/full#h12 . But is this really a "feature" we need? (Maybe a hidden preference to toggle it on/off would be enough.) CC @adam3smith |
{ | ||
"type": "web", | ||
"url": "https://zotero.org/?d=10.7208/chicago/9780226924632.001.0001", | ||
"items": "multiple" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add https://onlinelibrary.wiley.com/doi/abs/10.1002/(SICI)1099-1360(199711)6:6%3C320::AID-MCDA164%3E3.0.CO;2-2 as a test case?
Yes.
In #1092 (comment) I suggested a runtime flag:
So it would be something like |
Yeah, the partial DOI matching sounds like a good idea. Just we have to keep in mind that sometimes journal itself can have a DOI which is part of the journal article DOI, but if there is only one DOI in the body it would be very unlikely.
DOI position in the body is not a very reliable metric. Some more advanced websites are listing metadata of additional articles:
But generally, I agree that if a DOI from the URL can be reliably matched with a DOI in the body, whether there are one or more of them, we should return a single item. |
From some URLs DOI can't be correctly extracted, i.e.
http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291470-9856/issues
would result to10.1111/%28ISSN%291470-9856/issues
instead of10.1111/%28ISSN%291470-9856
.Some URLs can also have multiple DOIs i.e.
http://api.crossref.org/works/?filter=doi:10.1117/3.1002595.ch10,doi:10.3403/00522251u,doi:10.3403/00522251,doi:10.3403/30217493,doi:10.3403/30289582,doi:10.1117/12.939903,doi:10.3403/02454346u,doi:10.1364/ofc.1979.thf1,doi:10.5772/7558,doi:10.3758/BF03202760,doi:10.3758/bf03195760,doi:10.1006/jmla.1997.2532,doi:10.1037/h0082866
.