-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding accuracy analysis results #18
Adding accuracy analysis results #18
Conversation
… run (as running all of the code at once was getting caught up in the first readline() call).
I started going through the DOI sample this morning, and quickly realized that there are a couple questions that we should agree on before I keep going through them. Specifically:
@dhimmel, do you have additional thoughts on these two questions? If not, I'll keep at this throughout the day. |
Interesting. What's the DOI? Does the publisher's page link to ScienceDirect? My first inclination is to require the access to have resulted from DOI resolution and then following any necessary links. However, I can see how compilations that Penn subscribes to, such as ScienceDirect and JSTOR, could cause problems here. Can we see how many DOIs have these situations and then make a more systematic decision?
Agreed! |
The DOI is |
This is a progress note so that I won't forget: the DOI I went back to our actual XML data from the PennText backend, using this query: SELECT * FROM dois_table
JOIN library_holdings_data ON
dois_table.database_id = library_holdings_data.doi_foreign_key
WHERE dois_table.doi = "10.1017/s1357729800051109" And the data for that DOI does (correctly) indicate that there is no full-text available. This is all to say that I've only gotten through two dozen DOIs, and I'm already confused on occasion about what I as a user have access to. |
(I made the above note to note that CUP login isn't working, which is either a problem with CUP's site, or one that we in the library should look into. And to note, anecdotally, that in several cases already, I spent several minutes trying to figure out whether I have access to an article.) |
Another progress note: our automated access checker is looking for just electronic access, I think I can confirm: The XML response for doi |
@Publicus yes! it's not always straightforward. My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?
Great! |
Indeed! Goodness.
Following my comment yesterday, I am still of the somewhat-opposite opinion, qualified by being within reason. Since we're trying to verify the validity of PennText saying that we do/don't have legal access, I'm fine with authenticating (e.g., to go through a Shibboleth login page if required). What I've found so far, though (I'm 36 DOIs in) is that every time Penn's network hasn't automatically triggered access, trying to login manually through Penn hasn't granted access. So, I think that the initial access page is a good rule-of-thumb indicator, which I think is in line with your comment, yes? But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to. Does that all sound reasonable to you, as well? As in my note yesterday, though, this all applies to the on-campus / in-network series of checks. For the off-campus run, I fully agree with you, and won't authenticate at all. |
I guess if it requests for you to authorize your PennKey (linked from the DOI landing page), that could still be access. It doesn't really make sense since you're inside of Penn's network. However, if it's like create a personal account... then verify that your account uses a Penn email, I think that's out of scope. Or if it requires some sort of login that requires a librarian... that's inaccessible. We concur? |
Agreed, yes. : ) |
Happy new year! Ok, I've completed and pushed the manual checks for all 200 DOIs on-campus. I've also set aside several hours tomorrow to do an off-campus check. For clarity: The on-campus check was done using my laptop, which was hard-wired (i.e., via ethernet) into the network in the Van Pelt library. |
As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.). This is something I'll bring up with our dev. team in the Libraries, as it's an area to improve our services. On that note, I'm thinking now that internally, I do want to see what trends this shows across the open access "colors" (even if color assignment was imperfect). Given these data, once I do that for my own use, would you be interested to see it here, as well? I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected. There are also false positives, but they're (just eyeballing it) a much lower percentage of cases. My guess is that if this is happening with the openURL resolver that we use at UPenn, it's likely happening at other institutions, as well; either because of the software stack that we're using, or because this points to what is potentially a major difficulty in tracking what one actually has access to. |
There were two DOIs I marked as "invalid." One (https://doi.org/10.17816/jowd6265-11) didn't resolve, and the other (https://doi.org/10.3892/or.2012.2190) redirected to a publisher home page (https://www.spandidos-publications.com/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things will be simpler if we can avoid invalid
, although there are some decisions we'd have to make to do so. My thoughts inline.
10.1002/phbl.19510070201 0 2018-01-03 1 | ||
10.1515/ijnsns-2011-0005 0 2018-01-03 0 | ||
10.1515/jnet.1983.8.4.255 0 2018-01-03 0 | ||
10.3892/or.2012.2190 0 2018-01-03 invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DOI link https://doi.org/10.3892/or.2012.2190 currently redirects to https://www.spandidos-publications.com/ rather than the article page at https://www.spandidos-publications.com/10.3892/or.2012.2190. I wonder if in this instance, we should just use access status at the publisher URL (even though the DOI redirect is faulty).
10.1002/chin.197531174 0 2018-01-03 1 | ||
10.1002/zaac.19402430401 0 2018-01-03 1 | ||
10.1029/2011jd016541 0 2018-01-03 1 | ||
10.17816/jowd6265-11 0 2018-01-03 invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://doi.org/10.17816/jowd6265-11 redirects to http://journals.eco-vector.com/index.php/jowd/article/viewFile/2617/2229. The issue here appears to be that journals.eco-vector.com is currently down altogether. Perhaps in a day or two the website will be back up and we can recheck this.
I agree. I think these articles will mostly be available off-campus as well (i.e. open access). I'm guessing the PennText tool is most focused on tracking access to toll-access content, so it's not a huge surprise it's unaware that it, by default, has access to much OA literature.
I'm interested in off-campus access as the control here. Let's focus the work in this repo on that definition of open access (rather than oaDOI colors). If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency. |
Agreed, exactly.
Sure! And to confirm, I'll work tomorrow to finish the off-campus checks before diving into any of the color-based analyses I mentioned. |
For transparency: In order to achieve off-campus access, rather than actually leaving campus, I'm using SSHuttle to create a transparent SOCKS proxy back to my house (which is not on campus). I have confirmed that this works using DOI 10.1111/j.1550-7408.1962.tb02648.x: When the proxy is not engaged (and I am thus on the campus network), I have access to the article. When the proxy is engaged, I no longer have access to the article. Further, I've confirmed that my IP address is changing to my home IP when the proxy is engaged, by visiting http://ipv4.icanhazip.com/ in my browser. |
…k (using SSHuttle-based SOCKS proxy to off-campus, confirmed by checking IP address and with DOI 10.1111/j.1550-7408.1962.tb02648.x, which is available without the proxy (i.e., on-campus), and not available with the proxy (i.e., off-campus).
… case subscriptions changed with the calendar year, and because I think that my more recent checks were more accurate, having seen the (often confusing) publisher web pages more.
Ok, all DOIs are checked both on- and off-campus! In the off-campus search, there was one different DOI that I marked as "invalid" -- |
So, concretely, what do we now want from these data? Does this look accurate to you?
|
It's also worth noting, I think, that some of the publisher pages are very confusing; to the point that I felt it necessary to redo the on-campus checks I'd completed before the New Year just now, because I've learned a lot in the last two days from looking through the publisher pages more. Wiley, for example, often shows a loading page after a user clicks the "Download PDF" button, before then redirecting to a paywall page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be best if we can avoid "invalid" altogether. See specific comments for what I think we should do for each invalid DOI.
} | ||
|
||
# Save the changes to the tsv: | ||
write.table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this command is what caused all the quotes in the TSV, which I tend to dislike. readr::write_tsv
defaults to not quote like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no, I think it was LibreOffice Calc. I opened the TSV to check it, and may have pressed Ctrl+S out of habit before closing it again. I'll reset it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quotes removed in 3ebe18d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may have been both LibreOffice and write.table
, though. I'm fine with changing the command, to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Am changing the command now...)
Also, I just saw that you posted this comment yesterday. I didn't see it then; thank you for re-posting it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"10.3233/jnd-160146" 0 2018-01-03 "1" 2018-01-04 "1" | ||
"10.7748/eldc.6.5.41.s39" 0 2018-01-03 "0" 2018-01-04 "0" | ||
"10.2306/scienceasia1513-1874.2013.39.204" 0 2018-01-03 "1" 2018-01-04 "1" | ||
"10.3934/dcds.2016103" 0 2018-01-03 "0" 2018-01-04 "invalid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you retry the off campus search for https://doi.org/10.3934/dcds.2016103. The "server error" is probably a temporary issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in d0967d7.
"10.1002/phbl.19510070201" 0 2018-01-03 "1" 2018-01-04 "1" | ||
"10.1515/ijnsns-2011-0005" 0 2018-01-03 "0" 2018-01-04 "0" | ||
"10.1515/jnet.1983.8.4.255" 0 2018-01-03 "0" 2018-01-04 "0" | ||
"10.3892/or.2012.2190" 0 2018-01-03 "invalid" 2018-01-04 "invalid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying from #18 (comment):
The DOI link https://doi.org/10.3892/or.2012.2190 currently redirects to https://www.spandidos-publications.com/ rather than the article page at https://www.spandidos-publications.com/10.3892/or.2012.2190. I wonder if in this instance, we should just use access status at the publisher URL (even though the DOI redirect is faulty).
I think in this instance, we should use access status were the DOI resolution URL fixed.
"10.1002/chin.197531174" 0 2018-01-03 "1" 2018-01-04 "0" | ||
"10.1002/zaac.19402430401" 0 2018-01-03 "1" 2018-01-04 "0" | ||
"10.1029/2011jd016541" 0 2018-01-03 "1" 2018-01-04 "1" | ||
"10.17816/jowd6265-11" 0 2018-01-03 "invalid" 2018-01-04 "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #18 (comment). The website was temporarily down. Let's redo the on-campus check now that the site is back up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Done in 78d1654.
Let's merge this PR first and calculate stats in a future PR. The counts of the following categories should be sufficient (from #17 (comment)):
This will be a sort of confusion matrix, but let's avoid that confusing terminology as well as TP/FP if we can 😄 |
At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you, as well? |
No. |
On- and off-campus checks for DOI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, concretely, what do we now want from these data? Does this look accurate to you?
I realize my previous answer was incomplete. Let me think of the best metrics.
Will merge this now!
This is an in-progress PR -- a branch to push the results of the manual DOI checks as I add them.