Adding accuracy analysis results #18

jglev · 2017-12-20T17:02:56Z

This is an in-progress PR -- a branch to push the results of the manual DOI checks as I add them.

… run (as running all of the code at once was getting caught up in the first readline() call).

jglev · 2017-12-20T17:04:45Z

I started going through the DOI sample this morning, and quickly realized that there are a couple questions that we should agree on before I keep going through them. Specifically:

How "user-experience-focused" should I be when doing this?
I've seen one DOI so far where, when I went to the page that the DOI resolved to (the publisher page), I was told I didn't have access. But when I Googled the article title, I found a ScienceDirect page for the article, which did give me access. (This was one where PennText told me that I did have access.)
In that case, it makes sense to me to say that we have legal access. But, that may be going a step further than some users would, since the DOI page itself seemed to say that we don't have access.
Since the underlying question here is "Do we have legal access to the article from on-/off-campus," my inclination is to do whatever it takes to legally find the article -- if the DOI page doesn't have it, then Google it, or sign in if necessary through the publisher page (if my Penn Affiliation isn't automatically noted) -- at least for the "on-campus" round of checks (for the "off-campus" round, I wouldn't do any authentication). I had assumed that being on-campus would always automatically grant access, but one DOI so far has made me question that -- it may not be the case that all publishers automatically see where the request is coming from, I'm thinking now.
Just so we're on the same page, I'll only mark that we have access to a record if I can actually get a PDF or HTML full-text document. There was one DOI so far that had a "Get PDF" button that looked like it indicated that full-text was available, but then took me to a paywall 😕 Do you have any objections to that metric?

@dhimmel, do you have additional thoughts on these two questions? If not, I'll keep at this throughout the day.

dhimmel · 2017-12-20T20:11:13Z

I've seen one DOI so far where, when I went to the page that the DOI resolved to (the publisher page), I was told I didn't have access. But when I Googled the article title, I found a ScienceDirect page for the article, which did give me access. (This was one where PennText told me that I did have access.)

Interesting. What's the DOI? Does the publisher's page link to ScienceDirect? My first inclination is to require the access to have resulted from DOI resolution and then following any necessary links. However, I can see how compilations that Penn subscribes to, such as ScienceDirect and JSTOR, could cause problems here. Can we see how many DOIs have these situations and then make a more systematic decision?

I'll only mark that we have access to a record if I can actually get a PDF or HTML full-text document.

Agreed!

jglev · 2017-12-21T16:13:43Z

The DOI is 10.1017/s1357729800051109. However, looking more closely now, I realize that the Science Direct article that came up is by the same authors, with the same first phrase of the title, but is actually a different article! I was skimming yesterday, and didn't even notice it today until the third reading. So, nevermind about that. : P

jglev · 2017-12-21T16:58:06Z

This is a progress note so that I won't forget: the DOI 10.1017/s1357729800051109, which resolves to Cambridge University Press, does not give full-text access -- when I try to log in via Shibboleth to UPenn, the CUP site says that the login has failed. The manual version of PennText (linked from the CUP page under "Get access" -> "Check library catalog" seems to indicate that there is full-text access to earlier DOI in the sample from CUP, but I couldn't figure out how to access it, after 5+ minutes of trying, so I think it's reasonable that I marked that full-text is not accessible and moved on.

I went back to our actual XML data from the PennText backend, using this query:

SELECT * FROM dois_table
JOIN library_holdings_data ON
dois_table.database_id = library_holdings_data.doi_foreign_key
WHERE dois_table.doi = "10.1017/s1357729800051109"

And the data for that DOI does (correctly) indicate that there is no full-text available.

This is all to say that I've only gotten through two dozen DOIs, and I'm already confused on occasion about what I as a user have access to.

jglev · 2017-12-21T16:59:13Z

(I made the above note to note that CUP login isn't working, which is either a problem with CUP's site, or one that we in the library should look into. And to note, anecdotally, that in several cases already, I spent several minutes trying to figure out whether I have access to an article.)

jglev · 2017-12-21T17:05:35Z

Another progress note: our automated access checker is looking for just electronic access, I think I can confirm: The XML response for doi https://doi.org/10.1016/0306-2619(90)90086-s (using the query in the comment above, but with this new DOI) indicates that there is not full-text access, while the manual version of PennText (which uses the same system as our automated checker) indicates that there is access through LIBRA, which is the Library's off-site physical storage center.

dhimmel · 2017-12-21T17:07:40Z

@Publicus yes! it's not always straightforward.

My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?

our automated access checker is looking for just electronic access

Great!

jglev · 2017-12-21T18:42:58Z

it's not always straightforward.

Indeed! Goodness.

My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?

Following my comment yesterday, I am still of the somewhat-opposite opinion, qualified by being within reason. Since we're trying to verify the validity of PennText saying that we do/don't have legal access, I'm fine with authenticating (e.g., to go through a Shibboleth login page if required). What I've found so far, though (I'm 36 DOIs in) is that every time Penn's network hasn't automatically triggered access, trying to login manually through Penn hasn't granted access. So, I think that the initial access page is a good rule-of-thumb indicator, which I think is in line with your comment, yes? But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to. Does that all sound reasonable to you, as well?

As in my note yesterday, though, this all applies to the on-campus / in-network series of checks. For the off-campus run, I fully agree with you, and won't authenticate at all.

dhimmel · 2017-12-21T20:35:48Z

But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to.

I guess if it requests for you to authorize your PennKey (linked from the DOI landing page), that could still be access. It doesn't really make sense since you're inside of Penn's network. However, if it's like create a personal account... then verify that your account uses a Penn email, I think that's out of scope. Or if it requires some sort of login that requires a librarian... that's inaccessible. We concur?

jglev · 2017-12-21T20:44:42Z

Agreed, yes. : )

jglev · 2018-01-03T20:39:54Z

Happy new year!

Ok, I've completed and pushed the manual checks for all 200 DOIs on-campus. I've also set aside several hours tomorrow to do an off-campus check.

For clarity: The on-campus check was done using my laptop, which was hard-wired (i.e., via ethernet) into the network in the Van Pelt library.

jglev · 2018-01-03T20:45:55Z

As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.). This is something I'll bring up with our dev. team in the Libraries, as it's an area to improve our services.

On that note, I'm thinking now that internally, I do want to see what trends this shows across the open access "colors" (even if color assignment was imperfect). Given these data, once I do that for my own use, would you be interested to see it here, as well? I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected.

There are also false positives, but they're (just eyeballing it) a much lower percentage of cases.

My guess is that if this is happening with the openURL resolver that we use at UPenn, it's likely happening at other institutions, as well; either because of the software stack that we're using, or because this points to what is potentially a major difficulty in tracking what one actually has access to.

jglev · 2018-01-03T20:48:23Z

There were two DOIs I marked as "invalid." One (https://doi.org/10.17816/jowd6265-11) didn't resolve, and the other (https://doi.org/10.3892/or.2012.2190) redirected to a publisher home page (https://www.spandidos-publications.com/).

dhimmel

Things will be simpler if we can avoid invalid, although there are some decisions we'd have to make to do so. My thoughts inline.

dhimmel · 2018-01-03T20:47:18Z

evaluate_library_access_from_output_tsv/manual-doi-checks.tsv

+10.1002/phbl.19510070201	0	2018-01-03	1		
+10.1515/ijnsns-2011-0005	0	2018-01-03	0		
+10.1515/jnet.1983.8.4.255	0	2018-01-03	0		
+10.3892/or.2012.2190	0	2018-01-03	invalid		


The DOI link https://doi.org/10.3892/or.2012.2190 currently redirects to https://www.spandidos-publications.com/ rather than the article page at https://www.spandidos-publications.com/10.3892/or.2012.2190. I wonder if in this instance, we should just use access status at the publisher URL (even though the DOI redirect is faulty).

dhimmel · 2018-01-03T20:51:13Z

evaluate_library_access_from_output_tsv/manual-doi-checks.tsv

+10.1002/chin.197531174	0	2018-01-03	1		
+10.1002/zaac.19402430401	0	2018-01-03	1		
+10.1029/2011jd016541	0	2018-01-03	1		
+10.17816/jowd6265-11	0	2018-01-03	invalid		


https://doi.org/10.17816/jowd6265-11 redirects to http://journals.eco-vector.com/index.php/jowd/article/viewFile/2617/2229. The issue here appears to be that journals.eco-vector.com is currently down altogether. Perhaps in a day or two the website will be back up and we can recheck this.

dhimmel · 2018-01-03T21:06:40Z

As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.).

I agree. I think these articles will mostly be available off-campus as well (i.e. open access).

I'm guessing the PennText tool is most focused on tracking access to toll-access content, so it's not a huge surprise it's unaware that it, by default, has access to much OA literature.

I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected.

I'm interested in off-campus access as the control here. Let's focus the work in this repo on that definition of open access (rather than oaDOI colors). If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency.

jglev · 2018-01-03T21:12:44Z

I'm guessing the PennText tool is most focused on tracking access to toll-access content

Agreed, exactly.

If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency.

Sure! And to confirm, I'll work tomorrow to finish the off-campus checks before diving into any of the color-based analyses I mentioned.

jglev · 2018-01-04T15:00:12Z

For transparency: In order to achieve off-campus access, rather than actually leaving campus, I'm using SSHuttle to create a transparent SOCKS proxy back to my house (which is not on campus). I have confirmed that this works using DOI 10.1111/j.1550-7408.1962.tb02648.x: When the proxy is not engaged (and I am thus on the campus network), I have access to the article. When the proxy is engaged, I no longer have access to the article. Further, I've confirmed that my IP address is changing to my home IP when the proxy is engaged, by visiting http://ipv4.icanhazip.com/ in my browser.

…k (using SSHuttle-based SOCKS proxy to off-campus, confirmed by checking IP address and with DOI 10.1111/j.1550-7408.1962.tb02648.x, which is available without the proxy (i.e., on-campus), and not available with the proxy (i.e., off-campus).

… case subscriptions changed with the calendar year, and because I think that my more recent checks were more accurate, having seen the (often confusing) publisher web pages more.

jglev · 2018-01-04T19:11:43Z

Ok, all DOIs are checked both on- and off-campus!

In the off-campus search, there was one different DOI that I marked as "invalid" -- 10.17816/jowd6265-11 worked this time (it didn't in the on-campus check), but 10.3934/dcds.2016103 gave a server error.

jglev · 2018-01-04T19:15:01Z

So, concretely, what do we now want from these data? Does this look accurate to you?

Using the on-campus check, percentage of false-negatives from PennText
Using the on-campus check, percentage of false-positives from PennText
Comparing the on- and off-campus checks, percentage where on-campus had access but off-campus didn't
Comparing the on- and off-campus checks, percentage where on-campus did not have access but off-campus did

jglev · 2018-01-04T19:18:36Z

It's also worth noting, I think, that some of the publisher pages are very confusing; to the point that I felt it necessary to redo the on-campus checks I'd completed before the New Year just now, because I've learned a lot in the last two days from looking through the publisher pages more. Wiley, for example, often shows a loading page after a user clicks the "Download PDF" button, before then redirecting to a paywall page.

dhimmel

It'd be best if we can avoid "invalid" altogether. See specific comments for what I think we should do for each invalid DOI.

dhimmel · 2018-01-04T20:06:29Z

evaluate_library_access_from_output_tsv/facilitate_going_through_dois_manually.R

+    }
+
+    # Save the changes to the tsv:
+    write.table(


Assuming this command is what caused all the quotes in the TSV, which I tend to dislike. readr::write_tsv defaults to not quote like this.

Oh, no, I think it was LibreOffice Calc. I opened the TSV to check it, and may have pressed Ctrl+S out of habit before closing it again. I'll reset it...

Quotes removed in 3ebe18d.

It may have been both LibreOffice and write.table, though. I'm fine with changing the command, to be consistent.

(Am changing the command now...)
Also, I just saw that you posted this comment yesterday. I didn't see it then; thank you for re-posting it now.

Command switched in 58c348f, and I checked the TSV after writing using it (d0967d7 and d0967d7).

dhimmel · 2018-01-04T20:10:17Z

evaluate_library_access_from_output_tsv/manual-doi-checks.tsv

+"10.3233/jnd-160146"	0	2018-01-03	"1"	2018-01-04	"1"
+"10.7748/eldc.6.5.41.s39"	0	2018-01-03	"0"	2018-01-04	"0"
+"10.2306/scienceasia1513-1874.2013.39.204"	0	2018-01-03	"1"	2018-01-04	"1"
+"10.3934/dcds.2016103"	0	2018-01-03	"0"	2018-01-04	"invalid"


Can you retry the off campus search for https://doi.org/10.3934/dcds.2016103. The "server error" is probably a temporary issue?

Done in d0967d7.

dhimmel · 2018-01-04T20:11:32Z

evaluate_library_access_from_output_tsv/manual-doi-checks.tsv

+"10.1002/phbl.19510070201"	0	2018-01-03	"1"	2018-01-04	"1"
+"10.1515/ijnsns-2011-0005"	0	2018-01-03	"0"	2018-01-04	"0"
+"10.1515/jnet.1983.8.4.255"	0	2018-01-03	"0"	2018-01-04	"0"
+"10.3892/or.2012.2190"	0	2018-01-03	"invalid"	2018-01-04	"invalid"


Copying from #18 (comment):

The DOI link https://doi.org/10.3892/or.2012.2190 currently redirects to https://www.spandidos-publications.com/ rather than the article page at https://www.spandidos-publications.com/10.3892/or.2012.2190. I wonder if in this instance, we should just use access status at the publisher URL (even though the DOI redirect is faulty).

I think in this instance, we should use access status were the DOI resolution URL fixed.

dhimmel · 2018-01-04T20:14:09Z

evaluate_library_access_from_output_tsv/manual-doi-checks.tsv

+"10.1002/chin.197531174"	0	2018-01-03	"1"	2018-01-04	"0"
+"10.1002/zaac.19402430401"	0	2018-01-03	"1"	2018-01-04	"0"
+"10.1029/2011jd016541"	0	2018-01-03	"1"	2018-01-04	"1"
+"10.17816/jowd6265-11"	0	2018-01-03	"invalid"	2018-01-04	"1"


See #18 (comment). The website was temporarily down. Let's redo the on-campus check now that the site is back up.

Sure. Done in 78d1654.

…Calc.

dhimmel · 2018-01-04T20:24:03Z

So, concretely, what do we now want from these data? Does this look accurate to you?

Let's merge this PR first and calculate stats in a future PR.

The counts of the following categories should be sufficient (from #17 (comment)):

Available on campus, Not available off campus
Available on campus, available out off campus
Not available on campus, not available off campus
Not available on campus, available off campus.

This will be a sort of confusion matrix, but let's avoid that confusing terminology as well as TP/FP if we can 😄

jglev · 2018-01-04T20:40:12Z

At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you, as well?

dhimmel · 2018-01-04T21:04:41Z

At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you

No. 10.3892/or.2012.2190 is still marked as invalid. See #18 (comment)... let's assess access for this articles using the URL https://www.spandidos-publications.com/10.3892/or.2012.2190

…ps://www.spandidos-publications.com/10.3892/or.2012.2190.

jglev · 2018-01-04T21:23:24Z

On- and off-campus checks for DOI 10.3892/or.2012.2190 are now updated, as of 8f5871e.

dhimmel

So, concretely, what do we now want from these data? Does this look accurate to you?

I realize my previous answer was incomplete. Let me think of the best metrics.

Will merge this now!

Wrapped the facilitation script into a function, to make it easier to…

14e2a75

… run (as running all of the code at once was getting caught up in the first readline() call).

Added data for on-campus manual check of all 200 sampled DOIs.

b7365aa

dhimmel reviewed Jan 3, 2018

View reviewed changes

Jacob Levernier added 2 commits January 4, 2018 13:45

Re-did the on-campus DOI checks I'd done before the new year, both in…

950dc60

… case subscriptions changed with the calendar year, and because I think that my more recent checks were more accurate, having seen the (often confusing) publisher web pages more.

dhimmel reviewed Jan 4, 2018

View reviewed changes

Removed quotes around cells, which I think was caused by LibreOffice …

3ebe18d

…Calc.

Jacob Levernier added 3 commits January 4, 2018 15:36

Switched to write_tsv from write.table.

58c348f

Re-did off-campus check for DOI 10.3934/dcds.2016103

d0967d7

Re-did on-campus check for DOI 10.17816/jowd6265-11

78d1654

Updated on- and off-campus checks of DOI 10.3892/or.2012.2190 via htt…

8f5871e

…ps://www.spandidos-publications.com/10.3892/or.2012.2190.

dhimmel approved these changes Jan 4, 2018

View reviewed changes

dhimmel merged commit 46a529c into greenelab:master Jan 4, 2018

dhimmel mentioned this pull request Jan 5, 2018

Accuracy analysis of full_text_indicator calls #15

Closed

Adding accuracy analysis results #18

Adding accuracy analysis results #18

Conversation

jglev commented Dec 20, 2017

jglev commented Dec 20, 2017

dhimmel commented Dec 20, 2017 • edited Loading

jglev commented Dec 21, 2017

jglev commented Dec 21, 2017

jglev commented Dec 21, 2017 • edited Loading

jglev commented Dec 21, 2017

dhimmel commented Dec 21, 2017

jglev commented Dec 21, 2017

dhimmel commented Dec 21, 2017

jglev commented Dec 21, 2017

jglev commented Jan 3, 2018

jglev commented Jan 3, 2018

jglev commented Jan 3, 2018

dhimmel left a comment

Choose a reason for hiding this comment

dhimmel Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Jan 3, 2018

jglev commented Jan 3, 2018

jglev commented Jan 4, 2018

jglev commented Jan 4, 2018

jglev commented Jan 4, 2018 • edited Loading

jglev commented Jan 4, 2018

dhimmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Jan 4, 2018

jglev commented Jan 4, 2018

dhimmel commented Jan 4, 2018

jglev commented Jan 4, 2018

dhimmel left a comment

Choose a reason for hiding this comment

dhimmel commented Dec 20, 2017 •

edited

Loading

jglev commented Dec 21, 2017 •

edited

Loading

dhimmel Jan 3, 2018 •

edited

Loading

jglev commented Jan 4, 2018 •

edited

Loading