Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding accuracy analysis results #18

Conversation

jglev
Copy link

@jglev jglev commented Dec 20, 2017

This is an in-progress PR -- a branch to push the results of the manual DOI checks as I add them.

… run (as running all of the code at once was getting caught up in the first readline() call).
@jglev
Copy link
Author

jglev commented Dec 20, 2017

I started going through the DOI sample this morning, and quickly realized that there are a couple questions that we should agree on before I keep going through them. Specifically:

  1. How "user-experience-focused" should I be when doing this?
    I've seen one DOI so far where, when I went to the page that the DOI resolved to (the publisher page), I was told I didn't have access. But when I Googled the article title, I found a ScienceDirect page for the article, which did give me access. (This was one where PennText told me that I did have access.)
    In that case, it makes sense to me to say that we have legal access. But, that may be going a step further than some users would, since the DOI page itself seemed to say that we don't have access.
    Since the underlying question here is "Do we have legal access to the article from on-/off-campus," my inclination is to do whatever it takes to legally find the article -- if the DOI page doesn't have it, then Google it, or sign in if necessary through the publisher page (if my Penn Affiliation isn't automatically noted) -- at least for the "on-campus" round of checks (for the "off-campus" round, I wouldn't do any authentication). I had assumed that being on-campus would always automatically grant access, but one DOI so far has made me question that -- it may not be the case that all publishers automatically see where the request is coming from, I'm thinking now.
  2. Just so we're on the same page, I'll only mark that we have access to a record if I can actually get a PDF or HTML full-text document. There was one DOI so far that had a "Get PDF" button that looked like it indicated that full-text was available, but then took me to a paywall 😕 Do you have any objections to that metric?

@dhimmel, do you have additional thoughts on these two questions? If not, I'll keep at this throughout the day.

@dhimmel
Copy link
Contributor

dhimmel commented Dec 20, 2017

I've seen one DOI so far where, when I went to the page that the DOI resolved to (the publisher page), I was told I didn't have access. But when I Googled the article title, I found a ScienceDirect page for the article, which did give me access. (This was one where PennText told me that I did have access.)

Interesting. What's the DOI? Does the publisher's page link to ScienceDirect? My first inclination is to require the access to have resulted from DOI resolution and then following any necessary links. However, I can see how compilations that Penn subscribes to, such as ScienceDirect and JSTOR, could cause problems here. Can we see how many DOIs have these situations and then make a more systematic decision?

I'll only mark that we have access to a record if I can actually get a PDF or HTML full-text document.

Agreed!

@jglev
Copy link
Author

jglev commented Dec 21, 2017

The DOI is 10.1017/s1357729800051109. However, looking more closely now, I realize that the Science Direct article that came up is by the same authors, with the same first phrase of the title, but is actually a different article! I was skimming yesterday, and didn't even notice it today until the third reading. So, nevermind about that. : P

@jglev
Copy link
Author

jglev commented Dec 21, 2017

This is a progress note so that I won't forget: the DOI 10.1017/s1357729800051109, which resolves to Cambridge University Press, does not give full-text access -- when I try to log in via Shibboleth to UPenn, the CUP site says that the login has failed. The manual version of PennText (linked from the CUP page under "Get access" -> "Check library catalog" seems to indicate that there is full-text access to earlier DOI in the sample from CUP, but I couldn't figure out how to access it, after 5+ minutes of trying, so I think it's reasonable that I marked that full-text is not accessible and moved on.

I went back to our actual XML data from the PennText backend, using this query:

SELECT * FROM dois_table
JOIN library_holdings_data ON
dois_table.database_id = library_holdings_data.doi_foreign_key
WHERE dois_table.doi = "10.1017/s1357729800051109"

And the data for that DOI does (correctly) indicate that there is no full-text available.

This is all to say that I've only gotten through two dozen DOIs, and I'm already confused on occasion about what I as a user have access to.

@jglev
Copy link
Author

jglev commented Dec 21, 2017

(I made the above note to note that CUP login isn't working, which is either a problem with CUP's site, or one that we in the library should look into. And to note, anecdotally, that in several cases already, I spent several minutes trying to figure out whether I have access to an article.)

@jglev
Copy link
Author

jglev commented Dec 21, 2017

Another progress note: our automated access checker is looking for just electronic access, I think I can confirm: The XML response for doi https://doi.org/10.1016/0306-2619(90)90086-s (using the query in the comment above, but with this new DOI) indicates that there is not full-text access, while the manual version of PennText (which uses the same system as our automated checker) indicates that there is access through LIBRA, which is the Library's off-site physical storage center.

@dhimmel
Copy link
Contributor

dhimmel commented Dec 21, 2017

@Publicus yes! it's not always straightforward.

My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?

our automated access checker is looking for just electronic access

Great!

@jglev
Copy link
Author

jglev commented Dec 21, 2017

it's not always straightforward.

Indeed! Goodness.

My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?

Following my comment yesterday, I am still of the somewhat-opposite opinion, qualified by being within reason. Since we're trying to verify the validity of PennText saying that we do/don't have legal access, I'm fine with authenticating (e.g., to go through a Shibboleth login page if required). What I've found so far, though (I'm 36 DOIs in) is that every time Penn's network hasn't automatically triggered access, trying to login manually through Penn hasn't granted access. So, I think that the initial access page is a good rule-of-thumb indicator, which I think is in line with your comment, yes? But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to. Does that all sound reasonable to you, as well?

As in my note yesterday, though, this all applies to the on-campus / in-network series of checks. For the off-campus run, I fully agree with you, and won't authenticate at all.

@dhimmel
Copy link
Contributor

dhimmel commented Dec 21, 2017

But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to.

I guess if it requests for you to authorize your PennKey (linked from the DOI landing page), that could still be access. It doesn't really make sense since you're inside of Penn's network. However, if it's like create a personal account... then verify that your account uses a Penn email, I think that's out of scope. Or if it requires some sort of login that requires a librarian... that's inaccessible. We concur?

@jglev
Copy link
Author

jglev commented Dec 21, 2017

Agreed, yes. : )

@jglev
Copy link
Author

jglev commented Jan 3, 2018

Happy new year!

Ok, I've completed and pushed the manual checks for all 200 DOIs on-campus. I've also set aside several hours tomorrow to do an off-campus check.

For clarity: The on-campus check was done using my laptop, which was hard-wired (i.e., via ethernet) into the network in the Van Pelt library.

@jglev
Copy link
Author

jglev commented Jan 3, 2018

As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.). This is something I'll bring up with our dev. team in the Libraries, as it's an area to improve our services.

On that note, I'm thinking now that internally, I do want to see what trends this shows across the open access "colors" (even if color assignment was imperfect). Given these data, once I do that for my own use, would you be interested to see it here, as well? I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected.

There are also false positives, but they're (just eyeballing it) a much lower percentage of cases.

My guess is that if this is happening with the openURL resolver that we use at UPenn, it's likely happening at other institutions, as well; either because of the software stack that we're using, or because this points to what is potentially a major difficulty in tracking what one actually has access to.

@jglev
Copy link
Author

jglev commented Jan 3, 2018

There were two DOIs I marked as "invalid." One (https://doi.org/10.17816/jowd6265-11) didn't resolve, and the other (https://doi.org/10.3892/or.2012.2190) redirected to a publisher home page (https://www.spandidos-publications.com/).

Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things will be simpler if we can avoid invalid, although there are some decisions we'd have to make to do so. My thoughts inline.

10.1002/phbl.19510070201 0 2018-01-03 1
10.1515/ijnsns-2011-0005 0 2018-01-03 0
10.1515/jnet.1983.8.4.255 0 2018-01-03 0
10.3892/or.2012.2190 0 2018-01-03 invalid
Copy link
Contributor

@dhimmel dhimmel Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DOI link https://doi.org/10.3892/or.2012.2190 currently redirects to https://www.spandidos-publications.com/ rather than the article page at https://www.spandidos-publications.com/10.3892/or.2012.2190. I wonder if in this instance, we should just use access status at the publisher URL (even though the DOI redirect is faulty).

10.1002/chin.197531174 0 2018-01-03 1
10.1002/zaac.19402430401 0 2018-01-03 1
10.1029/2011jd016541 0 2018-01-03 1
10.17816/jowd6265-11 0 2018-01-03 invalid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doi.org/10.17816/jowd6265-11 redirects to http://journals.eco-vector.com/index.php/jowd/article/viewFile/2617/2229. The issue here appears to be that journals.eco-vector.com is currently down altogether. Perhaps in a day or two the website will be back up and we can recheck this.

@dhimmel
Copy link
Contributor

dhimmel commented Jan 3, 2018

As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.).

I agree. I think these articles will mostly be available off-campus as well (i.e. open access).

I'm guessing the PennText tool is most focused on tracking access to toll-access content, so it's not a huge surprise it's unaware that it, by default, has access to much OA literature.

I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected.

I'm interested in off-campus access as the control here. Let's focus the work in this repo on that definition of open access (rather than oaDOI colors). If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency.

@jglev
Copy link
Author

jglev commented Jan 3, 2018

I'm guessing the PennText tool is most focused on tracking access to toll-access content

Agreed, exactly.

If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency.

Sure! And to confirm, I'll work tomorrow to finish the off-campus checks before diving into any of the color-based analyses I mentioned.

@jglev
Copy link
Author

jglev commented Jan 4, 2018

For transparency: In order to achieve off-campus access, rather than actually leaving campus, I'm using SSHuttle to create a transparent SOCKS proxy back to my house (which is not on campus). I have confirmed that this works using DOI 10.1111/j.1550-7408.1962.tb02648.x: When the proxy is not engaged (and I am thus on the campus network), I have access to the article. When the proxy is engaged, I no longer have access to the article. Further, I've confirmed that my IP address is changing to my home IP when the proxy is engaged, by visiting http://ipv4.icanhazip.com/ in my browser.

Jacob Levernier added 2 commits January 4, 2018 13:45
…k (using SSHuttle-based SOCKS proxy to off-campus, confirmed by checking IP address and with DOI 10.1111/j.1550-7408.1962.tb02648.x, which is available without the proxy (i.e., on-campus), and not available with the proxy (i.e., off-campus).
… case subscriptions changed with the calendar year, and because I think that my more recent checks were more accurate, having seen the (often confusing) publisher web pages more.
@jglev
Copy link
Author

jglev commented Jan 4, 2018

Ok, all DOIs are checked both on- and off-campus!

In the off-campus search, there was one different DOI that I marked as "invalid" -- 10.17816/jowd6265-11 worked this time (it didn't in the on-campus check), but 10.3934/dcds.2016103 gave a server error.

@jglev
Copy link
Author

jglev commented Jan 4, 2018

So, concretely, what do we now want from these data? Does this look accurate to you?

  • Using the on-campus check, percentage of false-negatives from PennText
  • Using the on-campus check, percentage of false-positives from PennText
  • Comparing the on- and off-campus checks, percentage where on-campus had access but off-campus didn't
  • Comparing the on- and off-campus checks, percentage where on-campus did not have access but off-campus did

@jglev
Copy link
Author

jglev commented Jan 4, 2018

It's also worth noting, I think, that some of the publisher pages are very confusing; to the point that I felt it necessary to redo the on-campus checks I'd completed before the New Year just now, because I've learned a lot in the last two days from looking through the publisher pages more. Wiley, for example, often shows a loading page after a user clicks the "Download PDF" button, before then redirecting to a paywall page.

Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be best if we can avoid "invalid" altogether. See specific comments for what I think we should do for each invalid DOI.

}

# Save the changes to the tsv:
write.table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this command is what caused all the quotes in the TSV, which I tend to dislike. readr::write_tsv defaults to not quote like this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, I think it was LibreOffice Calc. I opened the TSV to check it, and may have pressed Ctrl+S out of habit before closing it again. I'll reset it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quotes removed in 3ebe18d.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may have been both LibreOffice and write.table, though. I'm fine with changing the command, to be consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Am changing the command now...)
Also, I just saw that you posted this comment yesterday. I didn't see it then; thank you for re-posting it now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command switched in 58c348f, and I checked the TSV after writing using it (d0967d7 and d0967d7).

"10.3233/jnd-160146" 0 2018-01-03 "1" 2018-01-04 "1"
"10.7748/eldc.6.5.41.s39" 0 2018-01-03 "0" 2018-01-04 "0"
"10.2306/scienceasia1513-1874.2013.39.204" 0 2018-01-03 "1" 2018-01-04 "1"
"10.3934/dcds.2016103" 0 2018-01-03 "0" 2018-01-04 "invalid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you retry the off campus search for https://doi.org/10.3934/dcds.2016103. The "server error" is probably a temporary issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d0967d7.

"10.1002/phbl.19510070201" 0 2018-01-03 "1" 2018-01-04 "1"
"10.1515/ijnsns-2011-0005" 0 2018-01-03 "0" 2018-01-04 "0"
"10.1515/jnet.1983.8.4.255" 0 2018-01-03 "0" 2018-01-04 "0"
"10.3892/or.2012.2190" 0 2018-01-03 "invalid" 2018-01-04 "invalid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying from #18 (comment):

The DOI link https://doi.org/10.3892/or.2012.2190 currently redirects to https://www.spandidos-publications.com/ rather than the article page at https://www.spandidos-publications.com/10.3892/or.2012.2190. I wonder if in this instance, we should just use access status at the publisher URL (even though the DOI redirect is faulty).

I think in this instance, we should use access status were the DOI resolution URL fixed.

"10.1002/chin.197531174" 0 2018-01-03 "1" 2018-01-04 "0"
"10.1002/zaac.19402430401" 0 2018-01-03 "1" 2018-01-04 "0"
"10.1029/2011jd016541" 0 2018-01-03 "1" 2018-01-04 "1"
"10.17816/jowd6265-11" 0 2018-01-03 "invalid" 2018-01-04 "1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #18 (comment). The website was temporarily down. Let's redo the on-campus check now that the site is back up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done in 78d1654.

@dhimmel
Copy link
Contributor

dhimmel commented Jan 4, 2018

So, concretely, what do we now want from these data? Does this look accurate to you?

Let's merge this PR first and calculate stats in a future PR.

The counts of the following categories should be sufficient (from #17 (comment)):

  1. Available on campus, Not available off campus
  2. Available on campus, available out off campus
  3. Not available on campus, not available off campus
  4. Not available on campus, available off campus.

This will be a sort of confusion matrix, but let's avoid that confusing terminology as well as TP/FP if we can 😄

@jglev
Copy link
Author

jglev commented Jan 4, 2018

At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you, as well?

@dhimmel
Copy link
Contributor

dhimmel commented Jan 4, 2018

At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you

No. 10.3892/or.2012.2190 is still marked as invalid. See #18 (comment)... let's assess access for this articles using the URL https://www.spandidos-publications.com/10.3892/or.2012.2190

@jglev
Copy link
Author

jglev commented Jan 4, 2018

On- and off-campus checks for DOI 10.3892/or.2012.2190 are now updated, as of 8f5871e.

Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, concretely, what do we now want from these data? Does this look accurate to you?

I realize my previous answer was incomplete. Let me think of the best metrics.

Will merge this now!

@dhimmel dhimmel merged commit 46a529c into greenelab:master Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants