-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accuracy analysis of full_text_indicator calls #15
Comments
And perhaps view the DOIs when outside of Penn's network to see whether the article is paywalled at all. |
As in our conversation in PR 13, I am in agreement with this. We can in this way assess the rate of false negatives (as well as false positives, though I anticipate those not being prevalent) in the API we used. (This is also something that we in the Library want to know about! Because it points to what may be an issue somewhere in one of our systems, as noted in the comment linked above, whether at the publisher metadata level or at our OpenURL resolver.) Steps that I see for doing this:
@dhimmel, what are your thoughts about all that? Are there any other coauthors who would like to look over this analysis plan? Also, I know that you were hoping for an answer this week. I need to work to finish a project with a deadline most of the rest of this week; if I devoted Monday the 11th to this, would that work for you? |
No. I think it should be stratified by penn access status (100 articles where Don't place too much emphasis on the oaDOI colors for this analysis. As we've discussed in greenelab/scihub-manuscript#36, they are themselves an automated metric with imperfect quality.
Yes. I think we will need eventually want both. Access from within Penn's network, but also access outside as a control. If all articles are freely available off campus, then Penn's subscriptions aren't providing crucial access. We should limit the off campus search to the publisher's site... i.e. following the DOI, is the full text available. Can we start with a PR to select the 200 DOIs? Make sure its deterministic. Should be only a few lines in either R or python. |
If the stratification is within class (full_text_indicator status), I'm okay with it because it won't have the mentioned problem. Although I think it will probably end up being unnecessary. |
The latter (stratification within And yes; I'll get a PR up for deterministically sampling on Monday. |
See #18 / 46a529c for the manual classifications by @Publicus with on- and off-campus access for 200 DOIs stratified by PennText status. I looked into the results in this notebook, since that was the easiest way for me to think of how we want to look at this data. I'll hold off PRing this notebook in case @Publicus want to expand it or take it in another direction. Anyways what I think we can deduce:
@Publicus is this your interpretation as well? What do you think of these results? |
Thank you for getting these percentages, @dhimmel! Just a quick note to say that I'll be meeting with my supervisor, who's the Library's metadata architect, this afternoon, to talk through this. I'll then come back and write up my thoughts here. |
I spoke with my supervisor in a 1-1 meeting yesterday (just before campus shut down because of the weather), and brought this up in our Library Technology Services team meeting today, and have formed some thoughts: First, @dhimmel, thank you again for running those numbers. Your Jupyter notebook looks good and correct to me. It's also useful to me to note that, overall, PennText was wrong 26.5% of the time (in R, I got this with the following: dataset <- read_delim(
"./evaluate_library_access_from_output_tsv/manual-doi-checks.tsv",
"\t",
escape_double = FALSE,
trim_ws = TRUE
)
length(which(dataset$full_text_indicator_automated != dataset$full_text_indicator_manual_inside_campus))/nrow(dataset) # Percentage of time that PennText is wrong (in either direction)
# [1] 0.265 So, this obviously is a problem, on several levels:
It could also be the case that, as we have speculated, the OpenURL resolver on which PennText is built is just looking at journal subscription dates, and not the OA status of any particular article. It may also be that there are configuration issues with the OpenURL resolver. Or the errors could be coming from some combination of the above. I intend to take this question, of why there are errors in PennText's OpenURL resolver, and where they're coming from, and make a more in-depth study of it. That seems to me to be out of the scope of this manuscript, so I'm noting it here just to say that this finding is unexpected and needs follow-up within the Libraries. For this project, I see us having two main options:
Of those two options, I think that the latter has more promise. Though I think it's unfortunate that, if we go that route, all that API querying wouldn't be used! 🙁 I also recognize that, as project lead, you've got a timeline you'd like to stick to. So regardless of which way we go forward, since either option above requires some additional work, I think that it would be useful for us / I would feel good about talking through your timeline expectations, so that I can feel confident that we're on the same page. |
For this measure, I think you should average the accuracy on PennText true and false, weighted by the overall prevalence of PennText true and false. This would undo the effect of stratification. But I don't expect the accuracy will change much. I agree that the level of inaccuracy makes the PennText calls a poor estimator of Penn's access. I think the best option at this point is to proceed with additional manual checks. Perhaps a total of 500 DOIs. These should be randomly selected from all State of OA DOIs (i.e. a random subset of the DOIs which we queried PennText for). We could transfer somewhere between 100 and 200 of the existing manual calls. Time is of the essence. We'd like to resubmit the manuscript in the next couple weeks. So let's focus exclusively at this point on getting 500 manual calls. @Publicus, how about I open a PR to select the 500 DOIs, and then you take it from there?
I agreee. Although we can still report the PennText findings and data, perhaps in methods or a supplemental figure. The difficulty of ascertaining library access using the library systems is an interesting finding in and of itself. |
I'm on board with this. What are your thoughts re: counting the existing manual calls? Since the sample was random, I'm fine with using all 200. Or, perhaps, taking a random sample that's not stratified by PennText status, then randomly taking from the 200 calls in proportion to the non-stratified sample's PennText statuses. Or, if that sounds undesirable, I'd be fine just taking a new random sample of 500, If this is still going toward Figure 8b, it seems to me that it could make sense to stratify the sample by origin dataset (Web of Science, Unpaywall, Crossref) -- what are your thoughts on that? (If you see similar issues as you raised with stratifying the initial sample, I'm fine not stratifying.) Finally, so that we're on the same page going in, what are your thoughts on the Confidence Interval / Credible Interval approach? If we are going towards Figure 8b, it seems to me that we'd either need to run the CI estimation for each data source (i.e., 3 runs of analysis), or else make a random intercepts model where DOIs are nested within data source. The latter is probably cleaner; but the former is easier, and probably not as much of a philosophical problem (because Confidence Intervals carry the NHST issues of Type-1 error) if we use the Bayesian approach.
Noted. Please do open the PR. By "open," do you mean that you'll take the sample of 500? (I'm asking to confirm; I'm unsure from your comment). |
Yep see #19. Was able to migrate 178 of the existing 200 calls.
We cannot use all 200 because in a sample of 500 that matches the PennText proportion across all DOIs, there are only 78 PennText false DOIs. But luckily we can reuse most.
I shifted torwards combining all the DOIs from Web of Science, Unpaywall, and Crossref in the main text. I don't think splitting it added enough to warrant the additional complexity. We can still have the split figures as supplements. I was planning on replacing Figure 8A with cell 6 here and Figure 8B with cell 22 here. The Penn portion of these plots we'd update to use the results from these 500 manual calls.
It could be nice to report intervals. This is less of a priority to me than getting the calls, since we can always add it at a later analysis step. I'd like to learn more about the credible intervals from you at some point. |
Just a note to say that I'm in and out of meetings throughout the day, but that I am continuing to work through the DOIs between those meetings. |
When off-campus, https://doi.org/10.2307/3795274 gives me an option to "Read this item online for free by registering for a MyJSTOR account." So I'm counting it as having access. This is a note so that I won't forget, for transparency. |
My inclination would be to consider login-walled content not open. Creating an account is somewhat prohibitive. |
This is the only DOI so far that's said that, so it's not a problem to go back and change what I recorded for it. I don't have strong feelings about it -- but if the original question is "can a user access the full-text of the paper?" then it seems consistent to me to say that this satisfies that criterion, even if toward the tail-end of prohibitiveness within the distribution of things that count as "access." To me, this is in a similar way as how an article released openly but under a CC BY-NC-ND license is also prohibitive in what it allows users to do with it, but still provides the full text (In that latter case, I would more strongly feel about counting it as access, and am reasoning from that analogy here). Do you feel strongly about counting it as inaccessible? |
A quick note that the following DOIs are also of this type:
https://doi.org/10.1001/jama.1965.03080200059025 possibly is, as well, though it's hidden behind an extra click. In that case, it says "Create a free personal account to download free article PDFs, sign up for alerts, and more," which is not necessarily about this particular PDF (it implies but doesn't state that as directly as the examples above), so, at least for now, I've recorded it as not having access. |
I think we should. There are some types of login walls that should count as no access. For example, if the sign up process is laborious, requires users to identify themselves, agreeing to legal conditions, signing up for spam, etcetera. Because of the difficulty to draw a line, I think it'll be most straightforward to not sign up for any accounts in order to get access. Did you actually sign up for access? I can imagine sometimes the systems may have flaws so even with an account, you still couldn't access those articles. |
I hear this. Based on it, I'm fine with changing the values in question manually. I'm glad that we had the discussion about it, as it does seem to me to be a decision we needed to make actively. I didn't sign up for access, so your note re: system flaws may also be correct. |
I agree with categorizing JSTOR as non-open. It clearly doesn't meet the original definition:
All you get if you register is streaming access, so that's disqualifying already.
I also registered for AAAS's delayed access, but I have never had a success. The article is always mysteriously one of the ones that's not available. |
@tamunro, I appreciated reading your reasoning; thank you! |
No worries @Publicus. I'd be happy to help with the count too, if I don't need a Penn login. |
We should manually review calls for DOIs to see the accuracy of the
full_text_indicator
calls. I'd suggest randomly selecting 100 DOIs where full_text_indicator=False and 100 DOIs where full_text_indicator=True. Then we can navigate to the DOI URL while on Penn's network and see if full text access is available.@Publicus what do you think?
The text was updated successfully, but these errors were encountered: