-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding full text true rate estimator #8
Adding full text true rate estimator #8
Conversation
…onment.yml' (and then manually removing the prefix and name, which are specific to my system). Got sqlalchemy data insertion proof of concept working. Got API download, parse, and commit to SQLite database working with a given DOI. Next step is to create the loop over DOIs.
…f fulltext at different timepoints.
… object-oriented syntax.
…andon some of this work in next commit.
…Spyder's iPython console. Also created function re: whether doi is already in database, and the loop over DOIs.
…copy() for copying a dictionary.
…a bug in that function whereby lacking a DOI in the XML would cause an error.
…generated data. The next step is to hook this into our actual datset.
…s (currently split across the original tsv file and an sqlite database).
To clarify, since this PR inherits all of the commits from #7, the one new file in this new PR is |
One more explanatory note: The script currently draws data from both of the locations used in PR #7: It gets full-text access information from the SQLite database, and joins that to DOI open-access "colors" from the original Beyond just, e.g., looking at only more-closed DOIs (like 'closed' and 'bronze'), one useful thing that we could do with this is add oa color as a predictor into the model. I'm not sure whether that's a useful approach in the larger context of this project and its manuscript, so I'm just putting it out here for discussion before implementing it. |
Finally, the Lab onboarding documentation notes that "We write code for our analyses in Python or R, which allows everyone in the lab to know two languages and understand analytical code." Unlike Python (with |
Yes. SQL is fine! Although you may want to check out
It looks like this package (github) makes I was imagining we'd want to extract a TSV from the databases that omits the api_response column. This way it's really easy to read in the access data. My thinking is we should have a PR that converts the DB to a TSV and that should come before further analysis of the data. Then the data analyses can read the TSV, which should be more development-friendly. I'm going to hold off on reviewing this PR until we merge #7. It's not clear to me that we want to be doing these analyses in this repo as opposed to |
I apologize for my delay responding -- I was at a conference this week, and returned yesterday evening. As always, thanks for your comments. I hadn't heard about To confirm that I've understood, does this look correct to you re: what you want to see for a PR that you accept: Split this PR into two PRs: Responding to your final sentence, I added the estimation step because of @tamunro's original suggestion to sample, which alluded to getting a Confidence Interval, both steps I agree with. (As noted here in a comment in the code, a Confidence Interval should come out in this case to be the same as a Bayesian Credible Interval; I think it's beneficial for the scientific literature to prefer using the latter, so I used it here.) As I understand, since the SciHub coverage analysis was basically on a census rather than a sample, getting an estimate wouldn't be necessary for the SciHub aspect of the paper. Does that understanding jibe with yours? |
Yep I think we're on the same page. Regarding the second PR, I was thinking we'd add UPenn's coverage as bars in Figure 8B. Since these measure coverage on samples of DOIs, intervals would be appropriate. Also Figure 2A could use the same interval. Let's save design discussion for these intervals for later. It'll probably make most sense for me to make the updates to these figures. However, I'm happy for advice on calculating the intervals. I haven't used BCIs before and would be open to using them, if their is a conceptual advantage and the implementation is straightforward. |
Ah, thanks for pointing me to to Figure 8B in the manuscript. I misunderstood, so to update what I said before: I think now (and this seems to agree with what you wrote earlier) that the Credible Interval or Confidence Interval analysis is not necessary for the library records, given that they'll be incorporated into that figure (or one like it). If, however, the idea is to extrapolate from this and make inferences about DOI access beyond this dataset, I agree that we'd want to do the analysis on both the library data and SciHub data that will go into that figure. And in that case, the R script I have here (which I'll eventually transfer to a PR on the |
This is a work-in-progress PR which depends on (i.e., is branched from) PR #7 (which adds an API query script and some example data). It adds an analysis script, written in R, for performing a Bayesian estimation of the "true" rate of full-text access to DOIs like those in the dataset.
I was planning to wait to add this until #7 is merged, but since I've written the code and was playing with it this morning, I figured I'd put it up here now.
The Bayesian analysis follows one of the most basic examples in the literature -- it uses a Bernoulli likelihood (data-generating) function, with a flat beta-distributed prior. Our binary full-text access indicator data is essentially of the same type as coin-flip data, which makes it easy to apply this example from the literature.
Portions of this code (the model for
stan
, the Bayesian estimator software) are from a BSD-3-Clause-licensed (the author states the "new BSD license") example, as cited in the copyright section of the script and above where the code itself is used.I've written this to comply with the Google R Style Guide cited by the Greene Lab Onboarding documentation.
Todo
environment.yml
LICENSE
file?