Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding full text true rate estimator #8

Closed

Conversation

jglev
Copy link

@jglev jglev commented Oct 19, 2017

This is a work-in-progress PR which depends on (i.e., is branched from) PR #7 (which adds an API query script and some example data). It adds an analysis script, written in R, for performing a Bayesian estimation of the "true" rate of full-text access to DOIs like those in the dataset.

I was planning to wait to add this until #7 is merged, but since I've written the code and was playing with it this morning, I figured I'd put it up here now.

The Bayesian analysis follows one of the most basic examples in the literature -- it uses a Bernoulli likelihood (data-generating) function, with a flat beta-distributed prior. Our binary full-text access indicator data is essentially of the same type as coin-flip data, which makes it easy to apply this example from the literature.

Portions of this code (the model for stan, the Bayesian estimator software) are from a BSD-3-Clause-licensed (the author states the "new BSD license") example, as cited in the copyright section of the script and above where the code itself is used.

I've written this to comply with the Google R Style Guide cited by the Greene Lab Onboarding documentation.

Todo

  • Add new dependencies to environment.yml
  • Possibly add to the LICENSE file?
  • Add author, copyright, and a general description to the top of the file (I've added placeholder text for now).

Jacob Levernier added 30 commits October 16, 2017 13:37
…onment.yml' (and then manually removing the prefix and name, which are specific to my system).

Got sqlalchemy data insertion proof of concept working.
Got API download, parse, and commit to SQLite database working with a given DOI. Next step is to create the loop over DOIs.
…Spyder's iPython console. Also created function re: whether doi is already in database, and the loop over DOIs.
…a bug in that function whereby lacking a DOI in the XML would cause an error.
…generated data. The next step is to hook this into our actual datset.
…s (currently split across the original tsv file and an sqlite database).
@jglev
Copy link
Author

jglev commented Oct 19, 2017

To clarify, since this PR inherits all of the commits from #7, the one new file in this new PR is estimate_true_rate_of_fulltext_access_bayesian_approach.R.

@jglev
Copy link
Author

jglev commented Oct 19, 2017

One more explanatory note: The script currently draws data from both of the locations used in PR #7: It gets full-text access information from the SQLite database, and joins that to DOI open-access "colors" from the original tsv dataset. This will allow subsetting the data by color, if we want to eventually.

Beyond just, e.g., looking at only more-closed DOIs (like 'closed' and 'bronze'), one useful thing that we could do with this is add oa color as a predictor into the model. I'm not sure whether that's a useful approach in the larger context of this project and its manuscript, so I'm just putting it out here for discussion before implementing it.

@jglev
Copy link
Author

jglev commented Oct 19, 2017

Finally, the Lab onboarding documentation notes that "We write code for our analyses in Python or R, which allows everyone in the lab to know two languages and understand analytical code." Unlike Python (with sqlalchemy), I think that R doesn't have an object-based interface for SQL. Thus, the script includes a basic, but raw, SQL query. Hopefully this is in keeping with the spirit of the onboarding documentation (I'm not sure whether you consider SQL a separate language), as I don't think that there's another way in R for accessing the data in the sqlite database.

@dhimmel
Copy link
Contributor

dhimmel commented Oct 20, 2017

Yes. SQL is fine! Although you may want to check out dbplyr:

The goal of dbplyr is to automatically generate SQL for you so that you’re not forced to use it. However, SQL is a very large language and dbplyr doesn’t do everything.

It looks like this package (github) makes dplyr work with database backends.

I was imagining we'd want to extract a TSV from the databases that omits the api_response column. This way it's really easy to read in the access data. My thinking is we should have a PR that converts the DB to a TSV and that should come before further analysis of the data. Then the data analyses can read the TSV, which should be more development-friendly.

I'm going to hold off on reviewing this PR until we merge #7. It's not clear to me that we want to be doing these analyses in this repo as opposed to greenelab/scihub... if we're computing intervals for the library access coverage, we should probably also do that for the Sci-Hub coverage as well.

@jglev
Copy link
Author

jglev commented Oct 26, 2017

I apologize for my delay responding -- I was at a conference this week, and returned yesterday evening.

As always, thanks for your comments. I hadn't heard about dbplyr; it looks like a good tool for the toolkit. For now, I'll leave the SQL, since you wrote it's ok by Lab guidelines, but am open to incorporating dbplyr in the future.

To confirm that I've understood, does this look correct to you re: what you want to see for a PR that you accept:

Split this PR into two PRs:
1. One PR on this repo., to convert data from the database into a TSV, with the columns from the original TSV, plus a full_text_indicator column
2. One PR on greenelab/scihub, to do the actual estimation of full-text access

Responding to your final sentence, I added the estimation step because of @tamunro's original suggestion to sample, which alluded to getting a Confidence Interval, both steps I agree with. (As noted here in a comment in the code, a Confidence Interval should come out in this case to be the same as a Bayesian Credible Interval; I think it's beneficial for the scientific literature to prefer using the latter, so I used it here.) As I understand, since the SciHub coverage analysis was basically on a census rather than a sample, getting an estimate wouldn't be necessary for the SciHub aspect of the paper. Does that understanding jibe with yours?

@dhimmel
Copy link
Contributor

dhimmel commented Oct 30, 2017

Split this PR into two PRs

Yep I think we're on the same page.

Regarding the second PR, I was thinking we'd add UPenn's coverage as bars in Figure 8B. Since these measure coverage on samples of DOIs, intervals would be appropriate. Also Figure 2A could use the same interval.

Let's save design discussion for these intervals for later. It'll probably make most sense for me to make the updates to these figures. However, I'm happy for advice on calculating the intervals. I haven't used BCIs before and would be open to using them, if their is a conceptual advantage and the implementation is straightforward.

@jglev
Copy link
Author

jglev commented Oct 30, 2017

Ah, thanks for pointing me to to Figure 8B in the manuscript. I misunderstood, so to update what I said before: I think now (and this seems to agree with what you wrote earlier) that the Credible Interval or Confidence Interval analysis is not necessary for the library records, given that they'll be incorporated into that figure (or one like it).

If, however, the idea is to extrapolate from this and make inferences about DOI access beyond this dataset, I agree that we'd want to do the analysis on both the library data and SciHub data that will go into that figure. And in that case, the R script I have here (which I'll eventually transfer to a PR on the greenelab/scihub repo.) is straightforward to use. It'll just need to have a line added to the settings section to subset for different levels of access (Closed, Hybrid, Green, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants