Adding full text true rate estimator #8

jglev · 2017-10-19T15:46:33Z

This is a work-in-progress PR which depends on (i.e., is branched from) PR #7 (which adds an API query script and some example data). It adds an analysis script, written in R, for performing a Bayesian estimation of the "true" rate of full-text access to DOIs like those in the dataset.

I was planning to wait to add this until #7 is merged, but since I've written the code and was playing with it this morning, I figured I'd put it up here now.

The Bayesian analysis follows one of the most basic examples in the literature -- it uses a Bernoulli likelihood (data-generating) function, with a flat beta-distributed prior. Our binary full-text access indicator data is essentially of the same type as coin-flip data, which makes it easy to apply this example from the literature.

Portions of this code (the model for stan, the Bayesian estimator software) are from a BSD-3-Clause-licensed (the author states the "new BSD license") example, as cited in the copyright section of the script and above where the code itself is used.

I've written this to comply with the Google R Style Guide cited by the Greene Lab Onboarding documentation.

Todo

Add new dependencies to environment.yml
Possibly add to the LICENSE file?
Add author, copyright, and a general description to the top of the file (I've added placeholder text for now).

…onment.yml' (and then manually removing the prefix and name, which are specific to my system). Got sqlalchemy data insertion proof of concept working. Got API download, parse, and commit to SQLite database working with a given DOI. Next step is to create the loop over DOIs.

…f fulltext at different timepoints.

… object-oriented syntax.

…andon some of this work in next commit.

…Spyder's iPython console. Also created function re: whether doi is already in database, and the loop over DOIs.

…copy() for copying a dictionary.

…a bug in that function whereby lacking a DOI in the XML would cause an error.

…generated data. The next step is to hook this into our actual datset.

…s (currently split across the original tsv file and an sqlite database).

jglev · 2017-10-19T15:47:32Z

To clarify, since this PR inherits all of the commits from #7, the one new file in this new PR is estimate_true_rate_of_fulltext_access_bayesian_approach.R.

jglev · 2017-10-19T15:53:04Z

One more explanatory note: The script currently draws data from both of the locations used in PR #7: It gets full-text access information from the SQLite database, and joins that to DOI open-access "colors" from the original tsv dataset. This will allow subsetting the data by color, if we want to eventually.

Beyond just, e.g., looking at only more-closed DOIs (like 'closed' and 'bronze'), one useful thing that we could do with this is add oa color as a predictor into the model. I'm not sure whether that's a useful approach in the larger context of this project and its manuscript, so I'm just putting it out here for discussion before implementing it.

jglev · 2017-10-19T15:55:46Z

Finally, the Lab onboarding documentation notes that "We write code for our analyses in Python or R, which allows everyone in the lab to know two languages and understand analytical code." Unlike Python (with sqlalchemy), I think that R doesn't have an object-based interface for SQL. Thus, the script includes a basic, but raw, SQL query. Hopefully this is in keeping with the spirit of the onboarding documentation (I'm not sure whether you consider SQL a separate language), as I don't think that there's another way in R for accessing the data in the sqlite database.

…ide.

dhimmel · 2017-10-20T20:05:35Z

Yes. SQL is fine! Although you may want to check out dbplyr:

The goal of dbplyr is to automatically generate SQL for you so that you’re not forced to use it. However, SQL is a very large language and dbplyr doesn’t do everything.

It looks like this package (github) makes dplyr work with database backends.

I was imagining we'd want to extract a TSV from the databases that omits the api_response column. This way it's really easy to read in the access data. My thinking is we should have a PR that converts the DB to a TSV and that should come before further analysis of the data. Then the data analyses can read the TSV, which should be more development-friendly.

I'm going to hold off on reviewing this PR until we merge #7. It's not clear to me that we want to be doing these analyses in this repo as opposed to greenelab/scihub... if we're computing intervals for the library access coverage, we should probably also do that for the Sci-Hub coverage as well.

jglev · 2017-10-26T14:40:05Z

I apologize for my delay responding -- I was at a conference this week, and returned yesterday evening.

As always, thanks for your comments. I hadn't heard about dbplyr; it looks like a good tool for the toolkit. For now, I'll leave the SQL, since you wrote it's ok by Lab guidelines, but am open to incorporating dbplyr in the future.

To confirm that I've understood, does this look correct to you re: what you want to see for a PR that you accept:

Split this PR into two PRs:
1. One PR on this repo., to convert data from the database into a TSV, with the columns from the original TSV, plus a full_text_indicator column
2. One PR on greenelab/scihub, to do the actual estimation of full-text access

Responding to your final sentence, I added the estimation step because of @tamunro's original suggestion to sample, which alluded to getting a Confidence Interval, both steps I agree with. (As noted here in a comment in the code, a Confidence Interval should come out in this case to be the same as a Bayesian Credible Interval; I think it's beneficial for the scientific literature to prefer using the latter, so I used it here.) As I understand, since the SciHub coverage analysis was basically on a census rather than a sample, getting an estimate wouldn't be necessary for the SciHub aspect of the paper. Does that understanding jibe with yours?

dhimmel · 2017-10-30T18:27:47Z

Split this PR into two PRs

Yep I think we're on the same page.

Regarding the second PR, I was thinking we'd add UPenn's coverage as bars in Figure 8B. Since these measure coverage on samples of DOIs, intervals would be appropriate. Also Figure 2A could use the same interval.

Let's save design discussion for these intervals for later. It'll probably make most sense for me to make the updates to these figures. However, I'm happy for advice on calculating the intervals. I haven't used BCIs before and would be open to using them, if their is a conceptual advantage and the implementation is straightforward.

jglev · 2017-10-30T18:57:20Z

Ah, thanks for pointing me to to Figure 8B in the manuscript. I misunderstood, so to update what I said before: I think now (and this seems to agree with what you wrote earlier) that the Credible Interval or Confidence Interval analysis is not necessary for the library records, given that they'll be incorporated into that figure (or one like it).

If, however, the idea is to extrapolate from this and make inferences about DOI access beyond this dataset, I agree that we'd want to do the analysis on both the library data and SciHub data that will go into that figure. And in that case, the R script I have here (which I'll eventually transfer to a PR on the greenelab/scihub repo.) is straightforward to use. It'll just need to have a line added to the settings section to subset for different levels of access (Closed, Hybrid, Green, etc.).

Jacob Levernier added 30 commits October 16, 2017 13:37

Renamed files, and added __init.py__ file.

225683e

Re-added config. file template.

37f2517

Implemented two-table solution for DOI, allowing repeat measurement o…

ac41cd9

…f fulltext at different timepoints.

Wrote example for re-joining the dois and XML tables using SQLAlchemy…

b28eeab

… object-oriented syntax.

Separated code for inserting into database into its own function.

9871cd5

Started work to filter down existing DOIs in list, but will likely ab…

72ced9d

…andon some of this work in next commit.

Replaced print() with logging.info, as it seems to be working now in …

ec4370f

…Spyder's iPython console. Also created function re: whether doi is already in database, and the loop over DOIs.

Manually pared down Conda environment list.

309ec38

Added state-of-oa-dois dataset file.

55c0240

Added example database of 10 DOI results.

8929288

Made minor PEP8 indentation change.

6647570

Switched from using % method of string formation to using f-strings.

8d32f2f

Added channels explicitly to Conda environment YAML file.

b66a993

Further refined f-strings implementation.

22b0f19

Set custom user header more sensibly.

b94df06

Moved ErrorWithAPI placeholder out of function definition, and used .…

ea814de

…copy() for copying a dictionary.

Added .gitignore file.

2e939e4

Specified channels line-by-line in environment.yml.

08802c7

Made dataset variable name clearer.

06b0fce

Changed input datset in config. template file to use relative path.

934782b

Created unittests from examples for fulltext_indication(), and fixed …

468e1e7

…a bug in that function whereby lacking a DOI in the XML would cause an error.

Changed multi-line comment to a triple-quoted comment.

e05bb2c

Made PEP8 change to separate builtin and local imports.

5bade9f

Added clearer comment re: backoff settings.

f8cf46c

Expanded ErrorWithAPI error with actual API response text.

658f7b4

Removed superfluous 'is True' in if statement.

71b93e4

Removed whitespace for PEP8.

885a03b

Added in-progress proof-of-concept bayesian estimator using randomly-…

e9be10d

…generated data. The next step is to hook this into our actual datset.

Hooked up (and tested) the Bayesian estimator to our dataset, as it i…

33d899f

…s (currently split across the original tsv file and an sqlite database).

Jacob Levernier added 3 commits October 19, 2017 11:56

Added whitespace to make code easier to read.

3df9804

Replaced tab charaters with two spaces.

0e82309

Added whitespace in front of a comment, following Google's R style gu…

f184786

…ide.

This was referenced Dec 4, 2017

Add dataset created from processing State of OA DOIs #13

Merged

Accuracy analysis of full_text_indicator calls #15

Closed

jglev mentioned this pull request Dec 19, 2017

Getting and manually checking sample of doi results #17

Merged

jglev mentioned this pull request Jan 25, 2018

Add library access results greenelab/scihub-manuscript#44

Merged

2 tasks

dhimmel mentioned this pull request Jan 31, 2018

Confidence intervals for proportions on article subsets greenelab/scihub-manuscript#49

Merged

dhimmel closed this in greenelab/scihub@e35cc7b Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding full text true rate estimator #8

Adding full text true rate estimator #8

jglev commented Oct 19, 2017

jglev commented Oct 19, 2017 •

edited

Loading

jglev commented Oct 19, 2017

jglev commented Oct 19, 2017

dhimmel commented Oct 20, 2017

jglev commented Oct 26, 2017

dhimmel commented Oct 30, 2017

jglev commented Oct 30, 2017

Adding full text true rate estimator #8

Adding full text true rate estimator #8

Conversation

jglev commented Oct 19, 2017

Todo

jglev commented Oct 19, 2017 • edited Loading

jglev commented Oct 19, 2017

jglev commented Oct 19, 2017

dhimmel commented Oct 20, 2017

jglev commented Oct 26, 2017

dhimmel commented Oct 30, 2017

jglev commented Oct 30, 2017

jglev commented Oct 19, 2017 •

edited

Loading