Getting and manually checking sample of doi results #17

jglev · 2017-12-12T21:15:09Z

This is a work-in-progress PR. I'll update this documentation with a comment tomorrow, but wanted to post the code here in the meantime.

It adds the following:

Code for creating, and the resultant TSV, of a 200-DOI sample. This is stratified by both full_text_indicator status, and by oadoi_color (20 per cell)
Thus, this sample isn't proportional by color size. But I think that that's fine here. I can go into this more if there's disagreement about this approach -- @dhimmel, to confirm, does this sound ok to you?
A script that can be run to facilitate the manual process of going through each DOI: it will take care of opening the DOI URL, prompting the user for a y/n response (or invalid, if we find any DOIs that don't resolve), and updating the TSV.

…e easily re-use it. Also started script for getting a sample of DOIs. Also noted in the README that the working directory is always assumed to be the top-level directory for this repository -- this also holds for the Python scripts, to my memory.

…e it deterministic). Confirmed that the sampling is deterministic by running the full script thrice, and manually getting the sha256sum of evaluate_library_access_from_output_tsv/stratefied_sample_of_dois_and_manual_full_text_check_results.tsv after each run. It was the same each time (5a1dcce4b4...a2c8).

jglev · 2017-12-12T21:22:07Z

To confirm, manual tests indicate that this sample is deterministically generated -- set.seed() is being honored.

jglev · 2017-12-12T21:23:02Z

Also, this comes from our discussion in #15.

dhimmel

a 200-DOI sample. This is stratified by both full_text_indicator status, and by oadoi_color (20 per cell)
Thus, this sample isn't proportional by color size. But I think that that's fine here. I can go into this more if there's disagreement about this approach -- @dhimmel, to confirm, does this sound ok to you?

No. This has the problem we discussed in #15 (comment):

Since full_text_indicator and oaDOI color are not independent, stratifying by oaDOI color could inject bias. As an extreme example, imagine all gold articles had full_text_indicator equal to true. Therefore, the stratification would require picking a certain number of gold articles where full_text_indicator was false, although none actually exist.

The purpose of this analysis is to verify the accuracy of the Penn library calls... I don't really think oaDOI is relevant here. Remember oaDOI calls are automated and imperfect. We don't want to contaminate our accuracy analysis with these oaDOI categories.

Can you just select 100 DOIs from each without using oaDOI calls?

dhimmel · 2017-12-12T21:33:32Z

evaluate_library_access_from_output_tsv/create_stratefied_sample_of_dois.R

+
+set.seed(randomizer_seed_to_set)
+stratefied_sample <- merged_datasets %>%
+  group_by(full_text_indicator, oadoi_color) %>%


Can we do dplyr::group_by so it's clear to viewers where functions come from?

Sure; this is now implemented in 04a2976, following my comment a few moments ago above.

dhimmel · 2017-12-12T21:34:33Z

evaluate_library_access_from_output_tsv/create_stratefied_sample_of_dois.R

@@ -0,0 +1,72 @@
+#### Load libraries ####
+
+library('dplyr')


To ensure that all namespaces are declared, while still using pipes, you can do:

# Load magrittr pipe `%>%` = dplyr::`%>%`

In other words, I try never to call library in R and instead just always use ::

That's fine with me. I've implemented this in 04a2976.
Similarly, I added several R dependencies to environment.yml in bfda19c.

dhimmel · 2017-12-12T21:36:55Z

evaluate_library_access_from_output_tsv/read_and_merge_library_access_datasets.R

@@ -0,0 +1,45 @@
+#### Read the datasets ####
+
+read_and_merge_library_access_datasets <- function(


Let's use readr::read_tsv, which should infer the compression properly

This is now implemented in d2b4681.

dhimmel · 2017-12-12T21:38:07Z

evaluate_library_access_from_output_tsv/create_stratefied_sample_of_dois.R

+
+#### Write the output to a TSV ####
+
+write.table(


Use readr::write_tsv, continuing to specify na = ''

This is now implemented in 354d13a.

dhimmel · 2017-12-12T21:40:50Z

...rary_access_from_output_tsv/stratefied_sample_of_dois_and_manual_full_text_check_results.tsv

@@ -0,0 +1,201 @@
+"doi"	"oadoi_color"	"full_text_indicator_automated"	"date_of_manual_full_text_check"	"full_text_indicator_manual"


Quotes not necessary.

In general, your file names are way too long and verbose. Clarity is good, but long names become difficult to work with... throwing out some ideas: accuracy-analysis.tsv, manual-dois.tsv...

As an example, the directory name here is evaluate_library_access_from_output_tsv. Is _from_output_tsv necessary documentation to include at the directory naming level?

Quotes not necessary.

To confirm, following your comment to use reader::write_tsv, quotes will no longer be written by default.

throwing out some ideas: accuracy-analysis.tsv, manual-dois.tsv

As of e22a83c, I have renamed the tsv file to manual-doi-checks.tsv, following your suggestion. To you, does that meet the spirit of the suggestion that you wrote? If not, would you prefer that i name the file manual-dois.tsv exactly?

Is _from_output_tsv necessary documentation to include at the directory naming level?

[Shrug] This seems like an aesthetic choice to me -- I do find more verbose filenames useful in my daily work. If you're worried that the verbosity will be distracting to others viewing this code in the future, or if it's distracting to you now, I don't feel invested in keeping it the way it is. As the project lead, is there a name to which you would prefer I change this directory name, as well?

manual-doi-checks.tsv is great.

This seems like an aesthetic choice to me

Indeed. The issue is that it takes longer to read longer names and that URLs / file paths become unwieldy. Of course, it's a balance and overly long names are better than overly short names. Something like manual-doi-checks could be a good directory name... the idea is that everything related to the manual DOI checks will be in this directory. It captures the scope of the contents well?

(For example, evaluate_library_access seems ambiguous to me, since we're specifically evaluating the automated query's report on access. But evaluate_automated_access_output, e.g., is not much shorter than evaluate_library_access_from_output_tsv.

I'm writing this since you asked whether the extra characters were necessary documentation at the naming level, though I suspect you meant it as a rhetorical question. I appreciate your note, and take your point; if you did mean that as a rhetorical question, as you're project lead, I'm willing to just follow / implement naming conventions/requirements that you have, if there's something you'd like the directory changed to.)

dhimmel · 2017-12-12T21:45:14Z

evaluate_library_access_from_output_tsv/facilitate_going_through_dois_manually.R

+
+  message('Opening URL "', url_to_visit, '"...')
+
+  utils::browseURL(url_to_visit)


This curator application is a cool concept. I'd be worried that it'll be difficult to jump around between DOIs... but if it helps you, then use this app.

I'm not going to review it extensively because the actual output dataset is the important one, it's up to you as the curator to fill it in however you find best. So feel free to do this if it helps.

dhimmel · 2017-12-12T21:50:51Z

...rary_access_from_output_tsv/stratefied_sample_of_dois_and_manual_full_text_check_results.tsv

@@ -0,0 +1,201 @@
+"doi"	"oadoi_color"	"full_text_indicator_automated"	"date_of_manual_full_text_check"	"full_text_indicator_manual"


Also do you plan to do an in-network / out-of-network access check? If so, should column name (full_text_indicator_manual) reflect network status?

Fleshing out my comment from #15, I wasn't planning to do the separate out-of-network check, but then I re-read your comment, and see your point. For clarity (and mine/others' review), you noted:

If all articles are freely available off campus, then Penn's subscriptions aren't providing crucial access.

However, this seems like a separate -- though still useful -- research question to our original one, which is re: the coverage of Penn's subscriptions (and, now, as a subset of that, the accuracy of our estimation of that coverage). If I'm reading this accurately, taking that underlying question to its full extent, it would be more about the coverage of someone going anywhere except Penn and SciHub (like with bronze-ish articles that are illegally/quasi-legally posted on their authors' home pages).

All this is to say that evaluating whether the Library's subscriptions are worthwhile is a full-time job, and there's someone doing exactly that here in our Libraries. Thus, if we do a separate out-of-network access check:

If it's for checking the accuracy of our automated means, I don't think it will yield additional information over the on-campus manual check -- though I recognize that I may not be conceptualizing it as you are, and would like to be on the same page.

If it's for this additional question, of whether Penn's subscriptions are "providing crucial access," I think that we should think more carefully about that, as it's a big question that would require separate analyses from what we're dealing with in this PR (and in the scope of what we've been discussing in this whole branch of the manuscript project)[1], and which would likely benefit from additional input.

Does that make sense, as I'm explaining it here? If so, would you be open to tabling the idea of the out-of-network check, and sticking with the in-network check?

[1] For example, we might want a larger sample of DOIs if we're answering that question, and would need to separately estimate the rate of Library superfluous access.

Having both an in-network and out-of-network availability call will result in the following categories:

Available in network, Not available out of network

Available in network, available out of network

Not available in network, not available out of network

Not available in network, available out of network. Let's assume this will never happen

1 is where Penn's subscription is critical for access. If we encounter full_text_indicator=false for any category 1 articles then that is a concerning error.

2 will occur when the article is freely available from the publisher. I'm assuming that sometimes we'll see full_text_indicator=false for these category 2 DOIs, due to reasons like:

the journal is OA but not part of the database that enables calling full_text_indicator=true

the journal is subscription, Penn does not subscribe, but the article is not paywalled

3 represents paywalled articles where Penn doesn't have access. If we encounter full_text_indicator=true here that is a concerning error.

So I think we'll be able to assess the meaning and accuracy of full_text_indicator much better if we can differentiate categories 1-3 above. Make sense?

Yes, it does; thank you. Implemented in 8ec9792.

jglev · 2017-12-13T21:21:57Z

Can you just select 100 DOIs from each without using oaDOI calls?

Yes -- I've implemented this in 6af5b03. My understanding following your later comment had been that you would be ok with the sub-stratification within full_text_indicator.

dhimmel · 2017-12-13T21:15:26Z

evaluate_library_access_from_output_tsv/read_and_merge_library_access_datasets.R

+
+  # Combine the datasets so that we have doi, full_text_indicator,
+  # and oadoi_color
+  merged_datasets <- merge(


Let's use dplyr::inner_join or dplyr::left_join (whichever join is appropriate).

Implemented in f96571d.

dhimmel · 2017-12-13T21:16:53Z

evaluate_library_access_from_output_tsv/read_and_merge_library_access_datasets.R

+
+  # Create a temporary filepath for downloading the original dataset.
+  # Then download and read it.
+  tmp_filpath_for_original_dataset <- tempfile()


This is all unnecessary I believe... readr should be able to read from the URL

Progress update: I'm switching this to

original_dataset_with_oa_color_column <<- readr::read_tsv( original_dataset_with_oa_color_column_location )

but am now getting an error Error in make.names(x) : invalid multibyte string at '<fd>7zXZ', which I'm now working to address. I've not worked with readr and compressed files, though, and so thought to ask first, if you see this soon: Have you seen that error before in this context? If so, do you remember how you addressed it?

Googling the error brought up this page in Chinese. Translated potion relevant to error is:

It suddenly became an error. read_csv() Assumes UTF-8 as the default file encoding. Apparently this file seems not to have been saved in UTF - 8. Now you can specify read_csv() the value of the file encoding locale using arguments , so let's deal with that.

original_dataset_with_oa_color_column_location seems to refer to the file here, which should be UTF-8 encoded, so not sure. Try a different locale maybe?

Yeah, that was my thought, too. No luck yet, but I'll keep at it (playing with different locales)!

dhimmel · 2017-12-13T21:24:08Z

evaluate_library_access_from_output_tsv/read_and_merge_library_access_datasets.R

@@ -0,0 +1,43 @@
+#### Read the datasets ####


I don't think this should be a separate file. Rather integrate it into the script. First, with the tidyverse the code will become much shorter. Second, it's not a generalized operation (reading this file & merging it in a specific way). If you move this functionality to the script that uses and keep it in a function, then that is probably okay (but likely unnecessary). It's the splitting this functionality into a new file and the namespace tactics that are most problematic:

'<<-' at several places below will add the variable into our global scope, beyond this function.

This makes the code hard to follow and can lead to certain bugs.

With the tidyverse, it's often most preferred to pipe lot's of commands together. For example, this entire function could probably become one chain.

This same code is used in both evaluate_library_access_from_output_tsv.Rmd and create_stratefied_sample_of_dois.R, and so seemed justified to me to split into a function. I agree with your point about using <<-, and normally wouldn't do it, but that also seems justified to me in this case (in the context of wrapping this in a function), as, in line with your comment, this is not a generalized operation (even if you were noting that to argue against wrapping it in a function).

I assumed that you would globally argue against repeating code across files, and so took this function approach; so that we're on the same page, before I make this change, are you ok with repeating the read-and-merge across files in this specific case?

are you ok with repeating the read-and-merge across files in this specific case?

Yes that's okay to have some repetition of parts of data processing code. The tidyverse APIs are already very high level. If you find yourself constantly repeating the same read-and-merge operations, then maybe you should save those results as a TSV (not saying that's the case here).

Implemented in aac5bf5.

dhimmel

My understanding following your later comment had been that you would be ok with the sub-stratification within full_text_indicator

+stratefied_sample <- merged_datasets %>%
+  dplyr::group_by(full_text_indicator) %>%
+  dplyr::sample_n(sample_size_per_cell)

Looks good now!

dhimmel · 2017-12-13T21:27:12Z

evaluate_library_access_from_output_tsv/create_stratefied_sample_of_dois.R

+
+#### Add columns to fill in manually to the stratefied sample dataframe ####
+
+colnames(stratefied_sample)[3] <- 'full_text_indicator_automated'


dplyr::mutate to create new columns

Implemented in 0d9724a.

Avoid referring to columns by index rather than name:

colnames(stratefied_sample)[3] <- 'full_text_indicator_automated'

Are you intending to do a dplyr::rename here?

Then everything can be chained together so you're only assigning stratefied_sample a single time.

Ah shit, readr probably can only read gz URLs:

Remote gz files can also be automatically downloaded and decompressed.

tidyverse/readr#163
tidyverse/readr@22d6035

That's really annoying. Do you need to read this file anymore here (now that it's not being used for stratification)?

Re: dplyr::rename, implemented in 8e6e9f3.

Opened tidyverse/readr#764. But don't wait for this to be addressed (likely will take months to years before a release with the feature comes out).

I don't need to read the second file per se, since we're no longer stratifying within oadoi_color. Is your preference that I remove it? The current (non-readr) code does work as-is, if we want to retain it.

Is your preference that I remove it?

Yes! Less is more!

Implemented at approx. aac5bf5.

…sample script.

… with them.

jglev · 2017-12-19T21:10:13Z

@dhimmel, I think that the changes I've pushed just now answer all of the comments you've brought up. Do they look complete to you, too?

dhimmel · 2017-12-19T21:13:12Z

evaluate_library_access_from_output_tsv/evaluate_library_access_from_output_tsv.Rmd

+```{r settings, include = FALSE}
+lzma_compressed_library_access_tsv_location <- file.path(
+  'data', 'library_coverage_xml_and_fulltext_indicators.tsv.xz'
+)

 original_dataset_with_oa_color_column_location <- paste0(


This can go?

Ah, yes, I forgot to change that.

I can / forgot just now to add on a second table to the Rmd that doesn't stratify by oadoi_color, but I actually do want to keep the existing sub-stratified table here; it's something that my supervisor specifically requested to help diagnose whether there are issues we need to debug in our catalog website.

So, I do need that for my own work; but I can take it out of this repo., if you prefer. Would you accept me adding a second table that doesn't stratify, and keeping this one in?

Would you accept me adding a second table that doesn't stratify, and keeping this one in?

Sure this Rmd file can be used for these exploratory analyses.

The Rmd file will also need to be updated to reflect the on- vs. off-campus columns I just added, too, actually. So maybe we could hammer out exactly what you'd like for incorporating into the manuscript now.

When we last talked about it (I think that was the last time we discussed it), you mentioned that you'd like to incorporate the rates we get into "as bars in Figure 8B." Is that still the case? If so, would you want a table along these lines, or something different?

Sample On-Campus Off-Campus

Web of Science X out of Y articles (Z%)

Unpaywall

Crossref

Or, I could remove the Rmd file from this PR for now, and we could work on that later.

Sorry, also, GitHub didn't auto-refresh, so I didn't see your comment before posting follow-ups. Thanks!

Or, I could remove the Rmd file from this PR for now, and we could work on that later.

That makes sense!

would you want a table along these lines

Let me think a little more about how to respresent the accuracy analysis results. They'll probably go in the methods section. Will tag you in the relevant issue in another repo when the time comes.

Sure, that all sounds good to me!

jglev · 2017-12-19T21:14:05Z

Also, so that we're on the same page in advance, the Penn Library is shut down all next week (the 25th through the 29th); thus, I'll be pausing my work here between this Friday and the 2nd. I know that you previously mentioned that you're hoping to get this finished asap, so I'm noting it here, rather than dropping out of contact for several days.

dhimmel · 2017-12-19T21:17:32Z

evaluate_library_access_from_output_tsv/create_stratefied_sample_of_dois.R

+# Read the dataset -------------------------------------------------------------
+
+library_access_data <- readr::read_tsv(
+  gzfile(lzma_compressed_library_access_data_location),


gzfile not needed here. readr will detect that path ends in .xz. I'm actually surprised gzfile works, given that wouldn't it be xzfile?

Removed in ccf4a14.

Re: gzfile, I didn't actually consider xzfile (I didn't know about it, until you mentioned it just now), as gzfile was the first thing I found, and it worked. From its manual,

For gzfile the description is the path to a file compressed by gzip: it can also open for reading uncompressed files and those compressed by bzip2, xz or lzma.

So, the function's name is possibly confusingly narrow. I wonder whether the gzfile R function was developed earlier than xzfile?

dhimmel · 2017-12-19T21:33:30Z

Can you upload the manual-doi-checks.tsv skeleton in this PR and remove the Rmd (saving for later)?

jglev · 2017-12-19T21:37:13Z

I've removed the Rmd for now, and added the tsv of sampled DOIs with blank columns to manually fill in, in 9e7a83a.

Jacob Levernier added 7 commits December 12, 2017 11:58

Moved files, and got filepaths working again for Rmd file.

be74b2f

Added minor documentation.

633750d

Added script to facilitate going through DOIs manually.

c085af0

Increased total sample to 200 DOIs.

edbf6ad

Applied 200-sample code to create TSV.

c49ddaa

dhimmel reviewed Dec 12, 2017

View reviewed changes

Jacob Levernier added 6 commits December 13, 2017 15:04

Added R packages to environment.yml.

bfda19c

Explicitly declared all R namespaces.

04a2976

Moved to readr::write_tsv instead of base R write.table.

354d13a

Moved to readr::read_tsv from base R read.table.

d2b4681

Changed name of doi check TSV.

e22a83c

Switched to a sample that does not stratify by oadoi_color.

6af5b03

dhimmel reviewed Dec 13, 2017

View reviewed changes

Jacob Levernier added 7 commits December 15, 2017 16:03

Used dplyr mutate() to create new columns.

0d9724a

Used dplyr rename to rename columns.

1d5fc14

Switched to one chain for tibble creation, and updated heading style.

8e6e9f3

Updated formatting of headings and 80-character line max.

6a91eee

Moved away from separate merge function, and took oadoi color out of …

aac5bf5

…sample script.

Moved to read_tsv from read.table, and to inner_join from merge.

f96571d

Added on- vs. off-campus columns, and got facilitation script working…

8ec9792

… with them.

dhimmel reviewed Dec 19, 2017

View reviewed changes

Removed vestigial gzfile call.

ccf4a14

Removed Rmd file, and added doi sample tsv.

9e7a83a

dhimmel approved these changes Dec 19, 2017

View reviewed changes

dhimmel merged commit a0cea58 into greenelab:master Dec 19, 2017

dhimmel mentioned this pull request Jan 4, 2018

Adding accuracy analysis results #18

Merged

		@@ -0,0 +1,45 @@
		#### Read the datasets ####

		read_and_merge_library_access_datasets <- function(

		@@ -0,0 +1,201 @@
		"doi" "oadoi_color" "full_text_indicator_automated" "date_of_manual_full_text_check" "full_text_indicator_manual"


		message('Opening URL "', url_to_visit, '"...')

		utils::browseURL(url_to_visit)


		#### Add columns to fill in manually to the stratefied sample dataframe ####

		colnames(stratefied_sample)[3] <- 'full_text_indicator_automated'

Getting and manually checking sample of doi results #17

Getting and manually checking sample of doi results #17

Conversation

jglev commented Dec 12, 2017

jglev commented Dec 12, 2017

jglev commented Dec 12, 2017 • edited Loading

dhimmel left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglev commented Dec 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglev Dec 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Dec 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglev commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglev commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Dec 19, 2017 • edited Loading

jglev commented Dec 19, 2017

jglev commented Dec 12, 2017 •

edited

Loading

dhimmel left a comment •

edited

Loading

jglev Dec 15, 2017 •

edited

Loading

dhimmel Dec 15, 2017 •

edited

Loading

dhimmel commented Dec 19, 2017 •

edited

Loading