Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting and manually checking sample of doi results #17

Merged

Conversation

jglev
Copy link

@jglev jglev commented Dec 12, 2017

This is a work-in-progress PR. I'll update this documentation with a comment tomorrow, but wanted to post the code here in the meantime.

It adds the following:

  • Code for creating, and the resultant TSV, of a 200-DOI sample. This is stratified by both full_text_indicator status, and by oadoi_color (20 per cell)
    Thus, this sample isn't proportional by color size. But I think that that's fine here. I can go into this more if there's disagreement about this approach -- @dhimmel, to confirm, does this sound ok to you?
  • A script that can be run to facilitate the manual process of going through each DOI: it will take care of opening the DOI URL, prompting the user for a y/n response (or invalid, if we find any DOIs that don't resolve), and updating the TSV.

Jacob Levernier added 7 commits December 12, 2017 11:58
…e easily re-use it. Also started script for getting a sample of DOIs. Also noted in the README that the working directory is always assumed to be the top-level directory for this repository -- this also holds for the Python scripts, to my memory.
…e it deterministic). Confirmed that the sampling is deterministic by running the full script thrice, and manually getting the sha256sum of evaluate_library_access_from_output_tsv/stratefied_sample_of_dois_and_manual_full_text_check_results.tsv after each run. It was the same each time (5a1dcce4b4...a2c8).
@jglev
Copy link
Author

jglev commented Dec 12, 2017

To confirm, manual tests indicate that this sample is deterministically generated -- set.seed() is being honored.

@jglev
Copy link
Author

jglev commented Dec 12, 2017

Also, this comes from our discussion in #15.

Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a 200-DOI sample. This is stratified by both full_text_indicator status, and by oadoi_color (20 per cell)
Thus, this sample isn't proportional by color size. But I think that that's fine here. I can go into this more if there's disagreement about this approach -- @dhimmel, to confirm, does this sound ok to you?

No. This has the problem we discussed in #15 (comment):

Since full_text_indicator and oaDOI color are not independent, stratifying by oaDOI color could inject bias. As an extreme example, imagine all gold articles had full_text_indicator equal to true. Therefore, the stratification would require picking a certain number of gold articles where full_text_indicator was false, although none actually exist.

The purpose of this analysis is to verify the accuracy of the Penn library calls... I don't really think oaDOI is relevant here. Remember oaDOI calls are automated and imperfect. We don't want to contaminate our accuracy analysis with these oaDOI categories.

Can you just select 100 DOIs from each without using oaDOI calls?


set.seed(randomizer_seed_to_set)
stratefied_sample <- merged_datasets %>%
group_by(full_text_indicator, oadoi_color) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do dplyr::group_by so it's clear to viewers where functions come from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; this is now implemented in 04a2976, following my comment a few moments ago above.

@@ -0,0 +1,72 @@
#### Load libraries ####

library('dplyr')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ensure that all namespaces are declared, while still using pipes, you can do:

# Load magrittr pipe
`%>%` = dplyr::`%>%`

In other words, I try never to call library in R and instead just always use ::

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. I've implemented this in 04a2976.
Similarly, I added several R dependencies to environment.yml in bfda19c.

@@ -0,0 +1,45 @@
#### Read the datasets ####

read_and_merge_library_access_datasets <- function(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use readr::read_tsv, which should infer the compression properly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now implemented in d2b4681.


#### Write the output to a TSV ####

write.table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use readr::write_tsv, continuing to specify na = ''

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now implemented in 354d13a.

@@ -0,0 +1,201 @@
"doi" "oadoi_color" "full_text_indicator_automated" "date_of_manual_full_text_check" "full_text_indicator_manual"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quotes not necessary.

In general, your file names are way too long and verbose. Clarity is good, but long names become difficult to work with... throwing out some ideas: accuracy-analysis.tsv, manual-dois.tsv...

As an example, the directory name here is evaluate_library_access_from_output_tsv. Is _from_output_tsv necessary documentation to include at the directory naming level?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quotes not necessary.

To confirm, following your comment to use reader::write_tsv, quotes will no longer be written by default.

throwing out some ideas: accuracy-analysis.tsv, manual-dois.tsv

As of e22a83c, I have renamed the tsv file to manual-doi-checks.tsv, following your suggestion. To you, does that meet the spirit of the suggestion that you wrote? If not, would you prefer that i name the file manual-dois.tsv exactly?

Is _from_output_tsv necessary documentation to include at the directory naming level?

[Shrug] This seems like an aesthetic choice to me -- I do find more verbose filenames useful in my daily work. If you're worried that the verbosity will be distracting to others viewing this code in the future, or if it's distracting to you now, I don't feel invested in keeping it the way it is. As the project lead, is there a name to which you would prefer I change this directory name, as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manual-doi-checks.tsv is great.

This seems like an aesthetic choice to me

Indeed. The issue is that it takes longer to read longer names and that URLs / file paths become unwieldy. Of course, it's a balance and overly long names are better than overly short names. Something like manual-doi-checks could be a good directory name... the idea is that everything related to the manual DOI checks will be in this directory. It captures the scope of the contents well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For example, evaluate_library_access seems ambiguous to me, since we're specifically evaluating the automated query's report on access. But evaluate_automated_access_output, e.g., is not much shorter than evaluate_library_access_from_output_tsv.

I'm writing this since you asked whether the extra characters were necessary documentation at the naming level, though I suspect you meant it as a rhetorical question. I appreciate your note, and take your point; if you did mean that as a rhetorical question, as you're project lead, I'm willing to just follow / implement naming conventions/requirements that you have, if there's something you'd like the directory changed to.)


message('Opening URL "', url_to_visit, '"...')

utils::browseURL(url_to_visit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This curator application is a cool concept. I'd be worried that it'll be difficult to jump around between DOIs... but if it helps you, then use this app.

I'm not going to review it extensively because the actual output dataset is the important one, it's up to you as the curator to fill it in however you find best. So feel free to do this if it helps.

@@ -0,0 +1,201 @@
"doi" "oadoi_color" "full_text_indicator_automated" "date_of_manual_full_text_check" "full_text_indicator_manual"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do you plan to do an in-network / out-of-network access check? If so, should column name (full_text_indicator_manual) reflect network status?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleshing out my comment from #15, I wasn't planning to do the separate out-of-network check, but then I re-read your comment, and see your point. For clarity (and mine/others' review), you noted:

If all articles are freely available off campus, then Penn's subscriptions aren't providing crucial access.

However, this seems like a separate -- though still useful -- research question to our original one, which is re: the coverage of Penn's subscriptions (and, now, as a subset of that, the accuracy of our estimation of that coverage). If I'm reading this accurately, taking that underlying question to its full extent, it would be more about the coverage of someone going anywhere except Penn and SciHub (like with bronze-ish articles that are illegally/quasi-legally posted on their authors' home pages).

All this is to say that evaluating whether the Library's subscriptions are worthwhile is a full-time job, and there's someone doing exactly that here in our Libraries. Thus, if we do a separate out-of-network access check:

  • If it's for checking the accuracy of our automated means, I don't think it will yield additional information over the on-campus manual check -- though I recognize that I may not be conceptualizing it as you are, and would like to be on the same page.
  • If it's for this additional question, of whether Penn's subscriptions are "providing crucial access," I think that we should think more carefully about that, as it's a big question that would require separate analyses from what we're dealing with in this PR (and in the scope of what we've been discussing in this whole branch of the manuscript project)[1], and which would likely benefit from additional input.

Does that make sense, as I'm explaining it here? If so, would you be open to tabling the idea of the out-of-network check, and sticking with the in-network check?

[1] For example, we might want a larger sample of DOIs if we're answering that question, and would need to separately estimate the rate of Library superfluous access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having both an in-network and out-of-network availability call will result in the following categories:

  1. Available in network, Not available out of network
  2. Available in network, available out of network
  3. Not available in network, not available out of network
  4. Not available in network, available out of network. Let's assume this will never happen

1 is where Penn's subscription is critical for access. If we encounter full_text_indicator=false for any category 1 articles then that is a concerning error.

2 will occur when the article is freely available from the publisher. I'm assuming that sometimes we'll see full_text_indicator=false for these category 2 DOIs, due to reasons like:

  • the journal is OA but not part of the database that enables calling full_text_indicator=true
  • the journal is subscription, Penn does not subscribe, but the article is not paywalled

3 represents paywalled articles where Penn doesn't have access. If we encounter full_text_indicator=true here that is a concerning error.

So I think we'll be able to assess the meaning and accuracy of full_text_indicator much better if we can differentiate categories 1-3 above. Make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does; thank you. Implemented in 8ec9792.

@jglev
Copy link
Author

jglev commented Dec 13, 2017

Can you just select 100 DOIs from each without using oaDOI calls?

Yes -- I've implemented this in 6af5b03. My understanding following your later comment had been that you would be ok with the sub-stratification within full_text_indicator.


# Combine the datasets so that we have doi, full_text_indicator,
# and oadoi_color
merged_datasets <- merge(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use dplyr::inner_join or dplyr::left_join (whichever join is appropriate).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in f96571d.


# Create a temporary filepath for downloading the original dataset.
# Then download and read it.
tmp_filpath_for_original_dataset <- tempfile()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all unnecessary I believe... readr should be able to read from the URL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress update: I'm switching this to

original_dataset_with_oa_color_column <<- readr::read_tsv(
    original_dataset_with_oa_color_column_location
  )

but am now getting an error Error in make.names(x) : invalid multibyte string at '<fd>7zXZ', which I'm now working to address. I've not worked with readr and compressed files, though, and so thought to ask first, if you see this soon: Have you seen that error before in this context? If so, do you remember how you addressed it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Googling the error brought up this page in Chinese. Translated potion relevant to error is:

It suddenly became an error. read_csv() Assumes UTF-8 as the default file encoding. Apparently this file seems not to have been saved in UTF - 8. Now you can specify read_csv() the value of the file encoding locale using arguments , so let's deal with that.

original_dataset_with_oa_color_column_location seems to refer to the file here, which should be UTF-8 encoded, so not sure. Try a different locale maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my thought, too. No luck yet, but I'll keep at it (playing with different locales)!

@@ -0,0 +1,43 @@
#### Read the datasets ####
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a separate file. Rather integrate it into the script. First, with the tidyverse the code will become much shorter. Second, it's not a generalized operation (reading this file & merging it in a specific way). If you move this functionality to the script that uses and keep it in a function, then that is probably okay (but likely unnecessary). It's the splitting this functionality into a new file and the namespace tactics that are most problematic:

'<<-' at several places below will add the variable into our global scope, beyond this function.

This makes the code hard to follow and can lead to certain bugs.

With the tidyverse, it's often most preferred to pipe lot's of commands together. For example, this entire function could probably become one chain.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This same code is used in both evaluate_library_access_from_output_tsv.Rmd and create_stratefied_sample_of_dois.R, and so seemed justified to me to split into a function. I agree with your point about using <<-, and normally wouldn't do it, but that also seems justified to me in this case (in the context of wrapping this in a function), as, in line with your comment, this is not a generalized operation (even if you were noting that to argue against wrapping it in a function).

I assumed that you would globally argue against repeating code across files, and so took this function approach; so that we're on the same page, before I make this change, are you ok with repeating the read-and-merge across files in this specific case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you ok with repeating the read-and-merge across files in this specific case?

Yes that's okay to have some repetition of parts of data processing code. The tidyverse APIs are already very high level. If you find yourself constantly repeating the same read-and-merge operations, then maybe you should save those results as a TSV (not saying that's the case here).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in aac5bf5.

Copy link
Contributor

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding following your later comment had been that you would be ok with the sub-stratification within full_text_indicator

+stratefied_sample <- merged_datasets %>%
+  dplyr::group_by(full_text_indicator) %>%
+  dplyr::sample_n(sample_size_per_cell)

Looks good now!


#### Add columns to fill in manually to the stratefied sample dataframe ####

colnames(stratefied_sample)[3] <- 'full_text_indicator_automated'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr::mutate to create new columns

Copy link
Author

@jglev jglev Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 0d9724a.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid referring to columns by index rather than name:

colnames(stratefied_sample)[3] <- 'full_text_indicator_automated'

Are you intending to do a dplyr::rename here?

Then everything can be chained together so you're only assigning stratefied_sample a single time.

Copy link
Contributor

@dhimmel dhimmel Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah shit, readr probably can only read gz URLs:

Remote gz files can also be automatically downloaded and decompressed.

tidyverse/readr#163
tidyverse/readr@22d6035

That's really annoying. Do you need to read this file anymore here (now that it's not being used for stratification)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: dplyr::rename, implemented in 8e6e9f3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened tidyverse/readr#764. But don't wait for this to be addressed (likely will take months to years before a release with the feature comes out).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't need to read the second file per se, since we're no longer stratifying within oadoi_color. Is your preference that I remove it? The current (non-readr) code does work as-is, if we want to retain it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your preference that I remove it?

Yes! Less is more!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented at approx. aac5bf5.

@jglev
Copy link
Author

jglev commented Dec 19, 2017

@dhimmel, I think that the changes I've pushed just now answer all of the comments you've brought up. Do they look complete to you, too?

```{r settings, include = FALSE}
lzma_compressed_library_access_tsv_location <- file.path(
'data', 'library_coverage_xml_and_fulltext_indicators.tsv.xz'
)

original_dataset_with_oa_color_column_location <- paste0(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I forgot to change that.

I can / forgot just now to add on a second table to the Rmd that doesn't stratify by oadoi_color, but I actually do want to keep the existing sub-stratified table here; it's something that my supervisor specifically requested to help diagnose whether there are issues we need to debug in our catalog website.

So, I do need that for my own work; but I can take it out of this repo., if you prefer. Would you accept me adding a second table that doesn't stratify, and keeping this one in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you accept me adding a second table that doesn't stratify, and keeping this one in?

Sure this Rmd file can be used for these exploratory analyses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Rmd file will also need to be updated to reflect the on- vs. off-campus columns I just added, too, actually. So maybe we could hammer out exactly what you'd like for incorporating into the manuscript now.

When we last talked about it (I think that was the last time we discussed it), you mentioned that you'd like to incorporate the rates we get into "as bars in Figure 8B." Is that still the case? If so, would you want a table along these lines, or something different?

Sample On-Campus Off-Campus
Web of Science X out of Y articles (Z%)
Unpaywall
Crossref

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I could remove the Rmd file from this PR for now, and we could work on that later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, also, GitHub didn't auto-refresh, so I didn't see your comment before posting follow-ups. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I could remove the Rmd file from this PR for now, and we could work on that later.

That makes sense!

would you want a table along these lines

Let me think a little more about how to respresent the accuracy analysis results. They'll probably go in the methods section. Will tag you in the relevant issue in another repo when the time comes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that all sounds good to me!

@jglev
Copy link
Author

jglev commented Dec 19, 2017

Also, so that we're on the same page in advance, the Penn Library is shut down all next week (the 25th through the 29th); thus, I'll be pausing my work here between this Friday and the 2nd. I know that you previously mentioned that you're hoping to get this finished asap, so I'm noting it here, rather than dropping out of contact for several days.

# Read the dataset -------------------------------------------------------------

library_access_data <- readr::read_tsv(
gzfile(lzma_compressed_library_access_data_location),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gzfile not needed here. readr will detect that path ends in .xz. I'm actually surprised gzfile works, given that wouldn't it be xzfile?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in ccf4a14.

Re: gzfile, I didn't actually consider xzfile (I didn't know about it, until you mentioned it just now), as gzfile was the first thing I found, and it worked. From its manual,

For gzfile the description is the path to a file compressed by gzip: it can also open for reading uncompressed files and those compressed by bzip2, xz or lzma.

So, the function's name is possibly confusingly narrow. I wonder whether the gzfile R function was developed earlier than xzfile?

@dhimmel
Copy link
Contributor

dhimmel commented Dec 19, 2017

Can you upload the manual-doi-checks.tsv skeleton in this PR and remove the Rmd (saving for later)?

@jglev
Copy link
Author

jglev commented Dec 19, 2017

I've removed the Rmd for now, and added the tsv of sampled DOIs with blank columns to manually fill in, in 9e7a83a.

@dhimmel dhimmel merged commit a0cea58 into greenelab:master Dec 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants