Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting and manually checking sample of doi results #17

Merged
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
3c352c8
Wrapped code for getting and merging datasets into a function, to mor…
Dec 12, 2017
be74b2f
Moved files, and got filepaths working again for Rmd file.
Dec 12, 2017
633750d
Added minor documentation.
Dec 12, 2017
1b51a12
Used dplyr to get stratefied random sample that respects seed (to mak…
Dec 12, 2017
c085af0
Added script to facilitate going through DOIs manually.
Dec 12, 2017
edbf6ad
Increased total sample to 200 DOIs.
Dec 12, 2017
c49ddaa
Applied 200-sample code to create TSV.
Dec 12, 2017
bfda19c
Added R packages to environment.yml.
Dec 13, 2017
04a2976
Explicitly declared all R namespaces.
Dec 13, 2017
354d13a
Moved to readr::write_tsv instead of base R write.table.
Dec 13, 2017
d2b4681
Moved to readr::read_tsv from base R read.table.
Dec 13, 2017
e22a83c
Changed name of doi check TSV.
Dec 13, 2017
6af5b03
Switched to a sample that does not stratify by oadoi_color.
Dec 13, 2017
0d9724a
Used dplyr mutate() to create new columns.
Dec 15, 2017
1d5fc14
Used dplyr rename to rename columns.
Dec 15, 2017
8e6e9f3
Switched to one chain for tibble creation, and updated heading style.
Dec 15, 2017
6a91eee
Updated formatting of headings and 80-character line max.
Dec 19, 2017
aac5bf5
Moved away from separate merge function, and took oadoi color out of …
Dec 19, 2017
f96571d
Moved to read_tsv from read.table, and to inner_join from merge.
Dec 19, 2017
8ec9792
Added on- vs. off-campus columns, and got facilitation script working…
Dec 19, 2017
ccf4a14
Removed vestigial gzfile call.
Dec 19, 2017
9e7a83a
Removed Rmd file, and added doi sample tsv.
Dec 19, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ conda env create --file=environment.yml
Then use `source activate library-access` and `source deactivate` to activate or deactivate the environment.
On windows, use `activate library-access` and `deactivate` instead.

## Using the Code

The code files in this repository assume that your working directory is set to the top-level directory of this repository.

## License

The files in this repository are released under the CC0 1.0 public domain dedication ([`LICENSE-CC0.md`](LICENSE-CC0.md)), excepting those that match the glob patterns listed below.
Expand Down
6 changes: 6 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,12 @@ dependencies:
- anaconda::pytest=3.2.1
- anaconda::python=3.6.1
- anaconda::r-base=3.4.1
- anaconda::r-dplyr=0.7.0
- anaconda::r-ggplot2=2.2.1
- anaconda::r-knitr=1.16
- anaconda::r-markdown=0.8
- anaconda::r-readr=1.1.1
- anaconda::r-rmarkdown=1.5
- anaconda::requests=2.14.2
- anaconda::spyder=3.1.4
- anaconda::sqlalchemy=1.1.9
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Load dependencies ------------------------------------------------------------

# Load magrittr pipe
`%>%` = dplyr::`%>%`

# Settings ---------------------------------------------------------------------

lzma_compressed_library_access_data_location <- file.path(
'data', 'library_coverage_xml_and_fulltext_indicators.tsv.xz'
)

sample_size_per_cell <- 100 # This will be for each cell, multiplied by
# 2 full_text_indicator status

output_tsv_location <- file.path(
'evaluate_library_access_from_output_tsv',
'manual-doi-checks.tsv'
)

randomizer_seed_to_set <- 3 # Ensure that random sampling will always return
# the same result.

# Read the dataset -------------------------------------------------------------

library_access_data <- readr::read_tsv(
gzfile(lzma_compressed_library_access_data_location),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gzfile not needed here. readr will detect that path ends in .xz. I'm actually surprised gzfile works, given that wouldn't it be xzfile?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in ccf4a14.

Re: gzfile, I didn't actually consider xzfile (I didn't know about it, until you mentioned it just now), as gzfile was the first thing I found, and it worked. From its manual,

For gzfile the description is the path to a file compressed by gzip: it can also open for reading uncompressed files and those compressed by bzip2, xz or lzma.

So, the function's name is possibly confusingly narrow. I wonder whether the gzfile R function was developed earlier than xzfile?

)
# View(lzma_compressed_library_access_data) # Check the dataset

# Convert variable to factor:
library_access_data <- library_access_data %>% dplyr::mutate(
full_text_indicator = as.factor(full_text_indicator)
)

# Create stratefied sample, and clean up the tibble ----------------------------

set.seed(randomizer_seed_to_set)
stratefied_sample <- library_access_data %>%
dplyr::group_by(full_text_indicator) %>%
dplyr::sample_n(sample_size_per_cell) %>%
# Add columns to fill in manually to the stratefied sample dataframe:
dplyr::rename('full_text_indicator_automated' = 'full_text_indicator') %>%
dplyr::mutate(
date_of_manual_full_text_check_inside_campus = NA,
full_text_indicator_manual_inside_campus = NA,
date_of_manual_full_text_check_outside_campus = NA,
full_text_indicator_manual_outside_campus = NA
)

# Write the output to a TSV ----------------------------------------------------

readr::write_tsv(
stratefied_sample,
output_tsv_location,
na = ''
)
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,30 @@ date: "2017"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(include = FALSE)
knitr::opts_chunk$set(results = "asis")
knitr::opts_chunk$set(cache = TRUE)
```

```{r settings}
lzma_compressed_library_access_tsv_location <- "data/library_coverage_xml_and_fulltext_indicators.tsv.xz"
```{r settings, include = FALSE}
lzma_compressed_library_access_tsv_location <- file.path(
'data', 'library_coverage_xml_and_fulltext_indicators.tsv.xz'
)

original_dataset_with_oa_color_column_location <- paste0(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I forgot to change that.

I can / forgot just now to add on a second table to the Rmd that doesn't stratify by oadoi_color, but I actually do want to keep the existing sub-stratified table here; it's something that my supervisor specifically requested to help diagnose whether there are issues we need to debug in our catalog website.

So, I do need that for my own work; but I can take it out of this repo., if you prefer. Would you accept me adding a second table that doesn't stratify, and keeping this one in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you accept me adding a second table that doesn't stratify, and keeping this one in?

Sure this Rmd file can be used for these exploratory analyses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Rmd file will also need to be updated to reflect the on- vs. off-campus columns I just added, too, actually. So maybe we could hammer out exactly what you'd like for incorporating into the manuscript now.

When we last talked about it (I think that was the last time we discussed it), you mentioned that you'd like to incorporate the rates we get into "as bars in Figure 8B." Is that still the case? If so, would you want a table along these lines, or something different?

Sample On-Campus Off-Campus
Web of Science X out of Y articles (Z%)
Unpaywall
Crossref

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I could remove the Rmd file from this PR for now, and we could work on that later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, also, GitHub didn't auto-refresh, so I didn't see your comment before posting follow-ups. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I could remove the Rmd file from this PR for now, and we could work on that later.

That makes sense!

would you want a table along these lines

Let me think a little more about how to respresent the accuracy analysis results. They'll probably go in the methods section. Will tag you in the relevant issue in another repo when the time comes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that all sounds good to me!

'https://github.com/greenelab/scihub/raw/',
'4172526ac7433357b31790578ad6f59948b6db26/data/',
'state-of-oa-dois.tsv.xz')
'state-of-oa-dois.tsv.xz'
)

repository_root_directory <- '..' # This sets the Working Directory that knitr
# uses when knitting this document back to the top directory of this repository.
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(include = FALSE)
knitr::opts_chunk$set(results = "asis")
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_knit$set(root.dir = repository_root_directory)
```

```{r read datasets}
```{r read and merge datasets}
lzma_compressed_library_access_tsv <- read.table(
gzfile(lzma_compressed_library_access_tsv_location),
sep = '\t',
Expand All @@ -46,11 +52,12 @@ original_dataset_with_oa_color_column <- read.table(
header = TRUE
)
# View(original_dataset_with_oa_color_column) # Check the dataset
```

```{r merge the datasets}
# Combine the datasets so that we have doi, full_text_indicator, and oadoi_color
merged_datasets <- merge(
# Merge the datasets ---------------------------------------------------------

# Combine the datasets so that we have doi, full_text_indicator,
# and oadoi_color
merged_datasets <- dplyr::inner_join(
original_dataset_with_oa_color_column,
lzma_compressed_library_access_tsv,
by = "doi"
Expand Down Expand Up @@ -81,13 +88,15 @@ frequency_and_proportion_table <- data.frame(
"no_access_percent" = proportion_table_by_oa_color[,1],
"yes_access_percent" = proportion_table_by_oa_color[,2],
"yes_access_rate" = frequency_table_by_oa_color[, 2],
"oa_color_total" = frequency_table_by_oa_color[, 1] + frequency_table_by_oa_color[, 2]
"oa_color_total" = frequency_table_by_oa_color[, 1] +
frequency_table_by_oa_color[, 2]
)
rownames(frequency_and_proportion_table) <- NULL
# View(frequency_and_proportion_table)
```

We queried `r nrow(merged_datasets)` DOIs of the the `r nrow(original_dataset_with_oa_color_column)` listed in the original State of OA dataset. Queried DOIs included the following OA "colors:" `r paste(unique(merged_datasets$oadoi_color), collapse = ", ")`.
We queried `r nrow(merged_datasets)` DOIs of the the `r nrow(original_dataset_with_oa_color_column)` listed in the original State of OA dataset.
Queried DOIs included the following OA "colors:" `r paste(unique(merged_datasets$oadoi_color), collapse = ", ")`.

The proportions of access, alongside the rate of access, are presented below:

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Settings ---------------------------------------------------------------------

manual_tsv_location <- file.path(
'evaluate_library_access_from_output_tsv',
'manual-doi-checks.tsv'
)

# Open the tsv -----------------------------------------------------------------

dataset_to_go_through <- readr::read_tsv(
manual_tsv_location,
na = ''
)
# View(dataset_to_go_through)

# Facilitate going through the rows that haven't been filled in ----------------

while (TRUE) {
user_location_input <- readline(paste0(
'Are you on the university campus network',
'(y for on-campus, n for off-campus)? [y/n]'
))

if (
tolower(user_location_input) == 'y' ||
tolower(user_location_input) == 'n'
) {
if (tolower(user_location_input) == 'y') {
column_for_data_entry <- 'full_text_indicator_manual_inside_campus'
column_for_date <- 'date_of_manual_full_text_check_inside_campus'
} else {
column_for_data_entry <- 'full_text_indicator_manual_outside_campus'
column_for_date <- 'date_of_manual_full_text_check_outside_campus'
}

break # Break out of the loop, and move on.
} else {
message('Please enter y or n. Asking again...')
}
}

for (row_number in which(
is.na(dataset_to_go_through[, column_for_data_entry])
)) {
doi_for_row <- dataset_to_go_through[row_number, 'doi']

url_to_visit <- paste0(
'https://doi.org/',
doi_for_row
)

message('Opening URL "', url_to_visit, '"...')

utils::browseURL(url_to_visit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This curator application is a cool concept. I'd be worried that it'll be difficult to jump around between DOIs... but if it helps you, then use this app.

I'm not going to review it extensively because the actual output dataset is the important one, it's up to you as the curator to fill it in however you find best. So feel free to do this if it helps.


while (TRUE) {
user_full_text_input <- readline(
'Do we have full-text access to this DOI? [y/n/invalid]
("invalid" = invalid DOI)'
)

if (
tolower(user_full_text_input) == 'y' ||
tolower(user_full_text_input) == 'n' ||
tolower(user_full_text_input) == 'invalid'
) {
dataset_to_go_through[
row_number,
column_for_date
] <- as.character(Sys.Date())

if (tolower(user_full_text_input) == 'y') {
dataset_to_go_through[row_number, column_for_data_entry] <- 1
} else if (tolower(user_full_text_input) == 'n') {
dataset_to_go_through[row_number, column_for_data_entry] <- 0
} else {
dataset_to_go_through[row_number, column_for_data_entry] <- 'invalid'
}

break # Break out of the loop, and move on.
} else {
message('Please enter y, n, or invalid. Asking again...')
}
}

# Save the changes to the tsv:
write.table(
dataset_to_go_through,
file = manual_tsv_location,
sep = '\t',
na = '',
row.names = FALSE
)
}