-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting and manually checking sample of doi results #17
Changes from 20 commits
3c352c8
be74b2f
633750d
1b51a12
c085af0
edbf6ad
c49ddaa
bfda19c
04a2976
354d13a
d2b4681
e22a83c
6af5b03
0d9724a
1d5fc14
8e6e9f3
6a91eee
aac5bf5
f96571d
8ec9792
ccf4a14
9e7a83a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Load dependencies ------------------------------------------------------------ | ||
|
||
# Load magrittr pipe | ||
`%>%` = dplyr::`%>%` | ||
|
||
# Settings --------------------------------------------------------------------- | ||
|
||
lzma_compressed_library_access_data_location <- file.path( | ||
'data', 'library_coverage_xml_and_fulltext_indicators.tsv.xz' | ||
) | ||
|
||
sample_size_per_cell <- 100 # This will be for each cell, multiplied by | ||
# 2 full_text_indicator status | ||
|
||
output_tsv_location <- file.path( | ||
'evaluate_library_access_from_output_tsv', | ||
'manual-doi-checks.tsv' | ||
) | ||
|
||
randomizer_seed_to_set <- 3 # Ensure that random sampling will always return | ||
# the same result. | ||
|
||
# Read the dataset ------------------------------------------------------------- | ||
|
||
library_access_data <- readr::read_tsv( | ||
gzfile(lzma_compressed_library_access_data_location), | ||
) | ||
# View(lzma_compressed_library_access_data) # Check the dataset | ||
|
||
# Convert variable to factor: | ||
library_access_data <- library_access_data %>% dplyr::mutate( | ||
full_text_indicator = as.factor(full_text_indicator) | ||
) | ||
|
||
# Create stratefied sample, and clean up the tibble ---------------------------- | ||
|
||
set.seed(randomizer_seed_to_set) | ||
stratefied_sample <- library_access_data %>% | ||
dplyr::group_by(full_text_indicator) %>% | ||
dplyr::sample_n(sample_size_per_cell) %>% | ||
# Add columns to fill in manually to the stratefied sample dataframe: | ||
dplyr::rename('full_text_indicator_automated' = 'full_text_indicator') %>% | ||
dplyr::mutate( | ||
date_of_manual_full_text_check_inside_campus = NA, | ||
full_text_indicator_manual_inside_campus = NA, | ||
date_of_manual_full_text_check_outside_campus = NA, | ||
full_text_indicator_manual_outside_campus = NA | ||
) | ||
|
||
# Write the output to a TSV ---------------------------------------------------- | ||
|
||
readr::write_tsv( | ||
stratefied_sample, | ||
output_tsv_location, | ||
na = '' | ||
) |
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -5,24 +5,30 @@ date: "2017" | |||||||||||||
output: pdf_document | ||||||||||||||
--- | ||||||||||||||
|
||||||||||||||
```{r setup, include=FALSE} | ||||||||||||||
knitr::opts_chunk$set(echo = TRUE) | ||||||||||||||
knitr::opts_chunk$set(include = FALSE) | ||||||||||||||
knitr::opts_chunk$set(results = "asis") | ||||||||||||||
knitr::opts_chunk$set(cache = TRUE) | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
```{r settings} | ||||||||||||||
lzma_compressed_library_access_tsv_location <- "data/library_coverage_xml_and_fulltext_indicators.tsv.xz" | ||||||||||||||
```{r settings, include = FALSE} | ||||||||||||||
lzma_compressed_library_access_tsv_location <- file.path( | ||||||||||||||
'data', 'library_coverage_xml_and_fulltext_indicators.tsv.xz' | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
original_dataset_with_oa_color_column_location <- paste0( | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can go? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, yes, I forgot to change that. I can / forgot just now to add on a second table to the Rmd that doesn't stratify by So, I do need that for my own work; but I can take it out of this repo., if you prefer. Would you accept me adding a second table that doesn't stratify, and keeping this one in? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sure this Rmd file can be used for these exploratory analyses. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The When we last talked about it (I think that was the last time we discussed it), you mentioned that you'd like to incorporate the rates we get into "as bars in Figure 8B." Is that still the case? If so, would you want a table along these lines, or something different?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or, I could remove the Rmd file from this PR for now, and we could work on that later. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, also, GitHub didn't auto-refresh, so I didn't see your comment before posting follow-ups. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Let me think a little more about how to respresent the accuracy analysis results. They'll probably go in the methods section. Will tag you in the relevant issue in another repo when the time comes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, that all sounds good to me! |
||||||||||||||
'https://github.com/greenelab/scihub/raw/', | ||||||||||||||
'4172526ac7433357b31790578ad6f59948b6db26/data/', | ||||||||||||||
'state-of-oa-dois.tsv.xz') | ||||||||||||||
'state-of-oa-dois.tsv.xz' | ||||||||||||||
) | ||||||||||||||
|
||||||||||||||
repository_root_directory <- '..' # This sets the Working Directory that knitr | ||||||||||||||
# uses when knitting this document back to the top directory of this repository. | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
```{r setup, include=FALSE} | ||||||||||||||
knitr::opts_chunk$set(echo = FALSE) | ||||||||||||||
knitr::opts_chunk$set(include = FALSE) | ||||||||||||||
knitr::opts_chunk$set(results = "asis") | ||||||||||||||
knitr::opts_chunk$set(cache = TRUE) | ||||||||||||||
knitr::opts_knit$set(root.dir = repository_root_directory) | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
```{r read datasets} | ||||||||||||||
```{r read and merge datasets} | ||||||||||||||
lzma_compressed_library_access_tsv <- read.table( | ||||||||||||||
gzfile(lzma_compressed_library_access_tsv_location), | ||||||||||||||
sep = '\t', | ||||||||||||||
|
@@ -46,11 +52,12 @@ original_dataset_with_oa_color_column <- read.table( | |||||||||||||
header = TRUE | ||||||||||||||
) | ||||||||||||||
# View(original_dataset_with_oa_color_column) # Check the dataset | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
```{r merge the datasets} | ||||||||||||||
# Combine the datasets so that we have doi, full_text_indicator, and oadoi_color | ||||||||||||||
merged_datasets <- merge( | ||||||||||||||
# Merge the datasets --------------------------------------------------------- | ||||||||||||||
|
||||||||||||||
# Combine the datasets so that we have doi, full_text_indicator, | ||||||||||||||
# and oadoi_color | ||||||||||||||
merged_datasets <- dplyr::inner_join( | ||||||||||||||
original_dataset_with_oa_color_column, | ||||||||||||||
lzma_compressed_library_access_tsv, | ||||||||||||||
by = "doi" | ||||||||||||||
|
@@ -81,13 +88,15 @@ frequency_and_proportion_table <- data.frame( | |||||||||||||
"no_access_percent" = proportion_table_by_oa_color[,1], | ||||||||||||||
"yes_access_percent" = proportion_table_by_oa_color[,2], | ||||||||||||||
"yes_access_rate" = frequency_table_by_oa_color[, 2], | ||||||||||||||
"oa_color_total" = frequency_table_by_oa_color[, 1] + frequency_table_by_oa_color[, 2] | ||||||||||||||
"oa_color_total" = frequency_table_by_oa_color[, 1] + | ||||||||||||||
frequency_table_by_oa_color[, 2] | ||||||||||||||
) | ||||||||||||||
rownames(frequency_and_proportion_table) <- NULL | ||||||||||||||
# View(frequency_and_proportion_table) | ||||||||||||||
``` | ||||||||||||||
|
||||||||||||||
We queried `r nrow(merged_datasets)` DOIs of the the `r nrow(original_dataset_with_oa_color_column)` listed in the original State of OA dataset. Queried DOIs included the following OA "colors:" `r paste(unique(merged_datasets$oadoi_color), collapse = ", ")`. | ||||||||||||||
We queried `r nrow(merged_datasets)` DOIs of the the `r nrow(original_dataset_with_oa_color_column)` listed in the original State of OA dataset. | ||||||||||||||
Queried DOIs included the following OA "colors:" `r paste(unique(merged_datasets$oadoi_color), collapse = ", ")`. | ||||||||||||||
|
||||||||||||||
The proportions of access, alongside the rate of access, are presented below: | ||||||||||||||
|
||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,94 @@ | ||
# Settings --------------------------------------------------------------------- | ||
|
||
manual_tsv_location <- file.path( | ||
'evaluate_library_access_from_output_tsv', | ||
'manual-doi-checks.tsv' | ||
) | ||
|
||
# Open the tsv ----------------------------------------------------------------- | ||
|
||
dataset_to_go_through <- readr::read_tsv( | ||
manual_tsv_location, | ||
na = '' | ||
) | ||
# View(dataset_to_go_through) | ||
|
||
# Facilitate going through the rows that haven't been filled in ---------------- | ||
|
||
while (TRUE) { | ||
user_location_input <- readline(paste0( | ||
'Are you on the university campus network', | ||
'(y for on-campus, n for off-campus)? [y/n]' | ||
)) | ||
|
||
if ( | ||
tolower(user_location_input) == 'y' || | ||
tolower(user_location_input) == 'n' | ||
) { | ||
if (tolower(user_location_input) == 'y') { | ||
column_for_data_entry <- 'full_text_indicator_manual_inside_campus' | ||
column_for_date <- 'date_of_manual_full_text_check_inside_campus' | ||
} else { | ||
column_for_data_entry <- 'full_text_indicator_manual_outside_campus' | ||
column_for_date <- 'date_of_manual_full_text_check_outside_campus' | ||
} | ||
|
||
break # Break out of the loop, and move on. | ||
} else { | ||
message('Please enter y or n. Asking again...') | ||
} | ||
} | ||
|
||
for (row_number in which( | ||
is.na(dataset_to_go_through[, column_for_data_entry]) | ||
)) { | ||
doi_for_row <- dataset_to_go_through[row_number, 'doi'] | ||
|
||
url_to_visit <- paste0( | ||
'https://doi.org/', | ||
doi_for_row | ||
) | ||
|
||
message('Opening URL "', url_to_visit, '"...') | ||
|
||
utils::browseURL(url_to_visit) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This curator application is a cool concept. I'd be worried that it'll be difficult to jump around between DOIs... but if it helps you, then use this app. I'm not going to review it extensively because the actual output dataset is the important one, it's up to you as the curator to fill it in however you find best. So feel free to do this if it helps. |
||
|
||
while (TRUE) { | ||
user_full_text_input <- readline( | ||
'Do we have full-text access to this DOI? [y/n/invalid] | ||
("invalid" = invalid DOI)' | ||
) | ||
|
||
if ( | ||
tolower(user_full_text_input) == 'y' || | ||
tolower(user_full_text_input) == 'n' || | ||
tolower(user_full_text_input) == 'invalid' | ||
) { | ||
dataset_to_go_through[ | ||
row_number, | ||
column_for_date | ||
] <- as.character(Sys.Date()) | ||
|
||
if (tolower(user_full_text_input) == 'y') { | ||
dataset_to_go_through[row_number, column_for_data_entry] <- 1 | ||
} else if (tolower(user_full_text_input) == 'n') { | ||
dataset_to_go_through[row_number, column_for_data_entry] <- 0 | ||
} else { | ||
dataset_to_go_through[row_number, column_for_data_entry] <- 'invalid' | ||
} | ||
|
||
break # Break out of the loop, and move on. | ||
} else { | ||
message('Please enter y, n, or invalid. Asking again...') | ||
} | ||
} | ||
|
||
# Save the changes to the tsv: | ||
write.table( | ||
dataset_to_go_through, | ||
file = manual_tsv_location, | ||
sep = '\t', | ||
na = '', | ||
row.names = FALSE | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gzfile not needed here.
readr
will detect that path ends in.xz
. I'm actually surprised gzfile works, given that wouldn't it be xzfile?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in ccf4a14.
Re:
gzfile
, I didn't actually considerxzfile
(I didn't know about it, until you mentioned it just now), asgzfile
was the first thing I found, and it worked. From its manual,So, the function's name is possibly confusingly narrow. I wonder whether the
gzfile
R function was developed earlier thanxzfile
?