-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
medrxivr: Accessing and searching medRxiv preprint data in R #380
Comments
Editor checks:
Editor comments@mcguinlu, thanks again for your submission. The editor checks flagged a few issues that need your attention; see them below. Let's discuss the first two items (ml1 and ml2) before I search for reviewers; these two items refer to a potential overlap with existing packages.
The remaining items are important but not as urgent as the first two.
> spelling::spell_check_package()
WORD FOUND IN
api mx_api_content.Rd:39,42
mx_api_doi.Rd:28,31
description:4
AppVeyor README.md:14
README.Rmd:24
ation building-complex-search-strategies.Rmd:110
biorxiv description:4
capitalisation building-complex-search-strategies.Rmd:125
... more lines
> goodpractice::gp()
... more lines
── GP medrxivr ─────────────────────────────────────────────────────────────────
It is good practice to
✖ write unit tests for all functions, and all package code in
general. 77% of code lines are covered by test cases.
R/mx_crosscheck.R:50:NA
R/mx_download.R:25:NA
R/mx_download.R:27:NA
R/mx_download.R:28:NA
R/mx_download.R:30:NA
... and 51 more lines
> covr::package_coverage()
medrxivr Coverage: 77.60%
R/mx_download.R: 1.92%
R/mx_crosscheck.R: 96.15%
R/mx_search.R: 96.33%
R/mx_api.R: 100.00%
R/mx_info.R: 100.00%
> styler::style_pkg()
Styling 12 files:
R/medrxivr.R ✓
R/mx_api.R ℹ
R/mx_crosscheck.R ℹ
R/mx_download.R ℹ
R/mx_info.R ℹ
R/mx_search.R ℹ
tests/testthat.R ✓
tests/testthat/test-api.R ℹ
tests/testthat/test-crosscheck.R ✓
tests/testthat/test-download.R ℹ
tests/testthat/test-info.R ✓
tests/testthat/test-search.R ℹ
────────────────────────────────────────
Status Count Legend
✓ 4 File unchanged.
ℹ 8 File changed.
x 0 Styling threw an error.
────────────────────────────────────────
Please review the changes carefully! |
Hi @maurolepore Thanks for your inital review of our package. I've gone through it and try to address each point below: ml1: Overlap with fulltext
ml2: Overlap with biorxivr ml3: Spelling ml4/ml5: Test coverage ml6: styler Hopefully this addresses your inital concerns, but please do let me know if anything is unclear, if my responses are insufficient, or if you need further details! |
Thanks @mcguinlu! I think {medrxivr} merits to move to the next stage. I'll now start searching for reviewers. ml1 and ml2Here is my conclusion. I base it on your answers above, and on this quote from rOpenSci's
I considered the packages {medrxivr}, {fulltext} and {biorxivr}. I see an Compared to the other packages, {medrxivr} searches locally. This ensures the ml3 to ml6@mcguinlu, please let me know or check the boxes as you address these issues. ml7@mcguinlu, I see the positive aspects of the "local" approach to searching that {medrxivr} implements; but I understand that {medrxivr} downloads the entire database. I worry this may not scale up. Here are some questions I have; you may discuss them directly with the reviewers:
Maybe you can avoid downloading the database and still provide flexible queries. For example, see how |
@mcguinlu, please do this (from these guidelines):
|
Package ReviewHi @maurolepore and @mcguinlu - here is my review. Thanks for this opportunity, and all the best for the package!
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 10
Review CommentsmedRxiv has been accepting preprints for a year now. Their API does not offer any search capabilities, so clearly Although the target group and the goal of the package are clearly defined, it took me some time to understand the core functionality. I suppose the main reason for this is the varying terminology of data sources used in vignettes and help pages. The way I understand the logic looks like this: In short, for a search target there are two options, the dataset I download myself from medRxiv, or the dataset provided by the GitHub repo. The former can be either all items or just a subset limited by date. The latter is all items. Technically speaking, my download uses the medRxiv API, but the dataset in the repo is built by scraping the medRxiv web site on a daily basis. My understanding is that the main reasons for the scraped dataset are to provide a reliable data source for those occasions when the API does not serve well or not at all, and lighten the burden of the API usage. How long does it take to download all metadata from the API? I tested it from two physical locations with a differing bandwidth:
So far this is not bad, especially if you run the function once a day. One minor thing: is there any way to gracefully stop the process if started by accident? When the RStudio's red Stop button is hit, the following error is thrown
How rapidly can we expect medRxiv to grow? Looking back, the amount of submissions accelerated when the still very much prevailing COVID-19 pandemic began.
Search is a key component of this package, and vignettes help in building search queries. The medrxivr one shows how to use the
The NOT argument does not match to Mild cognitive impairment which is found in one abstract, so perhaps better to use the form of In When I ran As of writing this, how long does it take to query the repo?
To me this is acceptable, but people of today tend to be impatient. Still, when the same search against my local copy of the medRxiv database takes only 0.5 secs, you begin to wonder which one to use. I noticed that the question of how to efficiently host and serve a dataset is something you and the editor have already discussed about. Unfortunately, I cannot give any advice, but am very much interested to learn about this topic too. I hope you will find a good solution. Downloading PDFs works smoothly and as promised. Note: the The Shiny application that comes with the package is a beautiful piece of work, and the idea of delivering reproducible code is a nice one indeed. However, there are some issues with the code. Both the basic and advanced search codes throw an error when run in R. Basic:
Advanced:
I was noted by @maurolepore that the package includes also a short manuscript to be submitted to Journal of Open Source Software. I found the manuscript in the |
Hi @tts, Just a short note to say thanks so much for your review. I've given it a quick skim, and it seems that everything you propose will be straightforward to implement. I'll go through your comments systematically soon, and post a response/list of changes. (@maurolepore, a process question - is it better for me to wait until the second reviewer has filed their review before beginning to make changes?) Thanks in particular for spotting the discrepancies across the package (old function names in the examples, missing definitions for arguments, problems with the code from the app). You are correct that there is some hangover from an earlier version of the package/early versions of the package functions - I thought I had caught them all, but obviously not! When I started developing One specific thing I wanted to follow-up on was that the "Automated testing" item in the reviewer checklist is not marked as complete - did you have any specific issues with/reccomendations for this area of the pacakge? |
@tts, thanks for your wonderful review! @mcguinlu, RE
Both reviewers should work on the exact same package. You may change the package in a separate branch, but please only merge it after both reviewers submitted their review. |
@mcguinlu Sorry, my bad. Both |
@njahn82, I hope you are well. Could you please update us about your review? |
Sorry, I didn't meet my review deadline. Will submit it by Wednesday.
Thanks for your patience!
…On Thu, 2 Jul 2020 at 03:28, Mauro Lepore ***@***.***> wrote:
@njahn82 <https://github.com/njahn82>, I hope you are well. Could you
please update us about your review?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#380 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM7YRTEKAQTRMHY73MYBBDRZPPEVANCNFSM4NMF3HMQ>
.
|
Package ReviewPlease check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 5 hours
Review CommentsThis is very timely package that not just reflect the increasing popularity of open access preprints in Health Sciences, but also issues around finding and searching them. Although a growing suite of scholarly search engines make medRxiv preprints available, there seems to be no standard way to retrieve data from medRxiv thoroughly and systematically. Also finding full-texts is challenging, because medRxiv preprints are not made available via PubMed Central. Similiary, Crossref metadata, medRxiv's DOI registration agency, lack links to pdf full-texts. Before I share my code review, I want to disclose that I neither have an academic background in Health Sciences nor have I been involved in systematic reviews as a librarian. I will therefore focus on more formal aspects of the package and its design. Overall DesignThe package contains functions to retrieve metadata from medRxiv, applying complex search strategies on a metadata snapshot, and download pdf full-texts. However, the source code repository contains a considerable amount of other functionality as well, which is outside of the
There's also a link to (daily updated) data in an external GitHub repo, https://github.com/mcguinlu/medrxivr-data/, which is used in an exported R function. My main concern with this approach is that dependencies, which are not part of the package, are loaded, and in one case installed. The code outside of the In the following, I will focus on the functionality, which is part of the package build. README
Documentation
Vignette
Functionality
Here's the checking using {polite} polite::bow("https://www.medrxiv.org/archive", force = TRUE)
#> <polite session> https://www.medrxiv.org/archive
#> User-agent: polite R package - https://github.com/dmi3kno/polite
#> robots.txt: 68 rules are defined for 1 bots
#> Crawl delay: 7 sec
#> The path is scrapable for this user-agent Created on 2020-07-08 by the reprex package (v0.3.0)
Here's a reprex using the vignette example, which took less than 2 second. library(tidyverse)
library(europepmc)
ep_q <-
c('PUBLISHER:"medRxiv" AND (mendelian* AND (randomisation OR randomization))')
epmc_l <- europepmc::epmc_search(ep_q, "raw", limit = 10000)
#> 91 records found, returning 91
my_df <-
purrr::map_dfr(epmc_l, `[`, c("doi", "title", "abstractText"))
my_df %>%
filter_at(vars(abstractText, title), any_vars(
grepl(
"[Mm]endelian(\\s)([[:graph:]]+\\s){0,4}randomi([[:alpha:]])ation",
.
)))
#> # A tibble: 81 x 3
#> doi title abstractText
#> <chr> <chr> <chr>
#> 1 10.1101/2020… Cardiometabolic traits, seps… Objectives: To investigate wheth…
#> 2 10.1101/2020… The relationship between gly… Aims: To investigate the relatio…
#> 3 10.1101/2020… Modifiable lifestyle factors… Aims: Assessing whether modifiab…
#> 4 10.1101/2020… Influence of blood pressure … Objectives: To determine whether…
#> 5 10.1101/2020… Increased adiposity is prote… Background Breast and prostate c…
#> 6 10.1101/2020… Examining the association be… Background: We examined associat…
#> 7 10.1101/2020… Investigating the potential … Aim: Use Mendelian randomisation…
#> 8 10.1101/2020… Unhealthy Behaviours and Par… Objective: Tobacco smoking, alco…
#> 9 10.1101/2020… Exploring the causal effect … BACKGROUND: Hearing loss has bee…
#> 10 10.1101/2020… Genetically informed precisi… Impaired lung function is associ…
#> # … with 71 more rows Created on 2020-07-08 by the reprex package (v0.3.0) (Disclaimer: I maintain the {europepmc} package and I am curios to learn more about potential shortcomings using Europe PMC instead of a primary literature source. Because I also find it sometimes not very helpful when reviewers point to their own work, I do not expect you to consider this :-)) Testing
I think that's it from me! Thank you for making Health Science preprints more accessible and better discoverable! Happy to help further with the process! |
Hi @mcguinlu. Sorry, while still playing with your app, I just realised that I was wrong and nothing is installed from the Shiny app. Please ignore this bit from the review. |
@njahn82 Thanks a million for your detailed review! At a quick skim, everything you flag/recommend is fixable/implementable, and will definitely help to improve the functionality. I'm also looking forward to examing @maurolepore Just confirming that I have seen this, and so am aiming to address the comments by 23rd July (at the latest). |
Hi all (esp @maurolepore) A brief message to let you know that I have most of the changes requested made, but due to external circumstances, I haven't yet finished off the small number of outstanding items. I'm now aiming to have it ready for re-review by Thursday week (6th August) at the very latest. Very sorry for the delay, and hope this is okay! |
That's okay. Thanks for letting me know. |
Thanks for the further feedback both (and Happy Friday)! Please find my responses to your comments below: Editor (@maurolepore)
Reviewer 1 (@tts)Glad to hear things are a bit clearer now! The reason Finally, just wanted to confirm that your details in the DESCRIPTION are correct? |
@mcguinlu Yes, my details in DESCRIPTION are correct. |
Great - thanks for letting me know! |
Great job @mcguinlu, and thank you for the careful and thorough consideration of my review. I feel, it is clearer now what the package does and how it relates to the Shiny app and the backup/dump mechanism. Thank you also for cross-checking with Europe PMC and demonstrating the added value of the medrxivr package. Although all my suggestions have been addressed, I have some final suggestions
|
Thanks @njahn82. Just to note as well that I recently moved the snapshot functionality from relying on my local Task Scheduler to working from GitHub Actions, so it should now be a lot more robust (in the past, if my local PC experienced network issues, the snapshot would not be taken). In response to your comments: I wonder if the returned data frames from the mx_api_*() family could be also represented as tibbles?
The package does a good job in parsing and cleaning preprint metadata. Unfortunately, I cannot find documentation or an example showcasing what is actually returned. Can you provide one reproducible example in the README and/or extend the documentation in the function docs?
In the function docs of mx_export(), it says Dataframe returned by mx_search(), but I realised that also data obtained from the mx_api_ family can be exported as bib file using mx_export().
@maurolepore, I have checked that these changes don't throw any new errors and that Hoping we are nearly there! |
Thank you again @mcguinlu for your careful consideration of my review! All my suggestions have been addressed. |
Approved! Thanks @mcguinlu for submitting and @tts and @njahn82 for your reviews! 😄 To-dos:
From #380 (comment) I see you wish to automatically submit to the Journal of Open Source Software? If so:
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions. We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here. |
Hello @mcguinlu! I've just invited you to the @ropensci/medrxivr team! You should now be allowed to transfer the repo. Once you do, just ping me here and I'll transfer full admin rights back to you 🙂👍 |
Hi @annakrystalli have transferred across now. @maurolepore thanks for the checklist - I will work through it over the coming day. And finally, just flagging to @stefaniebutland that I would be interested in producing a blog post for this package! Thanks again to @tts and @njahn82 for reviewing, and @maurolepore for herding us all through the process! |
Thanks @mcguinlu ! Full admin rights now returned 👍 |
Has this review been completed? (I'm asking as the editor of the corresponding JOSS submission) |
Okay, I've completed all the steps now @maurolepore! Re: the JOSS review, please see @danielskatz's comment above. The one thing I wasn't clear on was how to replace the old |
@danielskatz, thanks for checking. Yes, as the guest editor of this submission, I confirm this review has been completed. |
RE:
I'm sorry this isn't clear for you or me. But as you say, the working website seems correct. I see no reason to worry. Here are a few more comments from section 8.1.4 of https://devguide.ropensci.org/:
Please check these boxes to confirm you've done the following last steps:
-- Ping me when this is done and I'll then close this issue. Thanks! |
@mcguinlu, I see you already mentioned Stephanie Butland above. To comply with https://devguide.ropensci.org/editorguide.html#after-review, I also mention @ropensci/blog-editors for follow-up about your willingness to write a blog post or tech note. Finally, please see https://devguide.ropensci.org/editorguide.html#package-promotion |
So in response to the last few bits:
CodeMeta file added (see here)
All CI badges updated to point to the ropensci endpoints (e.g see here)
Done, and have triggered a build under the new set-up to ensure everything works, which was successful.
Not applicable to me.
Done! Thanks also for the additional materials re: CRAN submission (I do intend to submit to CRAN in the near future) and promotion, and for looping in@ropensci/blog-editors. And I think that's us! |
@mcguinlu , thanks and congratulations! To the best of my knowledge, this completes the review process so I'll close now. -- You may already know this. To prepare packages for CRAN, Are you in rOpenSci's Slack workspace? If not, I recommend you find someone who can add you. I have found friendly advice there that I wouldn't find anywhere else. |
Hello @mcguinlu. We'd love to have a post about medrxivr. Our Blog Guide has most of the information you should need, with both content and technical advice. For readers, it would be helpful to highlight how this package relates to similar ones and the specific niche that medrxivr fills. Once that's clear early in the post, your readers will give their attention. Let me know when you'd like to submit a draft and I can suggest a publication date. |
@mcguinlu Also let me know if you'd like a new invitation to rOpenSci Slack. We could move this discussion there for example. |
@stefaniebutland a new invite would be great! I thought I had activated the first one correctly but apparently not (I am still getting to grips with Slack) 🤦♂️ and happy to continue chatting about this there. |
Submitting Author: Luke McGuinness (@mcguinlu)
Repository: https://github.com/mcguinlu/medrxivr
Version submitted: 0.0.2
Editor: @maurolepore
Reviewer 1: @tts
Reviewer 2: @njahn82
Archive: TBD
Version accepted: TBD
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
medrxivr allows users to programmatically access data from medRxiv, a preprint respository for papers in medical, clinical, and related health sciences. The package also allows user to readily perform and document reproducible literature searches of the medRxiv database.
Who is the target audience and what are scientific applications of this package?
The primary target of this package is systematic reviewers (i.e. me!), who frequently wish both to use more complicated queries (e.g. regular expresssions/Boolean combinations) when searching medRxiv than the official site currrently allows for, and who also wish to be easily able to download the full text PDFs of records matching their search.
medrxivr
helps with both of these challenges. However, anyone who wishes to find and retrieve relevant medRxiv records in R, for example to explore the distribution of preprints by subject area, will find the package useful.Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
As far as I am aware, no other package allows users to access medRxiv data in R.
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Issue: Presubmission inquiry: medrxivr #369
Editor: @annakrystalli
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
JOSS Options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.MEE Options
Code of conduct
Tagging my co-author @L-ENA for reference.
The text was updated successfully, but these errors were encountered: