README.Rmd

---
output:
  md_document:
    variant: markdown_github
---


<!-- README.md is generated from README.Rmd. Please edit that file -->

# Discovering good data packages


[![DOI: 10.5281/zenodo.47223](https://zenodo.org/badge/doi/10.5281/zenodo.1095831.svg)](https://doi.org/10.5281/zenodo.1095831)

### Project participants:

- Andy Teucher [\@ateucher](https://github.com/ateucher)
- Richie Cotton [\@richierocks](https://github.com/richierocks)
- Claudia Vitolo [\@cvitolo](https://github.com/cvitolo)
- Jakub Nowosad [\@Nowosad](https://github.com/Nowosad)
- Joe Stachelek [\@jsta](https://github.com/jsta)

Most of us are involved in teaching R in some way, and it is always a struggle to find suitable datasets with which to teach, especially across domain expertise. There are many packages that have data, but finding them and knowing what is in them is a struggle due to inadequate documentation.
 
## Goals:

1. Make it easy to discover suitable data
2. Write some guidance on documenting data in packages
 
## Deliverables:

1. Google Doc which describes best practices for documentation. 

  Checklist of things to document.
  
  Make sure your documentation answers as many of these questions as possible.
   
  - What does the data represent?
  - What format is the data in?
  - How big is the dataset?
  - Where does the come from?
  - How has the data been processed?
  - What does the data look like?
  - How do you analyze the data?
  - Where is this data used?
  - Is there a paper, or other external resource discussing this dataset?

2. A patch for `usethis::use_readme_rmd()` to display datasets in package README files.

3. A [flexdashboard](https://ropenscilabs.github.io/data-packages) with a searchable table that shows metadata on datasets from many CRAN packages. It has information for over 4000 datasets.

![](inst/screenshots/pkg.png)

![](inst/screenshots/datasets.png)

![](inst/screenshots/rdfiles.png)

## The state of data on CRAN

 * [List CRAN packages with data](R/get-pkgs-with-data-dir.R)

 * [Parsing DESCRIPTION files](R/parse_description.R)
 
 * [Parsing Rd files](R/get_metacran.R)
 
 * *Installing and loading packages*

## What makes a good data package?

https://docs.google.com/document/d/1xhJmt0v4p49jpwINNak9N7AMMb5yohTwwNOXH8WzqqQ/edit?usp=sharing

## Twitter Bot

https://twitter.com/rstatsdata

![](inst/screenshots/tweetbot.png)

## Graphs

```{r, echo=FALSE, message=FALSE, warning=FALSE, fig.width=8}
library(forcats)
library(tidyverse)
theme_set(theme_bw())

data_cran <- read_rds("data/description_metadata.rds") %>% 
  mutate(is_data_package = factor(is_data_package, levels=c("FALSE", "TRUE"), labels=c("Non-data packages", "Data packages")))

ggplot(data_cran, aes(is_data_package)) + 
  geom_bar() +
  labs(x=NULL)

ggplot(data_cran, aes(LazyData)) +
  geom_bar() +
  facet_wrap(~is_data_package)  + 
  theme(axis.text.x=element_text(angle = 90, hjust = 1))

ggplot(data_cran, aes(BugReports_is_not_na)) +
  geom_bar() +
  facet_wrap(~is_data_package) +
  labs(x = "Is there an BugReports in the package description?")

ggplot(data_cran, aes(URL_is_not_na)) +
  geom_bar() +
  facet_wrap(~is_data_package) +
  labs(x = "Is there an URL in the package description?")
```

```{r, echo=FALSE, fig.height=8}
# ggplot(data_cran, aes(License)) +
#   geom_bar() +
#   coord_flip() +
#   facet_wrap(~is_data_package)
```

```{r, echo=FALSE, fig.height=14}
df <- data_cran %>% 
  select(is_data_package, License) %>% 
  group_by(is_data_package, License) %>% 
  summarise(n=n())

df_wide <- df %>% 
  spread(is_data_package, n) %>% 
  mutate(`Non-data packages` = ifelse(is.na(`Non-data packages`), 0, `Non-data packages`),
         `Data packages` = ifelse(is.na(`Data packages`), 0, `Data packages`)) %>% 
  mutate(diff=`Data packages`-`Non-data packages`)

ggplot(df_wide, aes(fct_reorder(License, diff), diff)) +
  geom_col() + 
  coord_flip() +
  labs(x="License", y="<--- Non-data packages --------- Data packages --->") +
  theme(axis.text.y = element_text(size=8))
```

```{r fig.height=4, fig.width=8, echo=FALSE, fig.align='center'}
df_ts <- data_cran %>% 
  mutate(year = as.integer(strftime(data_cran$Published, format = "%Y"))) %>% 
  group_by(year, is_data_package) %>% 
  summarise(number = n())

# devtools::install_github("karthik/wesanderson")
library(wesanderson)
ggplot(df_ts, aes(year, number, color = is_data_package)) +
  geom_point() + 
  scale_y_log10(breaks = c(10, 100, 1000)) +
  scale_x_continuous(breaks = seq(2005, 2017, by=3)) +
  scale_color_manual(name=NULL, values = wes_palette("Royal1"))
```

## Potential Future Work

### Additional Data Sources 

 * Crawl Biocondunctor
 
 * Examine `inst/extdata` folders

### Additional Package Stats

 * Use Github URLs to pull geo-location of package maintainers

### Quality Assessment

 * Scoring the quality of data in a package
 
 * Creating badges to advertise data quality
 
 * Contact package authors with data quality deficiencies