Skip to content

UCLATALL/CourseKataData

Repository files navigation

CourseKataData

NOTE: This package and wiki are being phased out. Please see research.coursekata.org for more info, and the mo for a more efficient data cleaner.

R build status, Codecov test coverage, CRAN status

This is the source code and an introduction to the CourseKataData package. If you are trying to learn more about the contents of the actual data, or how other researchers are using this package, head to the Wiki.

The goal of CourseKataData is to help researchers working with CourseKata courses on data science, statistics, and modeling. The data downloaded from CourseKata can be useful in the more-or-less “raw” form that it comes in, but you will usually want to process the data before working with it. This package includes useful functions for cleaning and tidying the raw course data from completed courses.

Installation

Download the package directly from this repository using devtools:

library(devtools)
install_github("UCLATALL/CourseKataData")

Usage

This section details the usage of the functions in the CourseKataData package. If you would like to read more about the actual structure of the data downloaded from CourseKata, check out the associated Data Structure page in the Wiki.

One and Done

The easiest way to get started working with your data is to just download it and run process_data with the path to the zip file (no need to unzip the file first).

library(CourseKataData)

process_data("path/to/downloaded/data.zip")

If you would like to unzip the file first (which can sometimes be faster), you can also pass the name of the directory to process_data(). If your data was in path/to/downloaded/data.zip and you extracted it to path/to/downloaded/data, then the call should be to the latter path:

process_data("path/to/downloaded/data")

When you run process_data(), it will load all of the data and create six data frames (actually tibbles, but they work pretty much the same). These are the names of the tables that are created:

  • classes
  • responses
  • items
  • tags
  • media_views
  • page_views

If there is already an R object with one of these names, you will be given the opportunity to abort the processing or overwrite the existing object. Advanced users can check the R documentation (?process_data) to learn how to create these variables in a different environment.

Automatically Merge Multiple Downloads

If you have downloaded multiple data download bundles from CourseKata, don’t worry, this package takes care of merging the files for you. You can specify a vector of zip files or directories to load:

paths <- c("path/to/first/zip", "path/to/second/zip", "path/to/a/directory")
process_data(paths)

Or, you can put all of the data into a common directory and specify that directory:

# assuming you put all of the downloaded bundles in the same directory
process_data("path/to/directory/containing/bundles")

Note: this process relies on the fact that the data bundles you download have a consistent format. It may get confused if you reorganize things too much within the downloaded bundles. (See more about the structure of the data download and what is in each file in the Data Structure page in the Wiki.)

Time Zones

By default the data is parsed using the “UTC” time zone. If you would like to convert the date time data in each table to a different time zone you can specify it. Here is an example with Los Angeles, and another showing how to let your computer decide what time zone you are in:

# specifically Los Angeles
process_data("path/to/downloaded/data", time_zone = "America/Los_Angeles")

# let the system figure it out for you
process_data("path/to/downloaded/data", time_zone = Sys.timezone())

If you are having trouble with this, see ?timezones for more information.

Splitting Responses

By default, all of the responses in the course are included in the responses table. These responses can be semantically split into three parts: the surveys at the beginning and end of the course, the practice quizzes at the end of each chapter (these are now called review questions, but this function still calls them quizzes because they used to be called that), and the rest of the items in the text. If you would like to split the responses like this, there is a handy split_responses() function that will return a list of three tables:

parts <- split_responses(responses)
parts$in_text
parts$quizzes
parts$surveys

Piece by Piece

If you would instead like to process only part of the data, you can use the helpful sub-process functions that exist for each of the file types in the data download:

  • process_classes()
  • process_responses()
  • process_items()
  • process_tags()
  • process_media_views()
  • process_page_views()

As with process_data(), each of these functions will accept the path to a CourseKata data bundle zip file or the directory of an extracted zip file, a directory full of bundles, or you can simply supply the name of the file that you want to process with the function.

# zip file
process_classes("path/to/downloaded/data.zip")

# directory
process_classes("path/to/downloaded/data)

The classes data file contains information about each class in the data download. However, the other files are specific to each class included in the download. If you want to extract a specific class, you can specify the class_id, which should correspond to the name of the class’s folder in the classes folder of the download. (See more about the structure of the data download and what is in each file in the Data Structure page in the Wiki.)

# a single class
process_page_views(
  "path/to/downloaded/data.zip", 
  class_id = "dad3c954-3cb5-11ea-b81d-1d385b778009"
)

# two classes
process_page_views(
  "path/to/downloaded/data.zip", 
  class_id = c(
    "dad3c954-3cb5-11ea-b81d-1d385b778009",
    "1a99bb15-4d19-4e3f-bb60-6332141573ed"
  )
)

Export Data

If you are not going to use R to perform your data analysis, you will likely want to export the processed data back to CSV or some other format. Note that many of the base functions in R will have trouble writing the processed data because many of the processed tables have list-columns. A list-column is a column of a data frame where each row / element is itself a list. Here is an example:

tbl_with_lists <- tibble::tibble(x = list(
  # first element/row of column x
  list(
    list(1, 2, 3),
    list(4, 5, 6)
  ),
  # second element/row of column x
  list(
    list(7, 8, 9)
  )
))

These columns need to be converted back to a format that can be written as a string if you want to export the data to CSV or a similar format that does not have clear rules for exporting lists. For convenience, the convert_lists() function is provided to help conversion, and we load a custom CSV writer to handle this for you:

# if you just want to convert lists
convert_lists(tbl_with_lists)
#> # A tibble: 2 × 1
#>   x                            
#>   <chr>                        
#> 1 [[[1],[2],[3]],[[4],[5],[6]]]
#> 2 [[[7],[8],[9]]]

If you specifically want to export to CSV, we have included a function that can handle this for you. It automatically masks the standard write.csv() with one that will handle these list columns:

# make sure you have CourseKataData loaded
library(CourseKataData)
responses <- process_responses("path/to/some/data")
write.csv(responses, "responses.csv")

Contributing

If you see an issue, problem, or improvement that you think we should know about, or you think would fit with this package, please let us know on our issues page. Alternatively, if you are up for a little coding of your own, submit a pull request:

  1. Fork it!
  2. Create your feature branch: git checkout -b my-new-feature
  3. Commit your changes: git commit -am 'Add some feature'
  4. Push to the branch: git push origin my-new-feature
  5. Submit a pull request :D