Skip to content

hathitrust/datasets

Repository files navigation

Tests Coverage Status Ruby Style Guide

HathiTrust Datasets Builder

This project maintains the full text datasets provided to researchers and the HathiTrust Research Center.

Development

git clone https://github.com/hathitrust/datasets
cd datasets
docker-compose build
docker-compose run test bundle install
docker-compose run test

Datasets Design

Volumes

The datasets are the volume fulltexts with rights that permit inclusion in and distribution by the hathitrust research datasets. Each volume is uniquely identified by a prefix and number. e.g. pre.01234567891011 or exp.11223344556677

Superset (ht_text)

The entire corpus of the fulltexts available for research is called ht_text. This is the superset of available volumes. This set is only used directly by the HathiTrust Research Center.

Subsets

There are subsets of volumes that correspond to specific rights attributed to the volumes. For example, volumes with the rights, public domain world, are in the subset ht_text_pd_world. The list of subsets is:

  • ht_text_pd
  • ht_text_pd_open_access
  • ht_text_pd_world
  • ht_text_pd_world_open_access

Content and Symlinks

The The zip files which contain the data reside in ht_text (the superset). The subsets are mirrors of sections of the ht_text pairtree with the final directory being a symlink to the correpsonding directory in ht_text.

ls -l /datasets/ht_text_pd_world/obj/exp/pairtree_root/11/22/33/44/55/66/77
11223344556677 -> /datasets/ht_text/ht_text_pd_world/obj/exp/pairtree_root/11/22/33/44/55/66/77

Operation

  1. Queue Check

Prior to beginning a new run, the queue of jobs must be empty. In order to be empty, each job must have completed successfully. Failed jobs are re-queued. This is done to prevent race conditions with multiple changes to the same volume.

  1. Get Changes

There are two kinds of changes to the HathiTrust volumes that the research datasets need to incorporate: - Rights: Updates to the copyright determination or access rights. Queried from the aptly named, rights table. - Content: Updates to the OCR text. Queried from the re-ingest feed table.

  1. Filter Changes

The list of changes is filtered into queues. There is a queue for each subset and a queue for the content changes.

  1. Schedule Jobs

For each volume in a queue, a job is scheduled to apply the changes to the filesystem.

Use

Deployed via private ArgoCD control repository

This creates a set of workers for handling data set jobs, as well as a set of cron jobs to generate the dataset full inventory, fetch metadata, queue jobs for updating the data set, and compiling and processing the logs generated by the workers.

Assumptions & Dependencies

Atomic filesystem moves. This remains to be tested.