Skip to content

Node scripts to build an Arkisto-ready language data collection from the "From Farms to Freeways" history project data.

License

Notifications You must be signed in to change notification settings

Language-Research-Technology/corpus-tools-farms-to-freeways

Repository files navigation

Farms to Freeways Arkisto corpus ingest tools

This repository documents how to build a language corpus from the Farms to Freeways history project data.

The data are [archived at Western Sydney University]

And are available in an Omeka Repository

Peter Sefton exported the data into an RO-Crate, using this process.

These tools work on the resulting RO-Crate.

Install

Then install

npm install

Making CSV files from PDF transcripts

This work has all been done and is not automated but here are notes about how it was done.

The transcripts in the Omeka repository are in PDF format and speaker turns are only indicated using bold-face text.

There are some plain text versions available but they don't have speaker turns indicated.

To extract text from the PDF files in the repo first use open office:

On a mac, this command will create a bunch of SVG files in the working directory.

find farms-to-freeways/ -name "*.pdf" -exec /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to svg {} \;

Move these into an svgfiles directory:

mv *.svg svgfiles/

Run svg2csv to create csv files in csvfiles/

node svg2csv.js

copy the CSV files to cloudstor

rsync csvfiles/*  ~/cloudstor/atap-repo-misc/farms_to_freeways_csv_files/ -ruvi

Convert the metadata file from a plain-old crate to being a corpus

Assuming there is a copy of the Farms to Freeways data as exported from Omeka in cloudstor.

  • Run the script.
make BASE_DATA_DIR=/farms-to-freeways/data REPO_OUT_DIR=/your/ocfl-repo BASE_TMP_DIR=/your/temp

How to run your own oni

See oni/README.md for instructions

About

Node scripts to build an Arkisto-ready language data collection from the "From Farms to Freeways" history project data.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •