This repository documents how to build a language corpus from the Farms to Freeways history project data.
The data are [archived at Western Sydney University]
And are available in an Omeka Repository
Peter Sefton exported the data into an RO-Crate, using this process.
These tools work on the resulting RO-Crate.
Then install
npm install
This work has all been done and is not automated but here are notes about how it was done.
The transcripts in the Omeka repository are in PDF format and speaker turns are only indicated using bold-face text.
There are some plain text versions available but they don't have speaker turns indicated.
To extract text from the PDF files in the repo first use open office:
On a mac, this command will create a bunch of SVG files in the working directory.
find farms-to-freeways/ -name "*.pdf" -exec /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to svg {} \;
Move these into an svgfiles directory:
mv *.svg svgfiles/
Run svg2csv
to create csv files in csvfiles/
node svg2csv.js
copy the CSV files to cloudstor
rsync csvfiles/* ~/cloudstor/atap-repo-misc/farms_to_freeways_csv_files/ -ruvi
Assuming there is a copy of the Farms to Freeways data as exported from Omeka in cloudstor.
- Run the script.
make BASE_DATA_DIR=/farms-to-freeways/data REPO_OUT_DIR=/your/ocfl-repo BASE_TMP_DIR=/your/temp
See oni/README.md for instructions