Farms to Freeways Arkisto corpus ingest tools

This repository documents how to build a language corpus from the Farms to Freeways history project data.

The data are [archived at Western Sydney University]

And are available in an Omeka Repository

Peter Sefton exported the data into an RO-Crate, using this process.

These tools work on the resulting RO-Crate.

Install

Then install

npm install

Making CSV files from PDF transcripts

This work has all been done and is not automated but here are notes about how it was done.

The transcripts in the Omeka repository are in PDF format and speaker turns are only indicated using bold-face text.

There are some plain text versions available but they don't have speaker turns indicated.

To extract text from the PDF files in the repo first use open office:

On a mac, this command will create a bunch of SVG files in the working directory.

find farms-to-freeways/ -name "*.pdf" -exec /Applications/LibreOffice.app/Contents/MacOS/soffice --headless --convert-to svg {} \;

Move these into an svgfiles directory:

mv *.svg svgfiles/

Run svg2csv to create csv files in csvfiles/

node svg2csv.js

copy the CSV files to cloudstor

rsync csvfiles/*  ~/cloudstor/atap-repo-misc/farms_to_freeways_csv_files/ -ruvi

Convert the metadata file from a plain-old crate to being a corpus

Assuming there is a copy of the Farms to Freeways data as exported from Omeka in cloudstor.

Run the script.

make BASE_DATA_DIR=/farms-to-freeways/data REPO_OUT_DIR=/your/ocfl-repo BASE_TMP_DIR=/your/temp

How to run your own oni

See oni/README.md for instructions

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
oni		oni
output		output
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
index.js		index.js
make_ocfl_for_local_oni.sh		make_ocfl_for_local_oni.sh
package-lock.json		package-lock.json
package.json		package.json
svg2csv.js		svg2csv.js
test.js		test.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Farms to Freeways Arkisto corpus ingest tools

Install

Making CSV files from PDF transcripts

Convert the metadata file from a plain-old crate to being a corpus

How to run your own oni

About

Releases 1

Packages

Contributors 3

Languages

License

Language-Research-Technology/corpus-tools-farms-to-freeways

Folders and files

Latest commit

History

Repository files navigation

Farms to Freeways Arkisto corpus ingest tools

Install

Making CSV files from PDF transcripts

Convert the metadata file from a plain-old crate to being a corpus

How to run your own oni

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages