OpenLaw NZ Data Pipeline

This is the OpenLaw NZ data ingest and parser pipeline. Given a common JSON file format it:

Downloads PDF files to AWS s3 (ingester, filefetcher)
Obtains and parses the text content, sending files to Azure cognitive services for OCR if necessary (pdfconverter)
Saves results to separate s3 buckets in JSON format (pdfconverter)
Inserts the results to a postgres database (putInDB)
Triggers a cloudwatch rule to monitor whether an ingest is complete (ingesterWatcher)
On completion, stops the watcher and start step functions to carry out additional parsing which must be done sequentially (parseCaseCitations, parseCaseToCase)

Structure

Each subdirectory is designed to be set up as a serverless function on AWS Lambda.

The flow between Lambda functions must be linked with s3 events notifications and sqs queues to ensure batching, rate limiting, and retryability.

This code is straight from Cloud9 IDE on AWS and is deliberately missing the template.yaml files to make it work. If you're an AWS Cloudformation guru please get in touch.

How to run

The pipeline is started by running /ingester. The ingester downloads case law and other legal information to s3.

The ingester must be set up as a lambda function with environment variables. See the ingester README for further details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
createJSONFile		createJSONFile
docs		docs
downloadExternalFile/downloadExternalFile		downloadExternalFile/downloadExternalFile
filefetcher/filefetcher		filefetcher/filefetcher
ingester/ingester		ingester/ingester
ingesterBatcher/ingesterBatcher		ingesterBatcher/ingesterBatcher
ingesterWatcher/ingesterWatcher		ingesterWatcher/ingesterWatcher
invalidateCloudfrontCache		invalidateCloudfrontCache
parseCaseCitations/parseCaseCitations		parseCaseCitations/parseCaseCitations
parseCaseToCase/parseCaseToCase		parseCaseToCase/parseCaseToCase
parserStart/parserStart		parserStart/parserStart
pdfconverter		pdfconverter
putInDB/putInDB		putInDB/putInDB
rebuildAzureSearchIndex		rebuildAzureSearchIndex
s3ToSQS/s3ToSQS		s3ToSQS/s3ToSQS
tearDown		tearDown
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenLaw NZ Data Pipeline

Structure

How to run

About

Contributors 2

Languages

License

openlawnz/openlawnz-pipeline

Folders and files

Latest commit

History

Repository files navigation

OpenLaw NZ Data Pipeline

Structure

How to run

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages