OS-Climate Data Extraction Scripts

osc-data-extraction-scripts

Scripts to build/deploy the OS-Climate data extraction tooling

Files

Name	Description
Dockerfile	Defines the steps required to build the data extraction tooling Docker container
bootstrap.sh	Used to install Docker on a server instance [Debian/Ubuntu/Fedora/Amazon Linux]
build.sh	Thin shell script/wrapper to build the Docker container
metadata	Contains variable definitions/parameters describing the Docker container
publish.sh	Thin shell script/wrapper publish the Docker container to a registry
run.sh	Thin shell script/wrapper ro run the Docker container
script.sh	Script copied into the Docker container; uses GNU parallel to run data extraction
tag.sh	Thin shell script/wrapper to tag the container with metadata

script.sh

This is the primary script that runs/executes the data extraction toolset. It is designed to run on a Linux instance and/or inside a docker container. It enumerates the number of processor cores available and invokes the data extraction Python tooling with GNU parallel to perform as many operations as possible in parallel.

The script obtains a list of files selected for processing by using a wildcard pattern match against a directory containing PDF files. Every file returned by the pattern match is then passed to the Python tooling via GNU parallel.

SELECTION="e15*.pdf"

Since Docker containers can artificially be restricted to a reduced number of processor cores, Docker therefore provides a simple way to test the tool performance when given a variable number of cores on which to run jobs.

The function that performs the processing is defined as:

_process_files() {
 echo "Processing: $1"
 sleep 3
}

The source directory containing PDF files for ingestion is by default:

SOURCE="inputs"

To invoke the data extraction tooling, simply replace the sleep statement with the code required to ingest and process files. The shell script counts the time elapsed (in seconds) to run the batch job, making it trivial to compare the performance under different numbers of cores, versions of code, or other metrics.

License

All repository code/contents are licensed under the Apache-2.0 license

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OS-Climate Data Extraction Scripts

osc-data-extraction-scripts

Files

script.sh

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSES		LICENSES
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
bootstrap.sh		bootstrap.sh
build.sh		build.sh
metadata		metadata
publish.sh		publish.sh
run.sh		run.sh
script.sh		script.sh
tag.sh		tag.sh

License

os-climate/osc-data-extraction-scripts

Folders and files

Latest commit

History

Repository files navigation

OS-Climate Data Extraction Scripts

osc-data-extraction-scripts

Files

script.sh

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages