This project is an ETL system for cultural heritage metadata. The system has three primary components:
- OAI-PMH
- ResourceSync TBD
- API TBD
- File TBD
- Primo TBD
For additional information please read the announcement blog post or the complete documentation on our wiki
Clone the repo and build the JAR
git clone https://github.com/dpla/ingestion3.git
cd ingestion3
sbt package
OAI Harvster options
Option | Obligation | Usage |
---|---|---|
endpoint | Required | The base URL for the OAI repository. |
verb | Required | "ListSets" to harvest only sets; "ListRecords" to harvest records and any sets to which the records may belong. Case-sensitive. |
outputDir | Required | Location to save output. This should be a local path. Amazon S3 may be supported at some point. |
metadataPrefix | Required when verb="ListRecords"; prohibited when verb="ListSets". | The the metadata format in OAI-PMH requests issued to the repository. |
provider | Required | The name of the source of the records. |
harvestAllSets | Optional when verb="ListRecords"; cannot be used in conjunction with either setlist or blacklist. | "True" to harvest records from all sets. Default is "false". Case-insensitive. Results will include all sets and all their records. This will only return records that belong to at least one set; records that do not belong to any set will not be included in the results. |
setlist | Optional when verb="ListRecords"; cannot be used in conjunction with either harvestAllSets or blacklist. | Comma-separated lists of sets to include in the harvest. Use the OAI setSpec to identify a set. Results will include all sets in the setlist and all their records. |
blacklist | Optional when verb="ListRecords"; cannot be used in conjunction with either harvestAllSets or setlist. | Comma-separated lists of sets to exclude from the harvest. Use the OAI setSpec to identify a set. Results will include all sets not in the blacklist and all their records. Records that do not belong to any set will not be included in the results. |
Sample OAI harvester config file
# oai-sample.conf
verb = "ListRecords"
endpoint = "http://fedora.sample.org/oaiprovider/"
metadataPrefix = "mods"
outputDir = "/path/to/somewhere"
provider = "DPLA partner A"
blacklist = "ignore,ignore2"
To use SBT you need to specify the path to the config file you just created when invoking the harvester. Depending on where the config file is located you will do this in one of three ways as a VM parameter
Specifying the configuration file
# Example when the path is stored locally. This should address 95% of all use cases
/local/path/to/config.conf
# Example when the profile is stored remotely
https://s3.amazonaws.com/dpla-i3-oai-profiles/sample_provider.conf
Example invocation with local config file
sbt "run /local/path/to/oai.conf /path/to/ingestion3.jar"
Specify the config file parameter as a VM Option argument.
https://s3.amazonaws.com/dpla-i3-oai-profiles/sample_provider.conf
$ sbt test # Runs unit tests
$ GEOCODER_HOST="my-geocoder" sbt it:testOnly # Runs integration tests