This tool aims to introduce some digital preservation transparency into the process of copying digital collections files from local storage to S3. Amazon's awcli provides high-level tools like sync and low-level tools like put-object claim to validate fixity on uploaded objects behind the scenes, but this isn't transparent.
In our digital collections processes, we generate fixity using md5deep as soon after file capture or creation as possible. We use that fixity digest to verify files on each move. This tool does the following:
- Verify that all files in the fixity manifest exist in the filesystem
- Verify that all files in the filesystem are explicated in the manifest
- Ignore some files like Thumbs.db
- Verify the MD5 fixity for each file matches the fixity recorded in the manifest
- Replicates files from local storage to AWS S3
- Request that AWS validates the MD5 to ensure file in S3 is accurate
- Configure metadata on the S3 object with the MD5 of the file
- Log all of these actions to a log file for review
To get a local copy up and running follow these simple steps.
-
Clone the repo
git clone https://github.com/VTUL/digital-collections-cloud-replicate.git
-
Install libraries
pip install -r requirements.txt
See options using -h
$ ./s3-replicate.py -h
usage: s3-replicate.py [-h] [-c CONFIG] -d DIRECTORY [-f] [-l LOG] [-m MANIFEST] [-p PROFILE] -u URI [-v]
optional arguments:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
path to aws credentials file. E.g., /home/user/.aws/credentials. Default is ~/.aws/credentials
-d DIRECTORY, --directory DIRECTORY
path to digital collections directory. E.g., /some/path
-f, --fixity perform fixity validation against manifest
-l LOG, --log LOG directory to save logfile. E.g., /some/path. Default is POSIX temp directory
-m MANIFEST, --manifest MANIFEST
name of manifest file if not "checksums-md5.txt"
-p PROFILE, --profile PROFILE
aws profile name. E.g., default. Default is default.
-u URI, --uri URI S3 URI. E.g. s3://vt-testbucket/SpecScans/IAWA3/JDW/
-v, --verbose print verbose output to console
Run with options
$ ./s3-replicate.py -u s3://imgagestore/SpecScans/IAWA/JDW/ -d /home/jjt/Downloads/ingest_test/in_jdw/ -m checksums-md5-jdw.txt -f -v
Review logfile
$ cat /tmp/s3-replicate_in_jdw_JDW_2021-09-20-12-44-04.log
2021-09-20 14:43:43,092 - INFO - Replicating files from /home/jjt/Downloads/ingest_test/in_jdw to s3://imagestore/SpecScans/IAWA/JDW/
2021-09-20 14:43:43,102 - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2021-09-20 14:43:43,380 - INFO - User has write access to S3 bucket
2021-09-20 14:43:43,384 - INFO - Ignoring manifest entry matching ignore list: ./jdwst001001/Thumbs.db
2021-09-20 14:43:43,385 - INFO - Found 5 records in manifest file
2021-09-20 14:43:43,385 - INFO - Using 4 manifest records after matching ignored files
2021-09-20 14:43:43,385 - INFO - Scanning files at /home/jjt/Downloads/ingest_test/in_jdw. Generating fixity will take time
2021-09-20 14:43:43,386 - INFO - Ignoring file checksums-md5-jdw.txt
2021-09-20 14:43:43,675 - INFO - Found 4 files in /home/jjt/Downloads/ingest_test/in_jdw after ignoring 1 files
2021-09-20 14:43:43,675 - INFO - Filesystem and manifest match.
2021-09-20 14:43:43,675 - INFO - Initiating file replication to s3://imgagestore/SpecScans/IAWA/JDW/
2021-09-20 14:43:46,191 - INFO - {'ResponseMetadata': {'RequestId': '3M752071KR8DS1YW', 'HostId': 'ueyoxW3Wkdff6SJan2S1zv6Mkm1wbMQb/lfy9hq97m4AlGRQFFe4DMDFUuqSdrqR+6dvl03QgNk=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'ueyoxW3Wkdfe6SJan2S1zv6Mkm1wbMQb/lfy9hq97m4AlGRQFFe4DMDFUuqSdjqR+6xvl03QgNk=', 'x-amz-request-id': '3M752971KK8DS1YW', 'date': 'Mon, 20 Sep 2021 18:43:44 GMT', 'etag': '"7034b2e690d2e04bc50a6ce8a8be392e"', 'server': 'AmazonS3', 'content-length': '0'}, 'RetryAttempts': 0}, 'ETag': '"7034b2e690d2e04bc50a6ce8a8be392e"'}
...
See the open issues for a list of proposed features (and known issues).
Distributed under the Apache 2.0 License. See LICENSE
for more information.
Project Link: https://github.com/VTUL/digital-collections-cloud-replicate