Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BagIt Support - Add automatic checksum validation on upload #8677

Merged
merged 1 commit into from
May 26, 2022

Conversation

abujeda
Copy link
Contributor

@abujeda abujeda commented May 5, 2022

What this PR does / why we need it:
It adds a new file handler to manage BagIt packages that are uploaded using a Zip file.
The first requirement is to detect that is a BagIt package, extract the files as they are and perform the checksum validation.

Which issue(s) this PR closes:

Special notes for your reviewer:
BagIt package detection: When uploading a zip file, the system will look for a zip entry called bagit.txt. Then within the same folder where that file is, it will look for a manifest file with a supported hash algorithm, like manifest-sha256.txt. If both are found, the zip file is deemed a BagIt package.

Suggestions on how to test this:
Enable the feature: curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:BagItHandlerEnabled

Upload a BagIt package as a Zip file. It should extract all files and perform the checksum validation.
Upload a BagIt package with invalid checksums. The upload should not be allowed and up to 5 errors should be highlighted in the UI.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
It adds a validation error message to the upload screen. Sample of the screenshot in the issue: #8608

Is there a release notes update needed for this change?:
It will be included in the changes

Additional documentation:
BagIt documentation will be added to the Dataverse guide.
This is part of the Harvard Data Commons project.

@abujeda abujeda force-pushed the 8608-bagit-upload-support-checksums branch 7 times, most recently from 9414e42 to c23f086 Compare May 10, 2022 12:07
@landreev landreev self-requested a review May 12, 2022 20:02
@landreev landreev self-assigned this May 12, 2022
@abujeda abujeda force-pushed the 8608-bagit-upload-support-checksums branch 2 times, most recently from 61b073a to 9044fd4 Compare May 18, 2022 10:08
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving along.
The PR changes a large number of class files, but the changes are pretty straightforward.
(I may add a couple of words under "how to test").

@landreev landreev removed their assignment May 23, 2022
@kcondon kcondon self-assigned this May 24, 2022
@kcondon
Copy link
Contributor

kcondon commented May 24, 2022

@adaybujeda Would you refresh this branch from develop? We recently renamed a flyway script and this pr has the old name and so cannot deploy. Thanks!

@abujeda abujeda force-pushed the 8608-bagit-upload-support-checksums branch from 9044fd4 to a88da4e Compare May 25, 2022 08:12
@abujeda
Copy link
Contributor Author

abujeda commented May 25, 2022

Rebase from develop completed @kcondon

Thanks!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.4%) to 19.674% when pulling a88da4e on adaybujeda:8608-bagit-upload-support-checksums into f578a5e on IQSS:develop.

@kcondon
Copy link
Contributor

kcondon commented May 25, 2022

@adaybujeda Apologies, but do you have a test bag I can use? I tried one from Jim but it fails when I upload in UI.

@abujeda
Copy link
Contributor Author

abujeda commented May 25, 2022

Hi @kcondon, we created a couple of BagIt packages to do internal testing.

bagit-no-errors.zip
bagit-1-error.zip
bagit-10-errors.zip

@kcondon
Copy link
Contributor

kcondon commented May 25, 2022

@adaybujeda Thanks for the sample files, they worked fine. I did have a question on the validation. It seems the failure examples complain about files not existing that are listed in the manifest rather than having bad checksums? I ask because when I edit the working bag file manifest-sha512.txt (sha512 is what my dataverse installation is using) and alter the checksums, it doesn't fail. What is it checking then? There is also a tagmanifest-sha512.txt

@kcondon
Copy link
Contributor

kcondon commented May 25, 2022

Issues found/questions:

  1. altering checksum in manifest-sha512.txt was not detected until manifest-sha256.txt file was removed. Our installation used manifest-sha512.txt. Is there an automatic checksum selection fallback?
  2. setting :BagValidatorMaxErrors to 2 still processes 10 errors in log and displays 5 errors in error msg in ui. The ui error count appears to be controlled by :CreateDataFilesMaxErrorsToDisplay instead.

@abujeda
Copy link
Contributor Author

abujeda commented May 26, 2022

Thanks @kcondon.

1

For file validation, the backend will search the manifests provided in the zip and use the first one that the code can process.

The checksums supported are controlled by BagChecksumType, it will loop through the supported checksums until it finds one that it can process. It does not take into consideration the algorithm used by the installation as these could not be compatible.

2

:BagValidatorMaxErrors setting is a best effort for the validation of the files. The processing of the checksums is done using a thread pool to improve performance for large files. When waiting for completion, it will check every 10 seconds to see if the :BagValidatorMaxErrors has been reached. If reached, it will stop processing and return.

For small enough files, it will complete processing before the 10 seconds and all files will be processed. For the FE, we use :CreateDataFilesMaxErrorsToDisplay to control how many of these processing errors we want to show.

@kcondon kcondon merged commit 1df6b63 into IQSS:develop May 26, 2022
@pdurbin pdurbin added this to the 5.11 milestone Jun 2, 2022
@scolapasta scolapasta added HDC: 2 Harvard Data Commons Obj. 2 HDC Harvard Data Commons labels Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDC Harvard Data Commons HDC: 2 Harvard Data Commons Obj. 2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: BagIt Support - Add automatic checksum validation on upload
6 participants