Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDL/7493 Improve BagGenerator failure handling #8609

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Apr 13, 2022

What this PR does / why we need it: The prior code did not correctly close connections when a non-200 response was received. Due to this, repeated failures could temporarily exhaust the pool of threads. While this was first seen with 'bad' datasets, e.g. ones where the physical files have been deleted, it would happen if there were temporary network outages as well. The PR correctly closes the stream in those cases.

It also adds a couple minor fixes/enhancements:

  • Calls to the Dataverse API when using an AWS load balancer can generate 'Invalid Cookie Header' warnings. While it is possible that most of the ones seen at TDL come from the AWS library in the S3 Store, the recommended fix is to accept Standard cookies (which the aws library apparently doesn't do), so that is added here for the Apache Client calls.
  • The Bag spec requires a manifest file which we weren't generating for datasets with no files. To be slightly more compliant with that, this PR creates an empty manifest and updates the log message to make it clearer that this is not necessarily an error (i.e. this happens normally if a dataset has no files.).
  • If a 404 error is received, the code will no longer attempt to retry. If a >=500 error, such as gateway timeouts are received, the code will still attempt up to 5 retries.

Which issue(s) this PR closes:

Closes #8603

Special notes for your reviewer:

Suggestions on how to test this:

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: planning an aggregate Bag/archiving change notes rather than per PR.

Additional documentation:

@qqmyers qqmyers added the TDL of interest to the Texas Digital Library label Apr 13, 2022
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.004%) to 18.998% when pulling a9f3aa5 on TexasDigitalLibrary:TDL/7493-improve_BagGnerator_failure_handling into 4e991dd on IQSS:develop.

@qqmyers qqmyers added HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A labels May 25, 2022
@kcondon kcondon assigned kcondon and unassigned landreev May 25, 2022
@kcondon kcondon merged commit 4947992 into IQSS:develop May 27, 2022
@pdurbin pdurbin added this to the 5.11 milestone Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDC Harvard Data Commons HDC: 3a Harvard Data Commons Obj. 3A TDL of interest to the Texas Digital Library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BagGenerator doesn't handle missing files/404 responses or no-file datasets well.
6 participants