Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-ingestion attempts are happening too quickly flooding the logs #297

Open
lagoan opened this issue Jan 30, 2023 · 3 comments
Open

Re-ingestion attempts are happening too quickly flooding the logs #297

lagoan opened this issue Jan 30, 2023 · 3 comments

Comments

@lagoan
Copy link
Contributor

lagoan commented Jan 30, 2023

The new functionality that re-adds items that failed to be ingested floods the logs. This happens because the retries happen right after the failed attempt with only a 10-second delay.

We need to add a way to delay the attempts from PMPY to reingest the entities. We talked about doubling the time between retries. With a max wait time of a week.

lagoan added a commit that referenced this issue Jan 30, 2023
When an entity failed to be deposited, PMPY would only wait the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem to be fixed and we flood the logs with information that is no longer useful.

This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added:

- ingestion_prefix : The prefix used for redis keys storing the failed attempt count - Default value of  - Default value of `prod:pmpy_ingest_attempt:`
- ingestion_attempts : The max number of re-ingestion attempts - Default value of `15`
- first_failed_wait : The time to add at the beginning which will be double every retry - Default value of `10`

The default values will keep retrying for around a week and will drop the entity for ingestion after that.

This is related to issue [#297](#297).
@jefferya
Copy link

I'm not sure where to add this comment. I think the approach good. In case it helps, I'll share what was done with the CWRC preservation workflow. Adding delays helped with the CWRC preservation however failures still happened. A choice was made to assume failures would happen and development went toward both an audit tool to detect failures and report item preservation status plus a mechanism to preserve individual failed items.

@lagoan
Copy link
Contributor Author

lagoan commented Feb 13, 2023

Thanks for sharing this @jefferya. Yes, we need to assume failures will continue to happen and we do need some sort of audit tool. Would you be able to provide design documentation for the approach used in CWRC?

@jefferya
Copy link

jefferya commented Mar 1, 2023

Would you be able to provide design documentation for the approach used in CWRC?

The design of the current approach is at the following link: https://github.com/ualbertalib/cwrc_preservation#reporting--auditing-cwrc_audit_reportrb. The approach is simplistic. The audit report lists all IDs in CWRC and Swift with an indication whether preservation was successful or absent (either due to a failure or due to the CWRC object being updated after the last preservation run) plus a quick check if the Swift object size makes sense.

Here is a summary:

  • a custom metadate field is added to the swift object at ingest time (not into the package but added to the swift object metadata) to indicate CWRC resource version
  • Audit process:
    • API call to get a list of CWRC resources and their associated version
    • API call to get a list of Swift objects with the resource version from the Swift object metadata
    • compare the two lists aligning by ID and version (not very efficient because quickly done)
    • output a CSV report indicating
      • if matching ID and version then preservation was successful and up-to-date
      • if matching ID but outdated Swift version then either
        • a preservation error
        • or a newly added/updated CWRC object since last preservation run
      • if only in CWRC then either
        • a preservation error
        • or a newly added CWRC object since the last preservation run
      • if only in Swift then this is CWRC deleted object (policy is to retain deleted within preservation, at present)
      • quick check of size (i.e., if size is below a minimum then likely the preservation package is an HTTP error as opposed to real content
      • the CSV: "#{cwrc_pid},#{cwrc_version},#{swift_id},#{swift_cwrc_version},#{swift_bytes},#{status}"

There are assumptions and trade-offs in this approach that might be solvable in the new CWRC version (e.g., db row created with checksums of the files within the archival package) but I'm not sure how an audit report might look like in the context of OLRC.

Note: this details of the approach are

  • likely not compatible with the upcoming version of CWRC.
  • may not be compatible with OLRC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants