-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-ingestion attempts are happening too quickly flooding the logs #297
Comments
When an entity failed to be deposited, PMPY would only wait the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem to be fixed and we flood the logs with information that is no longer useful. This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added: - ingestion_prefix : The prefix used for redis keys storing the failed attempt count - Default value of - Default value of `prod:pmpy_ingest_attempt:` - ingestion_attempts : The max number of re-ingestion attempts - Default value of `15` - first_failed_wait : The time to add at the beginning which will be double every retry - Default value of `10` The default values will keep retrying for around a week and will drop the entity for ingestion after that. This is related to issue [#297](#297).
I'm not sure where to add this comment. I think the approach good. In case it helps, I'll share what was done with the CWRC preservation workflow. Adding delays helped with the CWRC preservation however failures still happened. A choice was made to assume failures would happen and development went toward both an audit tool to detect failures and report item preservation status plus a mechanism to preserve individual failed items. |
Thanks for sharing this @jefferya. Yes, we need to assume failures will continue to happen and we do need some sort of audit tool. Would you be able to provide design documentation for the approach used in CWRC? |
The design of the current approach is at the following link: https://github.com/ualbertalib/cwrc_preservation#reporting--auditing-cwrc_audit_reportrb. The approach is simplistic. The audit report lists all IDs in CWRC and Swift with an indication whether preservation was successful or absent (either due to a failure or due to the CWRC object being updated after the last preservation run) plus a quick check if the Swift object size makes sense. Here is a summary:
There are assumptions and trade-offs in this approach that might be solvable in the new CWRC version (e.g., db row created with checksums of the files within the archival package) but I'm not sure how an audit report might look like in the context of OLRC. Note: this details of the approach are
|
The new functionality that re-adds items that failed to be ingested floods the logs. This happens because the retries happen right after the failed attempt with only a 10-second delay.
We need to add a way to delay the attempts from PMPY to reingest the entities. We talked about doubling the time between retries. With a max wait time of a week.
The text was updated successfully, but these errors were encountered: