Re-ingestion attempts are happening too quickly flooding the logs #297

lagoan · 2023-01-30T17:34:42Z

The new functionality that re-adds items that failed to be ingested floods the logs. This happens because the retries happen right after the failed attempt with only a 10-second delay.

We need to add a way to delay the attempts from PMPY to reingest the entities. We talked about doubling the time between retries. With a max wait time of a week.

When an entity failed to be deposited, PMPY would only wait the minimum age to retry to ingest the entity. This creates the problem where we don't let the problem to be fixed and we flood the logs with information that is no longer useful. This change doubles the time of each attempt to give space for the problems to be fixed. There are 3 configuration parameters added: - ingestion_prefix : The prefix used for redis keys storing the failed attempt count - Default value of - Default value of `prod:pmpy_ingest_attempt:` - ingestion_attempts : The max number of re-ingestion attempts - Default value of `15` - first_failed_wait : The time to add at the beginning which will be double every retry - Default value of `10` The default values will keep retrying for around a week and will drop the entity for ingestion after that. This is related to issue [#297](#297).

jefferya · 2023-01-31T17:44:57Z

I'm not sure where to add this comment. I think the approach good. In case it helps, I'll share what was done with the CWRC preservation workflow. Adding delays helped with the CWRC preservation however failures still happened. A choice was made to assume failures would happen and development went toward both an audit tool to detect failures and report item preservation status plus a mechanism to preserve individual failed items.

lagoan · 2023-02-13T17:39:38Z

Thanks for sharing this @jefferya. Yes, we need to assume failures will continue to happen and we do need some sort of audit tool. Would you be able to provide design documentation for the approach used in CWRC?

jefferya · 2023-03-01T00:01:18Z

Would you be able to provide design documentation for the approach used in CWRC?

The design of the current approach is at the following link: https://github.com/ualbertalib/cwrc_preservation#reporting--auditing-cwrc_audit_reportrb. The approach is simplistic. The audit report lists all IDs in CWRC and Swift with an indication whether preservation was successful or absent (either due to a failure or due to the CWRC object being updated after the last preservation run) plus a quick check if the Swift object size makes sense.

Here is a summary:

a custom metadate field is added to the swift object at ingest time (not into the package but added to the swift object metadata) to indicate CWRC resource version
Audit process:
- API call to get a list of CWRC resources and their associated version
- API call to get a list of Swift objects with the resource version from the Swift object metadata
- compare the two lists aligning by ID and version (not very efficient because quickly done)
- output a CSV report indicating
  - if matching ID and version then preservation was successful and up-to-date
  - if matching ID but outdated Swift version then either
    - a preservation error
    - or a newly added/updated CWRC object since last preservation run
  - if only in CWRC then either
    - a preservation error
    - or a newly added CWRC object since the last preservation run
  - if only in Swift then this is CWRC deleted object (policy is to retain deleted within preservation, at present)
  - quick check of size (i.e., if size is below a minimum then likely the preservation package is an HTTP error as opposed to real content
  - the CSV: "#{cwrc_pid},#{cwrc_version},#{swift_id},#{swift_cwrc_version},#{swift_bytes},#{status}"

There are assumptions and trade-offs in this approach that might be solvable in the new CWRC version (e.g., db row created with checksums of the files within the archival package) but I'm not sure how an audit report might look like in the context of OLRC.

Note: this details of the approach are

likely not compatible with the upcoming version of CWRC.
may not be compatible with OLRC

This was referenced Jan 30, 2023

Add delay on entity re-ingestion on failed attempt #298

Merged

Add ingestion attempt communication with PMPY ualbertalib/jupiter#3049

Merged

lagoan mentioned this issue Mar 16, 2023

Solve communication problems with jupiter #308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-ingestion attempts are happening too quickly flooding the logs #297

Re-ingestion attempts are happening too quickly flooding the logs #297

lagoan commented Jan 30, 2023 •

edited

Loading

jefferya commented Jan 31, 2023

lagoan commented Feb 13, 2023

jefferya commented Mar 1, 2023

Re-ingestion attempts are happening too quickly flooding the logs #297

Re-ingestion attempts are happening too quickly flooding the logs #297

Comments

lagoan commented Jan 30, 2023 • edited Loading

jefferya commented Jan 31, 2023

lagoan commented Feb 13, 2023

jefferya commented Mar 1, 2023

lagoan commented Jan 30, 2023 •

edited

Loading