Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix termination deadlock when the output is retrying whole failed bulk requests #1117

Conversation

pfcoperez
Copy link

Closes: #1116
Detailed explanation of the problem: #1116 (comment)

Break the termination deadlock when whole bulk requests are failing in a retry loop during pipeline terminations

Copy link
Contributor

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we return without raising (and crashing the plugin/worker), the subsequent normal-completion of a batch that was sourced from the PQ will cause it to be acknowledged and marked eligible for immediate deletion, causing a restarted pipeline to NOT include the batch. I consider to be a data-loss regression, even though it is less-bad for the memory-queue case, or for cases where some amount of data-loss is preferred if it is the cost of keeping things moving.

The current behaviour is an unfortunate deadlock, true, but a force-shutdown followed by a restart currently will re-emit the unacknowledged batch for reprocessing.

Should we instead raise an exception that crashes the worker and prevents the batch from being acknowledged?

@pfcoperez
Copy link
Author

Thanks a ton for chiming in @yaauie !

Should we instead raise an exception that crashes the worker and prevents the batch from being acknowledged?

Do you mean that we could replace the changed retry unless shutting_down? lines with an if that would raise an exception when shutting_down? is true?

@@ -154,12 +154,17 @@ def successful_connection?
!!maximum_seen_major_version && alive_urls_count > 0
end

def shutting_down?
@stopping.true? || (!execution_context.nil? && !execution_context.pipeline.nil? && execution_context.pipeline.shutdown_requested? && !execution_context.pipeline.worker_threads_draining?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the cascading nil checks we could use the &. safe navigation operator. That would simplify as:

Suggested change
@stopping.true? || (!execution_context.nil? && !execution_context.pipeline.nil? && execution_context.pipeline.shutdown_requested? && !execution_context.pipeline.worker_threads_draining?)
@stopping.true? || (execution_context&.pipeline&.shutdown_requested? && !execution_context&.pipeline&.worker_threads_draining?)

@pfcoperez
Copy link
Author

Superseded by #1119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some bulk requests are retried for ever even when we are trying to stop a pipeline
4 participants