Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-12867: Fix ConsumeBenchWorker exit behavior for maxMessages config #10797

Conversation

kowshik
Copy link
Contributor

@kowshik kowshik commented May 31, 2021

Bug:
The trogdor ConsumeBenchWorker has a bug. It allows several consumption tasks to be run in parallel, the number is configurable using the threadsPerWorker config. If one of the consumption tasks completes executing successfully due to maxMessages being consumed, then, the consumption task prematurely notifies the doneFuture causing the entire ConsumeBenchWorker to halt. This becomes a problem when more than 1 consumption task is running in parallel, because the successful completion of 1 of the tasks shuts down the entire worker while the other tasks are still running. When the worker is shut down, it kills all the active consumption tasks, though they have not consumed maxMessages yet. This is not the desired behavior.

How to reproduce?:
The bug is easily reproducible by running a ConsumeBenchSpec task configured with maxMessages value and threadsPerWorker > 1. When a trogdor workload is started with such a spec, and when one of the threads (i.e. consumption task) has consumed maxMessages, you can notice that it prematurely shuts down the worker although the other threads have not yet consumed at least maxMessages.

Fix:
The fix is to defer the notification of the doneFuture to the CloseStatusUpdater thread. This thread is already responsible for tracking the status of the tasks and updating their status when all of the tasks complete, so this seems like the right place to inject the doneFuture notification.

@kowshik
Copy link
Contributor Author

kowshik commented May 31, 2021

cc @junrao @apovzner @rajinisivaram for review.

@apovzner It appears this behavior has been around since ConsumeBenchWorker was first implemented: #4775. Please let me know your thoughts.

Copy link
Contributor

@rajinisivaram rajinisivaram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kowshik Thanks for the PR, LGTM. Test failures not related, merging to trunk.

@rajinisivaram rajinisivaram merged commit 13ba9c9 into apache:trunk Jun 2, 2021
@kowshik kowshik deleted the KAFKA-12867_fix_trogdor_ConsumeBenchWorker branch June 3, 2021 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants