-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proper exception handler to around eviction_stm's call to replicate #12959
Add proper exception handler to around eviction_stm's call to replicate #12959
Conversation
- Replicate may throw in the case the gate belonging to the consensus class has closed. The eviction_stm never had proper exception handling to cover this case, this change adds a catch all exception clause around the call to _c->replicate() - Fixes: redpanda-data#12950
- This log is printed when replicate fails, therefore it should be logged at warn (or error) and not info
_log.warn, | ||
"Replicating prefix_truncate failed with exception: {}", | ||
std::current_exception()); | ||
result = errc::replication_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a quick question, is this going to be propagated to Kafka handler ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it eventually will be translated into a kafka::error_code
here https://github.com/redpanda-data/redpanda/blob/dev/src/v/kafka/server/replicated_partition.cc#L399 then that is propagated back to the client in the delete_records handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the linked snippet, wouldn't this return a USE? Curious why that instead of explicitly handling the is_shutdown_exception
case with raft::errc::shutting_down
which results in a (I think) retriable timeout error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consensus::replicate
will return errc::shutting_down
as an error code not as an exception so that path is already covered. What was happening was gate_closed_exception
was being thrown and not caught, you can see that from the bad log lines error in the PR cover.
Now that I think about it , seems like there's maybe a missing gate closed exception handler clause somewhere within the replicate logic.
/backport v23.2.x |
The issue is observed as a bad log line error within a delete-records test. A higher level exception handler in the delete records request handling logic caught an exception and logged it at error level making the test fail.
Looking into the cause of the issue, it was observed that within
log_eviction_stm::truncate
there's a call toconsensus::replicate
that does not catch exceptions, which may occur. Note that this is not a severe issue because the code eventually catches the exception and returns the proper error code to the client in its current state.Fixes #12950
Backports Required
Release Notes