Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proper exception handler to around eviction_stm's call to replicate #12959

Merged
merged 2 commits into from
Aug 24, 2023

Conversation

graphcareful
Copy link
Contributor

@graphcareful graphcareful commented Aug 23, 2023

The issue is observed as a bad log line error within a delete-records test. A higher level exception handler in the delete records request handling logic caught an exception and logged it at error level making the test fail.

Looking into the cause of the issue, it was observed that within log_eviction_stm::truncate there's a call to consensus::replicate that does not catch exceptions, which may occur. Note that this is not a severe issue because the code eventually catches the exception and returns the proper error code to the client in its current state.

Fixes #12950

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

  • none

Rob Blafford added 2 commits August 23, 2023 10:19
- Replicate may throw in the case the gate belonging to the consensus
class has closed. The eviction_stm never had proper exception handling
to cover this case, this change adds a catch all exception clause around
the call to _c->replicate()

- Fixes: redpanda-data#12950
- This log is printed when replicate fails, therefore it should be
logged at warn (or error) and not info
_log.warn,
"Replicating prefix_truncate failed with exception: {}",
std::current_exception());
result = errc::replication_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a quick question, is this going to be propagated to Kafka handler ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it eventually will be translated into a kafka::error_code here https://github.com/redpanda-data/redpanda/blob/dev/src/v/kafka/server/replicated_partition.cc#L399 then that is propagated back to the client in the delete_records handler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the linked snippet, wouldn't this return a USE? Curious why that instead of explicitly handling the is_shutdown_exception case with raft::errc::shutting_down which results in a (I think) retriable timeout error?

Copy link
Contributor Author

@graphcareful graphcareful Aug 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consensus::replicate will return errc::shutting_down as an error code not as an exception so that path is already covered. What was happening was gate_closed_exception was being thrown and not caught, you can see that from the bad log lines error in the PR cover.

Now that I think about it , seems like there's maybe a missing gate closed exception handler clause somewhere within the replicate logic.

@graphcareful graphcareful merged commit a21e8e7 into redpanda-data:dev Aug 24, 2023
@BenPope
Copy link
Member

BenPope commented Nov 10, 2023

@BenPope
Copy link
Member

BenPope commented Nov 10, 2023

/backport v23.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI Failure (gate closed) in DeleteRecordsTest.test_delete_records_concurrent_truncations
5 participants