-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error: use explicit matches for errors when the decision is crucial for correctness/performance #1083
base: main
Are you sure you want to change the base?
error: use explicit matches for errors when the decision is crucial for correctness/performance #1083
Conversation
|
// In all other cases propagate the error to the user | ||
_ => RetryDecision::DontRetry, | ||
QueryError::BadQuery(_) | ||
| QueryError::MetadataError(_) | ||
| QueryError::CqlRequestSerialization(_) | ||
| QueryError::BodyExtensionsParseError(_) | ||
| QueryError::EmptyPlan | ||
| QueryError::CqlResultParseError(_) | ||
| QueryError::CqlErrorParseError(_) | ||
| QueryError::ProtocolError(_) | ||
| QueryError::TimeoutError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we really propagate TimeoutError
instead of trying another node? Well, it's the weird error about use_keyspace
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny thing is that this error variant can't even appear there. AFAIU, use keyspace
requests go through different path than regular requests. They are not a subject to retries, since we send use keyspace
requests to all nodes simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, there are some errors that can't possibly appear in some places (same for speculative_execution::can_be_ignored
). This is because QueryError
is a really broad error type, whose specific variants are constructed in multiple driver layers. IMO, the only way to fix this, is to narrow the internal return error types as much as possible. I'm not sure how much work it would require, though.
Idea: places where explicit match should be used could use a comment about it, and perhaps this lint: https://rust-lang.github.io/rust-clippy/master/index.html#/wildcard_enum_match_arm
The most important thing is to estabilish a rule for each of the following places, so that contributors know how to categorize possible future variants - and perhaps fix bugs in categorization that we do.
The rule in the comments is imo reasonable, but a bit imprecise. Another issue: what does it mean for the query to be "slow" here? Note: "DbError" in the slow branch should also not use wildcard.
The rule in the comment of that function looks reasonable to me, and the current filtering matches this rule.
Those are complicated - I don't think there can exist a general rule. I'll look at them later.
The rule here seems to be the we can ignore errors if the presence of the error doesn't mean other speculative tries will fail too.
I'm fairly sure that |
That's right. We always read the whole frame, and only then we start to deserialize the data. The only exception is frame header deserialization (see
When I first looked at this code, my first was thought that "fast" errors are the errors that appear on driver's side, before sending the request to the server. I'm not sure why the
An example of this error that appears in cpp-rust-driver integration tests:
I believe this should not be ignored - it will fail on other nodes as well.
This is a protocol error that was recognized by the server. The
I removed this variant during protocol error refactor. |
My first thought was that "fast" errors are errors on the driver side and error on server side that happen before it tries to contact other nodes (so can reply quickly, without actually performing the query).
I agree |
e09334f
to
5bd35ff
Compare
v2:
|
Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants in `reliable_latency_measure` function. We also enable the `wildcard_enum_match_arm` clippy lint here.
Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants in retry policy modules. We also enable `wildcard_enum_match_arm` clippy lint in this place for QueryError and DbError matches.
Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants when deciding if error received after `USE KEYSPACE` should be ignored. We also enable the `wildcard_enum_match_arm` clippy lint to disallow using `_` matches.
Since last time, during error refactor I introduced a silent bug to the code (scylladb#1075), I'd like to prevent that from happening in the future. This is why we replace a `_` match with explicit error variants when deciding if error received from speculative execution should be ignored. We also enabled the `wildcard_enum_match_arm` clippy lint.
Previously, `can_be_ignored` function would return `true` for some weird error variants. I adjusted the implementation, and justified the decision for each error variant in the comments.
5bd35ff
to
c400b5f
Compare
ref: #519
Since last time, during error refactor I introduced a silent bug to the code (#1075), I'd like to prevent that from happening in the future. This is why I replace
_
matches with explicit error variants.This way, if some developer introduces a new variant to
QueryError
, the compiler will complain about it in multiple places, forcing the developer to adjustmatch
expressions there.Notice that
LatencyAwareness::reliable_latency_measure
matched errors this way even before this PR.Discussion:
I think it's a great place to discuss what decision should be made based on the received error in following modules/places:
LatencyAwareness::reliable_latency_measure
USE KEYSPACE
result from a single connection/node (seeuse_keyspace_result
from cluster.rs)DefaultRetrySession::decide_should_retry
(same forDowngradingConsistencyRetrySession
)I did not change the logic at all yet, simply transformed the code to use explicit match expressions.
My first suspect is
QueryError::TimeoutError
incan_be_ignored
inspeculative_execution
module. I believe we shouldreturn true for this error only iff we return true for
QueryError::RequestTimeout
.[ ] I added relevant tests for new features and bug fixes.[ ] I have provided docstrings for the public items that I want to introduce.[ ] I have adjusted the documentation in./docs/source/
.Fixes:
annotations to PR description.