Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: use explicit matches for errors when the decision is crucial for correctness/performance #1083

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

muzarski
Copy link
Contributor

@muzarski muzarski commented Oct 4, 2024

ref: #519

Since last time, during error refactor I introduced a silent bug to the code (#1075), I'd like to prevent that from happening in the future. This is why I replace _ matches with explicit error variants.
This way, if some developer introduces a new variant to QueryError, the compiler will complain about it in multiple places, forcing the developer to adjust match expressions there.

Notice that LatencyAwareness::reliable_latency_measure matched errors this way even before this PR.

Discussion:

I think it's a great place to discuss what decision should be made based on the received error in following modules/places:

  • LatencyAwareness::reliable_latency_measure
  • receiving USE KEYSPACE result from a single connection/node (see use_keyspace_result from cluster.rs)
  • DefaultRetrySession::decide_should_retry (same for DowngradingConsistencyRetrySession)
  • deciding whether the error received after speculative execution should be ignored

I did not change the logic at all yet, simply transformed the code to use explicit match expressions.
My first suspect is QueryError::TimeoutError in can_be_ignored in speculative_execution module. I believe we should
return true for this error only iff we return true for QueryError::RequestTimeout.

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • [ ] I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@muzarski muzarski self-assigned this Oct 4, 2024
Copy link

github-actions bot commented Oct 4, 2024

cargo semver-checks found no API-breaking changes in this PR! 🎉🥳
Checked commit: c400b5f

@muzarski muzarski changed the title Internal error matches no wildcard error: use explicit matches for errors when the decision is crucial for correctness/performance Oct 4, 2024
Comment on lines 173 to +214
// In all other cases propagate the error to the user
_ => RetryDecision::DontRetry,
QueryError::BadQuery(_)
| QueryError::MetadataError(_)
| QueryError::CqlRequestSerialization(_)
| QueryError::BodyExtensionsParseError(_)
| QueryError::EmptyPlan
| QueryError::CqlResultParseError(_)
| QueryError::CqlErrorParseError(_)
| QueryError::ProtocolError(_)
| QueryError::TimeoutError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really propagate TimeoutError instead of trying another node? Well, it's the weird error about use_keyspace...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thing is that this error variant can't even appear there. AFAIU, use keyspace requests go through different path than regular requests. They are not a subject to retries, since we send use keyspace requests to all nodes simultaneously.

Copy link
Contributor Author

@muzarski muzarski Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, there are some errors that can't possibly appear in some places (same for speculative_execution::can_be_ignored). This is because QueryError is a really broad error type, whose specific variants are constructed in multiple driver layers. IMO, the only way to fix this, is to narrow the internal return error types as much as possible. I'm not sure how much work it would require, though.

scylla/src/transport/retry_policy.rs Outdated Show resolved Hide resolved
scylla/src/transport/errors.rs Outdated Show resolved Hide resolved
@Lorak-mmk
Copy link
Collaborator

ref: #519

Since last time, during error refactor I introduced a silent bug to the code (#1075), I'd like to prevent that from happening in the future. This is why I replace _ matches with explicit error variants. This way, if some developer introduces a new variant to QueryError, the compiler will complain about it in multiple places, forcing the developer to adjust match expressions there.

Idea: places where explicit match should be used could use a comment about it, and perhaps this lint: https://rust-lang.github.io/rust-clippy/master/index.html#/wildcard_enum_match_arm
That way a future contributor won't simplify the code by using wildcard.

Notice that LatencyAwareness::reliable_latency_measure matched errors this way even before this PR.

Discussion:

I think it's a great place to discuss what decision should be made based on the received error in following modules/places:

The most important thing is to estabilish a rule for each of the following places, so that contributors know how to categorize possible future variants - and perhaps fix bugs in categorization that we do.
General question: are we sure that if connection is broken then we always receive ConnectionBroken variant? It's not possible to receive some deserialization error because of stream ending too early, right?
I will assume that we always receive ConnectionBroken in that case - and also in the case that we can't write to socket because it was closed.

* `LatencyAwareness::reliable_latency_measure`

The rule in the comments is imo reasonable, but a bit imprecise.
Some errors are not always fast or slow - and the comments don't specify where they should go.
I suspect that "slow" errors should be only those that are always "slow" - otherwise they can't be a reliable latency measure.

Another issue: what does it mean for the query to be "slow" here?
I don't get why "DbError::IsBootstrapping", "DbError::Unprepared" or "DbError::Unavailable" are fast, but "DbError::SyntaxError" or "DbError::Invalid" (and some others) are "slow".

Note: "DbError" in the slow branch should also not use wildcard.

* receiving `USE KEYSPACE` result from a single connection/node (see `use_keyspace_result` from cluster.rs)

The rule in the comment of that function looks reasonable to me, and the current filtering matches this rule.
The invariant is that after the function returns all connections are either broken or will have new keyspace set.
If this invariant was kept in the upper layers then the behavior would be good: after Session.use_keyspace returns all new queries will be issued on connections with the new keyspace.
However it looks like this invariant is not kept in the upper layers - PoolRefiller.use_keyspace doesn't close connections that returned an error.

* `DefaultRetrySession::decide_should_retry` (same for `DowngradingConsistencyRetrySession`)

Those are complicated - I don't think there can exist a general rule. I'll look at them later.

* deciding whether the error received after speculative execution should be ignored

The rule here seems to be the we can ignore errors if the presence of the error doesn't mean other speculative tries will fail too.
As I mention below I think it is not possible to receive QueryError::TimeoutError or QueryError::RequestTimeout here - so we should not ignore them so they are propagated to the user if the assumption turns out to be wrong.

QueryError::DbError:
- SyntaxError - can't ignore
- Invalid - I'm not sure. When can it happen?
- AlreadyExists - can't ignore
- FunctionFailure - not sure if UDF failure is transient. @wmitros ?
- AuthenticationError - Can this happen? I think not. We can't ignore it anyway
- Unauthorized - Can't ignore
- ConfigError - not sure. When can this happen?
- Unavailable - may be a split brain situation etc. I think we can ignore.
- Overloaded - We can ignore
- IsBootstrapping - can ignore
- TruncateError - not sure. When can this happen?
- ReadTimeout - can ignore
- WriteTimeout - can ignore
- ReadFailure - I'm not sure when can this happen. Probably can ignore
- WriteFailure - as above
- Unprepared - hmm... this can happen if we tried to reprepare, but it is still unprepared. I think we can ignore - it may be preparable on another node
- ServerError - I think we can ignore, but I'm not sure
- ProtocolError - why is there protocol error here too? The one in QueryError is not enough? Anyway, we probably can't ignore
- RateLimitReached - Can ignore
- Other - this shouldn't happen, right? I think we should not ignore
QueryError::BadQuery - we can't ignore because query is bad always or never
QueryError::CqlRequestSerialization - same as above
QueryError::BodyExtensionsParseError - if we get parse / deserialization error then something went terribly wrong, no need to retry on another node
QueryError::CqlResultParseError - same as above
QueryError::CqlErrorParseError - same as above
QueryError::ConnectionPoolError - we can ignore, we just don't have connections to given node
QueryError::ProtocolError - we can't ignore, same as with serialization / deserialization error
QueryError::InvalidMessage - I don't know when it happens. Assuming it's similar to ProtocolError we can't ignore it.
QueryError::TimeoutError - shouldn't happen, so we can't ignore
QueryError::BrokenConnection - we can ignore
QueryError::UnableToAllocStreamId - we can ignore
QueryError::RequestTimeout - shouldn't happen, so we can't ignore

I did not change the logic at all yet, simply transformed the code to use explicit match expressions. My first suspect is QueryError::TimeoutError in can_be_ignored in speculative_execution module. I believe we should return true for this error only iff we return true for QueryError::RequestTimeout.

I'm fairly sure that RequestTimeout is applied to the whole execution, not individual speculative fibers, so it is not possible to receive it there.

@muzarski
Copy link
Contributor Author

muzarski commented Oct 8, 2024

It's not possible to receive some deserialization error because of stream ending too early, right?

That's right. We always read the whole frame, and only then we start to deserialize the data. The only exception is frame header deserialization (see FrameHeaderParseError), but if such error appears, we close the connection anyway (i.e. it's included in BrokenConnectionError).

The rule in the comments is imo reasonable, but a bit imprecise. Some errors are not always fast or slow - and the comments don't specify where they should go. I suspect that "slow" errors should be only those that are always "slow" - otherwise they can't be a reliable latency measure.

Another issue: what does it mean for the query to be "slow" here? I don't get why "DbError::IsBootstrapping", "DbError::Unprepared" or "DbError::Unavailable" are fast, but "DbError::SyntaxError" or "DbError::Invalid" (and some others) are "slow".

When I first looked at this code, my first was thought that "fast" errors are the errors that appear on driver's side, before sending the request to the server. I'm not sure why the DbErrors you mentioned are "fast" as well.

  • Invalid - I'm not sure. When can it happen?

An example of this error that appears in cpp-rust-driver integration tests:

Invalid query: Database returned an error: 
The query is syntactically correct but invalid, 
Error message: No keyspace has been specified. 
USE a keyspace, or explicitly specify 
keyspace.tablename

I believe this should not be ignored - it will fail on other nodes as well.

  • ProtocolError - why is there protocol error here too? The one in QueryError is not enough? Anyway, we probably can't ignore

This is a protocol error that was recognized by the server. The QueryError::ProtocolError are the protocol errors recognized by the driver. Anyway, I agree that we should not ignore it.

  • QueryError::InvalidMessage - I don't know when it happens. Assuming it's similar to ProtocolError we can't ignore it.

I removed this variant during protocol error refactor.

@Lorak-mmk
Copy link
Collaborator

When I first looked at this code, my first was thought that "fast" errors are the errors that appear on driver's side, before sending the request to the server. I'm not sure why the DbErrors you mentioned are "fast" as well.

My first thought was that "fast" errors are errors on the driver side and error on server side that happen before it tries to contact other nodes (so can reply quickly, without actually performing the query).
I'm not sure this is correct.

  • Invalid - I'm not sure. When can it happen?

An example of this error that appears in cpp-rust-driver integration tests:

Invalid query: Database returned an error: 
The query is syntactically correct but invalid, 
Error message: No keyspace has been specified. 
USE a keyspace, or explicitly specify 
keyspace.tablename

I believe this should not be ignored - it will fail on other nodes as well.

I agree

@muzarski muzarski force-pushed the internal-error-matches-no-wildcard branch from e09334f to 5bd35ff Compare October 8, 2024 14:30
@muzarski
Copy link
Contributor Author

muzarski commented Oct 8, 2024

v2:

  • rebased on main (I apologize for that - should be done in a separate push, but I forgot that I already pulled some changes locally)
  • enabled wildcard_enum_match_arm lint in some places (including latency awareness)
  • using explicit matches for DbErrors (including latency awareness)
  • fixed can_be_ignored function in speculative_execution module

@muzarski muzarski requested a review from wprzytula October 8, 2024 15:18
Since last time, during error refactor I introduced a silent
bug to the code (scylladb#1075),
I'd like to prevent that from happening in the future. This is why
we replace a `_` match with explicit error variants in `reliable_latency_measure` function.

We also enable the `wildcard_enum_match_arm` clippy lint here.
Since last time, during error refactor I introduced a silent
bug to the code (scylladb#1075),
I'd like to prevent that from happening in the future. This is why
we replace a `_` match with explicit error variants in retry policy
modules.

We also enable `wildcard_enum_match_arm` clippy lint in this place
for QueryError and DbError matches.
Since last time, during error refactor I introduced a silent
bug to the code (scylladb#1075),
I'd like to prevent that from happening in the future. This is why
we replace a `_` match with explicit error variants when deciding
if error received after `USE KEYSPACE` should be ignored.

We also enable the `wildcard_enum_match_arm` clippy lint to disallow
using `_` matches.
Since last time, during error refactor I introduced a silent
bug to the code (scylladb#1075),
I'd like to prevent that from happening in the future. This is why
we replace a `_` match with explicit error variants when deciding
if error received from speculative execution should be ignored.

We also enabled the `wildcard_enum_match_arm` clippy lint.
Previously, `can_be_ignored` function would return `true` for some
weird error variants.

I adjusted the implementation, and justified the decision for each
error variant in the comments.
@muzarski muzarski force-pushed the internal-error-matches-no-wildcard branch from 5bd35ff to c400b5f Compare October 14, 2024 13:51
@muzarski muzarski added this to the 0.15.0 milestone Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants