test_coalescing: improve error handling #834

cvybhu · 2023-10-13T20:21:55Z

test_coalescing spawns a lot of tokio tasks, each of which runs a query on the database.
If a query fails, the task will panic because of the unwrap().

The problem is that all of the tasks are joined using join_all, which doesn't check for errors in the joined tasks.
For example, this piece of code prints "Ok!" and the program finishes successfuly:

let mut tasks = Vec::new();
for _i in 0..4 {
    tasks.push(tokio::spawn(async { panic!("Panic!") }));
}

futures::future::join_all(tasks).await;

println!("Ok!");

There are some error messages in the output, but the important thig is that they don't stop the control flow.

To fix it, let's use try_join_all instead of join_all. try_join_all will check whether the task panicked or not, and return an error if there was a panic.

I manually tested that it works by inserting a panic!() after conn.query().

Let's also make sure that the result of join_all doesn't contain any Result<>, as it could hide the errors. The result of join_all should be just a Vec<> of unit values.

The PR also adds a check that Connection::query didn't return QueryResponse::Error. It is required to check it in order to ensure that the query didn't fail.

Refs: #828

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

piodul · 2023-10-17T07:28:33Z

scylla/src/transport/connection.rs

                    let futs = (base..base + batch_size).map(|j| {
                        let q = Query::new("INSERT INTO t (p, v) VALUES (?, ?)");
                        let conn = conn.clone();
-                        async move {
-                            conn.query(&q, (j, vec![j as u8; j as usize]), None)
-                                .await
-                                .unwrap()
-                        }
+                        async move { conn.query(&q, (j, vec![j as u8; j as usize]), None).await }
                    });
-                    futures::future::join_all(futs).await;
+                    futures::future::try_join_all(futs).await


I think this part is OK, actually. If you have a compound future (i.e. a future that polls other futures, like the one returned by join_all), panics from the inner futures will be propagated properly. I'm not sure that changes are needed here.

Yeah, I guess we can just panic there.
I was scared of panics because I don't fully understand how tokio handles panics, but it looks like it would be ok.

Although I found one more issue here. Connection::query returns a QueryResponse, which can be QueryResponse::Error, so we have to also check that. Fixed in the new version.

piodul · 2023-10-17T07:34:32Z

scylla/src/transport/connection.rs

                }));

                tokio::task::yield_now().await;
            }

-            futures::future::join_all(futs).await;
+            futures::future::try_join_all(futs).await.unwrap();


On the other hand, I'm not sure this line is correct now after your changes. If you look at the definition of tokio::task::spawn, you will see that it returns a handle that is a future which returns its own Result - if the future polled by the tokio task panics, the handle will contain Err. So, because you changed the futures to return Result<>s, now try_join_all returns a Result<Result<_>>.

Assuming that you won't change the futs.push part of the code, you will need to do two unwraps here. On the other hand, if you revert changes you did to the futs.push part, you should keep the line directly above as it is and that's it.

To sum up, the whole change should be an one-liner (unless I messed something in my reasoning).

Ahh you're right. It's a shame that rust didn't complain about unused Results here.
I uploaded new version which also checks that the result of join_all is always a Vec<()>

`test_coalescing` spawns a lot of tokio tasks, each of which runs a query on the database. If a query fails, the task will panic because of the `unwrap()`. The problem is that all of the tasks are joined using `join_all`, which doesn't check for errors in the joined tasks. For example, this piece of code prints "Ok!" and the program finishes successfuly: ```rust let mut tasks = Vec::new(); for _i in 0..4 { tasks.push(tokio::spawn(async { panic!("Panic!") })); } futures::future::join_all(tasks).await; println!("Ok!"); ``` There are some error messages in the output, but the important thig is that they don't stop the control flow. To fix it, let's use `try_join_all` instead of `join_all`. `try_join_all` will check whether the task panicked or not, and return an error if there was a panic. I manually tested that it works by inserting a panic!() after `conn.query()`. Let's also make sure that the result of join_all doesn't contain any `Result<>`, as it could hide the errors. The result of `join_all` should be just a `Vec<>` of unit values. Refs: scylladb#828 Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

`test_coalescing` runs a bunch of insert queries using `Connection::query`. It unwraps the result of `Connection::query`, but this function returns a `QueryResponse`, which might be `QueryResponse::Error`. To make sure that the query was successful we have to check that the response doesn't contain any errors. This can by done using `into_non_error_query_response()`. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

cvybhu · 2023-10-19T14:26:46Z

v2:

Went back to doing panic! if the query fails, unwrap() is easier to use than Result. I manually tested that the test fails when I add a panic!() next to conn.query.
Ensure that join_all doesn't return a Vec of results, it should always be a vec of unit values
I noticed that conn.query returns a QueryResponse which could be QueryResponse::Error. To make sure that a query was successful we also have to check that the response isn't an error. Fixed that in a new commit.

cvybhu mentioned this pull request Oct 13, 2023

test_coalescing might be flaky #828

Closed

cvybhu requested a review from piodul October 13, 2023 20:30

Lorak-mmk approved these changes Oct 16, 2023

View reviewed changes

piodul requested changes Oct 17, 2023

View reviewed changes

cvybhu added 2 commits October 19, 2023 16:16

cvybhu force-pushed the colaerror branch from b177595 to d532dd9 Compare October 19, 2023 14:21

piodul approved these changes Oct 23, 2023

View reviewed changes

piodul merged commit 535c724 into scylladb:main Oct 23, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_coalescing: improve error handling #834

test_coalescing: improve error handling #834

cvybhu commented Oct 13, 2023 •

edited

Loading

piodul Oct 17, 2023

cvybhu Oct 19, 2023

piodul Oct 17, 2023 •

edited

Loading

cvybhu Oct 19, 2023

cvybhu commented Oct 19, 2023 •

edited

Loading

test_coalescing: improve error handling #834

test_coalescing: improve error handling #834

Conversation

cvybhu commented Oct 13, 2023 • edited Loading

Pre-review checklist

piodul Oct 17, 2023

Choose a reason for hiding this comment

cvybhu Oct 19, 2023

Choose a reason for hiding this comment

piodul Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

cvybhu Oct 19, 2023

Choose a reason for hiding this comment

cvybhu commented Oct 19, 2023 • edited Loading

cvybhu commented Oct 13, 2023 •

edited

Loading

piodul Oct 17, 2023 •

edited

Loading

cvybhu commented Oct 19, 2023 •

edited

Loading