Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate panics #6449

Closed
wants to merge 2 commits into from
Closed

Conversation

nvartolomei
Copy link
Contributor

Another try for fixing #3104.

RepartitionExec might need a similar fix.

Initially I thought about propagating the JoinError, but (re)panicking seems nicer and helps debugging better.

What changes are included in this PR?

Use tokio::task::JoinSet to handle spawning and tearing down of tasks.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Another try for fixing apache#3104.

RepartitionExec might need a similar fix.
@github-actions github-actions bot added the core Core DataFusion crate label May 25, 2023
nvartolomei added a commit to nvartolomei/tokio that referenced this pull request May 25, 2023
My use-case can be seen here apache/datafusion#6449.

I want to avoid needless Box::pin calls when I can just poll the underlying task.
// If the input stream is done, wait for all tasks to finish and return
// the failure if any.
if let Poll::Ready(None) = poll {
match Box::pin(self.tasks.join_next()).poll_unpin(cx) {
Copy link
Contributor Author

@nvartolomei nvartolomei May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would help here a bit tokio-rs/tokio#5721

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that it makes sense to add a poll_join_next method, you don't have to box here. You can use the pin! macro instead of boxing.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @nvartolomei 🙏

Initially I thought about propagating the JoinError, but (re)panicking seems nicer and helps debugging better.

I agree it is nicer, but I think propagating the JoinError is more consistent with the rest of the DataFusion codebase. For example RepartitionExec

In IOx, I think we solved a similar problem using a WatchedTask

https://github.com/influxdata/influxdb_iox/blob/0b9b775fbd5a47a8e8c918fab5d704af3f576b5e/datafusion_util/src/watch.rs#L15-L94

Maybe we could apply a similar pattern in DataFusion

cc @crepererum or @tustvold in case you have some thoughts in this area

}

Ok(Box::pin(MergeStream {
input: receiver,
schema: self.schema(),
baseline_metrics,
drop_helper: AbortOnDropMany(join_handles),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 it does look like https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html may be a nicer alternative to AbortOnDrop -- what do you think @crepererum ?

// If the input stream is done, wait for all tasks to finish and return
// the failure if any.
if let Poll::Ready(None) = poll {
let fut = self.tasks.join_next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this only checks the first future that finishes. Would it work if there were multiple input tasks and the first one that finished was not the one that panicd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I have been working on writing a test for this as part of #6507

nvartolomei added a commit to nvartolomei/tokio that referenced this pull request May 28, 2023
…h_id` public

My use-case can be seen here apache/datafusion#6449.

I want to avoid needless Box::pin calls when I can just poll the underlying task.
nvartolomei added a commit to nvartolomei/tokio that referenced this pull request May 28, 2023
My use-case can be seen here apache/datafusion#6449.

I want to avoid needless Box::pin calls when I can just poll the underlying task.
nvartolomei added a commit to nvartolomei/tokio that referenced this pull request May 29, 2023
My use-case can be seen here apache/datafusion#6449.

I want to avoid needless Box::pin calls when I can just poll the underlying task.
nvartolomei added a commit to nvartolomei/tokio that referenced this pull request May 29, 2023
My use-case can be seen here apache/datafusion#6449.

I want to avoid needless Box::pin calls when I can just poll the underlying task.
@alamb alamb marked this pull request as draft May 29, 2023 11:18
@alamb alamb marked this pull request as ready for review May 29, 2023 11:19
@crepererum
Copy link
Contributor

Good fix and test, thank you. Could you test+fix datafusion/core/src/physical_plan/repartition/mod.rs as well and remove AbortOnDropMany from the code base so that nobody introduces the same bug again? If you don't have the time/resources, just write a ticket and someone else will (hopefully) pick it up.

Short side note on JoinSet: IIRC this is a rather recent (1y or so) addition to tokio and wasn't available when the abortion bug was fixed.

@@ -98,12 +98,13 @@ fn build_file_list_recurse(

/// Spawns a task to the tokio threadpool and writes its outputs to the provided mpsc sender
pub(crate) fn spawn_execution(
join_set: &mut JoinSet<()>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Jun 1, 2023

If you don't have the time/resources, just write a ticket and someone else will (hopefully) pick it up.

I filed #6513

@alamb alamb closed this in #6507 Jun 6, 2023
@nvartolomei nvartolomei deleted the nv/propagate-panics branch June 6, 2023 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants