-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Let the PVF host kill the worker on timeout #6381
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,6 +78,9 @@ enum Selected { | |
|
||
/// Given the idle token of a worker and parameters of work, communicates with the worker and | ||
/// returns the outcome. | ||
/// | ||
/// NOTE: Returning the `TimedOut` or `DidNotMakeIt` errors will trigger the child process being | ||
/// killed. | ||
pub async fn start_work( | ||
worker: IdleWorker, | ||
code: Arc<Vec<u8>>, | ||
|
@@ -149,6 +152,7 @@ pub async fn start_work( | |
}, | ||
}; | ||
|
||
// NOTE: A `TimedOut` or `DidNotMakeIt` error triggers the child process being killed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added some comments about the error handling flow. Having too many comments is These comments would have helped me understand the error flow earlier, but if it |
||
match selected { | ||
// Timed out on the child. This should already be logged by the child. | ||
Selected::Done(Err(PrepareError::TimedOut)) => Outcome::TimedOut, | ||
|
@@ -162,6 +166,9 @@ pub async fn start_work( | |
} | ||
|
||
/// Handles the case where we successfully received response bytes on the host from the child. | ||
/// | ||
/// NOTE: Here we know the artifact exists, but is still located in a temporary file which will be | ||
/// cleared by `with_tmp_file`. | ||
async fn handle_response_bytes( | ||
response_bytes: Vec<u8>, | ||
pid: u32, | ||
|
@@ -201,9 +208,6 @@ async fn handle_response_bytes( | |
); | ||
|
||
// Return a timeout error. | ||
// | ||
// NOTE: The artifact exists, but is located in a temporary file which | ||
// will be cleared by `with_tmp_file`. | ||
return Selected::Deadline | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -233,7 +233,11 @@ pub async fn cpu_time_monitor_loop( | |
timeout.as_millis(), | ||
); | ||
|
||
// Send back a TimedOut error on timeout. | ||
// Send back a `TimedOut` error. | ||
// | ||
// NOTE: This will cause the worker, whether preparation or execution, to be killed by | ||
// the host. We do not kill the process here because it would interfere with the proper | ||
// handling of this error. | ||
let encoded_result = match job_kind { | ||
JobKind::Prepare => { | ||
let result: Result<(), PrepareError> = Err(PrepareError::TimedOut); | ||
|
@@ -244,8 +248,8 @@ pub async fn cpu_time_monitor_loop( | |
result.encode() | ||
}, | ||
}; | ||
// If we error there is nothing else we can do here, and we are killing the process, | ||
// anyway. The receiving side will just have to time out. | ||
// If we error here there is nothing we can do apart from log it. The receiving side | ||
// will just have to time out. | ||
if let Err(err) = framed_send(&mut stream, encoded_result.as_slice()).await { | ||
gum::warn!( | ||
target: LOG_TARGET, | ||
|
@@ -254,9 +258,6 @@ pub async fn cpu_time_monitor_loop( | |
err | ||
); | ||
} | ||
|
||
// Kill the process. | ||
std::process::exit(1); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What will happen in case of the timeout is that you will go onto the next iteration of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! |
||
} | ||
|
||
// Sleep for the remaining CPU time, plus a bit to account for overhead. Note that the sleep | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line should be a doc comment on
FromPool::Concluded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the existing doc on
Concluded
already addresses this, but it says the worker should already be killed:This happens in
attempt_retire
, which I missed. I had a misunderstanding based on this line:Not sure if we need any comments here, just a brain fart on my end.