-
-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add remote workunits for Zipkin trace #7897
Add remote workunits for Zipkin trace #7897
Conversation
c44a0ce
to
5ef5b75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good :) Some minor things, and we'll need some tests that you're working on :)
@@ -211,7 +212,7 @@ impl super::CommandRunner for CommandRunner { | |||
/// | |||
/// Runs a command on this machine in the passed working directory. | |||
/// | |||
fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> { | |||
fn run(&self, req: ExecuteProcessRequest, _workunit_store: Arc<WorkUnitStore>) -> BoxFuture<FallibleExecuteProcessResult, String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a TODO here to populate workunits for local process executions. We could reasonably have workunits for each of:
materialize input files
execution
ingest output files
(and I have a not-so-secret plan that we'll start also populating workunits for each store action (at least the remote ones) for things like "download file from remote" :))
fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>; | ||
fn add_workunit(&self, workunit: WorkUnit); | ||
} | ||
|
||
pub struct SafeWorkUnitStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's safe about this, to influence it's name?
If Context
itself isn't going to implement the WorkUnitStore
trait, then we could get rid of the trait, and just have a struct named WorkUnitStore
that we use everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to call it safe because it is a Mutex of a value(Vec). So we wanted to say that it is safe to work with it from several threads.
What could be a better name in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just WorkUnitStore
- Rust guarantees that you'll always be able to work with things properly across threads (and specifically the Send
and Sync
marker traits are used to denote these kinds of safety) :)
fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>; | ||
fn add_workunit(&self, workunit: WorkUnit); | ||
} | ||
|
||
pub struct SafeWorkUnitStore { | ||
pub workunits: Mutex<Vec<WorkUnit>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this needs to be pub
?
fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>; | ||
fn add_workunit(&self, workunit: WorkUnit); | ||
} | ||
|
||
pub struct SafeWorkUnitStore { | ||
pub workunits: Mutex<Vec<WorkUnit>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth having this field be an Arc<Mutex<Vec<WorkUnit>>>
so that all of the callers don't need to wrap it in an Arc
. Or it may not :)
@@ -736,6 +753,30 @@ impl CommandRunner { | |||
} | |||
} | |||
|
|||
fn maybe_add_workunit(result_cached: &bool, name: &str, start_time: &Timespec, end_time: &Timespec, parent_id: Option<String>, workunit_store: &Arc<WorkUnitStore>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird to take a &bool
- clippy will probably suggest that you just take a bool
.
The reason for this is that a bool
is just one small amount of memory to copy, and is probably actually cheaper to just copy than using references, and any time you use a reference you make the compiler's job slightly harder.
} | ||
|
||
fn timespec_as_float_secs(timespec: &Timespec) -> f64 { | ||
// Returning value is formed by representing duration as a hole number of seconds (u64) plus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably remove this comment - it's just describing what the code says and not adding much value on its own :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
} | ||
|
||
fn timespec_as_float_secs(timespec: &Timespec) -> f64 { | ||
// Returning value is formed by representing duration as a hole number of seconds (u64) plus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.
start_timestamp, | ||
end_timestamp, | ||
span_id, | ||
parent_id: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a TODO (ideally with an issue number) to have this not be None
soon (excited!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
5ef5b75
to
29097d4
Compare
29097d4
to
1cb27b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Apart from one question about how we're asserting equality, all really minor comments :)
} | ||
|
||
fn timespec_as_float_secs(timespec: &Timespec) -> f64 { | ||
// Reverting time from duration to f64 decrease precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth linking to rust-lang/rust#54361 saying that there's an unstable standard library feature we'd like to use, but we're copying their implementation until it's stabilised.
@@ -2489,6 +2529,70 @@ mod tests { | |||
) | |||
} | |||
|
|||
#[test] | |||
fn check_that_remote_workunits_are_in_workunit_store() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend not to use check_that
or similar prefixes in our test names, because that what all tests do, so it's typically redundant. Maybe call this stores_workunits
or remote_workunits_are_stored
?
StderrType::Raw(testdata_empty.string()), | ||
0, | ||
) | ||
.op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This .op.unwrap().unwrap()
points to a slightly wrong abstraction here. make_successful_operation_with_metadata
returns a MockOperation
, which maybe wraps an Operation
. It looks like you want a function that just makes an Operation
.
I think this should be pretty easy to do; just make make_successful_operation_with_maybe_metadata
return an Operation
, and do the MockOperation
wrapping in make_successful_operation
instead.
let mut runtime = tokio::runtime::Runtime::new().unwrap(); | ||
|
||
let workunit_store_2 = workunit_store.clone(); | ||
runtime.block_on(futures::future::ok(()).and_then(move |()| command_runner.extract_execute_response( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly more clear than the futures::future::ok(()).and_then(move |()|
could be: futures::future::lazy(move ||
let workunits_arc = workunit_store.get_workunits(); | ||
let workunits = workunits_arc.lock(); | ||
let scheduling_workunit = &workunits[0]; | ||
assert_workunit_params(scheduling_workunit, "scheduling", 0.0, 1.0, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the order of WorkUnits in the store actually matter? If not, we may not want to assert on it...
And in general, it's nice to be able to assert equality on two WorkUnits rather than have to go through field by field... We may want to (or it may be confusing and overkill...) do something like:
let got_workunits: HashSet<SpanIdIgnoringWorkUnits> = workunit_store.get_workunits().lock().iter().cloned().map(WorkUnitIgnoringSpanId::from_workunit).collect()
let want_workunits = hashset! {
WorkUnitIgnoringSpanId { ... },
WorkUnitIgnoringSpanId { ... },
WorkUnitIgnoringSpanId { ... },
WorkUnitIgnoringSpanId { ... },
};
assert_eq!(want_workunits, got_workunits)
let mut metadata = ExecutedActionMetadata::new(); | ||
metadata.set_queued_timestamp(timestamp_only_secs(0)); | ||
metadata.set_worker_start_timestamp(timestamp_only_secs(1)); | ||
metadata.set_worker_completed_timestamp(timestamp_only_secs(8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this line to the end, so that the order of code reflects the order of timestamps
src/rust/engine/src/nodes.rs
Outdated
NodeKey::Snapshot(n) => n.run(context).map(NodeResult::from).to_boxed(), | ||
NodeKey::Task(n) => n.run(context).map(NodeResult::from).to_boxed(), | ||
} | ||
futures::future::ok(()).and_then(|()| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, maybe futures::future::lazy(||
(but not big deal either way)
self.workunits.lock().push(workunit); | ||
} | ||
|
||
pub fn len(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len
seems like a kind of weird operation to have on a WorkUnitStore
, and it looks like it's only used in a test. Maybe in the test instead do store.get_workunits().lock().len()
?
// only if '--reporting-zipkin-trace-v2' is set | ||
if !result_cached { | ||
let workunit = WorkUnit { | ||
name: String::from(name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to use slightly more verbose names, something like "remote execution worker input fetching" and similar
@@ -126,7 +128,7 @@ impl super::CommandRunner for CommandRunner { | |||
/// | |||
/// TODO: Request jdk_home be created if set. | |||
/// | |||
fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> { | |||
fn run(&self, req: ExecuteProcessRequest, workunit_store: WorkUnitStore) -> BoxFuture<FallibleExecuteProcessResult, String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@illicitonion : Should this be using thread/task locals like logging does? AFAIK, most tracing implementations do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm personally a big fan of explicit over implicit, but it could if we wanted to...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, tracing is a lot like logging. Doing the equivalent thing there and propagating our logger explicitly throughout our callstack to all the places we might want to log would probably be too much.
Maybe this is not similar to a logger, and so that doesn't make sense here... unknown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I use task locals in not many places in this PR I would prefer to leave it as it is.
But this is a useful suggestion and it would be great to reconsider it when adding right parent_id to all other v2 Nodes.
1cb27b3
to
edb8c80
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple of tiny comments, otherwise about ready to merge :)
Can you look at the first two shards on travis? There are a couple of trivial clippy things, and a reformatting needed. All the other shards look like unrelated flakes
format!("{:16.x}", random_u64) | ||
} | ||
|
||
pub fn got_workunits(workunit_store: WorkUnitStore) -> HashSet<WorkUnit> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to something like: workunits_without_span_id
or workunits_with_constant_span_id
Also, this only needs a reference to a WorkUnitStore
, so let's take a &WorkUnitStore
instead of a WorkUnitStore
.
There's also a cute little construction where you could if you wanted (I don't think it's necessarily better here, but it's pretty cute and good to know about), write:
workunit_store.get_workunits().lock().iter()
.map(|workunit| WorkUnit { span_id: String::from("ignore"), **workunit })
.collect()
which means "Make me a new WorkUnit
with all the fields set the same as workunit
except span_id
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I changed the code to
workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()
I got the next error:
error[E0308]: mismatched types
--> workunit_store/src/lib.rs:76:109
|
76 | workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()
| ^^^^^^^^ expected struct `WorkUnit`, found &WorkUnit
|
I cannot also do:
workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..*workunit }).collect()
Because cannot move out of borrowed content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WorkUnit { span_id: String::from("ignore"), ..workunit.clone() }
works
So anyway need to clone the workunit
start_timestamp, | ||
end_timestamp, | ||
span_id, | ||
parent_id: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
edb8c80
to
b54225f
Compare
### Problem As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing. ### Solution The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace. The last part of this PR is to add a test(unitest and/or integration test).
b54225f
to
ac1f978
Compare
I just pushed a rebase because I introduced a bunch of conflicts in two other PRs; will merge when green! :) |
Problem
As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing.
Solution
The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace.
The last part of this PR is to add a test(unitest and/or integration test).