Add remote workunits for Zipkin trace #7897

cattibrie · 2019-06-19T13:51:06Z

Problem

As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing.

Solution

The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace.

The last part of this PR is to add a test(unitest and/or integration test).

illicitonion

Looks good :) Some minor things, and we'll need some tests that you're working on :)

illicitonion · 2019-06-19T15:48:30Z

src/rust/engine/process_execution/src/local.rs

@@ -211,7 +212,7 @@ impl super::CommandRunner for CommandRunner {
  ///
  /// Runs a command on this machine in the passed working directory.
  ///
-  fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> {
+  fn run(&self, req: ExecuteProcessRequest, _workunit_store: Arc<WorkUnitStore>) -> BoxFuture<FallibleExecuteProcessResult, String> {


Maybe add a TODO here to populate workunits for local process executions. We could reasonably have workunits for each of:
materialize input files
execution
ingest output files

(and I have a not-so-secret plan that we'll start also populating workunits for each store action (at least the remote ones) for things like "download file from remote" :))

illicitonion · 2019-06-19T15:50:26Z

src/rust/engine/workunit_store/src/lib.rs

  fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
  fn add_workunit(&self, workunit: WorkUnit);
 }
+
+pub struct SafeWorkUnitStore {


What's safe about this, to influence it's name?

If Context itself isn't going to implement the WorkUnitStore trait, then we could get rid of the trait, and just have a struct named WorkUnitStore that we use everywhere?

We decided to call it safe because it is a Mutex of a value(Vec). So we wanted to say that it is safe to work with it from several threads.
What could be a better name in this case?

Just WorkUnitStore - Rust guarantees that you'll always be able to work with things properly across threads (and specifically the Send and Sync marker traits are used to denote these kinds of safety) :)

illicitonion · 2019-06-19T15:51:18Z

src/rust/engine/workunit_store/src/lib.rs

  fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
  fn add_workunit(&self, workunit: WorkUnit);
 }
+
+pub struct SafeWorkUnitStore {
+  pub workunits: Mutex<Vec<WorkUnit>>,


I don't think this needs to be pub?

illicitonion · 2019-06-19T15:51:57Z

src/rust/engine/workunit_store/src/lib.rs

  fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
  fn add_workunit(&self, workunit: WorkUnit);
 }
+
+pub struct SafeWorkUnitStore {
+  pub workunits: Mutex<Vec<WorkUnit>>,


It may be worth having this field be an Arc<Mutex<Vec<WorkUnit>>> so that all of the callers don't need to wrap it in an Arc. Or it may not :)

illicitonion · 2019-06-19T15:54:12Z

src/rust/engine/process_execution/src/remote.rs

@@ -736,6 +753,30 @@ impl CommandRunner {
  }
 }

+fn maybe_add_workunit(result_cached: &bool, name: &str, start_time: &Timespec, end_time: &Timespec, parent_id: Option<String>, workunit_store: &Arc<WorkUnitStore>) {


It's kind of weird to take a &bool - clippy will probably suggest that you just take a bool.

The reason for this is that a bool is just one small amount of memory to copy, and is probably actually cheaper to just copy than using references, and any time you use a reference you make the compiler's job slightly harder.

illicitonion · 2019-06-19T16:10:38Z

src/rust/engine/process_execution/src/remote.rs

+}
+
+fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
+  //  Returning value is formed by representing duration as a hole number of seconds (u64) plus


Probably remove this comment - it's just describing what the code says and not adding much value on its own :)

I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.

blorente

Awesome work!

blorente · 2019-06-20T09:35:36Z

src/rust/engine/process_execution/src/remote.rs

+}
+
+fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
+  //  Returning value is formed by representing duration as a hole number of seconds (u64) plus


I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.

blorente · 2019-06-20T09:36:29Z

src/rust/engine/src/nodes.rs

+          start_timestamp,
+          end_timestamp,
+          span_id,
+          parent_id: None,


Maybe add a TODO (ideally with an issue number) to have this not be None soon (excited!)

illicitonion

Looks great! Apart from one question about how we're asserting equality, all really minor comments :)

illicitonion · 2019-07-01T10:37:53Z

src/rust/engine/process_execution/src/remote.rs

+}
+
+fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
+  //  Reverting time from duration to f64 decrease precision.


Probably worth linking to rust-lang/rust#54361 saying that there's an unstable standard library feature we'd like to use, but we're copying their implementation until it's stabilised.

illicitonion · 2019-07-01T10:39:08Z

src/rust/engine/process_execution/src/remote.rs

@@ -2489,6 +2529,70 @@ mod tests {
    )
  }

+  #[test]
+  fn check_that_remote_workunits_are_in_workunit_store() {


We tend not to use check_that or similar prefixes in our test names, because that what all tests do, so it's typically redundant. Maybe call this stores_workunits or remote_workunits_are_stored?

illicitonion · 2019-07-01T10:48:11Z

src/rust/engine/process_execution/src/remote.rs

+      StderrType::Raw(testdata_empty.string()),
+      0,
+    )
+        .op


This .op.unwrap().unwrap() points to a slightly wrong abstraction here. make_successful_operation_with_metadata returns a MockOperation, which maybe wraps an Operation. It looks like you want a function that just makes an Operation.

I think this should be pretty easy to do; just make make_successful_operation_with_maybe_metadata return an Operation, and do the MockOperation wrapping in make_successful_operation instead.

illicitonion · 2019-07-01T10:50:13Z

src/rust/engine/process_execution/src/remote.rs

+    let mut runtime = tokio::runtime::Runtime::new().unwrap();
+
+    let workunit_store_2 = workunit_store.clone();
+    runtime.block_on(futures::future::ok(()).and_then(move |()| command_runner.extract_execute_response(


Slightly more clear than the futures::future::ok(()).and_then(move |()| could be: futures::future::lazy(move ||

illicitonion · 2019-07-01T10:52:14Z

src/rust/engine/process_execution/src/remote.rs

+    let workunits_arc = workunit_store.get_workunits();
+    let workunits = workunits_arc.lock();
+    let scheduling_workunit = &workunits[0];
+    assert_workunit_params(scheduling_workunit, "scheduling", 0.0, 1.0, None);


Does the order of WorkUnits in the store actually matter? If not, we may not want to assert on it...

And in general, it's nice to be able to assert equality on two WorkUnits rather than have to go through field by field... We may want to (or it may be confusing and overkill...) do something like:

let got_workunits: HashSet<SpanIdIgnoringWorkUnits> = workunit_store.get_workunits().lock().iter().cloned().map(WorkUnitIgnoringSpanId::from_workunit).collect() let want_workunits = hashset! { WorkUnitIgnoringSpanId { ... }, WorkUnitIgnoringSpanId { ... }, WorkUnitIgnoringSpanId { ... }, WorkUnitIgnoringSpanId { ... }, }; assert_eq!(want_workunits, got_workunits)

illicitonion · 2019-07-01T10:52:50Z

src/rust/engine/process_execution/src/remote.rs

+    let mut metadata = ExecutedActionMetadata::new();
+    metadata.set_queued_timestamp(timestamp_only_secs(0));
+    metadata.set_worker_start_timestamp(timestamp_only_secs(1));
+    metadata.set_worker_completed_timestamp(timestamp_only_secs(8));


Let's move this line to the end, so that the order of code reflects the order of timestamps

illicitonion · 2019-07-01T10:56:53Z

src/rust/engine/src/nodes.rs

-      NodeKey::Snapshot(n) => n.run(context).map(NodeResult::from).to_boxed(),
-      NodeKey::Task(n) => n.run(context).map(NodeResult::from).to_boxed(),
-    }
+    futures::future::ok(()).and_then(|()| {


Again, maybe futures::future::lazy(|| (but not big deal either way)

illicitonion · 2019-07-01T11:00:45Z

src/rust/engine/workunit_store/src/lib.rs

+    self.workunits.lock().push(workunit);
+  }
+
+  pub fn len(&self) -> usize {


len seems like a kind of weird operation to have on a WorkUnitStore, and it looks like it's only used in a test. Maybe in the test instead do store.get_workunits().lock().len()?

illicitonion · 2019-07-01T11:05:04Z

src/rust/engine/process_execution/src/remote.rs

+//   only if '--reporting-zipkin-trace-v2' is set
+  if !result_cached {
+    let workunit = WorkUnit {
+      name: String::from(name),


We probably want to use slightly more verbose names, something like "remote execution worker input fetching" and similar

stuhood · 2019-07-02T01:52:52Z

src/rust/engine/process_execution/src/remote.rs

@@ -126,7 +128,7 @@ impl super::CommandRunner for CommandRunner {
  ///
  /// TODO: Request jdk_home be created if set.
  ///
-  fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> {
+  fn run(&self, req: ExecuteProcessRequest, workunit_store: WorkUnitStore) -> BoxFuture<FallibleExecuteProcessResult, String> {


@illicitonion : Should this be using thread/task locals like logging does? AFAIK, most tracing implementations do.

I'm personally a big fan of explicit over implicit, but it could if we wanted to...

Well, tracing is a lot like logging. Doing the equivalent thing there and propagating our logger explicitly throughout our callstack to all the places we might want to log would probably be too much.

Maybe this is not similar to a logger, and so that doesn't make sense here... unknown.

As I use task locals in not many places in this PR I would prefer to leave it as it is.
But this is a useful suggestion and it would be great to reconsider it when adding right parent_id to all other v2 Nodes.

illicitonion

Looks great! A couple of tiny comments, otherwise about ready to merge :)

Can you look at the first two shards on travis? There are a couple of trivial clippy things, and a reformatting needed. All the other shards look like unrelated flakes

illicitonion · 2019-07-08T14:14:07Z

src/rust/engine/workunit_store/src/lib.rs

+    format!("{:16.x}", random_u64)
+}
+
+pub fn got_workunits(workunit_store: WorkUnitStore) -> HashSet<WorkUnit> {


Let's rename this to something like: workunits_without_span_id or workunits_with_constant_span_id

Also, this only needs a reference to a WorkUnitStore, so let's take a &WorkUnitStore instead of a WorkUnitStore.

There's also a cute little construction where you could if you wanted (I don't think it's necessarily better here, but it's pretty cute and good to know about), write:

workunit_store.get_workunits().lock().iter() .map(|workunit| WorkUnit { span_id: String::from("ignore"), **workunit }) .collect()

which means "Make me a new WorkUnit with all the fields set the same as workunit except span_id"

When I changed the code to

workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()

I got the next error:

error[E0308]: mismatched types --> workunit_store/src/lib.rs:76:109 | 76 | workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect() | ^^^^^^^^ expected struct `WorkUnit`, found &WorkUnit |

I cannot also do:

workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..*workunit }).collect()

Because cannot move out of borrowed content

WorkUnit { span_id: String::from("ignore"), ..workunit.clone() } works
So anyway need to clone the workunit

illicitonion · 2019-07-08T14:14:24Z

src/rust/engine/src/nodes.rs

+          start_timestamp,
+          end_timestamp,
+          span_id,
+          parent_id: None,


### Problem As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing. ### Solution The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace. The last part of this PR is to add a test(unitest and/or integration test).

illicitonion · 2019-07-10T12:36:55Z

I just pushed a rebase because I introduced a bunch of conflicts in two other PRs; will merge when green! :)

cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch 3 times, most recently from c44a0ce to 5ef5b75 Compare June 19, 2019 14:14

illicitonion requested review from illicitonion, patliu85, stuhood and blorente and removed request for illicitonion and patliu85 June 19, 2019 14:21

illicitonion reviewed Jun 19, 2019

View reviewed changes

blorente approved these changes Jun 20, 2019

View reviewed changes

cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from 5ef5b75 to 29097d4 Compare June 27, 2019 16:14

cattibrie changed the title ~~WIP: Add remote workunits for Zipkin trace~~ Add remote workunits for Zipkin trace Jun 27, 2019

cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from 29097d4 to 1cb27b3 Compare June 28, 2019 14:25

illicitonion reviewed Jul 1, 2019

View reviewed changes

stuhood reviewed Jul 2, 2019

View reviewed changes

cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from 1cb27b3 to edb8c80 Compare July 5, 2019 16:37

illicitonion approved these changes Jul 8, 2019

View reviewed changes

cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from edb8c80 to b54225f Compare July 9, 2019 23:24

illicitonion force-pushed the etyurina/zipkin_add_remote_spans branch from b54225f to ac1f978 Compare July 10, 2019 12:36

illicitonion merged commit 05188ed into pantsbuild:master Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add remote workunits for Zipkin trace #7897

Add remote workunits for Zipkin trace #7897

cattibrie commented Jun 19, 2019 •

edited by stuhood

Loading

illicitonion left a comment

illicitonion Jun 19, 2019

illicitonion Jun 19, 2019

cattibrie Jun 27, 2019

illicitonion Jul 1, 2019

illicitonion Jun 19, 2019

illicitonion Jun 19, 2019

illicitonion Jun 19, 2019

illicitonion Jun 19, 2019

blorente Jun 20, 2019

blorente left a comment

blorente Jun 20, 2019

blorente Jun 20, 2019

illicitonion Jul 8, 2019

illicitonion left a comment

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

illicitonion Jul 1, 2019

stuhood Jul 2, 2019

illicitonion Jul 2, 2019

stuhood Jul 2, 2019

cattibrie Jul 5, 2019

illicitonion left a comment

illicitonion Jul 8, 2019

cattibrie Jul 9, 2019 •

edited

Loading

cattibrie Jul 9, 2019 •

edited

Loading

illicitonion Jul 8, 2019

illicitonion commented Jul 10, 2019

Add remote workunits for Zipkin trace #7897

Add remote workunits for Zipkin trace #7897

Conversation

cattibrie commented Jun 19, 2019 • edited by stuhood Loading

Problem

Solution

illicitonion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blorente left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illicitonion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illicitonion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cattibrie Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

cattibrie Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illicitonion commented Jul 10, 2019

cattibrie commented Jun 19, 2019 •

edited by stuhood

Loading

cattibrie Jul 9, 2019 •

edited

Loading

cattibrie Jul 9, 2019 •

edited

Loading