Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add remote workunits for Zipkin trace #7897

Merged

Conversation

cattibrie
Copy link
Contributor

@cattibrie cattibrie commented Jun 19, 2019

Problem

As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing.

Solution

The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace.

The last part of this PR is to add a test(unitest and/or integration test).

@cattibrie cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch 3 times, most recently from c44a0ce to 5ef5b75 Compare June 19, 2019 14:14
@illicitonion illicitonion requested review from illicitonion, patliu85, stuhood and blorente and removed request for illicitonion and patliu85 June 19, 2019 14:21
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :) Some minor things, and we'll need some tests that you're working on :)

@@ -211,7 +212,7 @@ impl super::CommandRunner for CommandRunner {
///
/// Runs a command on this machine in the passed working directory.
///
fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> {
fn run(&self, req: ExecuteProcessRequest, _workunit_store: Arc<WorkUnitStore>) -> BoxFuture<FallibleExecuteProcessResult, String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO here to populate workunits for local process executions. We could reasonably have workunits for each of:
materialize input files
execution
ingest output files

(and I have a not-so-secret plan that we'll start also populating workunits for each store action (at least the remote ones) for things like "download file from remote" :))

fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
fn add_workunit(&self, workunit: WorkUnit);
}

pub struct SafeWorkUnitStore {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's safe about this, to influence it's name?

If Context itself isn't going to implement the WorkUnitStore trait, then we could get rid of the trait, and just have a struct named WorkUnitStore that we use everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to call it safe because it is a Mutex of a value(Vec). So we wanted to say that it is safe to work with it from several threads.
What could be a better name in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just WorkUnitStore - Rust guarantees that you'll always be able to work with things properly across threads (and specifically the Send and Sync marker traits are used to denote these kinds of safety) :)

fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
fn add_workunit(&self, workunit: WorkUnit);
}

pub struct SafeWorkUnitStore {
pub workunits: Mutex<Vec<WorkUnit>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be pub?

fn get_workunits(&self) -> &Mutex<Vec<WorkUnit>>;
fn add_workunit(&self, workunit: WorkUnit);
}

pub struct SafeWorkUnitStore {
pub workunits: Mutex<Vec<WorkUnit>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth having this field be an Arc<Mutex<Vec<WorkUnit>>> so that all of the callers don't need to wrap it in an Arc. Or it may not :)

@@ -736,6 +753,30 @@ impl CommandRunner {
}
}

fn maybe_add_workunit(result_cached: &bool, name: &str, start_time: &Timespec, end_time: &Timespec, parent_id: Option<String>, workunit_store: &Arc<WorkUnitStore>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird to take a &bool - clippy will probably suggest that you just take a bool.

The reason for this is that a bool is just one small amount of memory to copy, and is probably actually cheaper to just copy than using references, and any time you use a reference you make the compiler's job slightly harder.

}

fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
// Returning value is formed by representing duration as a hole number of seconds (u64) plus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably remove this comment - it's just describing what the code says and not adding much value on its own :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.

Copy link
Contributor

@blorente blorente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

}

fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
// Returning value is formed by representing duration as a hole number of seconds (u64) plus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably keep the last line though, because we are losing precision and I can imagine myself forgetting about that very easily, and not getting that from the code.

start_timestamp,
end_timestamp,
span_id,
parent_id: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a TODO (ideally with an issue number) to have this not be None soon (excited!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cattibrie cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from 5ef5b75 to 29097d4 Compare June 27, 2019 16:14
@cattibrie cattibrie changed the title WIP: Add remote workunits for Zipkin trace Add remote workunits for Zipkin trace Jun 27, 2019
@cattibrie cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from 29097d4 to 1cb27b3 Compare June 28, 2019 14:25
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Apart from one question about how we're asserting equality, all really minor comments :)

}

fn timespec_as_float_secs(timespec: &Timespec) -> f64 {
// Reverting time from duration to f64 decrease precision.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth linking to rust-lang/rust#54361 saying that there's an unstable standard library feature we'd like to use, but we're copying their implementation until it's stabilised.

@@ -2489,6 +2529,70 @@ mod tests {
)
}

#[test]
fn check_that_remote_workunits_are_in_workunit_store() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend not to use check_that or similar prefixes in our test names, because that what all tests do, so it's typically redundant. Maybe call this stores_workunits or remote_workunits_are_stored?

StderrType::Raw(testdata_empty.string()),
0,
)
.op
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This .op.unwrap().unwrap() points to a slightly wrong abstraction here. make_successful_operation_with_metadata returns a MockOperation, which maybe wraps an Operation. It looks like you want a function that just makes an Operation.

I think this should be pretty easy to do; just make make_successful_operation_with_maybe_metadata return an Operation, and do the MockOperation wrapping in make_successful_operation instead.

let mut runtime = tokio::runtime::Runtime::new().unwrap();

let workunit_store_2 = workunit_store.clone();
runtime.block_on(futures::future::ok(()).and_then(move |()| command_runner.extract_execute_response(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly more clear than the futures::future::ok(()).and_then(move |()| could be: futures::future::lazy(move ||

let workunits_arc = workunit_store.get_workunits();
let workunits = workunits_arc.lock();
let scheduling_workunit = &workunits[0];
assert_workunit_params(scheduling_workunit, "scheduling", 0.0, 1.0, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the order of WorkUnits in the store actually matter? If not, we may not want to assert on it...

And in general, it's nice to be able to assert equality on two WorkUnits rather than have to go through field by field... We may want to (or it may be confusing and overkill...) do something like:

let got_workunits: HashSet<SpanIdIgnoringWorkUnits> = workunit_store.get_workunits().lock().iter().cloned().map(WorkUnitIgnoringSpanId::from_workunit).collect()

let want_workunits = hashset! {
  WorkUnitIgnoringSpanId { ... },
  WorkUnitIgnoringSpanId { ... },
  WorkUnitIgnoringSpanId { ... },
  WorkUnitIgnoringSpanId { ... },
};

assert_eq!(want_workunits, got_workunits)

let mut metadata = ExecutedActionMetadata::new();
metadata.set_queued_timestamp(timestamp_only_secs(0));
metadata.set_worker_start_timestamp(timestamp_only_secs(1));
metadata.set_worker_completed_timestamp(timestamp_only_secs(8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this line to the end, so that the order of code reflects the order of timestamps

NodeKey::Snapshot(n) => n.run(context).map(NodeResult::from).to_boxed(),
NodeKey::Task(n) => n.run(context).map(NodeResult::from).to_boxed(),
}
futures::future::ok(()).and_then(|()| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, maybe futures::future::lazy(|| (but not big deal either way)

self.workunits.lock().push(workunit);
}

pub fn len(&self) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len seems like a kind of weird operation to have on a WorkUnitStore, and it looks like it's only used in a test. Maybe in the test instead do store.get_workunits().lock().len()?

// only if '--reporting-zipkin-trace-v2' is set
if !result_cached {
let workunit = WorkUnit {
name: String::from(name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to use slightly more verbose names, something like "remote execution worker input fetching" and similar

@@ -126,7 +128,7 @@ impl super::CommandRunner for CommandRunner {
///
/// TODO: Request jdk_home be created if set.
///
fn run(&self, req: ExecuteProcessRequest) -> BoxFuture<FallibleExecuteProcessResult, String> {
fn run(&self, req: ExecuteProcessRequest, workunit_store: WorkUnitStore) -> BoxFuture<FallibleExecuteProcessResult, String> {
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@illicitonion : Should this be using thread/task locals like logging does? AFAIK, most tracing implementations do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally a big fan of explicit over implicit, but it could if we wanted to...

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, tracing is a lot like logging. Doing the equivalent thing there and propagating our logger explicitly throughout our callstack to all the places we might want to log would probably be too much.

Maybe this is not similar to a logger, and so that doesn't make sense here... unknown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I use task locals in not many places in this PR I would prefer to leave it as it is.
But this is a useful suggestion and it would be great to reconsider it when adding right parent_id to all other v2 Nodes.

@cattibrie cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from 1cb27b3 to edb8c80 Compare July 5, 2019 16:37
Copy link
Contributor

@illicitonion illicitonion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of tiny comments, otherwise about ready to merge :)

Can you look at the first two shards on travis? There are a couple of trivial clippy things, and a reformatting needed. All the other shards look like unrelated flakes

format!("{:16.x}", random_u64)
}

pub fn got_workunits(workunit_store: WorkUnitStore) -> HashSet<WorkUnit> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to something like: workunits_without_span_id or workunits_with_constant_span_id

Also, this only needs a reference to a WorkUnitStore, so let's take a &WorkUnitStore instead of a WorkUnitStore.

There's also a cute little construction where you could if you wanted (I don't think it's necessarily better here, but it's pretty cute and good to know about), write:

workunit_store.get_workunits().lock().iter()
  .map(|workunit| WorkUnit { span_id: String::from("ignore"), **workunit })
  .collect()

which means "Make me a new WorkUnit with all the fields set the same as workunit except span_id"

Copy link
Contributor Author

@cattibrie cattibrie Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I changed the code to

 workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()

I got the next error:

error[E0308]: mismatched types
  --> workunit_store/src/lib.rs:76:109
   |
76 |   workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..workunit }).collect()
   |                                                                                                             ^^^^^^^^ expected struct `WorkUnit`, found &WorkUnit
   |

I cannot also do:

 workunit_store.get_workunits().lock().iter().map(|workunit| WorkUnit { span_id: String::from("ignore"), ..*workunit }).collect()

Because cannot move out of borrowed content

Copy link
Contributor Author

@cattibrie cattibrie Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WorkUnit { span_id: String::from("ignore"), ..workunit.clone() } works
So anyway need to clone the workunit

start_timestamp,
end_timestamp,
span_id,
parent_id: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cattibrie cattibrie force-pushed the etyurina/zipkin_add_remote_spans branch from edb8c80 to b54225f Compare July 9, 2019 23:24
### Problem
As the purpose of the remoting is to make pants runs faster (for compile) it is very important to understand its performance. One of the nice ways is Zipkin tracing.

### Solution
The result of the remote execution contains timings of different stages of the remote process that are used to create workunits to Zipkin trace.

The last part of this PR is to add a test(unitest and/or integration test).
@illicitonion illicitonion force-pushed the etyurina/zipkin_add_remote_spans branch from b54225f to ac1f978 Compare July 10, 2019 12:36
@illicitonion
Copy link
Contributor

I just pushed a rebase because I introduced a bunch of conflicts in two other PRs; will merge when green! :)

@illicitonion illicitonion merged commit 05188ed into pantsbuild:master Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants