Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add InterruptedIterations metric #1769

Closed
wants to merge 10 commits into from

Conversation

imiric
Copy link
Contributor

@imiric imiric commented Dec 14, 2020

These are partial changes for #877, but I'm creating a draft PR to clear up a few things and get some feedback. It's probably easier to view this commit by commit.

Currently only scenarios 1 and 2 from this comment are working (cause: error and cause: fail). cause: interrupted is commented out as it needs a different approach, and I'm not sure how to handle the last ^C scenario (cause: signal would make sense for that).

I settled on using a Rate metric for interrupted_iterations, but opted to hack around actually emitting the 0 value in order to avoid an explosion of metric data, as this would be a duplicate of iterations, and given #1321 is still open we shouldn't exacerbate this problem. See e2b88ab.

@imiric imiric requested review from mstoykov and na-- December 14, 2020 15:59
@imiric
Copy link
Contributor Author

imiric commented Dec 14, 2020

Some UI questions:

fail('test') and throw 'test' currently look like this:

ERRO[0001] fail: test at github.com/loadimpact/k6/js/common.Bind.func1 (native)  executor=constant-vus scenario=default
ERRO[0001] test at file:///tmp/test-877-fail.js:5:11(8)  executor=constant-vus scenario=default

and the metric in the summary like this:

interrupted_iterations...: 57.14% ✓ 40  ✗ 30
  1. I would like to make the fail case also show the correct file and line number. I haven't dug into bridge.go yet and not sure if it's possible with goja, but is this something that we want?
  2. The way Rate metrics are currently rendered in the summary makes the and numbers kind of ambiguous. Does refer to the "successful"/"complete" iterations or how many were interrupted? And same for , it could be interpreted both ways.
    Currently is the Sink.Trues value, so these are the interrupted iterations, but not sure how we could make this clearer, besides using a Counter.

// TODO: Use enum?
// TODO: How to distinguish interruptions from signal (^C)?
// tags["cause"] = "duration"
// state.Samples <- stats.Sample{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we would handle the 3rd scenario of executors interrupting the iteration, and while it worked in manual tests and TestMetricsEmission, it would cause other tests to hang (presumably because they were setting a nil channel), and I'm not sure if it wouldn't be racy, so I commented it out for now.

Is this the right place/approach, or would we have to receive two contexts here or something more complicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging through the context chains we're using, this shouldn't be racy since the Samples channel is closed only when the globalCtx is done, which happens after all metrics processing is finished, so it should be usable even if this VU "run context" is done. Yet some of our tests use the same context for both, e.g.:

https://github.com/loadimpact/k6/blob/420dd4161c280bdf0398af5e58f73c939057695a/api/v1/status_routes_test.go#L108

This explains why it works in real-world tests but not in the test suite. I'll see if I can fix the tests then.

Ivan Mirić added 7 commits December 15, 2020 13:11
This avoids outputting "GoError" as mentioned in
#877 (comment)
This is definitely racy and causes hanging in tests, so it needs a
different approach.
This isn't racy in real world tests since the Samples channel isn't
closed until the global context is done.
@imiric imiric force-pushed the feat/877-interr-iters-metric branch from 9697bab to c12f449 Compare December 15, 2020 12:12
@codecov-io
Copy link

codecov-io commented Dec 15, 2020

Codecov Report

Merging #1769 (d52efc4) into master (420dd41) will increase coverage by 0.01%.
The diff coverage is 96.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1769      +/-   ##
==========================================
+ Coverage   71.47%   71.48%   +0.01%     
==========================================
  Files         178      178              
  Lines       13777    13829      +52     
==========================================
+ Hits         9847     9886      +39     
- Misses       3317     3329      +12     
- Partials      613      614       +1     
Flag Coverage Δ
ubuntu 71.44% <96.87%> (+0.02%) ⬆️
windows 70.06% <96.87%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
js/runner.go 80.62% <86.66%> (-0.49%) ⬇️
core/engine.go 91.08% <100.00%> (-1.88%) ⬇️
js/common/util.go 100.00% <100.00%> (ø)
js/modules/k6/k6.go 100.00% <100.00%> (ø)
lib/errors.go 91.30% <100.00%> (ø)
lib/executor/helpers.go 100.00% <100.00%> (+3.61%) ⬆️
lib/testutils/minirunner/minirunner.go 86.95% <100.00%> (+0.91%) ⬆️
loader/loader.go 79.28% <0.00%> (-3.58%) ⬇️
lib/execution.go 89.32% <0.00%> (-2.92%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 420dd41...d52efc4. Read the comment docs.

@@ -288,6 +288,14 @@ func (e *Engine) processMetrics(globalCtx context.Context, processMetricsAfterRu
if !e.NoThresholds {
e.processThresholds()
}
if iim, ok := e.Metrics[metrics.InterruptedIterations.Name]; ok {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map lookup is racy according to the current TestSentReceivedMetrics failure. So I guess we'll need to rethink or abandon this hack...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this breaks any output as you are only emitting the 1 to them ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on how this metric is agreggated. If their system allows it they could factor it in with the iterations metric and get a percentage that way, which is kind of what we're doing here for the summary.

It makes no sense to me to emit the 0 value of this metric considering it's equivalent to iterations, and it would add a considerable amount of output data and affect the performance. A 5s/5VU test with an empty default and no sleep does 2115298 iterations on my machine and generates a 1.5 GB JSON file when emitting the 0 value. The same test without it manages to do 3203002 iterations and generate a 1.9 GB JSON file, but this is because it's doing ~50% more iterations.

Considering we already have complaints about outputting too much data and interest in #1321, I think we should rather work on #1321 before we consider adding another default metric.

Though we should probably rethink duplicating iterations...

How about instead of a new metric, we simply reused iterations and added an interrupted: "<cause>" tag to it? If the tag is omitted it's a complete iteration and if it's set then it was interrupted with the specific cause. WYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with prioritizing #1321, but that will actually probably require quite the refactoring, IMO even for an MVP.
The reusage of the iterations was discussed (I think) and we decided against it based on the fact the primary usage for this will be to check that a percentage of the iterations aren't actually interrupted, which with the current thresholds isn't possible if the metric isn't rate... so :(.

I think that maybe we should just make another PR with the changes that are in this PR (the test fixes and the lib.Exception) and merge that to both have them and make this PR easier to rebase and then work on #1321 so we can merge this then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reusage of the iterations was discussed (I think) and we decided against it based on the fact the primary usage for this will be to check that a percentage of the iterations aren't actually interrupted, which with the current thresholds isn't possible if the metric isn't rate

Then maybe that should be a feature? I'll see what I can do.

@@ -348,7 +348,7 @@ func TestBind(t *testing.T) {
}},
{"Error", bridgeTestErrorType{}, func(t *testing.T, obj interface{}, rt *goja.Runtime) {
_, err := RunString(rt, `obj.error()`)
assert.Contains(t, err.Error(), "GoError: error")
assert.Contains(t, err.Error(), "error")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be moved to another PR that we can merge today and make this ... more focused ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1775.

tags["cause"] = er.Cause()
}
logger.Error(err.Error())
case fmt.Stringer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This literally was worded as fmt.Stringer because I didn't want to add one more place where lib and lib/executor specifically depend on goja. So I think it is better if we don't actually start using it and also you need to fix the fact that now exceptions don't have "source=stacktrace" ;)

I do think now is a good time to actually add and error type (under lib), probably just an interface "Exception", that is implemented in the js package and wraps goja.Exception so that we don't use this directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hhmm sure, I'll give that a try, thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I managed to remove the goja dependency in d52efc4, but I'm not happy with how it turned out...

I wanted to treat all errors and exceptions consistently, avoid logging in runFn() and delegate that to getIterationRunner(), but this took an embarassing amount of trial/error and I'm still not sure all errors will be handled correctly. :-/

Some notes:

  • I didn't think making lib.Exception an interface was needed, as it's generic enough to have a single implementation.
  • This needs a lot more tests to ensure errors are rendered properly.
  • The source=stacktrace logger field doesn't make sense to me. If anything the source should be the same as the "cause" value, as "stacktrace" isn't really a "source".
  • Should all errors output the full stack trace? I was trying to keep it consistent with master, but maybe errors from fail() should only output a short single-line trace with the last stack frame.

This was motivated by wanting to remove the goja dependency from the
lib/executor package, see #1769 (comment)

The idea is for runFn to always return a lib.Exception error that can
handle all JS errors and Go panics consistently. In practice I'm not
sure if this hack is worth it as it might mess up handling of some
errors...
@na--
Copy link
Member

na-- commented Jun 15, 2022

We didn't merge this as it was because we realized we needed #1321 (or something like it) for this PR to not introduce performance problems. At this point it has so many merge conflicts that it'd be easier to start from scratch, so I'll close it to reduce noise.

@na-- na-- closed this Jun 15, 2022
@na-- na-- deleted the feat/877-interr-iters-metric branch June 15, 2022 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants