Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle k6 exit codes #75

Open
b0nete opened this issue Sep 27, 2021 · 13 comments
Open

Handle k6 exit codes #75

b0nete opened this issue Sep 27, 2021 · 13 comments

Comments

@b0nete
Copy link

b0nete commented Sep 27, 2021

Hi, i'm executing load tests in my kubernetes cluster but i have a problem when tests fails.

I need tests be executed only one time, and if these run succesfully o fails don't be executed again.
Currently if tests running ok these dont be executed again, but if test threshold faild automatically starter container is created and launch another pod to try run test again.

I leave my config files here, i tried to set abortOnFail in threshold and use abortTest() function but the problem persist.
I think it is a k6-operator behaviour, maybe you can help me.

This is my test file.

apiVersion: v1
kind: ConfigMap
metadata:
  name: k6-test
  namespace: k6-operator-system
data:
  test.js: |
    import http from 'k6/http';
    import { Rate } from 'k6/metrics';
    import { check, sleep, abortTest } from 'k6';

    const failRate = new Rate('failed_requests');

    export let options = {
      stages: [
        { target: 1, duration: '1s' },
        { target: 0, duration: '1s' },
      ],
      thresholds: {
        failed_requests: [{threshold: 'rate<=0', abortOnFail: true}],
        http_req_duration: [{threshold: 'p(95)<1', abortOnFail: true}],
      },
    };

    export default function () {
      const result = http.get('http://test/login/');
      check(result, {
        'http response status code is 200': result.status === 500,
      });
      failRate.add(result.status !== 200);
      sleep(1);
      abortTest();
    }

And this is my k6 definition.

apiVersion: k6.io/v1alpha1
kind: K6
metadata:
  name: k6-sample
  namespace: k6-operator-system
spec:
  parallelism: 1
  script:
    configMap:
      name: k6-test
      file: test.js
  arguments: --out influxdb=http://influxdb.influxdb:8086/test
  scuttle:
    enabled: "false"

I hope you can help me, thanks!

@knechtionscoding
Copy link
Contributor

So, I think this is because k6 exits with a non 0 exit code and so the k6 operator will try to keep it going till it succeeds.

We could probably add that to the crd as an option, restart never. And have k6-operator interpret that.

@yorugac
Copy link
Collaborator

yorugac commented Dec 20, 2021

@b0nete thanks for opening the issue!

I agree with @knechtionscoding that this happens because of non-zero exit of k6 run. It seems that number of completions for the k8s job is 1 by default so operator expects at least one successful exit. Another curious thing is that I don't actually observe multiple test runs when I try this scenario: the 1st runner fails with non-zero exit, then the 2nd runner is created and gets stuck in "paused" state. This likely happens because the 1st starter finished successfully and operator doesn't have any additional logic for this case: no 2nd starter is created and the 2nd runner waits indefinitely to be started.

IMO, this shouldn't be the default behavior: if thresholds fail, it is a reason for someone to look into the SUT and the script and figure out what to do with that. So k6-operator shouldn't be restarting any pods on failing thresholds 🤔

@yorugac yorugac added bug Something isn't working evaluation needed labels Dec 20, 2021
@yorugac
Copy link
Collaborator

yorugac commented Dec 21, 2021

Looking at https://github.com/grafana/k6/blob/master/errext/exitcodes/codes.go:

k6 error exit code meaning in k6-operator context restart the runner? is startup-only error?
CloudTestRunFailed 97 this error should never happen in k6-operator no -
CloudFailedToGetProgress 98 this error should never happen in k6-operator no-
ThresholdsHaveFailed 99 regular error, action is to be determined by user no -
SetupTimeout 100 regular error, likely the script or configuration needs to be reviewed no -
TeardownTimeout 101 regular error, likely the script or configuration needs to be reviewed no -
GenericTimeout 102 regular error, likely the script or configuration needs to be reviewed no -
GenericEngine 103 something going wrong in k6 setup and must be investigated no
InvalidConfig 104 regular error, test config should be reviewed no -
ExternalAbort 105 os.Interrupt, SIGINT or SIGTERM are regular errors but everything else should never happen in k6-operator yes* no
CannotStartRESTAPI 106 runner cannot be started without working REST yes yes
ScriptException 107 regular error, script must be reviewed no -
ScriptAborted 108 regular error, script must be reviewed no -
  • unless there is a point in restart on SIGINT and SIGTERM specifically? Other cases of ExternalAbort happen in k6 cloud execution which is not used in operator. During k6 run, ExternalAbort implies interrupts, SIGINTs and SIGTERMs.

EDIT 17 Feb: updated the table with Simme's input and additional info.

@simskij
Copy link
Contributor

simskij commented Dec 26, 2021

  • CannotStartRESTAPI should probably lead to a reschedule, as this likely is caused by networking issues on the cluster node.
  • ExternalAbort is also (most) likely to happen due to timing/scheduling issues because of pod eviction policies being triggered, and there is a pretty high chance that rescheduling the job would resolve that.

Do note that I use the term reschedule rather than restart though. Restarting the exact same pod would likely lead to another f failure, but allowing k8s to destroy the pod and reschedule it (preferably even to another node) might not.

@yorugac
Copy link
Collaborator

yorugac commented Jan 5, 2022

  • CannotStartRESTAPI should probably lead to a reschedule, as this likely is caused by networking issues on the cluster node.

Good point! There should be a limit to number of such restarts though.

@yorugac
Copy link
Collaborator

yorugac commented Feb 18, 2022

In PR #86, backoff limit for runner jobs was set to 0: that disables all restarts no matter the exit codes. It's a partial solution to this issue. Cases when there should be a restart (as noted in above comments) should be solved separately.

@jsravn
Copy link

jsravn commented Mar 30, 2022

Any progress on this? It blocks usage of the operator for me unfortunately. I'm thinking as a workaround, I could patch the job after the operator creates it.

@yorugac
Copy link
Collaborator

yorugac commented Mar 31, 2022

Hi @jsravn, as described in the last comment before yours, this was partially fixed in 0cdcc9d as part of PR #86. I expected that PR to be merged in by now but it's being delayed due to other issues 😞

I'll pull out this specific commit with backoff tomorrow so that it can be merged into main branch independently from #86. Please watch for the updates 🙂

@mhaddon
Copy link

mhaddon commented Apr 28, 2022

Was this merged up? @yorugac

@yorugac
Copy link
Collaborator

yorugac commented Apr 29, 2022

@mhaddon yes, the fix is in main 2780355
So the last image from main branch contains it.

@mhaddon
Copy link

mhaddon commented Apr 29, 2022

what image is that? because i tried v0.0.7rc4 (https://github.com/grafana/k6-operator/tree/v0.0.7rc4/config/default) and it doesnt have it).

ghcr.io/grafana/operator:latest

or do i build it myself?

@yorugac
Copy link
Collaborator

yorugac commented Apr 29, 2022

No, you don't need to build it, it's present with commit as tag:
ghcr.io/grafana/operator:278035580ffaa523b1a62f02e801fe7e35c7c5ab
You can find all the images built for operator on this page:
https://github.com/grafana/k6-operator/pkgs/container/operator

@yorugac
Copy link
Collaborator

yorugac commented Mar 17, 2023

Connected issue in k6: grafana/k6#2804

@yorugac yorugac changed the title Avoid test be executed again when it fails. Handle k6 exit codes Apr 25, 2023
@yorugac yorugac added the PLZ label Apr 27, 2023
@yorugac yorugac removed the bug Something isn't working label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants