Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the operator idempotent: resilient to state changes #138

Closed
yorugac opened this issue Aug 9, 2022 · 6 comments
Closed

Make the operator idempotent: resilient to state changes #138

yorugac opened this issue Aug 9, 2022 · 6 comments
Assignees
Labels
FSM Connected to FSM Priority: High

Comments

@yorugac
Copy link
Collaborator

yorugac commented Aug 9, 2022

This is an umbrella issue for WIP investigation that is being done to resolve the problems mentioned below.

There are two major issues in the k6 Operator that came up recently in different contexts. Firstly, there are hard to repeat bugs like #128 It is likely an indicator that the execution flow itself has a bug to it. One of the possible culprits is repetition of Reconcile call: this repetition is an expected behaviour but it doesn't seem that current setup deals with it properly. Essentially, the operator is not being idempotent.

Second issue is with the blocked simultaneous test run executions which came up in #126. It must be said that it was not an officially declared feature but apparently, the operator was indeed used in this way, and now it became impossible due to the need to store the additional state vars here:

testRunId string
token string
inspectOutput cloud.InspectOutput

More precisely, duration of the test run is blocking since #112 so this change must be reverted or, better yet, the duration must be stored as part of status to avoid the need to block in the first place.

Both of these issues are tied to how operator handles the state of a test run. Currently operator relies on the string field stage of CRD's status to direct execution flow. Given the above, it no longer seems as the correct solution. The Kubernetes API convention essentially promises eventual consistency of spec and status of the resource. So the execution flow of the operator must be updated to handle the declarative approach of Kubernetes API correctly.

Additional related info. The conditions enhancement can be referenced to add the informational part of the status of CRD.

Relevant comments would be welcome.

@yorugac yorugac added Priority: High FSM Connected to FSM labels Aug 9, 2022
@yorugac yorugac self-assigned this Aug 9, 2022
@mugioka
Copy link
Contributor

mugioka commented Aug 11, 2022

In k6-operator, wait.PollImmediate is used to achieve timeout with initialize/start/finish, but I think this method is not so good because it blocks the reconsiliation loop.
(e.g:

err := wait.PollImmediate(time.Second*30, testDuration+time.Minute*2, func() (done bool, err error) {
)
I think that timeout can be realized without blocking the reconsiliation loop by annotating k6 CR with a timeout time, and each action returns an error if the time exceeds the value of the timeout time.
And I think it is possible to requeu until the timeout time by returning the following return value in the part that used wait.PollImmediate.

if err := r.List(ctx, jl, opts); err ! = nil {
	return ctrl.Result{
			Requeue: true,
			RequeueAfter: 5 * time.Second,
		}, fmt.Errorf("Could not list jobs: %w", err)
	}
}

@prashanth-volvocars
Copy link

I am running k6 operator in AWS EKS with Fargate and was facing issue #128. Mainly because Fargate takes more than a minute for a pod to get assigned to a node. I increased the timeout to 3 mins in initialise and start. Now it seems to have solved the problem. I understand that it's not a perfect fix because in case of an error in the initialising phase it would still not be able to recover from it. Also, I can run tests simultaneously, so i am not sure what #126 means.

@yorugac
Copy link
Collaborator Author

yorugac commented Mar 9, 2023

Lately this issue has been prioritized internally, in part thanks to PR #169 🎉

There has been some confusion around this issue, as in what exactly it involves, as well as lack of clarity around the term "idempotency". So I'll try to clarify a bit 🙂

AFAIK, "idempotency" is often familiar in Kubernetes community but it may not always have exactly the same meaning as math term idempotency. Regardless, it definitely is meant in a broader sense here, in k6-operator: there's a number of inter-connected problems that are all related to how the operator treats the state. Some of them has already been linked here. I nicknamed them "idempotency issues" simply due to lack of another term (and lack of time to search for another term).

In order to be more clear, both with the problem and with current expectations, I've made a list of test cases that I look into to pinpoint the exact problems we need to solve. If there's another, significantly different test case that should be added, please feel free to write in the comments.

A list of test cases, together with some notes on current behaviour:

  1. General flow with one test run
  • normal test run, with zero exit
  • restart of a runner
    • if it happens before test execution is completed on this runner: the new runner does not receive a starting signal and remains paused while the operator waits indefinitely.
    • if it happens after test execution is completed on this runner: rather hard to simulate and leads to the question on how likely it is in reality
  • user abortion of the test run
    • from inside the script with abort(): seems OK, only reaches Finished instead of Error (see Revisit error status definition #196)
    • deletion of the jobs before test run completion: controller waits indefinitely
    • deletion of CR before test run completion: fine in OSS, but with cloud output there is no finalization of the test run
  • restart of the test run (i.e. delete and apply in quick succession): OSS test run seems to continue execution
  • apply of yaml with the same name but different options in the middle of the test run (note: this should likely be a truly idempotent ad definitium operation): we'll get a message k6.k6.io/k6-sample configured and then the loop can become broken. If nodes increase, then controller is stuck waiting. If nodes decrease, it might be fine but again, not with cloud output case.
  • should resources be deleted when they reach erroneous state? (Perhaps this part should be moved to Revisit error status definition #196)
    • from initializer (this can happen in case of syntax errors in script or bad arguments line, etc.)
    • from starter (when can this happen?)
    • from runner (when k6 exits with error during test execution)
  • modification of the CR during test run: message k6.k6.io/k6-sample edited but the same as above with two applications
  • restart of the operator
  • operator starts after k6 CR is applied - operator should still pick it up
  1. Simultaneous test runs
  • normal test run of 2+ CRs
    • tests are started one by one, with some small period of time inbetween
    • tests are started simultaneously
  • simultaneous tests together with above cases from point 1)

These cases are considered in two possible configurations: OSS test run and k6 Cloud (cloud output) test run. At the moment of writing, most of these cases are supported for OSS test runs, with a few notable exceptions.

Current conclusion The main problem is with fixing major pain points with k6 Cloud test run. Another problem: the operator does not have any timeout mechanism at the moment. If a test run enters any infinite execution scenario, the operator will wait for completion without timing out. Fixing and clarifying these two problems are considered the main priorities at the moment.

Just in case, I don't think it should be a goal to support all of the above test cases but as many as possible and, hopefully, to be clear on what is not going to work. We need to also take into account probability of a test case in real situation VS the effort required to resolve it.

@pablochacin
Copy link
Collaborator

pablochacin commented Mar 9, 2023

I nicknamed them "idempotency issues" simply due to lack of another term (and lack of time to search for another term).

After reading the issues described above, I think "idempotency" is not a proper term and may result confusing. I suggest using a more generic and possibly more neutral term such as "state handling".

@yorugac
Copy link
Collaborator Author

yorugac commented Mar 13, 2023

Thanks for feedback @pablochacin! I kind of agree with you 😄 That's how the label "FSM" appeared after all. Probably the more correct terminology would be ~ "resilient to state changes" which includes idempotency.

But the term "idempotency" is rather distinctive even if not exhaustive in this case and it'd be equally confusing to remove it at this point, IMO.
Update: adjusted the name of the issue.

@yorugac yorugac changed the title Make the operator idempotent Make the operator idempotent: resilient to state changes Mar 13, 2023
@yorugac
Copy link
Collaborator Author

yorugac commented Oct 10, 2023

The main goals of this issue were achieved: the operator acts as "idempotent operator" in most cases. If there are edge cases, it makes sense to open additional issues with specific instructions on how to repeat them.

As a follow-up of the list above, the following issues exist:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FSM Connected to FSM Priority: High
Projects
None yet
Development

No branches or pull requests

4 participants