Make the operator idempotent: resilient to state changes #138

yorugac · 2022-08-09T07:44:03Z

This is an umbrella issue for WIP investigation that is being done to resolve the problems mentioned below.

There are two major issues in the k6 Operator that came up recently in different contexts. Firstly, there are hard to repeat bugs like #128 It is likely an indicator that the execution flow itself has a bug to it. One of the possible culprits is repetition of Reconcile call: this repetition is an expected behaviour but it doesn't seem that current setup deals with it properly. Essentially, the operator is not being idempotent.

Second issue is with the blocked simultaneous test run executions which came up in #126. It must be said that it was not an officially declared feature but apparently, the operator was indeed used in this way, and now it became impossible due to the need to store the additional state vars here:

k6-operator/controllers/k6_initialize.go

Lines 29 to 31 in 0abaa3a

    
           testRunId     string 
        
           token         string 
        
           inspectOutput cloud.InspectOutput

More precisely, duration of the test run is blocking since #112 so this change must be reverted or, better yet, the duration must be stored as part of status to avoid the need to block in the first place.

Both of these issues are tied to how operator handles the state of a test run. Currently operator relies on the string field stage of CRD's status to direct execution flow. Given the above, it no longer seems as the correct solution. The Kubernetes API convention essentially promises eventual consistency of spec and status of the resource. So the execution flow of the operator must be updated to handle the declarative approach of Kubernetes API correctly.

Additional related info. The conditions enhancement can be referenced to add the informational part of the status of CRD.

Relevant comments would be welcome.

The text was updated successfully, but these errors were encountered:

mugioka · 2022-08-11T17:15:53Z

In k6-operator, wait.PollImmediate is used to achieve timeout with initialize/start/finish, but I think this method is not so good because it blocks the reconsiliation loop.
(e.g:

k6-operator/controllers/k6_finish.go

Line 34 in 0abaa3a

    
           err := wait.PollImmediate(time.Second*30, testDuration+time.Minute*2, func() (done bool, err error) {

)
I think that timeout can be realized without blocking the reconsiliation loop by annotating k6 CR with a timeout time, and each action returns an error if the time exceeds the value of the timeout time.
And I think it is possible to requeu until the timeout time by returning the following return value in the part that used wait.PollImmediate.

if err := r.List(ctx, jl, opts); err ! = nil {
	return ctrl.Result{
			Requeue: true,
			RequeueAfter: 5 * time.Second,
		}, fmt.Errorf("Could not list jobs: %w", err)
	}
}

prashanth-volvocars · 2022-09-25T11:41:39Z

I am running k6 operator in AWS EKS with Fargate and was facing issue #128. Mainly because Fargate takes more than a minute for a pod to get assigned to a node. I increased the timeout to 3 mins in initialise and start. Now it seems to have solved the problem. I understand that it's not a perfect fix because in case of an error in the initialising phase it would still not be able to recover from it. Also, I can run tests simultaneously, so i am not sure what #126 means.

yorugac · 2023-03-09T11:58:42Z

Lately this issue has been prioritized internally, in part thanks to PR #169 🎉

There has been some confusion around this issue, as in what exactly it involves, as well as lack of clarity around the term "idempotency". So I'll try to clarify a bit 🙂

AFAIK, "idempotency" is often familiar in Kubernetes community but it may not always have exactly the same meaning as math term idempotency. Regardless, it definitely is meant in a broader sense here, in k6-operator: there's a number of inter-connected problems that are all related to how the operator treats the state. Some of them has already been linked here. I nicknamed them "idempotency issues" simply due to lack of another term (and lack of time to search for another term).

In order to be more clear, both with the problem and with current expectations, I've made a list of test cases that I look into to pinpoint the exact problems we need to solve. If there's another, significantly different test case that should be added, please feel free to write in the comments.

A list of test cases, together with some notes on current behaviour:

General flow with one test run

normal test run, with zero exit
restart of a runner
- if it happens before test execution is completed on this runner: the new runner does not receive a starting signal and remains paused while the operator waits indefinitely.
- if it happens after test execution is completed on this runner: rather hard to simulate and leads to the question on how likely it is in reality
user abortion of the test run
- from inside the script with abort(): seems OK, only reaches Finished instead of Error (see Revisit error status definition #196)
- deletion of the jobs before test run completion: controller waits indefinitely
- deletion of CR before test run completion: fine in OSS, but with cloud output there is no finalization of the test run
restart of the test run (i.e. delete and apply in quick succession): OSS test run seems to continue execution
apply of yaml with the same name but different options in the middle of the test run (note: this should likely be a truly idempotent ad definitium operation): we'll get a message k6.k6.io/k6-sample configured and then the loop can become broken. If nodes increase, then controller is stuck waiting. If nodes decrease, it might be fine but again, not with cloud output case.
should resources be deleted when they reach erroneous state? (Perhaps this part should be moved to Revisit error status definition #196)
- from initializer (this can happen in case of syntax errors in script or bad arguments line, etc.)
- from starter (when can this happen?)
- from runner (when k6 exits with error during test execution)
modification of the CR during test run: message k6.k6.io/k6-sample edited but the same as above with two applications
restart of the operator
operator starts after k6 CR is applied - operator should still pick it up

Simultaneous test runs

normal test run of 2+ CRs
- tests are started one by one, with some small period of time inbetween
- tests are started simultaneously
simultaneous tests together with above cases from point 1)

These cases are considered in two possible configurations: OSS test run and k6 Cloud (cloud output) test run. At the moment of writing, most of these cases are supported for OSS test runs, with a few notable exceptions.

Current conclusion The main problem is with fixing major pain points with k6 Cloud test run. Another problem: the operator does not have any timeout mechanism at the moment. If a test run enters any infinite execution scenario, the operator will wait for completion without timing out. Fixing and clarifying these two problems are considered the main priorities at the moment.

Just in case, I don't think it should be a goal to support all of the above test cases but as many as possible and, hopefully, to be clear on what is not going to work. We need to also take into account probability of a test case in real situation VS the effort required to resolve it.

pablochacin · 2023-03-09T14:23:35Z

I nicknamed them "idempotency issues" simply due to lack of another term (and lack of time to search for another term).

After reading the issues described above, I think "idempotency" is not a proper term and may result confusing. I suggest using a more generic and possibly more neutral term such as "state handling".

yorugac · 2023-03-13T11:09:33Z

Thanks for feedback @pablochacin! I kind of agree with you 😄 That's how the label "FSM" appeared after all. Probably the more correct terminology would be ~ "resilient to state changes" which includes idempotency.

But the term "idempotency" is rather distinctive even if not exhaustive in this case and it'd be equally confusing to remove it at this point, IMO.
Update: adjusted the name of the issue.

yorugac · 2023-10-10T06:24:01Z

The main goals of this issue were achieved: the operator acts as "idempotent operator" in most cases. If there are edge cases, it makes sense to open additional issues with specific instructions on how to repeat them.

As a follow-up of the list above, the following issues exist:

yorugac added Priority: High FSM Connected to FSM labels Aug 9, 2022

yorugac self-assigned this Aug 9, 2022

yorugac mentioned this issue Aug 20, 2022

Add an automatically generated API client #144

Closed

yorugac mentioned this issue Sep 23, 2022

Feature: add Helm chart install mode #98

Merged

yorugac mentioned this issue Oct 7, 2022

stop \ terminate running jobs without waiting for the timeout from the pods #156

Closed

ivanape mentioned this issue Nov 21, 2022

Allow execution of simultaneous tests #169

Merged

yorugac mentioned this issue Dec 14, 2022

Support for multiple simultaneous tests #126

Closed

yorugac mentioned this issue Jan 20, 2023

Add initializer spec #175

Merged

3 tasks

yorugac mentioned this issue Mar 9, 2023

Increased wait time for initialize, runner and start jobs ready status. #161

Closed

yorugac changed the title ~~Make the operator idempotent~~ Make the operator idempotent: resilient to state changes Mar 13, 2023

yorugac mentioned this issue Apr 12, 2023

Improving logic with conditions and persistent test run ID #213

Merged

This was referenced Oct 10, 2023

Support for restart of a runner #304

Open

Edit of TestRun CRD during the test #305

Open

yorugac closed this as completed Oct 10, 2023

yorugac mentioned this issue Oct 13, 2023

Make cloud output test runs resilient to operator's restarts #108

Closed

yorugac mentioned this issue Apr 24, 2024

Refactor TestRun controller #398

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the operator idempotent: resilient to state changes #138

Make the operator idempotent: resilient to state changes #138

yorugac commented Aug 9, 2022

mugioka commented Aug 11, 2022 •

edited

Loading

prashanth-volvocars commented Sep 25, 2022

yorugac commented Mar 9, 2023

pablochacin commented Mar 9, 2023 •

edited

Loading

yorugac commented Mar 13, 2023 •

edited

Loading

yorugac commented Oct 10, 2023

Make the operator idempotent: resilient to state changes #138

Make the operator idempotent: resilient to state changes #138

Comments

yorugac commented Aug 9, 2022

mugioka commented Aug 11, 2022 • edited Loading

prashanth-volvocars commented Sep 25, 2022

yorugac commented Mar 9, 2023

pablochacin commented Mar 9, 2023 • edited Loading

yorugac commented Mar 13, 2023 • edited Loading

yorugac commented Oct 10, 2023

mugioka commented Aug 11, 2022 •

edited

Loading

pablochacin commented Mar 9, 2023 •

edited

Loading

yorugac commented Mar 13, 2023 •

edited

Loading