Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a design doc for distributed execution and test suites #3217

Merged
merged 2 commits into from
Feb 27, 2024

Conversation

na--
Copy link
Member

@na-- na-- commented Jul 20, 2023

What?

This is a design doc for my proposal on how native distributed execution (#140) can be added to k6 🎉 It briefly touches on test suites (#1342) as well, because my distributed execution offers a way to implement test suites (both for local and distributed tests!), but doesn't explore that topic to the same depth.

It is essentially a brain dump of most of the thoughts that exist in my head on the topic, after first making a distributed execution PoC in #2438, and then polishing it and making a PoC of test suites in #2816.

Why?

I will be taking a long vacation/sabbatical for the next few months and then gradually moving to another team. Because of that (and somewhat prompted by @ragnarlonn's comments in #140 😅), I thought it made sense to try and write down my thoughts on the topic. It's very unlikely that I will be able to completely finish the distributed execution work myself, so I've tried to make what exists as code and ideas in my head as easy to adopt and built upon as it's practical to do before I go away for a while.

That's also why I refactored #2816 to make big parts of it completely backwards compatible. It hopefully means that they are merge-able even when they are somewhat incomplete, untested and unpolished... 🤞 At this point, I've (lightly) polished the original distributed execution PoC and split it into a few PRs (#3204, #3205, #3213) that should be safe to merge as they are, even if the whole feature needs more work to be considered stable and not experimental.

That's also why I have written this document, so hopefully someone else can pick up the task and implement the missing pieces to make what exists stable and even more useful. Still, this design doc and the connected PRs are just my suggestion for how things can be done. If someone comes up with a better idea, feel free to throw all of this away. Even if some of the linked PRs were already merged, we need to be very clear that they are super-experimental and liable to change (or go away) at any point in time before they are officially declared stable by the team.

Related PR(s)/Issue(s)

3. Similarly, in the "Instance Synchronization API" section I showed how the `SignalAndWait()` pseudo-API can be used to ensure that all instances have finished initializing their VUs and executed `setup()` before they start the actual test execution simultaneously
4. Most failures will be handled by the coordinator and propaged to waiting instances. For the best UX, instances might subscribe to a special error channel so they can be interupted even during the test execution.
5. Metric collection was described in the "Metrics Handling" section, and the coordinator will be able to stop the test if a threshold is crossed.
6. We can probably handle `test.abort()` calls from an instance with a `GetOrCreateData()` call that uses a special `dataID` watched by the coordinator, but it might be simpler to just add a custom API for that... :thinking:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of reusing API calls using "special" parameters to make requests requiring specialized handling, such as a test.abort(). I would prefer a method in the Controller API for signaling this kind of events. For example, the Signal(event, error) API call described above with an abort event.


https://github.com/grafana/k6/issues/1342 is far behind https://github.com/grafana/k6/issues/140 when it comes to upvotes, but it is still one of our most requested features after it.

## Limitations of `scenarios`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this list of limitations followed by the proposal of test suits, it sounds like we are not addressing some of the limitations of scenarios but suggesting using suits. Is this the intention?

I think that even when some complex requirements such as "CI-like" capabilities make more sense in the context of suits, scenarios are still useful and will benefit from features such as "start after" chaining (to avoid the error-prone startTime) and also the per scenario setup() and teardown() functions. This last one is particularly important from the perspective of the disruptor to facilitate the setup of the disruptor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds like we are not addressing some of the limitations of scenarios but suggesting using suits

What limitations are not going to be addressed by test suites? I just re-read the list and all of the limitations are either directly solved by having test suites, or could be solved much more easily by building something on top of test suites compared to building it on top of the current scenarios (or anything else that currently exists in k6 that I can think of).

Copy link

@pablochacin pablochacin Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@na-- That is not the problem. It is clear that test suites address all the limitations of scenarios. And it is also clear that some limitations such as the dependencies between scenarios are easier to implement in test suits.

The problem I'm mentioning is that scenarios are still valuable in some simple cases and it would be interesting to address some of their limitations instead of somehow "forcing" to use test suites in all but the simplest cases.

In particular the setup/tear-down functions per scenario. Having those would make using the disruptor a lot simpler than now.

Copy link
Member Author

@na-- na-- Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see! And yeah, that makes sense, I am not against having setup()/teardown() functions (or something like them) per scenario too, though that is a bit out of scope for this document. The nice thing about setup()/teardown() functions per test in a test suite is that... they already exist 😅 The current logic for these life-cycle functions just works as you'd expect it in the test suite PoC and you essentially get that whole feature for free.

Even so, the execution.Controller architecture should also help with future efforts to implement per-scenario setup() and teardown() too! 🎉

The big problem before was the requirement that such functions need to be called only on a single instance, but their results need to be distributed across all instances. Similarly, this was also the big problem with implementing a SharedArray (or other similar things) that could be created during the test execution phase (i.e. outside of the script init context). Both of these problems are almost the same, see #1638 (comment). And both of these problems should now be trivial to solve with the GetOrCreateData() API/helper that will be used for the test setup() 🎉

That is, because setup() will no longer be all that special, and because the control flow of this architecture is not top-down (the k6 coordinator controlling the agents), but bottoms-up (the main actors are the agents, they are just using the coordinator for a central hub to synchronize with each other and exchange data), it should be trivial to build such features on top of it and have them work in both local and distributed tests! 🤞

@mstoykov mstoykov added this to the v0.47.0 milestone Jul 26, 2023
@codebien codebien modified the milestones: v0.47.0, TBD Sep 25, 2023
Copy link
Contributor

@yorugac yorugac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello 👋 I've finally gotten around to actually start grokking this document properly so a couple of additional comments.

I'm still thinking through the implications of such k6 distributed solution in Kubernetes but I'd like to upvote this issue again. It'd be great if it were possible to move it forward.

7. Detecting when the test has finished is trivial with `SignalAndWait()`.
8. `teardown()` can also be executed the same way we do `setup()`. Though, because it doesn't return any data, it is one of the reasons I think it makes sense to split `GetOrCreateData(id, callback)` into separate `Once(callback)`, `SetData(id)` amd `GetData(id)` calls...
9. The `k6 coordinator` node can run `handleSummary()` after the test is done, no problems there :tada:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing that would be good to include in this list is:

  1. termination of k6 processes by coordinator

Currently, k6-cloud relies on --linger and sending SIGTERM to k6 processes, in order to have consistent logic around teardown and finish of the test. It is another piece of low-level logic which has to be done "externally".

As an illustration. This is actually something that became the main stumbling block in my initial attempt to implement setup and teardown run once in k6-operator. Once --linger is set, there is no interface to stop k6 process - it must be killed. There is a stop test call in REST API but it doesn't actually terminate the process in case of --linger: it only returns the status. So one has to remove the k6 process "manually". That leads to almost identical logic for aborting k6 process as response to user abort VS regular stop of the test run, and so these different cases must be additionally tracked as separate thus complicating the logic even more.

Additionally, Kubernetes space in general "expects" the main processes in pods to exit on their own, and these manual interventions seem almost an interference in a healthy Kubernetes flow. From both design and reliability perspective, it'd be beneficial for k6-operator (and perhaps k6-cloud) to rely on a straight-forward interface of a "k6 coordinator" for these things, e.g.: coordinator executes teardown and then sends termination signal to all k6 runners. After that, coordinator exits itself, and its exit is an indicator of test run being de facto finished.

On a related note, probably coordinator should also take care of abort terminations as well? I.e. if one wants to abort the test run, it should be enough to "ping" coordinator. So there should be a user interface to do that as well.

Copy link
Contributor

@codebien codebien Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yorugac I'm not sure I'm getting the point regarding --linger on the Cloud. At the moment, it should be used as a workaround for getting the current architecture working in a distributed mode. But with the new architecture proposed here, --linger is not required anymore to distribute the test as it is already covered by the Sync API. --linger remains for the specific case where the user wants to force k6 to wait even if the test is finished.

Kubernetes space in general "expects" the main processes in pods to exit on their own,

Regarding this specific point, I think is a valid case to support on k6 HTTP API. It is something mentioned at point 5 for --linger here #2804 (comment). Feel free to open an issue for it.

then sends termination signal to all k6 runners

We are reversing the flow, so the coordinator shouldn't be able to contact directly the agents. The unique way should be to emit a specific event through the Sync API to terminate/abort but it means there isn't a guarantee to kill the agent. Does it work for the Operator/k8s life-cycle?

On a related note, probably coordinator should also take care of abort terminations as well? I.e. if one wants to abort the test run, it should be enough to "ping" coordinator. So there should be a user interface to do that as well.

Yep, it sounds like a good idea for a single entrypoint for forcing the abort. It can be done as described via the Sync API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If --linger won't be required with native distribution, all the better. I was trying to illustrate precisely that point: that existing workarounds with --linger in k6 Cloud are neither easy to implement nor natural in a setting like Kubernetes. IOW, if native distribution will work and terminate without --linger and manual interventions, that's basically ideal.

I think is a valid case to support on k6 HTTP API. It is something mentioned at point 5 for --linger here #2804 (comment). Feel free to open an issue for it.

Done, here.

From test distribution viewpoint, HTTP call is still an intervention (something must make that call) in comparison to native distribution. But having that call would be better than current state for sure 👍

The unique way should be to emit a specific event through the Sync API to terminate/abort but it means there isn't a guarantee to kill the agent. Does it work for the Operator/k8s life-cycle?

🤔 What do you think is the worst-case behaviour on the agent's side? I.e. if "terminate event" is emitted and something gets broken, will agent's process be stuck or will it exit with non-zero error code after some timeout? Additionally, if the agent can get stuck, when can that happen?


While there were some details in the original proposal that are no longer applicable because of the big architectural changes in k6 in the meantime, the overall proposal is probably still valid. I am pretty sure that the architecture I proposed above can be built on top of these technologies and in a way where instances don't need to hold a persistent connection to a central `k6 coordinator` node!

However, I think that the benefits of such an approach would be outweighed by the drawbacks in terms of imlpementation and deployment complexity. In short, it can probably work, but IMO the UX will be worse. Even if we ignore the complications of deploying multiple software stacks (vs. just the k6 binary), it'd be difficult to implement great error handling and timely handling of things like `abortOnFail` thresholds and `test.abort()` API calls. It can certainly be done, but it'd be a ton of effort to not get a flaky experience...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
Additionally, external dependencies will make it much harder to incorporate this solution into k6-operator or k6-cloud, if possible at all. So it'd be nice if the first, "native" implementation of the interface was as "light" and transparent from user perspective as possible.

@codebien codebien self-assigned this Jan 29, 2024
@yorugac
Copy link
Contributor

yorugac commented Feb 2, 2024

Continuing the Kubernetes-related musings 🙂

In context of this design doc, I'd like to suggest a three-level approach to load distribution which should allow a more flexible setup for distributed test runs.

Currently it is implied that there will be a two level system:

k6 coordinator (handling 100% of test load) 
      <- k6 agents

Another way to approach this would be with three levels:

k6 coordinator (100% of test load) 
      <- k6 sub-coordinators (with execution segment, like 0:1/3) 
      <- k6 agents

IOW, allow coordinator to act as a sub-coordinator:

  • coordinator becomes a sub-coordinator when it has a configured endpoint of coordinator gRPC server (similar to an agent configuration)
  • it can accept execution segment CLI options
  • it can "relay" signals back and forth to coordinator and agents

Main benefits from this approach:

  • this allows to segment the network and its configuration (e.g. access policies) by "zones" (using "zones" as a more general term here)
  • this might unlock a cleaner design of the test distribution across all types of distributed test runs (better design => easier to maintain + potential to support any kind of test distribution, even mixed PLZ + regular LZ)

Context: this came up while pondering questions accumulated from multiple sources, which include inquiries around PLZ and k6-operator in general (e.g. here in OSS).

It seems to me that this kind of approach will be relatively straigtht-forward to add, assuming two-level coordination is implemented or planned for implementation.

@ragnarlonn
Copy link

@yorugac Sounds like it will scale better also, if a large number of agents can be served by a reasonable number of sub-coordinators that then also aggregate results and control data up to the "root" coordinator. If it does scale better without making UX worse due to increased complexity, I think we may have an USP versus practically all load testing tools that support distributed exectution. AFAIK most (maybe all) other tools that can do distributed execution have a bottleneck in a single master controller/coordinator process, which has to aggregate data from potentially many agents.

@na--
Copy link
Member Author

na-- commented Feb 4, 2024

I think this should be fairly easy to do on top of the architecture I proposed in this PR, though I'd encourage someone to try to make a PoC on top of #3205 or even #3213 to validate that gut feeling before fully embracing the idea 😅

I can see the benefits when it comes to multiple zones and mixed deployments, though a single coordinator will probably be sufficient in the vast majority of cases 🤔 Given the simplicity of the control interface, I'd expect it to comfortably support hundreds of instances, though I haven't actually measured that.

@na-- na-- requested a review from a team as a code owner February 19, 2024 11:58
@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2024

Codecov Report

Attention: 291 lines in your changes are missing coverage. Please review.

Comparison is base (703970e) 72.68% compared to head (5fa0add) 73.57%.
Report is 288 commits behind head on master.

❗ Current head 5fa0add differs from pull request most recent head 015ab96. Consider uploading reports for the commit 015ab96 to get more accurate results

Files Patch % Lines
js/modules/k6/grpc/grpc.go 60.71% 14 Missing and 8 partials ⚠️
cmd/builtin_output_gen.go 16.66% 19 Missing and 1 partial ⚠️
js/modules/k6/experimental/fs/errors_gen.go 9.09% 19 Missing and 1 partial ⚠️
js/modules/k6/experimental/fs/module.go 84.25% 11 Missing and 9 partials ⚠️
js/modules/k6/grpc/params.go 85.71% 11 Missing and 7 partials ⚠️
cmd/report.go 63.41% 15 Missing ⚠️
lib/consts/consts.go 31.81% 14 Missing and 1 partial ⚠️
execution/scheduler.go 57.57% 7 Missing and 7 partials ⚠️
js/modules/k6/grpc/stream.go 53.33% 11 Missing and 3 partials ⚠️
cmd/run.go 73.91% 9 Missing and 3 partials ⚠️
... and 34 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3217      +/-   ##
==========================================
+ Coverage   72.68%   73.57%   +0.88%     
==========================================
  Files         259      277      +18     
  Lines       19864    20228     +364     
==========================================
+ Hits        14439    14882     +443     
+ Misses       4521     4398     -123     
- Partials      904      948      +44     
Flag Coverage Δ
macos 73.50% <75.95%> (?)
ubuntu 73.52% <75.95%> (+0.89%) ⬆️
windows 73.41% <75.86%> (+0.88%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@mstoykov mstoykov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I don't have any feedback that isn't implementaion detail that likely will be wrong when we implement it with all the error handling , timeouting and configuration.

Let's merge this and try to start making it work, the design is good and simple enough IMO to be bendable to fix any problem we hit. Especially any problem that I can envision

@pablochacin
Copy link

@yorugac @ragnarlonn Regarding the sub-coordinators and in particular this comment

AFAIK most (maybe all) other tools that can do distributed execution have a bottleneck in a single master controller/coordinator process, which has to aggregate data from potentially many agents

I don't see how sub-coordinators will alleviate the data aggregation in the coordinator if we want to have execution-wide thresholds.

Metric aggregation: While the test is executing, we somehow need to collect all of the metrics from all instances and crunch them, because we need the aggregated metrics to produce the end-of-test summary and to calculate whether the script thresholds were crossed. For these calculations to be correct, we can't do them separately on every k6 instance in a distributed test, we need to first aggregate the results centrally and use that for the calculations. This is particularly important because some thresholds might be configured with abortOnFail, which means that we need to be continuously crunching the data, since they can require us to stop the test run mid-test!

@ragnarlonn
Copy link

@pablochacin @yorugac My spontaneous thought was that aggregation could happen lower down in the tree, but I might be wrong. I'm not very much up to date on how the k6 internals work these days. In my mind, it just means that thresholds have to be communicated to sub-agents and agents, so they can aggregate things and send to up the tree. I'm probably missing something though :)

@yorugac
Copy link
Contributor

yorugac commented Feb 22, 2024

@pablochacin @ragnarlonn I've got an impression that Ned indirectly already answered to this question over here 👀

@na--
Copy link
Member Author

na-- commented Feb 22, 2024

Yep, though keep in mind that in the PoC version of distributed support for metrics and thresholds (#3213), the Trend metrics are still inefficiently transmitted between k6 agents and the coordinator (see this code). So even sub-coordinators won't help that much, since they can't aggregate these without loosing the ability to merge them back together correctly again in the central coordinator node.

For that, as I wrote in this design doc, you need HDR histograms. The PoC for that is in the next PR in the chain, #2816, specifically this commit: 1ad356b

As you can see, it's a relatively tiny change (most of the diff size is due to the used HDR library dependency). This commit actually used to be at the start of the PoC PR chain, but I moved it at the very end, so all of the rest of them could be merged without breaking changes. After all, k6 agent and k6 coordinator are new commands, so the idea was that they could be merged and behind a feature flag (or just hidden in the CLI usage k6 --help) for a while, even if they are currently inefficient. They can stay hidden until they are production ready, and HDR histograms are basically one of the few prerequisites for that (besides things like error handling and more tests). Still, as I wrote in #763 (comment), HDR histograms necessitate someone to spend a bit of time reading, investigating and measuring, but it shouldn't be too tricky to get them in (e.g. they themselves can also be behind a feature flag, to get feedback).

After having HDR histograms, sub-coordinator would be able to safely merge the HDR histograms from their agents and then pass the aggregated value to the central coordinator without a loss in precision. That said, it shouldn't make that much of a performance improvement, merging the HDR histograms is pretty efficient and I expect the central coordinator to be able to handle dozens if not hundreds of instances without much of a problem (though, admittedly, I haven't tested that assumption 😅 ).

@pablochacin
Copy link

Thanks for the detailed explanation @na--. For me is still unclear if we should consider partial aggregation using HDR and sub-coordinators as part of the "base" of distributed execution, or we should consider them in a second phase once we have an stable version of the coordinator/agent model.

@codebien
Copy link
Contributor

codebien commented Feb 22, 2024

or we should consider them in a second phase once we have a stable version of the coordinator/agent model

@pablochacin The first experimental version should validate the main assumptions:

  • Reverse the pattern
    • Sync API
    • ManyToOne architecture between agents and coordinator
  • Test suite enablement
  • Nice UX/DX for error handling

As @na-- mentioned in the previous comment sub-coordinators should not be intended as a primitive. We can build it as an additional feature on top of a more simplistic version.

In addition, considering the complexity it introduces, we should have an idea of the real demand for this specific feature. I don't think we will be able to predict it with good accuracy before the first version will be available to the community.

@codebien codebien merged commit 3a7bf2c into master Feb 27, 2024
22 of 24 checks passed
@codebien codebien deleted the distributed-execution-design-doc branch February 27, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants