Integration/stress testing orchestration #2358

qdm12 · 2022-03-08T18:03:44Z

qdm12
Mar 8, 2022

How do we orchestrate multiple nodes for our integration tests/stress tests?

For now we mostly run other nodes in goroutines with time based synchronization. The problem is this doesn't scale with the machine speed and can be unreliable if the wait time is not long enough (how long is enough?).

The minimal criteria for an eligible solution is to:

be able to run it in a Go single test, and not need to setup anything before running the test
Be reliable (repeatable, predictable)
has to be cross platform

The following criteria are considered, with a grade of -1, 0 or 1:

How much new code?
How much production code refactoring?
Extra tooling needed?
Separation of concerns?
Flexibility/maintenance cost?

Run each node in a goroutine and parse their logs

Score: 3

How much new code? 1 we just need to write logs to a bytes buffer and parse it depending on the test.
How much production code refactoring? 0 we might have to dependency inject loggers
Extra tooling needed? 1 no
Separation of concerns? 0 each goroutine runs independently, but some code may be shared (:eyes: global variables)
Flexibility/maintenance cost? 1 it's all Go code, so it's flexible and fast to maintain.

Votes: Quentin

Run each node in a goroutine and signal through channels

Score: -2

How much new code? -1 setup channels and careful async logic (to avoid deadlocks etc.) in tests
How much production code refactoring? -1 whole services refactor to be able to communicate status through channels
Extra tooling needed? 1 no
Separation of concerns? -1 each goroutine communicate with channels so there is a tight/risky coupling
Flexibility/maintenance cost? 0 it's all Go code, so it's flexible BUT the asynchronous logic with channels will be tedious to get right without running into deadlocks or race conditions.

Votes:

Run each node in a container and orchestrate with Docker Go API

We can use the Docker client Go API to launch nodes containers and stream their logs. This also has the advantage to test against the final Docker image and not just the Go code.

Score: 4

How much new code? 0 setup logs parsing and launching/stopping of Gossamer containers with the Docker API
How much production code refactoring? 1 niet
Extra tooling needed? 0 Docker running, although most devs have it I guess
Separation of concerns? 1 we run different containers so they're quite isolated, this also tests the full Docker image
Flexibility/maintenance cost? 0 it's Go code so it's quite flexible, and we have all Docker API at our fingertips. However, since we run a container, we are limited to environment variables/flags to pass to Gossamer + Docker image rebuild.

Votes: Quentin

Run nodes with Ansible

Use Ansible to get nodes started with the desired configuration. We can use go-ansible to launch it from our Go test code. This can be nice although might require some work to get the right configuration, and it might be a lot of code duplication. We should also use Go 'scripts' instead of bash scripts for it so it's fully cross platform.

Score: 1

How much new code? -1 setup Ansible files (Go scripts, yml files) + calls from Go code using the Go API
How much production code refactoring? 1 niet
Extra tooling needed? 0 Ansible might be needed (to check)
Separation of concerns? 1 each program can be run either as simple binaries or as containers, and both should be decently isolated.
Flexibility/maintenance cost? 0 It should be nicely flexible once setup properly, but it might lead to a lot of code duplication for each test with may make the whole solution not viable.

danforbes · 2022-03-09T20:12:45Z

danforbes
Mar 9, 2022

It seems like there is an assumption that we need to start the nodes from within the tests - am I interpreting that correctly? If so, what is the basis for that requirement?

4 replies

qdm12 Mar 11, 2022
Author

For now, we do a bit of both.

We launch N nodes in a TestMain before tests execute
We launch N nodes within a single test, and shut them down a the end of that test

I would say there is an advantage to be launch nodes within a Go test such that we can configure them according to the test. There is rarely a configuration-fits-all for each test.

To do that we should focus on fast starts, fast teardowns.
Ideally we should also aim to decouple test, for example remove pre-test setup.
Since go compilation is fast, we can even build and run with go run nodes within a test, which would make tests quite decoupled.

danforbes Mar 11, 2022

Can you elaborate on some of the test-specific configurations that exist and the range of values they make take?

qdm12 Mar 11, 2022
Author

Not yet really, I'm still working through the code, but I'll comment back.

Anyway, on top of that we should clear and restart nodes to clear their state from memory and disk.
For example with a database container, I usually setup the schema at a test start and drop all the tables at the end of each test, effectively clearing its state. I don't think we can do (or should do) this with Gossamer.

Again, I'm moving plumbing around in the tests/ directory, so I should gain a better understanding soon. Even our current time based approach is rather flaky and hard to change, so I'm working on this right now, it might be sufficient to just make it more configurable.

edwardmack Mar 11, 2022

If we want to start nodes from a specific state can we use a snapshot of the badger db's state for that? And swap out db files for different test/states, or use init db to clear state. Does docker or ansible provide any advantages regarding this? or different ways of handling state?

jimjbrettj · 2022-03-11T19:01:01Z

jimjbrettj
Mar 11, 2022
Collaborator

@qdm12 or @danforbes do you know what kinda overhead we might expect from setting up ansible? From reading this over it seems to me that using docker would be the way to go, but maybe there is more associated with ongoing maintenance (even tho its still go) than there would be if we used ansible? Or would the maintenance be comparable for the two of them?

I guess the question im asking myself is "is the overhead of using ansible worth it to make our lives easier in the future?"

Would be curious to hear thoughts

1 reply

qdm12 Mar 14, 2022
Author

do you know what kinda overhead we might expect from setting up ansible?

No overhead really, it just parses a yml and executes things really.

maybe there is more associated with ongoing maintenance (even tho its still go) than there would be if we used ansible

Yes and no. The advantage is Ansible is standardized, but on the other hand, I think we'll have to have some custom code with it as well, so we might as well just orchestrate ourselves (like we do now) using the Docker API (or just go run or goroutines). Happy to hear other opinions on that topic though.

To be clear, for now we run nodes as sub-processes in each test. The idea behind running them in docker containers is to test the entire docker image, not just the program. There is also the option to run nodes as goroutines, but that tests less stuff than running the whole program. And all this might be out of scope here as well.

qdm12 · 2022-03-14T15:44:41Z

qdm12
Mar 14, 2022
Author

I would also like to point out that we could keep it time based as long as we have proper RPC endpoints and retry mechanisms. That's what I'm working on right now. Obviously event-driven would be better, but it would also require more work.

0 replies

timwu20 · 2022-03-14T16:55:28Z

timwu20
Mar 14, 2022
Maintainer

Is the first option running the node via shell? or instantiating the node directly in a goroutine?

1 reply

qdm12 Mar 15, 2022
Author

What we do now is we run it via exec.Cmd so pretty much the 'go' shell I guess. Sorry it wasn't with goroutines, my bad.
That may also add some extra memory usage (less memory sharing), but it's simple to get all the logs (no need to refactor our whole code to inject loggers) and also tests the whole binary.

timwu20 · 2022-03-14T16:58:04Z

timwu20
Mar 14, 2022
Maintainer

My concern is that the github runner we is unreliable in terms of the amount of dedicated resources we get. I would assume this is the reason why the run time of these stress tests vary and are inherently flaky. We could optimise all we want in terms of timeouts, but block times can vary due to resource constraints. I think we need to run our own github runners to really achieve the quality and throughput of testing we're looking for.

1 reply

qdm12 Mar 15, 2022
Author

My concern is that the github runner we is unreliable in terms of the amount of dedicated resources we get

Unless it's memory and OOM errors, our test code should scale to the runner performance. And I don't think we actually run out of memory really.

why the run time of these stress tests vary and are inherently flaky

As far as my /tests/ refactoring is going, we just have weird timing and retry logic. I'm refactoring it and I'm mostly confident it will fix it. I don't even think more RPC methods need to be added on top, so it's really the path of least resistance.

but block times can vary due to resource constraints

A periodic RPC call with retry can make the test wait correctly. It won't be as fast as possible (i.e. like parsing logs/event-driven), but it will be fast enough and reliable.

I think we need to run our own github runners to really achieve the quality and throughput of testing we're looking for.

I disagree, again, unless we're memory bound. There are some tests where we run 9 nodes, maybe that can be problematic, but not as far as I know. Also we should rather split the load on multiple runners instead, ideally, and that's most likely feasible since we already communicate over RPC/network.

dapplion · 2022-04-04T18:26:41Z

dapplion
Apr 4, 2022

@qdm12 asked me to comment about Lodestar QA. So far:

We run a minimal testnet with 4 nodes 64 validators in Github runners for ~6 minutes and achieve finality in a minimal preset configuration. Useful to catch network issues and consensus bugs
Then we manually deploy every master merge to a fleet of testnet nodes in Prater (big live production network). Everything managed by ansible and with prometheus metrics
Before each release we deploy a beta to another fleet with the same ansible playbooks, see https://github.com/ChainSafe/lodestar/blob/master/RELEASE.md

Let me know if more context would be useful

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration/stress testing orchestration #2358

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Integration/stress testing orchestration #2358

qdm12 Mar 8, 2022

Run each node in a goroutine and parse their logs

Run each node in a goroutine and signal through channels

Run each node in a container and orchestrate with Docker Go API

Run nodes with Ansible

Replies: 6 comments · 7 replies

danforbes Mar 9, 2022

qdm12 Mar 11, 2022 Author

danforbes Mar 11, 2022

qdm12 Mar 11, 2022 Author

edwardmack Mar 11, 2022

jimjbrettj Mar 11, 2022 Collaborator

qdm12 Mar 14, 2022 Author

qdm12 Mar 14, 2022 Author

timwu20 Mar 14, 2022 Maintainer

qdm12 Mar 15, 2022 Author

timwu20 Mar 14, 2022 Maintainer

qdm12 Mar 15, 2022 Author

dapplion Apr 4, 2022

qdm12
Mar 8, 2022

Replies: 6 comments 7 replies

danforbes
Mar 9, 2022

qdm12 Mar 11, 2022
Author

qdm12 Mar 11, 2022
Author

jimjbrettj
Mar 11, 2022
Collaborator

qdm12 Mar 14, 2022
Author

qdm12
Mar 14, 2022
Author

timwu20
Mar 14, 2022
Maintainer

qdm12 Mar 15, 2022
Author

timwu20
Mar 14, 2022
Maintainer

qdm12 Mar 15, 2022
Author

dapplion
Apr 4, 2022