Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable statistically meaningful constant RPS load generation distributions #1281

Closed
wants to merge 9 commits into from

Conversation

timran1
Copy link

@timran1 timran1 commented Mar 10, 2020

Locust is very userful for stress testing web services at constant RPS. Prior work has added support for constant_pacing based wait_time which ensures that RPS is independent of how long the web services takes to respond back. However, how the individual requests are spread out within a single second of say a 10 RPS load is not currently considered at all in Locust. This can impose undue bursty pressure of requests on the system under test. The situation worsens if a slave is added or removed during load generation.

To measure and understand the existing work, I developed a small web service which simply records and plots the request arrival time, inter-arrival time between consecutive requests and the frequency distribution of inter-arrival times. Refer to the screenshots below for particular examples.

The following commands are used to invoke locust with different wait_time schemes implemented in this PR.

locust -f <task_file> --host=<host_address> --csv=locust --no-web -c 20 -r 100 -t 1m

Similar commands are used for distributed master/slave configurations.

Existing Support

Using constant_pacing on a single slave. Notice the bursts followed by periods of no requests. The exact pattern depends on when the locusts/users are hatched:
locust-constant_pacing-single

Using constant_pacing on a multiple slaves. Notice the inter-arrival time is affected when new slave joins after around 100 requests:
locust-constant_pacing-multiple

Added Support

An open-loop load generation behavior has been added which considers the clock time, user ID and slave ID to decide when to trigger tasks, such that a statistical distribution of task/request arrival time is achieved.

Constant Inter-Arrival Time

Using constant_uniform on a single slave:
locust-constant_uniform-single

Using constant_uniform on multiple slaves. Notice the system adjusts itself after a slave is added after around 100 requests.
locust-constant_uniform-multiple

Poisson Distribution Inter-Arrival Time

Using poisson on a single slave:
locust-poisson-single

Using poisson on multiple slaves:
locust-poisson-multiple

Implementation Details

Most of the edits in core locusts files are related to maintaining an ID for each locust such that IDs are consecutive at all times (even after random locust kills). Secondly, a similar scheme is used to track and communicate slave/client ID and hence the timeslots each client is supposed to issue requests at.

Currently the implementation does not account for difference in wall clock times of different nodes. However, if this support gets through, I can work on using a clock synchronization protocol to account for this affect as well.

Comments and suggestions are welcome.

@max-rocket-internet
Copy link
Contributor

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

@heyman
Copy link
Member

heyman commented Mar 10, 2020

Hi! Thanks for a well described PR.

What happens when some users are killed or spawned during a test (e.g. if a new slave node connects during a test). Are Locust user instances assigned new IDs?

There is an existing issue where new Locust are spawned in bursts when when running Locust with a large number of slave nodes (#896). One proposed solution to this is to introduce a different delay for each slave node when spawning locust (to spread out the spawning). Would these changes be compatible with that behaviour?

@cyberw
Copy link
Collaborator

cyberw commented Mar 10, 2020

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

If this feature causes unexpected/weird behaviour if the clocks are out of sync then I think we need to at least fail the run or log a warning. I would not want to be the one debugging that :)

(unless of course the behaviour would only impact the in-second distribution of requests in which case it doesnt really matter that much)

@cyberw
Copy link
Collaborator

cyberw commented Mar 10, 2020

Oh, and speaking of time syncing, I think maybe we should use time.monotonic() instead of time.time() for this timer and all others as well.

@timran1
Copy link
Author

timran1 commented Mar 10, 2020

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

I agree.

Hi! Thanks for a well described PR.

What happens when some users are killed or spawned during a test (e.g. if a new slave node connects during a test). Are Locust user instances assigned new IDs?

Each slave maintains the IDs for its own locusts (starting from zero for every slave). Moreover, each slave gets a "timeslot" from the master (which is updated everytime slave connects/quits). This slave timeslot makes sure that requests from different slaves are interleaved nicely to get the required distribution. I have included a screenshot of what happens if a slave quits (slave connects around 150 requests and quits around 700 requests):
locust-constant_uniform-multiple-quits

There is an existing issue where new Locust are spawned in bursts when when running Locust with a large number of slave nodes (#896). One proposed solution to this is to introduce a different delay for each slave node when spawning locust (to spread out the spawning). Would these changes be compatible with that behaviour?

This PR works to decouple the locust hatch time from the request generation pattern by using locust IDs instead. As long as locusts have consecutive IDs at all times, we should get expected results.

On the other hand, I think the timeslot information passed from master to clients in this PR may be used for spacing out hatch times as well.

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

If this feature causes unexpected/weird behaviour if the clocks are out of sync then I think we need to at least fail the run or log a warning. I would not want to be the one debugging that :)

(unless of course the behaviour would only impact the in-second distribution of requests in which case it doesnt really matter that much)

Out of sync clocks between slaves should affect the in-second distribution only. A task will be executed at the same interval, just the position in time for the task will be skewed.

Oh, and speaking of time syncing, I think maybe we should use time.monotonic() instead of time.time() for this timer and all others as well.

Makes sense. I will update the new wait_time functions to use time.monotonic()

One comment I want to make is that these distributions can be disturbed if a task takes longer than the wait_time specified. I generally specify a timeout for requests which is less than the specified wait_time to avoid this.

@timran1
Copy link
Author

timran1 commented Mar 10, 2020

Additionally, I just pulled the lastest changes from master and looks like #1266 removes the global runners.locust_runner.

The information required in wait_time functions for this PR is:

  1. Slave timeslot
  2. Locust ID

What is the recommended way to bring in this information now? If I add arguments to wait_time function it will break existing wait_time functions.

@heyman
Copy link
Member

heyman commented Apr 1, 2020

Additionally, I just pulled the lastest changes from master and looks like #1266 removes the global runners.locust_runner.

What is the recommended way to bring in this information now? If I add arguments to wait_time function it will break existing wait_time functions.

The runner instance is now accessible through locust_instance.environment.runner (locust_instance is given as an argument to the wait_time functions).

@cyberw
Copy link
Collaborator

cyberw commented Apr 5, 2020

@timran1 can you have a look at the conflicts? Having this in 1.0 would be nice!

@heyman
Copy link
Member

heyman commented Apr 5, 2020

If this feature causes unexpected/weird behaviour if the clocks are out of sync then I think we need to at least fail the run or log a warning. I would not want to be the one debugging that :)

That's already the case for Locust though, due to how response time stats aggregation works.

There is some commented out code from back in 2012 that seemingly is supposed to check for this, but those lines seems to have almost been commited by mistake judging from the commit message from yours truly...

locust/locust/runners.py

Lines 461 to 463 in 2ac0a84

## emit a warning if the worker's clock seem to be out of sync with our clock
#if abs(time() - msg.data["time"]) > 5.0:
# warnings.warn("The worker node's clock seem to be out of sync. For the statistics to be correct the different locust servers need to have synchronized clocks.")

@codecov
Copy link

codecov bot commented Apr 5, 2020

Codecov Report

Merging #1281 into master will decrease coverage by 0.96%.
The diff coverage is 42.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1281      +/-   ##
==========================================
- Coverage   80.21%   79.25%   -0.97%     
==========================================
  Files          23       23              
  Lines        2138     2198      +60     
  Branches      322      332      +10     
==========================================
+ Hits         1715     1742      +27     
- Misses        344      373      +29     
- Partials       79       83       +4     
Impacted Files Coverage Δ
locust/wait_time.py 42.00% <21.62%> (-58.00%) ⬇️
locust/runners.py 76.66% <69.23%> (+0.23%) ⬆️
locust/core.py 99.14% <100.00%> (+0.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13d7c3a...dd19a75. Read the comment docs.

@timran1
Copy link
Author

timran1 commented Apr 6, 2020

I have fixed the conflicts and updated test cases to account for additional messages sent by broadcast_timeslot.

@cyberw
Copy link
Collaborator

cyberw commented Apr 6, 2020

LGTM. Ok to merge @heyman ?

@heyman
Copy link
Member

heyman commented Apr 6, 2020

Is it possible to write tests for the wait_time functions as well?

@timran1
Copy link
Author

timran1 commented Apr 9, 2020

Is it possible to write tests for the wait_time functions as well?

We may need to run the tests for a while as the wait functions actually waits but this should be doable. I will work on these tests in next couple of days.

@heyman
Copy link
Member

heyman commented Apr 9, 2020

We may need to run the tests for a while as the wait functions actually waits but this should be doable. I will work on these tests in next couple of days.

That would be great! I can totally see that it might be far from trivial to write good tests for this, but I think it would be really good to have them since it might be hard to catch regressions in the future otherwise.

@jheld
Copy link

jheld commented Apr 11, 2020

Would freezegun (with tick) or mock of time/monotonic be helpful for the running of tests which require long runs? Or am I mistaken and the code would actually have to run for a certain amount of time for the test to be effective?

@domik82
Copy link

domik82 commented Apr 27, 2020

@timran1 @heyman @cyberw - do you have plans to finish this PR?

@cyberw
Copy link
Collaborator

cyberw commented May 22, 2020

@timran1 Will you have time to look at adding the tests & resolving the conflicts? Sorry for the slow response. Personally I'm not very invested in this change (so I havent really put the time in to determine if it is good nor not). If there are no further updates i will decline this PR in a week or so (we can always open a new one or reopen it)

@cyberw
Copy link
Collaborator

cyberw commented Jun 4, 2020

Closing due to inactivity. Feel free to reopen if someone (@timran1 ?) has the time to fix the conflicts & add tests.

@cyberw cyberw closed this Jun 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants