Enable statistically meaningful constant RPS load generation distributions #1281

timran1 · 2020-03-10T03:27:47Z

Locust is very userful for stress testing web services at constant RPS. Prior work has added support for constant_pacing based wait_time which ensures that RPS is independent of how long the web services takes to respond back. However, how the individual requests are spread out within a single second of say a 10 RPS load is not currently considered at all in Locust. This can impose undue bursty pressure of requests on the system under test. The situation worsens if a slave is added or removed during load generation.

To measure and understand the existing work, I developed a small web service which simply records and plots the request arrival time, inter-arrival time between consecutive requests and the frequency distribution of inter-arrival times. Refer to the screenshots below for particular examples.

The following commands are used to invoke locust with different wait_time schemes implemented in this PR.

locust -f <task_file> --host=<host_address> --csv=locust --no-web -c 20 -r 100 -t 1m

Similar commands are used for distributed master/slave configurations.

Existing Support

Using constant_pacing on a single slave. Notice the bursts followed by periods of no requests. The exact pattern depends on when the locusts/users are hatched:

Using constant_pacing on a multiple slaves. Notice the inter-arrival time is affected when new slave joins after around 100 requests:

Added Support

An open-loop load generation behavior has been added which considers the clock time, user ID and slave ID to decide when to trigger tasks, such that a statistical distribution of task/request arrival time is achieved.

Constant Inter-Arrival Time

Using constant_uniform on a single slave:

Using constant_uniform on multiple slaves. Notice the system adjusts itself after a slave is added after around 100 requests.

Poisson Distribution Inter-Arrival Time

Using poisson on a single slave:

Using poisson on multiple slaves:

Implementation Details

Most of the edits in core locusts files are related to maintaining an ID for each locust such that IDs are consecutive at all times (even after random locust kills). Secondly, a similar scheme is used to track and communicate slave/client ID and hence the timeslots each client is supposed to issue requests at.

Currently the implementation does not account for difference in wall clock times of different nodes. However, if this support gets through, I can work on using a clock synchronization protocol to account for this affect as well.

Comments and suggestions are welcome.

…tion

max-rocket-internet · 2020-03-10T13:31:39Z

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

heyman · 2020-03-10T14:47:33Z

Hi! Thanks for a well described PR.

What happens when some users are killed or spawned during a test (e.g. if a new slave node connects during a test). Are Locust user instances assigned new IDs?

There is an existing issue where new Locust are spawned in bursts when when running Locust with a large number of slave nodes (#896). One proposed solution to this is to introduce a different delay for each slave node when spawning locust (to spread out the spawning). Would these changes be compatible with that behaviour?

cyberw · 2020-03-10T16:22:35Z

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

If this feature causes unexpected/weird behaviour if the clocks are out of sync then I think we need to at least fail the run or log a warning. I would not want to be the one debugging that :)

(unless of course the behaviour would only impact the in-second distribution of requests in which case it doesnt really matter that much)

cyberw · 2020-03-10T18:06:59Z

Oh, and speaking of time syncing, I think maybe we should use time.monotonic() instead of time.time() for this timer and all others as well.

timran1 · 2020-03-10T22:36:25Z

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

I agree.

Hi! Thanks for a well described PR.

What happens when some users are killed or spawned during a test (e.g. if a new slave node connects during a test). Are Locust user instances assigned new IDs?

Each slave maintains the IDs for its own locusts (starting from zero for every slave). Moreover, each slave gets a "timeslot" from the master (which is updated everytime slave connects/quits). This slave timeslot makes sure that requests from different slaves are interleaved nicely to get the required distribution. I have included a screenshot of what happens if a slave quits (slave connects around 150 requests and quits around 700 requests):

There is an existing issue where new Locust are spawned in bursts when when running Locust with a large number of slave nodes (#896). One proposed solution to this is to introduce a different delay for each slave node when spawning locust (to spread out the spawning). Would these changes be compatible with that behaviour?

This PR works to decouple the locust hatch time from the request generation pattern by using locust IDs instead. As long as locusts have consecutive IDs at all times, we should get expected results.

On the other hand, I think the timeslot information passed from master to clients in this PR may be used for spacing out hatch times as well.

I can work on using a clock synchronization protocol to account for this affect as well.

Is that s a bit out of the scope of locust, no? Running and monitoring ntpd or chronyd should be the responsibility of the system, not locust.

If this feature causes unexpected/weird behaviour if the clocks are out of sync then I think we need to at least fail the run or log a warning. I would not want to be the one debugging that :)

(unless of course the behaviour would only impact the in-second distribution of requests in which case it doesnt really matter that much)

Out of sync clocks between slaves should affect the in-second distribution only. A task will be executed at the same interval, just the position in time for the task will be skewed.

Oh, and speaking of time syncing, I think maybe we should use time.monotonic() instead of time.time() for this timer and all others as well.

Makes sense. I will update the new wait_time functions to use time.monotonic()

One comment I want to make is that these distributions can be disturbed if a task takes longer than the wait_time specified. I generally specify a timeout for requests which is less than the specified wait_time to avoid this.

timran1 · 2020-03-10T22:59:16Z

Additionally, I just pulled the lastest changes from master and looks like #1266 removes the global runners.locust_runner.

The information required in wait_time functions for this PR is:

Slave timeslot
Locust ID

What is the recommended way to bring in this information now? If I add arguments to wait_time function it will break existing wait_time functions.

heyman · 2020-04-01T00:50:14Z

Additionally, I just pulled the lastest changes from master and looks like #1266 removes the global runners.locust_runner.

What is the recommended way to bring in this information now? If I add arguments to wait_time function it will break existing wait_time functions.

The runner instance is now accessible through locust_instance.environment.runner (locust_instance is given as an argument to the wait_time functions).

cyberw · 2020-04-05T14:51:03Z

@timran1 can you have a look at the conflicts? Having this in 1.0 would be nice!

heyman · 2020-04-05T16:37:25Z

If this feature causes unexpected/weird behaviour if the clocks are out of sync then I think we need to at least fail the run or log a warning. I would not want to be the one debugging that :)

That's already the case for Locust though, due to how response time stats aggregation works.

There is some commented out code from back in 2012 that seemingly is supposed to check for this, but those lines seems to have almost been commited by mistake judging from the commit message from yours truly...

locust/locust/runners.py

Lines 461 to 463 in 2ac0a84

    
           ## emit a warning if the worker's clock seem to be out of sync with our clock 
        
           #if abs(time() - msg.data["time"]) > 5.0: 
        
           #    warnings.warn("The worker node's clock seem to be out of sync. For the statistics to be correct the different locust servers need to have synchronized clocks.")

codecov · 2020-04-05T16:45:43Z

Codecov Report

Merging #1281 into master will decrease coverage by 0.96%.
The diff coverage is 42.18%.

@@            Coverage Diff             @@
##           master    #1281      +/-   ##
==========================================
- Coverage   80.21%   79.25%   -0.97%     
==========================================
  Files          23       23              
  Lines        2138     2198      +60     
  Branches      322      332      +10     
==========================================
+ Hits         1715     1742      +27     
- Misses        344      373      +29     
- Partials       79       83       +4

Impacted Files	Coverage Δ
locust/wait_time.py	`42.00% <21.62%> (-58.00%)`	⬇️
locust/runners.py	`76.66% <69.23%> (+0.23%)`	⬆️
locust/core.py	`99.14% <100.00%> (+0.43%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 13d7c3a...dd19a75. Read the comment docs.

…ners.py

timran1 · 2020-04-06T06:22:34Z

I have fixed the conflicts and updated test cases to account for additional messages sent by broadcast_timeslot.

cyberw · 2020-04-06T07:25:38Z

LGTM. Ok to merge @heyman ?

locust/runners.py

heyman · 2020-04-06T13:18:59Z

Is it possible to write tests for the wait_time functions as well?

timran1 · 2020-04-09T04:46:20Z

Is it possible to write tests for the wait_time functions as well?

We may need to run the tests for a while as the wait functions actually waits but this should be doable. I will work on these tests in next couple of days.

heyman · 2020-04-09T14:37:42Z

We may need to run the tests for a while as the wait functions actually waits but this should be doable. I will work on these tests in next couple of days.

That would be great! I can totally see that it might be far from trivial to write good tests for this, but I think it would be really good to have them since it might be hard to catch regressions in the future otherwise.

jheld · 2020-04-11T15:52:20Z

Would freezegun (with tick) or mock of time/monotonic be helpful for the running of tests which require long runs? Or am I mistaken and the code would actually have to run for a certain amount of time for the test to be effective?

domik82 · 2020-04-27T12:09:46Z

@timran1 @heyman @cyberw - do you have plans to finish this PR?

cyberw · 2020-05-22T11:40:37Z

@timran1 Will you have time to look at adding the tests & resolving the conflicts? Sorry for the slow response. Personally I'm not very invested in this change (so I havent really put the time in to determine if it is good nor not). If there are no further updates i will decline this PR in a week or so (we can always open a new one or reopen it)

cyberw · 2020-06-04T11:23:06Z

Closing due to inactivity. Feel free to reopen if someone (@timran1 ?) has the time to fix the conflicts & add tests.

timran1 added 4 commits March 5, 2020 16:48

Allow locusts and slaves syncrhonization for uniform request distribu…

0564c31

…tion

Add placeholder for clock offset

4139106

Add poisson and constant_uniform wait_time functions

a3149a2

Cleanup

ad753c8

timran1 added 2 commits March 10, 2020 16:41

Use time.monotonic() instead of time.time() for wait_time

dec67dc

Merge branch 'master' of https://github.com/locustio/locust

064cee0

timran1 added 3 commits April 5, 2020 22:13

Merge branch 'master' of https://github.com/locustio/locust

8393565

Ensure load is redistributed when a worker quits

33050a7

Account for timeslot_ratio messages in server outbox in test/test_run…

dd19a75

…ners.py

heyman reviewed Apr 6, 2020

View reviewed changes

locust/runners.py Show resolved Hide resolved

heyman reviewed Apr 6, 2020

View reviewed changes

locust/runners.py Show resolved Hide resolved

heldic mentioned this pull request Apr 29, 2020

Let hatch rate be provided by a callable #1353

Closed

cyberw closed this Jun 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable statistically meaningful constant RPS load generation distributions #1281

Enable statistically meaningful constant RPS load generation distributions #1281

timran1 commented Mar 10, 2020

max-rocket-internet commented Mar 10, 2020

heyman commented Mar 10, 2020

cyberw commented Mar 10, 2020 •

edited

Loading

cyberw commented Mar 10, 2020 •

edited

Loading

timran1 commented Mar 10, 2020

timran1 commented Mar 10, 2020 •

edited

Loading

heyman commented Apr 1, 2020

cyberw commented Apr 5, 2020

heyman commented Apr 5, 2020

codecov bot commented Apr 5, 2020 •

edited

Loading

timran1 commented Apr 6, 2020

cyberw commented Apr 6, 2020

heyman commented Apr 6, 2020

timran1 commented Apr 9, 2020

heyman commented Apr 9, 2020

jheld commented Apr 11, 2020

domik82 commented Apr 27, 2020

cyberw commented May 22, 2020 •

edited

Loading

cyberw commented Jun 4, 2020

Enable statistically meaningful constant RPS load generation distributions #1281

Enable statistically meaningful constant RPS load generation distributions #1281

Conversation

timran1 commented Mar 10, 2020

Existing Support

Added Support

Constant Inter-Arrival Time

Poisson Distribution Inter-Arrival Time

Implementation Details

max-rocket-internet commented Mar 10, 2020

heyman commented Mar 10, 2020

cyberw commented Mar 10, 2020 • edited Loading

cyberw commented Mar 10, 2020 • edited Loading

timran1 commented Mar 10, 2020

timran1 commented Mar 10, 2020 • edited Loading

heyman commented Apr 1, 2020

cyberw commented Apr 5, 2020

heyman commented Apr 5, 2020

codecov bot commented Apr 5, 2020 • edited Loading

Codecov Report

timran1 commented Apr 6, 2020

cyberw commented Apr 6, 2020

heyman commented Apr 6, 2020

timran1 commented Apr 9, 2020

heyman commented Apr 9, 2020

jheld commented Apr 11, 2020

domik82 commented Apr 27, 2020

cyberw commented May 22, 2020 • edited Loading

cyberw commented Jun 4, 2020

cyberw commented Mar 10, 2020 •

edited

Loading

cyberw commented Mar 10, 2020 •

edited

Loading

timran1 commented Mar 10, 2020 •

edited

Loading

codecov bot commented Apr 5, 2020 •

edited

Loading

cyberw commented May 22, 2020 •

edited

Loading