Dispatcher not killed on runner.stop() #2938

PatrikMosko · 2024-10-11T20:52:33Z

Prerequisites

I am using the latest version of Locust
I am reporting a bug, not asking a question

Description

Hello there.

I am building a custom wrapper on top of a locust. Most of the upstream locust functionality is being preserved, some stuff is being monkey-patched. Mentioning it even though it shouldn't have an impact on the issue I am about to describe.

I am trying to implement custom failure rate algorithm/monitor that STOPS test run prematurely, before --run-time is being expended, in case the failure rate goes beyond this threshold.

Standalone runner - no issues.
Distributed run has no issues if all users are already spawned before we reach threshold.

However, real issue is when I set lets say --users=300 and my custom event stops the runner when only subset of all the users is spawned. worker is being stopped and immediately sends message that its ready again. What master does though, is it spawns start() method in a greenlet, which spawns additional greenlets in a separate gevent group that is not being stored anywhere in the runner, therefore I cant see/kill it. On top of that, MasterWorker is NOT actually killing the user dispatcher in its stop() method, it merely sets the dispatcher to None
self._users_dispatcher = None
which means the gevent group hooks back to the worker and keeps spawning users.

Whats strange is that it looks like it spawns exactly 60 users before runner.quit() is called and both master/worker runners quit...

What I would like to do is simply stop both master AND worker runners, reset the user dispatcher/user variables in both runners and be able to restart the test from webUI.

Being forced to straight up quit the worker is a workaround that I would like to not take into account.

All help would be appreciated.

Command line

mylocust --host 127.0.0.1 --root . -L DEBUG -f locustfile.py --autostart --autoquit 1 --users 200 --spawn-rate 1 -t 40 --master; mylocust --host 127.0.0.1 --root . -L DEBUG -f locustfile.py --autostart --autoquit 1 --users 200 --spawn-rate 1 -t 40 --worker

Locustfile contents

import warnings

from mylocust.events.failure_rate_monitor_fema import *
from locust import HttpUser, constant_pacing, events, tag, task

warnings.filterwarnings("ignore")


@tag("labs")
@task
def get_labs(self):
    self.client.get("/api/v0/labs")


@tag("labs")
@task
def delete_lab(self):
    self.client.delete("/api/v0/labs/9e232a38-ad8b-48a0-9f2e-cba37d661453")


class MyUser(HttpUser):
    wait_time = constant_pacing(1)
    weight = 30
    tasks = [delete_lab, get_labs]

    def on_start(self):
        self.client.verify = False
        response = self.client.post(
            "/api/v0/authenticate", json={"username": "user1", "password": "password123"}
        )
        token = response.text.strip('"')
        self.client.headers = {"Authorization": f"Bearer {token}"}

# part of the code from custom locust "init" event
def init_failure_rate_monitor(
...
        # end test run once we reach high-enough failure rate
        if fema.ema >= max_failure_rate:
            logger.info(
                "Failure rate passed the selected threshold of "
                f"'{max_failure_rate * 100}%'."
            )
            gevent.spawn(
                environment.my_runner.stop_and_optionally_quit, force=True
            ).link_exception(environment.my_runner._greenlet_exception_handler)
            break 
...

@events.init.add_listener
def event_init_failure_rate_monitor(environment, max_failure_rate: float = 0.05, **__):
    # Master/Standalone only
    if not isinstance(environment.runner, WorkerRunner):
        gevent.spawn(
            init_failure_rate_monitor,
            environment,
            max_failure_rate=max_failure_rate
        )

Python version

3.12

Locust version

2.31.8

Operating system

Ubuntu 22.04.5 LTS

The text was updated successfully, but these errors were encountered:

cyberw · 2024-10-14T13:46:00Z

Hi Patrik!

Right now "stopping" workers is a bit weird - workers send a message to master saying they have stopped and then send a "client_ready" message which is, in the master's eyes, pretty much the same as a worker connecting. I think maybe if we started by cleaning that up a little, it would be easier to make this work.

[2024-10-14 15:44:04,656] lars-mbp/INFO/locust.runners: lars-mbp.local_495d30b64433446dbcd700c1ff6c6af1 (index 0) reported that it has stopped
[2024-10-14 15:44:04,656] lars-mbp/INFO/locust.runners: lars-mbp.local_495d30b64433446dbcd700c1ff6c6af1 (index 0) reported as ready. 1 workers connected.

PatrikMosko · 2024-10-21T10:57:23Z

In the meantime, I have realized that the "STOPPED" worker might actually mean different things and I probably automatically assumed one of them, but after scrolling through the code over and over again, I am not sure whether there's an actual consensus on what the STOPPED should mean.

In the case of this "bug", it kinda acts as if it means "PAUSED" with the aforementioned feature of auto-unpausing added right after.

Biggest issue I have is that when I hit STOP, I assume (expect) that everything truly stops, including dispatcher, and all data, like http users from a current run are purged. If we want to persist the current state and just temporarily halt it maybe adding another PAUSED state would be an option.

AbyPaul01 · 2024-11-20T07:56:52Z

Hi Patrik!

Right now "stopping" workers is a bit weird - workers send a message to master saying they have stopped and then send a "client_ready" message which is, in the master's eyes, pretty much the same as a worker connecting. I think maybe if we started by cleaning that up a little, it would be easier to make this work.
[2024-10-14 15:44:04,656] lars-mbp/INFO/locust.runners: lars-mbp.local_495d30b64433446dbcd700c1ff6c6af1 (index 0) reported that it has stopped
[2024-10-14 15:44:04,656] lars-mbp/INFO/locust.runners: lars-mbp.local_495d30b64433446dbcd700c1ff6c6af1 (index 0) reported as ready. 1 workers connected.

Are we planning for this optimisation, if so please assign the task to me i would like to contribute

cyberw · 2024-11-20T10:49:44Z

Nice! Assigned it to you!

AbyPaul01 · 2024-11-20T11:44:03Z

Want to check whether the thought process is aligned

Currently once the master sends a stop message, each worker does the following steps in sequence

stop() - which kills the greenlets in that worker
worker sends a client_stopped message to master. Based on this master is removing those clients from the client list and from the dispatcher
Sleep for short interval
Worker sends the client_ready message to the master. Now based on this master again adds back the clients to client list and adds back the worker nodes in dispatcher
Set the worker state to ready

Change : We can remove steps 2,3 and 4, so that now once the master sends the stop message the worker will kill the greenlets and will set its state as ready. Master will still retain all the client list and the worker nodes list in the dispatcher

cyberw · 2024-11-20T12:07:50Z

I dont think we can remove all of that, because the master still needs to know when the workers completed their stopping procedure (shutting down all Users for example).

This issue was originally about the dispatcher not being stopped, perhaps that could be fixed by ensuring it is properly stopped (somehow) before throwing away the instance (setting it to None), but what I was talking about is more about changing the communication to make the communication more clear to the master (sending something like "stopping_finished" instead of "client_ready", if you understand what I mean).

Maybe this isnt a huge priority...

AbyPaul01 · 2024-11-21T13:34:41Z

understood, thanks

PatrikMosko added the bug label Oct 11, 2024

PatrikMosko changed the title ~~Dispatcher not killed~~ Dispatcher not killed on runner.stop() Oct 14, 2024

github-staff deleted a comment Oct 29, 2024

cyberw assigned AbyPaul01 Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatcher not killed on runner.stop() #2938

Dispatcher not killed on runner.stop() #2938

PatrikMosko commented Oct 11, 2024 •

edited

Loading

cyberw commented Oct 14, 2024

PatrikMosko commented Oct 21, 2024

AbyPaul01 commented Nov 20, 2024

cyberw commented Nov 20, 2024

AbyPaul01 commented Nov 20, 2024

cyberw commented Nov 20, 2024

AbyPaul01 commented Nov 21, 2024

Dispatcher not killed on runner.stop() #2938

Dispatcher not killed on runner.stop() #2938

Comments

PatrikMosko commented Oct 11, 2024 • edited Loading

Prerequisites

Description

Command line

Locustfile contents

Python version

Locust version

Operating system

cyberw commented Oct 14, 2024

PatrikMosko commented Oct 21, 2024

AbyPaul01 commented Nov 20, 2024

cyberw commented Nov 20, 2024

AbyPaul01 commented Nov 20, 2024

cyberw commented Nov 20, 2024

AbyPaul01 commented Nov 21, 2024

PatrikMosko commented Oct 11, 2024 •

edited

Loading