Load being distributed to missing workers too Version 2.7.4.dev14 #2008

radhakrishnaakamat · 2022-02-11T15:13:48Z

When running Locust in a distributed mode in Kubernetes, if the pod restarts then the worker status gets changed to missing. I can see that spawning distributes the users equally to all workers, but in this case it distributes to missing workers as well which leads to reduced spwaning rate.
Kindly help!!
I have made some changes by adding Kubernetes support natively, have shared the details in slack channel. kindly suggest on how we can take this further.

cyberw · 2022-02-11T15:22:05Z

Hi! @mboutet is the person most with most insight into user distribution, but I think he'll need more to go on.

Perhaps you can write a failing unit test that shows your issue? Look here for inspiration, and run tox or pytest to execute it. https://github.com/locustio/locust/blob/master/locust/test/test_dispatch.py

(my opinion has always been that it is ok for locust to fail miserably if workers go missing in the middle of a test, because IMO that test is already invalid :)

mboutet · 2022-02-13T18:22:58Z

@radhakrishnaakamat, I also used Locust in k8s at the previous company I worked for. One of the thing I did to prevent restarted pods from going missing forever and reappearing as another worker was to run the workers as a stateful set instead of a deployment. Doing so made Locust less flaky because the worker would reconnect with the same ID.

Also, there's already logic in the master runner that removes missing workers from the dispatcher as you can see here:

locust/locust/runners.py

Lines 849 to 856 in 440c612

    
           if client.heartbeat < 0 and client.state != STATE_MISSING: 
        
               logger.info("Worker %s failed to send heartbeat, setting state to missing." % str(client.id)) 
        
               client.state = STATE_MISSING 
        
               client.user_classes_count = {} 
        
               if self._users_dispatcher is not None: 
        
                   self._users_dispatcher.remove_worker(client) 
        
                   if self.rebalancing_enabled() and self.state == STATE_RUNNING and self.spawning_completed: 
        
                       self.start(self.target_user_count, self.spawn_rate)

Unless there's a bug in this logic (which is completely possible), the logic proposed in #2010 might be redundant. Am I missing something?

radhakrishnaakamat · 2022-02-13T18:53:20Z

Closed this issue as it is linked with 2010.

mboutet · 2022-02-13T18:54:49Z

Did you want to link the issue with the PR? If so, then you should simply add "Fixes #<the_issue_number>" to the PR's description.

radhakrishnaakamat · 2022-02-13T19:01:32Z

@radhakrishnaakamat, I also used Locust in k8s at the previous company I worked for. One of the thing I did to prevent restarted pods from going missing forever and reappearing as another worker was to run the workers as a stateful set instead of a deployment. Doing so made Locust less flaky because the worker would reconnect with the same ID.

Also, there's already logic in the master runner that removes missing workers from the dispatcher as you can see here:

locust/locust/runners.py

Lines 849 to 856 in 440c612

if client.heartbeat < 0 and client.state != STATE_MISSING:

logger.info("Worker %s failed to send heartbeat, setting state to missing." % str(client.id))

client.state = STATE_MISSING

client.user_classes_count = {}

if self._users_dispatcher is not None:

self._users_dispatcher.remove_worker(client)

if self.rebalancing_enabled() and self.state == STATE_RUNNING and self.spawning_completed:

self.start(self.target_user_count, self.spawn_rate)

Unless there's a bug in this logic (which is completely possible), the logic proposed in #2010 might be redundant. Am I missing something?

Though this is removed from the dispatcher, it is getting used in distribution. So did some debugging with print statements ( any resource on how to debug in distributed setup locally will help a lot, as I am unable to attach it to debugger due to gevent) and found that the redistribution didn't happen. only after adding this logic it started working. I am trying to optimize my changes to remove the worker from runner env too. so that it is also consistant in ui.

Also I feel that running as deployment saves resources( not significantly in this case) over stateful set.

radhakrishnaakamat · 2022-02-13T19:01:57Z

Did you want to link the issue with the PR? If so, then you should simply add "Fixes #<the_issue_number>" to the PR's description.

Thank you, I was trying the same.

mboutet · 2022-02-13T19:50:13Z

If you're using PyCharm, you need to enable gevent compatibility in the settings.

For debugging the distributed scenario, your best bet is to write integration tests such as the ones in:

locust/locust/test/test_runners.py

Line 685 in 440c612

class TestMasterWorkerRunners(LocustTestCase):

I don't think you're fixing the issue in the right place. The UsersDispatcher should be alright. The issue is likely due to one or more bugs in the runners.py module.

Try to design integration tests such as the ones in

locust/locust/test/test_dispatch.py

Line 2220 in 440c612

class TestRemoveWorker(unittest.TestCase):

that will reproduce the issue.

That being said, I experienced various issues like you have when running in k8s at my last job. I made a few "improvements" on my fork at master...mboutet:mboutet-master (I did not update the tests so they are failing at the moment).

The main changes are:

Add the LOCUST_WORKER_ID_OVERRIDE environment variable on the worker so that a worker running in a pod can keep the same worker id across pod restarts. This needs to be combined with workers running in a stateful set.
Implement a periodic rebalance background job (enabled using the --enable-rebalancing flag). This will rebalance the users every 10s.
Implement a rebalance_work_queue (using gevent.queue.Queue()) that queues all the requested rebalances from the various places in the master. This queue prevents a bunch of rebalance from conflicting with each others. Instead, they are processed on a FIFO basis.

Obviously, these changes are highly experimental, but I found that they made Locust a lot more stable in a k8s environment with a lot of dynamic events such as restarting pods and auto-scaling workers.

Perhaps you could try my fork in your environment?

radhakrishnaakamat · 2022-02-14T03:57:04Z

If you're using PyCharm, you need to enable gevent compatibility in the settings.

For debugging the distributed scenario, your best bet is to write integration tests such as the ones in:

locust/locust/test/test_runners.py

Line 685 in 440c612

class TestMasterWorkerRunners(LocustTestCase):

I don't think you're fixing the issue in the right place. The UsersDispatcher should be alright. The issue is likely due to one or more bugs in the runners.py module.

Try to design integration tests such as the ones in

locust/locust/test/test_dispatch.py

Line 2220 in 440c612

class TestRemoveWorker(unittest.TestCase):

that will reproduce the issue.

That being said, I experienced various issues like you have when running in k8s at my last job. I made a few "improvements" on my fork at master...mboutet:mboutet-master (I did not update the tests so they are failing at the moment).

The main changes are:

Add the LOCUST_WORKER_ID_OVERRIDE environment variable on the worker so that a worker running in a pod can keep the same worker id across pod restarts. This needs to be combined with workers running in a stateful set.

Implement a periodic rebalance background job (enabled using the --enable-rebalancing flag). This will rebalance the users every 10s.

Implement a rebalance_work_queue (using gevent.queue.Queue()) that queues all the requested rebalances from the various places in the master. This queue prevents a bunch of rebalance from conflicting with each others. Instead, they are processed on a FIFO basis.

Obviously, these changes are highly experimental, but I found that they made Locust a lot more stable in a k8s environment with a lot of dynamic events such as restarting pods and auto-scaling workers.

Perhaps you could try my fork in your environment?

Thank you.. This is really helpful..

b-heimann-senacor · 2023-11-13T16:27:58Z

The main changes are:

Add the LOCUST_WORKER_ID_OVERRIDE environment variable on the worker so that a worker running in a pod can keep the same worker id across pod restarts. This needs to be combined with workers running in a stateful set.

I couldn't find information on this environment variable, is it still supported?

Currently, I am facing the issue that occasionally a Locust Worker loses connection to the Master and forever continues executing requests as an orphan.

cyberw · 2023-11-13T17:20:35Z

Disconnected workers should probably stop themselves after a while (maybe 60s?) but that is a different issue.

radhakrishnaakamat added the bug label Feb 11, 2022

mboutet mentioned this issue Feb 13, 2022

fixed load/users getting distributed to missing worker #2010

Merged

radhakrishnaakamat closed this as completed Feb 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load being distributed to missing workers too Version 2.7.4.dev14 #2008

Load being distributed to missing workers too Version 2.7.4.dev14 #2008

radhakrishnaakamat commented Feb 11, 2022

cyberw commented Feb 11, 2022 •

edited

Loading

mboutet commented Feb 13, 2022

radhakrishnaakamat commented Feb 13, 2022

mboutet commented Feb 13, 2022

radhakrishnaakamat commented Feb 13, 2022 •

edited

Loading

radhakrishnaakamat commented Feb 13, 2022

mboutet commented Feb 13, 2022

radhakrishnaakamat commented Feb 14, 2022

b-heimann-senacor commented Nov 13, 2023

cyberw commented Nov 13, 2023

Load being distributed to missing workers too Version 2.7.4.dev14 #2008

Load being distributed to missing workers too Version 2.7.4.dev14 #2008

Comments

radhakrishnaakamat commented Feb 11, 2022

cyberw commented Feb 11, 2022 • edited Loading

mboutet commented Feb 13, 2022

radhakrishnaakamat commented Feb 13, 2022

mboutet commented Feb 13, 2022

radhakrishnaakamat commented Feb 13, 2022 • edited Loading

radhakrishnaakamat commented Feb 13, 2022

mboutet commented Feb 13, 2022

radhakrishnaakamat commented Feb 14, 2022

b-heimann-senacor commented Nov 13, 2023

cyberw commented Nov 13, 2023

cyberw commented Feb 11, 2022 •

edited

Loading

radhakrishnaakamat commented Feb 13, 2022 •

edited

Loading