Slave count stuck at 3 instead of decreasing to 1 due to "missing" #1158

max-rocket-internet · 2019-11-18T17:08:12Z

Master logs after slaves scaling down:

[2019-11-18 16:57:02,829] locust-opa-master-787476b668-74txj/INFO/locust.main: Starting web monitor at *:8089
[2019-11-18 16:57:02,830] locust-opa-master-787476b668-74txj/INFO/locust.main: Starting Locust 0.13.1
[2019-11-18 16:57:24,100] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-cjkwb_594d3fd33f864c34aaabbb987cd79e65' reported as ready. Currently 1 clients ready to swarm.
[2019-11-18 16:57:24,210] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-pljft_240da2aa2ada4a0993fad606fb5976e0' reported as ready. Currently 2 clients ready to swarm.
[2019-11-18 16:57:24,221] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-zckdc_b42adb57f2d34ef885b0bcc228068102' reported as ready. Currently 3 clients ready to swarm.
[2019-11-18 16:57:24,253] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-z2g29_7772633197e6471b9344951aba939c39' reported as ready. Currently 4 clients ready to swarm.
[2019-11-18 16:57:24,356] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d95g2_a3a40cf055d749ea8b6c0e3246f22faa' reported as ready. Currently 5 clients ready to swarm.
[2019-11-18 16:57:24,381] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-mm7dt_ee79eead69c84802a096a12f6e6ef1e5' reported as ready. Currently 6 clients ready to swarm.
[2019-11-18 16:57:24,388] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-b7c2k_a961e9574df048739d1057d70dff35bd' reported as ready. Currently 7 clients ready to swarm.
[2019-11-18 16:57:24,612] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-kfjh2_c080829438e64f76a02090adaad1a0dc' reported as ready. Currently 8 clients ready to swarm.
[2019-11-18 16:57:24,626] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d2kg5_d5ee1c3c3dcb4762b87872edcdcb0e28' reported as ready. Currently 9 clients ready to swarm.
[2019-11-18 16:57:24,636] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-hwh4v_77d2b07171a549fd9af73a8a86222c81' reported as ready. Currently 10 clients ready to swarm.
[2019-11-18 16:57:24,690] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-vpfnf_5217f03d668f4bf2b3ae7580e8e701eb' reported as ready. Currently 11 clients ready to swarm.
[2019-11-18 16:57:24,764] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-sl24v_0da998e92bcd46b1ba7161d260bc3edf' reported as ready. Currently 12 clients ready to swarm.
[2019-11-18 16:57:24,786] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-nw5ln_97ac9424368546b1b947441ce2d11f3a' reported as ready. Currently 13 clients ready to swarm.
[2019-11-18 16:57:24,804] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-w6lhx_fa3a32665a814af8af8fca9657814d1f' reported as ready. Currently 14 clients ready to swarm.
[2019-11-18 16:57:24,932] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-qr4tf_8ff013c045e7445494154d52f3a5dda7' reported as ready. Currently 15 clients ready to swarm.
[2019-11-18 16:57:25,074] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-jb2rr_8e423e848d9b442e9cfc02d6c1cd6aca' reported as ready. Currently 16 clients ready to swarm.
[2019-11-18 16:57:34,843] locust-opa-master-787476b668-74txj/INFO/locust.runners: Slave locust-opa-slave-86cf6df9fb-zckdc_b42adb57f2d34ef885b0bcc228068102 failed to send heartbeat, setting state to missing.
[2019-11-18 16:57:34,843] locust-opa-master-787476b668-74txj/INFO/locust.runners: Slave locust-opa-slave-86cf6df9fb-z2g29_7772633197e6471b9344951aba939c39 failed to send heartbeat, setting state to missing.
[2019-11-18 17:02:30,550] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-b7c2k_a961e9574df048739d1057d70dff35bd' quit. Currently 13 clients connected.
[2019-11-18 17:02:30,557] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-vpfnf_5217f03d668f4bf2b3ae7580e8e701eb' quit. Currently 12 clients connected.
[2019-11-18 17:02:30,559] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-sl24v_0da998e92bcd46b1ba7161d260bc3edf' quit. Currently 11 clients connected.
[2019-11-18 17:02:30,560] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-jb2rr_8e423e848d9b442e9cfc02d6c1cd6aca' quit. Currently 10 clients connected.
[2019-11-18 17:02:30,563] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-kfjh2_c080829438e64f76a02090adaad1a0dc' quit. Currently 9 clients connected.
[2019-11-18 17:02:30,569] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-hwh4v_77d2b07171a549fd9af73a8a86222c81' quit. Currently 8 clients connected.
[2019-11-18 17:02:30,571] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-w6lhx_fa3a32665a814af8af8fca9657814d1f' quit. Currently 7 clients connected.
[2019-11-18 17:02:30,580] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-nw5ln_97ac9424368546b1b947441ce2d11f3a' quit. Currently 6 clients connected.
[2019-11-18 17:02:30,581] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-pljft_240da2aa2ada4a0993fad606fb5976e0' quit. Currently 5 clients connected.
[2019-11-18 17:02:30,588] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-mm7dt_ee79eead69c84802a096a12f6e6ef1e5' quit. Currently 4 clients connected.
[2019-11-18 17:02:30,599] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-cjkwb_594d3fd33f864c34aaabbb987cd79e65' quit. Currently 3 clients connected.
[2019-11-18 17:02:30,599] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d2kg5_d5ee1c3c3dcb4762b87872edcdcb0e28' quit. Currently 2 clients connected.
[2019-11-18 17:02:30,604] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d95g2_a3a40cf055d749ea8b6c0e3246f22faa' quit. Currently 1 clients connected.

The response from /stats/requests endpoint looks like this for very long time:

{
  "current_response_time_percentile_50": null, 
  "current_response_time_percentile_95": null, 
  "errors": [], 
  "fail_ratio": 0.0, 
  "slaves": [
    {
      "id": "locust-opa-slave-86cf6df9fb-zckdc_b42adb57f2d34ef885b0bcc228068102", 
      "state": "missing", 
      "user_count": 0
    }, 
    {
      "id": "locust-opa-slave-86cf6df9fb-z2g29_7772633197e6471b9344951aba939c39", 
      "state": "missing", 
      "user_count": 0
    }, 
    {
      "id": "locust-opa-slave-86cf6df9fb-qr4tf_8ff013c045e7445494154d52f3a5dda7", 
      "state": "ready", 
      "user_count": 0
    }
  ], 
  "state": "ready", 
...

I think the master is not removing the missing slaves?

The text was updated successfully, but these errors were encountered:

heyman · 2019-11-19T16:23:28Z

Slaves are not supposed to go missing, and if you get "missing" slaves I think that's a sign something is broken. It could happen because a slave process gets hard-killed. However, it could also happen if a slave node hits the CPU usage roof (and the heartbeat message gets delays), in which case the state could go back from missing to running again (once the heartbeat comes in, and if the CPU usage would go down). In that case I think it would be bad if we just removed the missing slave node and assumed everything was fine, since you could have "rouge" slave nodes still sending load towards the target system.

max-rocket-internet · 2019-11-19T17:02:33Z

OK cool.

If locust receives SIGTERM it should shutdown and register gracefully, right?

Perhaps related to #1159 ?

heyman · 2019-11-20T11:47:33Z

If locust receives SIGTERM it should shutdown and register gracefully, right?

Correct!

Perhaps related to #1159 ?

Ah, yes, that sounds very plausible.

cyberw · 2019-11-29T13:21:38Z

@max-rocket-internet Is this still an issue for you?

max-rocket-internet · 2020-01-02T15:22:08Z

Yes, with version 0.13.5 I still see this in the logs:

 [2020-01-02 15:19:29,656] locust-large-master-67fbf758f6-ft95m/INFO/locust.runners: Slave locust-large-slave-5766897bc6-n9vsj_67e802aa476b4b68966535c94ee80634 failed to send heartbeat, setting state to missing.                                                               │
│ [2020-01-02 15:19:29,656] locust-large-master-67fbf758f6-ft95m/INFO/locust.runners: Slave locust-large-slave-5766897bc6-sxp94_d0672a30b57e487687e76e573d7d8c0b failed to send heartbeat, setting state to missing.                                                               │
│ [2020-01-02 15:19:29,656] locust-large-master-67fbf758f6-ft95m/INFO/locust.runners: Slave locust-large-slave-5766897bc6-lzgvc_873f544d8dd4428c94132ae6784ddb2a failed to send heartbeat, setting state to missing.                                                               │

cyberw · 2020-01-02T16:47:31Z

Have you checked the logs on the slaves?

Could this be some kind of networking issue?

Or is there anything else you can think of that makes your tests ”special”?

max-rocket-internet · 2020-01-03T12:30:11Z

Hmmm, I see it now and again but can't reproduce it reliably. I tried scaling up and down during, before and after a load test but only see it sometimes.

Could this be some kind of networking issue?

On AWS EC2 I don't think so.

Have you checked the logs on the slaves?

Yes. So for whatever reason, the slave that goes missing is gracefully stopped by k8s but doesn't log:

Got SIGTERM signal
Shutting down (exit code 0), bye.
Cleaning up runner...
Running teardowns...

It just stops.

max-rocket-internet · 2020-01-03T12:31:37Z

I'll reopen when I can reproduce it.

lbostic · 2021-08-11T13:00:27Z

Hi,
I increased my POD count to compensate (only a workaround...) I don't have time yet to see why Workers are beginning to be "missed" as soon I figure something out I'll let you know.

max-rocket-internet added the bug label Nov 18, 2019

max-rocket-internet changed the title ~~Slave count stuck at 3 instead of decreasing to 1~~ Slave count stuck at 3 instead of decreasing to 1 due to "missing" Jan 3, 2020

max-rocket-internet closed this as completed Jan 3, 2020

max-rocket-internet mentioned this issue Mar 9, 2020

ensure the connection between master and slave in heartbeat #1280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slave count stuck at 3 instead of decreasing to 1 due to "missing" #1158

Slave count stuck at 3 instead of decreasing to 1 due to "missing" #1158

max-rocket-internet commented Nov 18, 2019

heyman commented Nov 19, 2019

max-rocket-internet commented Nov 19, 2019

heyman commented Nov 20, 2019

cyberw commented Nov 29, 2019

max-rocket-internet commented Jan 2, 2020

cyberw commented Jan 2, 2020

max-rocket-internet commented Jan 3, 2020

max-rocket-internet commented Jan 3, 2020

lbostic commented Aug 11, 2021

Slave count stuck at 3 instead of decreasing to 1 due to "missing" #1158

Slave count stuck at 3 instead of decreasing to 1 due to "missing" #1158

Comments

max-rocket-internet commented Nov 18, 2019

heyman commented Nov 19, 2019

max-rocket-internet commented Nov 19, 2019

heyman commented Nov 20, 2019

cyberw commented Nov 29, 2019

max-rocket-internet commented Jan 2, 2020

cyberw commented Jan 2, 2020

max-rocket-internet commented Jan 3, 2020

max-rocket-internet commented Jan 3, 2020

lbostic commented Aug 11, 2021