Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slave count stuck at 3 instead of decreasing to 1 due to "missing" #1158

Closed
max-rocket-internet opened this issue Nov 18, 2019 · 9 comments
Closed
Labels

Comments

@max-rocket-internet
Copy link
Contributor

Master logs after slaves scaling down:

[2019-11-18 16:57:02,829] locust-opa-master-787476b668-74txj/INFO/locust.main: Starting web monitor at *:8089
[2019-11-18 16:57:02,830] locust-opa-master-787476b668-74txj/INFO/locust.main: Starting Locust 0.13.1
[2019-11-18 16:57:24,100] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-cjkwb_594d3fd33f864c34aaabbb987cd79e65' reported as ready. Currently 1 clients ready to swarm.
[2019-11-18 16:57:24,210] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-pljft_240da2aa2ada4a0993fad606fb5976e0' reported as ready. Currently 2 clients ready to swarm.
[2019-11-18 16:57:24,221] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-zckdc_b42adb57f2d34ef885b0bcc228068102' reported as ready. Currently 3 clients ready to swarm.
[2019-11-18 16:57:24,253] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-z2g29_7772633197e6471b9344951aba939c39' reported as ready. Currently 4 clients ready to swarm.
[2019-11-18 16:57:24,356] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d95g2_a3a40cf055d749ea8b6c0e3246f22faa' reported as ready. Currently 5 clients ready to swarm.
[2019-11-18 16:57:24,381] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-mm7dt_ee79eead69c84802a096a12f6e6ef1e5' reported as ready. Currently 6 clients ready to swarm.
[2019-11-18 16:57:24,388] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-b7c2k_a961e9574df048739d1057d70dff35bd' reported as ready. Currently 7 clients ready to swarm.
[2019-11-18 16:57:24,612] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-kfjh2_c080829438e64f76a02090adaad1a0dc' reported as ready. Currently 8 clients ready to swarm.
[2019-11-18 16:57:24,626] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d2kg5_d5ee1c3c3dcb4762b87872edcdcb0e28' reported as ready. Currently 9 clients ready to swarm.
[2019-11-18 16:57:24,636] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-hwh4v_77d2b07171a549fd9af73a8a86222c81' reported as ready. Currently 10 clients ready to swarm.
[2019-11-18 16:57:24,690] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-vpfnf_5217f03d668f4bf2b3ae7580e8e701eb' reported as ready. Currently 11 clients ready to swarm.
[2019-11-18 16:57:24,764] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-sl24v_0da998e92bcd46b1ba7161d260bc3edf' reported as ready. Currently 12 clients ready to swarm.
[2019-11-18 16:57:24,786] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-nw5ln_97ac9424368546b1b947441ce2d11f3a' reported as ready. Currently 13 clients ready to swarm.
[2019-11-18 16:57:24,804] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-w6lhx_fa3a32665a814af8af8fca9657814d1f' reported as ready. Currently 14 clients ready to swarm.
[2019-11-18 16:57:24,932] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-qr4tf_8ff013c045e7445494154d52f3a5dda7' reported as ready. Currently 15 clients ready to swarm.
[2019-11-18 16:57:25,074] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-jb2rr_8e423e848d9b442e9cfc02d6c1cd6aca' reported as ready. Currently 16 clients ready to swarm.
[2019-11-18 16:57:34,843] locust-opa-master-787476b668-74txj/INFO/locust.runners: Slave locust-opa-slave-86cf6df9fb-zckdc_b42adb57f2d34ef885b0bcc228068102 failed to send heartbeat, setting state to missing.
[2019-11-18 16:57:34,843] locust-opa-master-787476b668-74txj/INFO/locust.runners: Slave locust-opa-slave-86cf6df9fb-z2g29_7772633197e6471b9344951aba939c39 failed to send heartbeat, setting state to missing.
[2019-11-18 17:02:30,550] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-b7c2k_a961e9574df048739d1057d70dff35bd' quit. Currently 13 clients connected.
[2019-11-18 17:02:30,557] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-vpfnf_5217f03d668f4bf2b3ae7580e8e701eb' quit. Currently 12 clients connected.
[2019-11-18 17:02:30,559] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-sl24v_0da998e92bcd46b1ba7161d260bc3edf' quit. Currently 11 clients connected.
[2019-11-18 17:02:30,560] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-jb2rr_8e423e848d9b442e9cfc02d6c1cd6aca' quit. Currently 10 clients connected.
[2019-11-18 17:02:30,563] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-kfjh2_c080829438e64f76a02090adaad1a0dc' quit. Currently 9 clients connected.
[2019-11-18 17:02:30,569] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-hwh4v_77d2b07171a549fd9af73a8a86222c81' quit. Currently 8 clients connected.
[2019-11-18 17:02:30,571] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-w6lhx_fa3a32665a814af8af8fca9657814d1f' quit. Currently 7 clients connected.
[2019-11-18 17:02:30,580] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-nw5ln_97ac9424368546b1b947441ce2d11f3a' quit. Currently 6 clients connected.
[2019-11-18 17:02:30,581] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-pljft_240da2aa2ada4a0993fad606fb5976e0' quit. Currently 5 clients connected.
[2019-11-18 17:02:30,588] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-mm7dt_ee79eead69c84802a096a12f6e6ef1e5' quit. Currently 4 clients connected.
[2019-11-18 17:02:30,599] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-cjkwb_594d3fd33f864c34aaabbb987cd79e65' quit. Currently 3 clients connected.
[2019-11-18 17:02:30,599] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d2kg5_d5ee1c3c3dcb4762b87872edcdcb0e28' quit. Currently 2 clients connected.
[2019-11-18 17:02:30,604] locust-opa-master-787476b668-74txj/INFO/locust.runners: Client 'locust-opa-slave-86cf6df9fb-d95g2_a3a40cf055d749ea8b6c0e3246f22faa' quit. Currently 1 clients connected.

The response from /stats/requests endpoint looks like this for very long time:

{
  "current_response_time_percentile_50": null, 
  "current_response_time_percentile_95": null, 
  "errors": [], 
  "fail_ratio": 0.0, 
  "slaves": [
    {
      "id": "locust-opa-slave-86cf6df9fb-zckdc_b42adb57f2d34ef885b0bcc228068102", 
      "state": "missing", 
      "user_count": 0
    }, 
    {
      "id": "locust-opa-slave-86cf6df9fb-z2g29_7772633197e6471b9344951aba939c39", 
      "state": "missing", 
      "user_count": 0
    }, 
    {
      "id": "locust-opa-slave-86cf6df9fb-qr4tf_8ff013c045e7445494154d52f3a5dda7", 
      "state": "ready", 
      "user_count": 0
    }
  ], 
  "state": "ready", 
...

I think the master is not removing the missing slaves?

@heyman
Copy link
Member

heyman commented Nov 19, 2019

Slaves are not supposed to go missing, and if you get "missing" slaves I think that's a sign something is broken. It could happen because a slave process gets hard-killed. However, it could also happen if a slave node hits the CPU usage roof (and the heartbeat message gets delays), in which case the state could go back from missing to running again (once the heartbeat comes in, and if the CPU usage would go down). In that case I think it would be bad if we just removed the missing slave node and assumed everything was fine, since you could have "rouge" slave nodes still sending load towards the target system.

@max-rocket-internet
Copy link
Contributor Author

OK cool.

If locust receives SIGTERM it should shutdown and register gracefully, right?

Perhaps related to #1159 ?

@heyman
Copy link
Member

heyman commented Nov 20, 2019

If locust receives SIGTERM it should shutdown and register gracefully, right?

Correct!

Perhaps related to #1159 ?

Ah, yes, that sounds very plausible.

@cyberw
Copy link
Collaborator

cyberw commented Nov 29, 2019

@max-rocket-internet Is this still an issue for you?

@max-rocket-internet
Copy link
Contributor Author

Yes, with version 0.13.5 I still see this in the logs:

 [2020-01-02 15:19:29,656] locust-large-master-67fbf758f6-ft95m/INFO/locust.runners: Slave locust-large-slave-5766897bc6-n9vsj_67e802aa476b4b68966535c94ee80634 failed to send heartbeat, setting state to missing.                                                               │
│ [2020-01-02 15:19:29,656] locust-large-master-67fbf758f6-ft95m/INFO/locust.runners: Slave locust-large-slave-5766897bc6-sxp94_d0672a30b57e487687e76e573d7d8c0b failed to send heartbeat, setting state to missing.                                                               │
│ [2020-01-02 15:19:29,656] locust-large-master-67fbf758f6-ft95m/INFO/locust.runners: Slave locust-large-slave-5766897bc6-lzgvc_873f544d8dd4428c94132ae6784ddb2a failed to send heartbeat, setting state to missing.                                                               │

@cyberw
Copy link
Collaborator

cyberw commented Jan 2, 2020

Have you checked the logs on the slaves?

Could this be some kind of networking issue?

Or is there anything else you can think of that makes your tests ”special”?

@max-rocket-internet
Copy link
Contributor Author

Hmmm, I see it now and again but can't reproduce it reliably. I tried scaling up and down during, before and after a load test but only see it sometimes.

Could this be some kind of networking issue?

On AWS EC2 I don't think so.

Have you checked the logs on the slaves?

Yes. So for whatever reason, the slave that goes missing is gracefully stopped by k8s but doesn't log:

Got SIGTERM signal
Shutting down (exit code 0), bye.
Cleaning up runner...
Running teardowns...

It just stops.

@max-rocket-internet max-rocket-internet changed the title Slave count stuck at 3 instead of decreasing to 1 Slave count stuck at 3 instead of decreasing to 1 due to "missing" Jan 3, 2020
@max-rocket-internet
Copy link
Contributor Author

I'll reopen when I can reproduce it.

@lbostic
Copy link

lbostic commented Aug 11, 2021

Hi,
I increased my POD count to compensate (only a workaround...) I don't have time yet to see why Workers are beginning to be "missed" as soon I figure something out I'll let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants