Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM error with master/slaves setup (zeromq, windows) #1372

Closed
mparent opened this issue May 13, 2020 · 30 comments
Closed

OOM error with master/slaves setup (zeromq, windows) #1372

mparent opened this issue May 13, 2020 · 30 comments
Labels

Comments

@mparent
Copy link

mparent commented May 13, 2020

Hi !

Describe the bug

An out of memory error occurs with ZeroMQ trying to allocate a crazy amount of memory in decoded_allocator, sometime up to several petabytes. This might very well be a ZeroMQ bug :
OUT OF MEMORY (bundled\zeromq\src\decoder_allocators.cpp:89)

I added some logs and recompiled pyzmq to check what's going on. Upon further investigation, _max_counters seems to take a crazy value at some point. See zmq_logs.txt
As you can see, allocator instance 0x0000016A9270F700 is constructed with _max_counters=249, but before crash its value has changed to 1557249601288, which causes a malloc of several terabytes.

Steps to reproduce

Sorry, I couldn't find a surefire way to reproduce this one. It seems kind of random. It sometime happens before the test is even started, sometime when the test is stopped. Sometime it doesn't happen at all. It does seem to happen more often when stopping a test in the web UI. Simply run the ps1 attached and do some stuff in the web UI.

Environment

  • OS: Windows 10.0.18362.778
  • Python version: 3.6
  • Locust version: 0.14.6
  • Locust files : test_locust.zip

I managed to repro the bug on two computers : my work computer and my personal computer. Both are on Windows 10/Python 3.6 that comes with VS2017, but my personal computer has a pristine python environent, just ran pip install locustio.

Am I doing something I'm not supposed to ?

@mparent mparent added the bug label May 13, 2020
@heyman
Copy link
Member

heyman commented May 13, 2020

Interesting. You're not doing anything wrong AFAICT. I suspect this is a windows related issue. Is it possible for you to test if you can reproduce this on a linux or mac machine?

@mparent
Copy link
Author

mparent commented May 13, 2020

Sure thing ! I'll try on WSL, if that works for you. Worst case, I can set up a Linux VM.

@mparent
Copy link
Author

mparent commented May 13, 2020

Ok, I tried several times on Ubuntu 18.04 LTS on WSL, running it the exact same way with Powershell Core. I couldn't repro the issue.
I can't say for sure that it won't happen on Linux, considering how inconsistent that bug is, but at the very least I can safely say that it is much less likely to happen. Considering it a Windows-related issue does seem plausible.

@heyman
Copy link
Member

heyman commented May 13, 2020

Ok, good to know! It might be awhile before I get the chance to try to reproduce this on a Windows machine. Please keep us updated if you test anything else (e.g. another version of ZeroMQ or Python).

@mparent
Copy link
Author

mparent commented May 13, 2020

Will do !

@anuj-ssharma
Copy link
Contributor

Tested this on my windows machine and I can reproduce this (on v1.0.0 of locust and python 3.8). Used a different locust file to the one attached above.

However, I couldn't really figure out a pattern to the failures but there were a few observations:

  1. It always failed for me after all the clients were connected and ready.
  2. Chances of failures increased with the increase in the number of workers.
  3. Memory consumption of the locust workers seemed to be normal.

Machine Specs:
Windows 10 Version 1903
Intel(R) Core i7-7700HQ CPU @ 2.80 GHz
16GB RAM

@cyberw
Copy link
Collaborator

cyberw commented May 30, 2020

Maybe we should bump the minimum required pyzmq version? Other than that I dont think we can do much without a clear repro case. @anuj-ssharma @mparent can you check your pyzmq versions?

@mparent
Copy link
Author

mparent commented Jun 1, 2020

@cyberw 19.0.1 for me.

@cyberw
Copy link
Collaborator

cyberw commented Jun 1, 2020

Ok, that is the latest, so that shouldnt be an issue...

@cyberw
Copy link
Collaborator

cyberw commented Jun 1, 2020

I dont have any real ideas on how to solve this, and I hardly use Windows at all these days.

If any of you have the time to do some more digging & finding a fix it would be much appreciated (unfortunately it is unlikely anyone else will fix it for you :-/ )

@mparent
Copy link
Author

mparent commented Jun 1, 2020

It's fine, our actual locusts are run on Linux anyway. I'll simply keep working on WSL locally and try to find a fix if I have some time.

@bebeo92
Copy link

bebeo92 commented Nov 13, 2020

@cyberw I also faced this issue when run on Window

@cyberw
Copy link
Collaborator

cyberw commented Nov 13, 2020

@bebeo92 Have you tried updating to latest pyzmq? Can you find any pattern to when it works and when it doesnt?

With no more details there is nothing we can do, sorry... (and with so few of our users running on windows I dont think this issue will get a lot of attention)

perhaps file an issue with pyzmq?

@bebeo92
Copy link

bebeo92 commented Nov 13, 2020

@cyberw I think it happens when I click on Stop button. Does it expect behaviour?
image

@cyberw
Copy link
Collaborator

cyberw commented Nov 13, 2020

It should work. Sorry, I dont think I can help you...

@bebeo92
Copy link

bebeo92 commented Nov 13, 2020

@cyberw I think it still a valid bug, can you contact someone else to verify it?

@cyberw
Copy link
Collaborator

cyberw commented Nov 13, 2020

I agree, but there is really nobody to contact. this is a project maintained by volunteers.

@cyberw
Copy link
Collaborator

cyberw commented Nov 13, 2020

like I said, you may have more luck talking to the maintainers of pyzmq itself.

@cyberw cyberw changed the title OOM error with master/slaves setup OOM error with master/slaves setup (zmq, windows) Nov 13, 2020
@cyberw cyberw changed the title OOM error with master/slaves setup (zmq, windows) OOM error with master/slaves setup (zeromq, windows) Nov 13, 2020
@RichardLions
Copy link

@cyberw this is happening on the project I am working on. When running locust on a windows machine in headless mode with several workers(all on the same machine) there is a high chance the master will assert. The chance increases the more workers that are spawned. Note: Assert only starts triggering with 3 or more workers.

The assert always triggers after the master has sent a message to all workers. Either at the start when sending the spawn message or at the end telling them to quit.

The assert: warning : FATAL ERROR: OUT OF MEMORY (C:\projects\libzmq\src\decoder_allocators.cpp:85)

https://pyzmq.readthedocs.io/en/latest/morethanbindings.html#thread-safety

The pyzmq docs mention c-level crashes could be encountered if calling into the same sockets from multiple threads. When looking at the locust setup it appears to be using greenlets. The same socket could be called into multiple times but would be on the same thread. I am not experienced with python(only started using it to get locust setup) so I am unsure if this could be causing the issue?

versions:
python 3.8.10
locust 1.6.0
pyzmq 22.1.0

Do you have any advice on tracking down what could be triggering this issue?

@cyberw
Copy link
Collaborator

cyberw commented Jun 30, 2021

Hi! Sorry, I have nothing to add here. You probably already know a lot more than me :)

@Matthew--Townsend
Copy link

I also have this issue.

FATAL ERROR: OUT OF MEMORY (C:\projects\libzmq\src\decoder_allocators.cpp:85)

I get it a majority if the time when just creating my master and worker nodes. I'd say 3 out of 4 attempts fail. If that part passes, sometimes it fails with the same error after I start my load test.

versions:
python 3.9.5
locust 1.5.3
pyzmq 22.1.0
OS: Windows 10, Windows Server 2016
Free Memory at time of failure: 7.3 GB of 16 GB on my Windows 10 box

I will check out pyzmq, but I wanted to post here for the sake of visibility (i.e., it isn't just a few people getting this error, when I talked with the guy who recommended Locust, he said "Oh yeah it does that all the time. I just keep trying until it works." Personally I'd rather fix it. So I'll see if the folks at pyzmq have this on their radar already.

Thanks.

@cyberw
Copy link
Collaborator

cyberw commented Jul 2, 2021

Is there a ticket on pyzmq? If so then maybe link it here.

@Matthew--Townsend
Copy link

It is failing in libzmq when trying to allocate the memory needed. I have seen this before when there is available system memory, but it is fragmented (thus not enough available in one spot to allocate continuously for the requested size). There is an issue open on pyzmq currently (zeromq/pyzmq#1555), but it was also opened by @RichardLions and has no replies from anyone else that may have seen this. On my end I'll need to investigate with a memory profiler to see what is filling up (or fragmenting) the available memory. I'll see how much time my project owner will let me spend debugging this and post back here if I find anything. It could be as simple as we somehow created a small memory leak in our python code. I usually write in C# so I am not sure if that is a common occurrence in python, but seems like a possibility.

@cyberw
Copy link
Collaborator

cyberw commented Jul 4, 2021

Memory leaks are not really a common occurence no, and since other people have encountered this issue it is pretty likely there is a real bug here. Good luck, and let us know if you find the issue or a workaround!

@cyberw
Copy link
Collaborator

cyberw commented Jul 4, 2021

One possibility is of course that locust is (for some reason) attempting to send a very (very) big message and that exceeds some limit on windows.

@Matthew--Townsend
Copy link

@RichardLions Good job updating the other bug (zeromq/pyzmq#1555) and finding a possible cause. Silly me, I thought the error (out of memory) could be something to do with running out of memory. :) I did try to run some python memory profilers but all I saw was a very flat memory allocation over time and nothing alarming.

@Matthew--Townsend
Copy link

Matthew--Townsend commented Jul 15, 2021

@RichardLions has a pull request that fixes this in the pyzmq project. I implemented it manually and tested and it works. See zeromq/pyzmq#1555

Pull Request: zeromq/pyzmq#1560

@cyberw
Copy link
Collaborator

cyberw commented Jul 15, 2021

Lets close this when there is a new release of pyzmq including the fix and we have bumped the dependency in locust.

@RichardLions
Copy link

pyzmq 22.2.1 has been released containing the fix for this issue.

https://github.com/zeromq/pyzmq/releases/tag/v22.2.1

@cyberw cyberw closed this as completed in d8e2f5d Aug 5, 2021
cyberw added a commit that referenced this issue Aug 5, 2021
…x-windows-OOM-issue

Bump dependency on pyzmq to fix #1372 (OOM on windows)
@cyberw
Copy link
Collaborator

cyberw commented Aug 5, 2021

Thanks @RichardLions !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants