-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: OpenAI server errors out with "ZMQError Too many open files" under heavy load #7920
Comments
Right, so it's a different issue then, bruv. "Too many open files", eh? That's a classic. Sounds like your system's getting overwhelmed with all those requests. First thing's first, check your system's open file limit. It's probably set too low. You can bump it up with ulimit -n. Give it a generous number, like 65536 or even higher. Next, have a look at how you're handling those asyncio tasks. Are you creating too many at once? Try limiting the number of concurrent requests with asyncio.Semaphore. If that doesn't do the trick, you might need to get a bit more creative. Consider using a connection pool to manage your requests. That way, you can reuse connections instead of opening new ones all the time. And if you're still hitting that limit, it might be time to look at your system architecture. Are you running vllm on a machine with enough resources? Maybe it's time for an upgrade, mate. Don't let this get you down, bruv. We'll get those files under control. |
Can you share the value of |
It's the Ubuntu default 1024. |
does this mean zmq still opens one file (fortunately it's not a socket) for every connection? |
@youkaichao i need to dig in more. There is only 1 ipc socket created. The rest are inproc, so I don’t get what’s going on. I’m going to get something up on an Ubuntu server to try to repro In my testing, I manually verified that socket usage was not growing by monitoring system stats, so I’m not sure what is going on here |
zmq.error.ZMQError: Too many open files , |
+1 |
We have a new design that should resolve this issue: |
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
@robertgshaw2-neuralmagic, @njhill
I am running vllm @ 6653040 which includes #7394.
Reproducer:
Error message:
This arguably is not normal online serving traffic. With that said, if
--disable-frontend-multiprocessing
is on, the server can handleN=8192
with no issue.strace shows lots of eventfd, which might be related to https://www.mail-archive.com/zeromq-dev@lists.zeromq.org/msg31244.html
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: