-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New connection socket descriptor (1024) is not less than FD_SETSIZE #1441
Comments
ok, that's interesting! |
No, unfortunately not, the log is full with the message when it exits: 3 Orion-LD Pod working parallel
|
ok ... difficult then ...
|
Kubernetes v1.24.17 |
A typical entitie i think is a Scooter like this :
|
ok, thanks. That I can use. |
2 cores and 16Gb per broker should give you about 5000 updates/second, per broker. We've seen, in loadtests we've performed in the FIWARE Foundation, that mongo needs about 3-5 times the resources that the broker has. But, that's for full speed, meaning around 15,000 updates/second (or 30k queries/sec) with your 3 brokers. |
The MongoDB is a Replicaset with 2 Replicas without Limits we have about 160 connections open on average and then rise to 300 connections for a short time when a surge of data arrives |
This sometimes happens within minutes, but can also go well for days. |
Is this bug already fixed in 1.5.0? |
Sorry, no time to look into this yet. Might be that for some reason the connection broker/mongo is slow and the requests pile up. So, try ulimit for more FDs and perhaps this solves our problem. |
I think there is a kind of leakage. What we can see:
If only round about 900 FDs are in use (limit not reached), the orion will not crash as expected, but will never free up the FDs. We can monitor it with
During this crash-period our mongodb replicaset has no real trouble with cpu or ram, there are some slow queries. We can (easily) reproduce it, if we burst some data to the orion.
ulimit:
orion setup:
|
ok, interesting. Any idea what the "type" of the leaked file descriptors are? |
|
ok, a socket ... doesn't really help. |
Ok, we need more debugging options. We stripped down the orion to only have one subscription. We nearly send no data to the orion. At startup orion consumes 20 FDs for a period of time. Suddenly it consumes about 170 FDs and never frees them up. |
Yeah, I was thinking a little about all this ... I was going to propose a new test without any subscription at all. So, please remove that last one subscription and see if the problem disappears. |
ok, this bug has nothing to do with replicaset or authentication of mongodb. |
That's good to know. A test, doing the exact same thing, but without any notifications (no matching subscriptions) would give important input. Just to rule one thing out. |
hmm...the problem persists without any subscription |
ok, valueble info. |
So, I found something interesting in the mongoc release notes for 1.24.0 (Orion-LD currently uses 1.22.0 of libmongoc):
So, one thing to test is to bump the mongoc version up to 1.24.0 (had problems compiling 1.25.x, so, that later) AND, another thing we could try is to disable Prometheus metrics, that's on by default. This is all a bit of, blindly trying things, as I have no clue who leaves those file descriptors open. Two PRs coming:
|
We just tested version 1577. The -noprom Feature did not fix the problem. |
ok, good to know. |
FD_SETSIZE error persists also with 1581 |
ok, I'd add a bit more. 10x for example. Now, an update. Once we have Orion-LD linked with mongoc 1.24.2 (that supports MongoDB 7.0), I'll let you know and we try again |
ok, this error still persists with 1.6.0-PRE-1587 with mongo 6 and mongo 7. |
I did a quick search on the error (should have done that ages ago ...). |
Ok, I checked some things on mongo. |
ok, I guess that if the mongo server is OK, the mongo driver must be as well. |
👽Back to the roots. |
Yeah ... we'll get there I'm sure. |
small update on this: orion-ld (1.6.0-PRE-1608) crashes rarely without quantumleap. if we enable quantumleap, then we get tons of crashes it could have something to do with subscriptions |
What else do we notice: example with about one unique entity several times with different (old) data:
|
By chance a new find. Right before the FD_SETSIZE we saw: Might that help? |
It just might. |
So, I created a functest with 1000 notifications to a "bad notification client" - a notification client that accepts the connection and reads the incoming notification BUT it doesn't respond. It leaves the poor broker waiting for that response, and finally it times out. Unfortunately, I believe I asked you guys to start the broker without any subscriptions at all and you still had your problem, so, this fix (coming in a few hours) doesn't change anything for you. I'm still not convinced it's not just "normal execution". File descriptors are needed and the more load the broker receives, the more simultaneously open fds there will be. Anyway, new version on its way, even though I doubt it will fix your problem. |
The PR has been merged, in case you want to test. Dockerfiles should be ready shortly |
THX, we will test it. About the fd-max: We have, as I see, no limit of 1024 filedescriptors.
Or how do you increase FD_SETSIZE outside the source? |
What we did in the last weeks to reduce the number of "crashes":
After all these actions we are able to reduce our 8 instances of orion-ld to only 2 instances, possibly one. But we will investigate further. idPattern could be a great problem. |
Sometimes Orion LD stops working with the following message:
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
Since the K8S pod doesn't crash, the LD just stops working, we only notice this when we look at the LOGs or the customer complains about missing data.
Are the connections perhaps not closed fast enough?
Tested in v1.2.1 1.3.0 and 1.4.0
The text was updated successfully, but these errors were encountered: