Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New connection socket descriptor (1024) is not less than FD_SETSIZE #1441

Open
DasNordlicht opened this issue Sep 11, 2023 · 40 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@DasNordlicht
Copy link

Sometimes Orion LD stops working with the following message:

New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).

Since the K8S pod doesn't crash, the LD just stops working, we only notice this when we look at the LOGs or the customer complains about missing data.

Are the connections perhaps not closed fast enough?

Tested in v1.2.1 1.3.0 and 1.4.0

@kzangeli kzangeli self-assigned this Sep 11, 2023
@kzangeli kzangeli added the bug Something isn't working label Sep 11, 2023
@kzangeli
Copy link
Collaborator

ok, that's interesting!
Too many open file descriptors. Seems like a missing "close" somewhere. That would be an important bug!
Can you give me some more info on the error message about "1024" ?

@DasNordlicht
Copy link
Author

No, unfortunately not, the log is full with the message when it exits:
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024)
And with this the LD is dead.
K8s Rev: v1.24.17
CPU 12%
MEM 23%

3 Orion-LD Pod working parallel

Containers:                                                                                                                                                                                                                                   
  orion-ld:                                                                                                                                                                                                                                   
    Container ID:  containerd://7a272e0491bf88454a39396c9d2563479dbec7ff7a7b9dc283d96cc84174f4a8                                                                                                                                              
    Image:         fiware/orion-ld:1.4.0                                                                                                                                                                                                      
    Image ID:      docker.io/fiware/orion-ld@sha256:b8f9618e9b089dcec3438c45a9e219daa36f7a428be892998703c6a6f8d736be                                                                                                                          
    Port:          1026/TCP                                                                                                                                                                                                                   
    Host Port:     0/TCP                                                                                                                                                                                                                      
    Args:                                                                                                                                                                                                                                     
      -dbhost                                                                                                                                                                                                                                 
      mongodb-headless.fiware.svc.cluster.local:27017/                                                                                                                                                                                        
      -logLevel                                                                                                                                                                                                                               
      WARN                                                                                                                                                                                                                                    
      -ctxTimeout                                                                                                                                                                                                                             
      10000                                                                                                                                                                                                                                   
      -mongocOnly                                                                                                                                                                                                                             
      -lmtmp                                                                                                                                                                                                                                  
      -coreContext                                                                                                                                                                                                                            
      v1.0                                                                                                                                                                                                                                    
    State:          Running                                                                                                                                                                                                                   
      Started:      Mon, 11 Sep 2023 12:20:24 +0200                                                                                                                                                                                           
    Ready:          True                                                                                                                                                                                                                      
    Restart Count:  0                                                                                                                                                                                                                         
    Limits:                                                                                                                                                                                                                                   
      cpu:     2                                                                                                                                                                                                                              
      memory:  16Gi                                                                                                                                                                                                                           
    Requests:                                                                                                                                                                                                                                 
      cpu:     50m                                                                                                                                                                                                                            
      memory:  64Mi                                                                                                                                                                                                                           
    Environment Variables from:                                                                                                                                                                                                               
      orion-ld-mongodb  Secret  Optional: false                                                                                                                                                                                               
    Environment:                                                                                                                                                                                                                              
      ORIONLD_MONGO_REPLICA_SET:  rs0                                                                                                                                                                                                         
      ORIONLD_MONGO_USER:         root                                                                                                                                                                                                        
      ORIONLD_MONGO_TIMEOUT:      4000                                                                                                                                                                                                        
      ORIONLD_MONGO_POOL_SIZE:    15                                                                                                                                                                                                          
      ORIONLD_MONGO_ID_INDEX:     TRUE                                                                                                                                                                                                        
      ORIONLD_STAT_COUNTERS:      TRUE                                                                                                                                                                                                        
      ORIONLD_STAT_SEM_WAIT:      TRUE                                                                                                                                                                                                        
      ORIONLD_STAT_TIMING:        TRUE                                                                                                                                                                                                        
      ORIONLD_SUBCACHE_IVAL:      60                                                                                                                                                                                                          

@kzangeli
Copy link
Collaborator

ok ... difficult then ...
Describe your setup then, and I'll try to "guess" what's happening.
With "describe" I mean:

  • tell me what features you're using, like geo-properties, tenants, and the likes.
  • Also, how many entities you have, numbers of attrs, etc (all approx, naturally).
  • How many updates/second ... things like that
  • An example of a "typical" Entity
    So I have "something" to go on.

@DasNordlicht
Copy link
Author

Kubernetes v1.24.17
3xOrion-LD v1.4.0
a Orion-LD Pod has the following Limits:
cpu: 2
memory: 16Gi
All Entities have geo-properties and yes we have NGSI-LD Tenants
I think we have over 30 000 Entities at the Platform
I estimate that we have about 400 - 600 requests per second.

@DasNordlicht
Copy link
Author

A typical entitie i think is a Scooter like this :

    {
        "id": "urn:ngsi-ld:Vehicle:BOLT:6f0ad756-4c73-4110-84ee-e8c5f1e16d47",
        "type": "Vehicle",
        "dateModified": {
            "type": "Property",
            "value": {
                "@type": "DateTime",
                "@value": "2023-09-11T11:13:24.000Z"
            }
        },
        "category": {
            "type": "Property",
            "value": "private"
        },
        "location": {
            "type": "GeoProperty",
            "value": {
                "coordinates": [
                    10.189408,
                    54.334972
                ],
                "type": "Point"
            }
        },
        "name": {
            "type": "Property",
            "value": "BOLT:eScooter:6f0ad756-4c73-4110-84ee-e8c5f1e16d47"
        },
        "refVehicleModel": {
            "type": "Property",
            "value": "urn:ngsi-ld:VehicleModel:BOLT:eScooter"
        },
        "serviceStatus": {
            "type": "Property",
            "value": "parked"
        },
        "vehiclePlateIdentifier": {
            "type": "Property",
            "value": "nicht bekannt"
        },
        "vehicleType": {
            "type": "Property",
            "value": "eScooter"
        },
        "annotations": {
            "type": "Property",
            "value": [
                "android:https%3A%2F%2Fbolt.onelink.me%2Falge%2Fffkz3db2%3Fdeep_link_value%3Dbolt%25253A%25252F%25252Faction%25252FrentalsSelectVehicleByRotatedUuid%25253Frotated_uuid%25253D6f0ad756-4c73-4110-84ee-e8c5f1e16d47",
                "ios:https%3A%2F%2Fbolt.onelink.me%2Falge%2Fffkz3db2%3Fdeep_link_value%3Dbolt%25253A%25252F%25252Faction%25252FrentalsSelectVehicleByRotatedUuid%25253Frotated_uuid%25253D6f0ad756-4c73-4110-84ee-e8c5f1e16d47",
                "pricing_plan_id:5959b310-f7ed-55f9-bd65-01e2ced234c2",
                "current_range_meters:8640"
            ]
        },
        "speed": {
            "type": "Property",
            "value": 0
        },
        "bearing": {
            "type": "Property",
            "value": 0
        },
        "owner": {
            "type": "Property",
            "value": "urn:ngsi-ld:Owner:BOLT"
        }
    }

@kzangeli
Copy link
Collaborator

ok, thanks. That I can use.
After how long approximately does the broker start to complain about file descriptor +1024 ?

@kzangeli
Copy link
Collaborator

2 cores and 16Gb per broker should give you about 5000 updates/second, per broker.
BUT, it depends very much on the size of your mongo instance.
I'm missing that information.
Perhaps the brokers are waiting for mongo and it queues up, and in the end, too many connections ...

We've seen, in loadtests we've performed in the FIWARE Foundation, that mongo needs about 3-5 times the resources that the broker has. But, that's for full speed, meaning around 15,000 updates/second (or 30k queries/sec) with your 3 brokers.

@DasNordlicht
Copy link
Author

DasNordlicht commented Sep 11, 2023

The MongoDB is a Replicaset with 2 Replicas without Limits
The two MongoDB Workers use on average 0.5 CPUs per POD and approx. 2 GB RAM

we have about 160 connections open on average and then rise to 300 connections for a short time when a surge of data arrives

@DasNordlicht
Copy link
Author

ok, thanks. That I can use. After how long approximately does the broker start to complain about file descriptor +1024 ?

This sometimes happens within minutes, but can also go well for days.
We have not yet been able to find a rule for this.

@kzangeli kzangeli mentioned this issue Oct 16, 2023
@DasNordlicht
Copy link
Author

Is this bug already fixed in 1.5.0?

@kzangeli
Copy link
Collaborator

Sorry, no time to look into this yet.
Have you tried to augment the max file descriptor (ulimit) ?

Might be that for some reason the connection broker/mongo is slow and the requests pile up.
I really don't have any other explanation for this error right now.
There is no file descriptor leakage. Had there been, nothing would work and everybody would have troubles.

So, try ulimit for more FDs and perhaps this solves our problem.

@ibordach
Copy link

ibordach commented Jan 9, 2024

I think there is a kind of leakage. What we can see:
orion is working quite fine for a period of time. Then suddenly the orion consumes hundreds of FDs in a very short period of time. If it reaches the FD limit of orion, it will crash.

New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
New connection socket descriptor (1024) is not less than FD_SETSIZE (1024).
INFO@10:33:50  orionld.cpp[521]: Signal Handler (caught signal 15)
INFO@10:33:50  orionld.cpp[528]: Orion context broker exiting due to receiving a signal 

If only round about 900 FDs are in use (limit not reached), the orion will not crash as expected, but will never free up the FDs. We can monitor it with

watch 'ls -l /proc/1/fd|wc -l'

During this crash-period our mongodb replicaset has no real trouble with cpu or ram, there are some slow queries.

We can (easily) reproduce it, if we burst some data to the orion.

  • orion 1.4.0
  • mongodb 7.0.5 (with and without password)
  • kubernetes 1.27

ulimit:

[root@orion-ld-deployment-7b5d5f99c7-dc65w /]# ulimit -aS
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 160807
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

orion setup:

  containers:
  - args:
    - -dbhost
    - mongodbnoauth-headless.fiware-noauth.svc.cluster.local:27017/
    - -logLevel
    - DEBUG
    - -ctxTimeout
    - "10000"
    - -mongocOnly
    - -lmtmp
    - -coreContext
    - v1.0
    - -logForHumans
    env:
    - name: ORIONLD_MONGO_REPLICA_SET
      value: rs0
    - name: ORIONLD_MONGO_TIMEOUT
      value: "6000"
    - name: ORIONLD_MONGO_POOL_SIZE
      value: "15"
    - name: ORIONLD_MONGO_ID_INDEX
      value: "TRUE"
    - name: ORIONLD_STAT_COUNTERS
      value: "TRUE"
    - name: ORIONLD_STAT_SEM_WAIT
      value: "TRUE"
    - name: ORIONLD_STAT_TIMING
      value: "TRUE"
    - name: ORIONLD_STAT_NOTIF_QUEUE
      value: "TRUE"
    - name: ORIONLD_SUBCACHE_IVAL
      value: "60"
    - name: ORIONLD_DISTRIBUTED
      value: "TRUE"
    - name: ORIONLD_CONTEXT_DOWNLOAD_TIMEOUT
      value: "10000"
    - name: ORIONLD_NOTIF_MODE
      value: threadpool:120:8
    - name: ORIONLD_DEBUG_CURL
      value: "TRUE"

@kzangeli
Copy link
Collaborator

kzangeli commented Jan 9, 2024

ok, interesting. Any idea what the "type" of the leaked file descriptors are?

@ibordach
Copy link

ibordach commented Jan 9, 2024

[root@orion-ld-deployment-7b5d5f99c7-dc65w /]# ls -l /proc/1/fd
total 0
lrwx------ 1 root root 64 Jan  9 10:44 0 -> /dev/null
l-wx------ 1 root root 64 Jan  9 10:44 1 -> 'pipe:[72034538]'
lrwx------ 1 root root 64 Jan  9 10:55 10 -> 'socket:[72033413]'
lrwx------ 1 root root 64 Jan  9 10:55 100 -> 'socket:[72048027]'
lrwx------ 1 root root 64 Jan  9 10:55 105 -> 'socket:[72049073]'
lrwx------ 1 root root 64 Jan  9 10:55 106 -> 'socket:[72048031]'
lrwx------ 1 root root 64 Jan  9 10:55 107 -> 'socket:[72087002]'
lrwx------ 1 root root 64 Jan  9 10:55 108 -> 'socket:[72049079]'
lrwx------ 1 root root 64 Jan  9 10:55 109 -> 'socket:[72049081]'
lrwx------ 1 root root 64 Jan  9 10:55 11 -> 'socket:[72034587]'
lrwx------ 1 root root 64 Jan  9 10:55 110 -> 'socket:[72049085]'
lrwx------ 1 root root 64 Jan  9 10:55 111 -> 'socket:[72049087]'
lrwx------ 1 root root 64 Jan  9 10:55 113 -> 'socket:[72049856]'
lrwx------ 1 root root 64 Jan  9 10:55 115 -> 'socket:[72048033]'
lrwx------ 1 root root 64 Jan  9 10:55 118 -> 'socket:[72084451]'
lrwx------ 1 root root 64 Jan  9 10:55 119 -> 'socket:[72049167]'
lrwx------ 1 root root 64 Jan  9 10:55 12 -> 'socket:[72033415]'
lrwx------ 1 root root 64 Jan  9 10:55 121 -> 'socket:[72049161]'
lrwx------ 1 root root 64 Jan  9 10:55 122 -> 'socket:[72051305]'
lrwx------ 1 root root 64 Jan  9 10:55 123 -> 'socket:[72049990]'
lrwx------ 1 root root 64 Jan  9 10:55 124 -> 'socket:[72050456]'
lrwx------ 1 root root 64 Jan  9 10:55 125 -> 'socket:[72094222]'
lrwx------ 1 root root 64 Jan  9 10:55 127 -> 'socket:[72049165]'
lrwx------ 1 root root 64 Jan  9 10:55 128 -> 'socket:[72092367]'
lrwx------ 1 root root 64 Jan  9 10:55 13 -> 'socket:[72033418]'
lrwx------ 1 root root 64 Jan  9 10:55 130 -> 'socket:[72086227]'
lrwx------ 1 root root 64 Jan  9 10:55 133 -> 'socket:[72050036]'
lrwx------ 1 root root 64 Jan  9 10:55 137 -> 'socket:[72050040]'
lrwx------ 1 root root 64 Jan  9 10:55 138 -> 'socket:[72099212]'
lrwx------ 1 root root 64 Jan  9 10:55 14 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan  9 10:55 140 -> 'socket:[72087112]'
lrwx------ 1 root root 64 Jan  9 10:55 142 -> 'socket:[72050100]'
...

@kzangeli
Copy link
Collaborator

kzangeli commented Jan 9, 2024

ok, a socket ... doesn't really help.
I'll try to add traces. Not very much I can do though as it's normally 3rd party libraries doing this kind of things.

@ibordach
Copy link

ibordach commented Jan 9, 2024

Ok, we need more debugging options. We stripped down the orion to only have one subscription. We nearly send no data to the orion. At startup orion consumes 20 FDs for a period of time. Suddenly it consumes about 170 FDs and never frees them up.

@kzangeli
Copy link
Collaborator

kzangeli commented Jan 9, 2024

Yeah, I was thinking a little about all this ...

I was going to propose a new test without any subscription at all.
Because, a "misbehaving" receptor of notifications could cause this problem, I believe.
A connection that is not properly closed can linger for 15 minutes in the worst case ...

So, please remove that last one subscription and see if the problem disappears.
That would be very important input for this issue.

@ibordach
Copy link

ok, this bug has nothing to do with replicaset or authentication of mongodb.

@kzangeli
Copy link
Collaborator

ok, this bug has nothing to do with replicaset or authentication of mongodb.

That's good to know.
I think it may have to do with notification recipients that don't close their part of the socket connection.
At least, that's a thing I'd like to have investigated.

A test, doing the exact same thing, but without any notifications (no matching subscriptions) would give important input.
And, if that works, then add subscriptions but with a well-known notification receiver, that we are sure closes it's fds.

Just to rule one thing out.
This is not an easy issue ...

@ibordach
Copy link

hmm...the problem persists without any subscription

@kzangeli
Copy link
Collaborator

ok, valueble info.
It's something else then ...
Pity!

@kzangeli
Copy link
Collaborator

So, I found something interesting in the mongoc release notes for 1.24.0 (Orion-LD currently uses 1.22.0 of libmongoc):

New Features:
  Support MongoDB server version 7.0.

So, one thing to test is to bump the mongoc version up to 1.24.0 (had problems compiling 1.25.x, so, that later)

AND, another thing we could try is to disable Prometheus metrics, that's on by default.
libprom is one of 3-4 libs that open file descriptors ...

This is all a bit of, blindly trying things, as I have no clue who leaves those file descriptors open.
But, perhaps we get lucky! :)

Two PRs coming:

  1. Bump the libmongoc version
  2. Add a CLI option (-no-metrics) to turn off libprom (it will stay "on" by default, that won't change - you'll have to start the broker with this new option)

@ibordach
Copy link

ibordach commented Feb 12, 2024

We just tested version 1577. The -noprom Feature did not fix the problem.

@kzangeli
Copy link
Collaborator

ok, good to know.
We'll keep trying.
The unmerged PR is giving me problems and I have a DevOps expert on it.
I'm hoping it will be merged later today but, can't promise anything. Out of my hands

@ibordach
Copy link

FD_SETSIZE error persists also with 1581

@kzangeli
Copy link
Collaborator

ok, I'd add a bit more. 10x for example.

Now, an update.
MongoDB 7.0 isn't supported by the mongoc drvier (v1.22.0) that is currently in use.
That's one possible culprit.
So, I've been working on updating the mongoc version the last few days.
There's a serious problem in the github actions scripts of Orion-LD that has made such a simple update quite difficult.
But, we're on it. I found help and we're almost there.
Hopefully tomorrow.

Once we have Orion-LD linked with mongoc 1.24.2 (that supports MongoDB 7.0), I'll let you know and we try again

@ibordach
Copy link

ok, this error still persists with 1.6.0-PRE-1587 with mongo 6 and mongo 7.
If the FD_SETSIZE error comes up, the orion is not responding any more on http requests.

@kzangeli
Copy link
Collaborator

I did a quick search on the error (should have done that ages ago ...).
The "article" is old, I know, but it might still be interesting.
Have a look at it and tell me what you think:
https://www.mongodb.com/community/forums/t/mongodb-4-2-19-extremely-high-file-descriptor-counts-70k/170223

@ibordach
Copy link

Ok, I checked some things on mongo.
We have no problem with the mongo: No crashes, no high use of FDs for open files, no replication problem. The mongo is peaceful :-)

@kzangeli
Copy link
Collaborator

kzangeli commented Mar 1, 2024

ok, I guess that if the mongo server is OK, the mongo driver must be as well.
What on earth can it be then???

@ibordach
Copy link

ibordach commented Mar 1, 2024

👽Back to the roots.
Hopefully next week I will try to setup a fresh orion-ld with a fresh mongo in a seperate namespace. After that we start some experiments step by step. I can't believe we won't find the beast.

@kzangeli
Copy link
Collaborator

kzangeli commented Mar 1, 2024

Yeah ... we'll get there I'm sure.
I'm away next week, FIWARE Foundation All-Hands Mon-Wed.
Then the week after that I'm preparing to move.
So, won't be too available. I'll do my best.

@ibordach
Copy link

ibordach commented Apr 25, 2024

small update on this: orion-ld (1.6.0-PRE-1608) crashes rarely without quantumleap. if we enable quantumleap, then we get tons of crashes

it could have something to do with subscriptions

@ibordach
Copy link

What else do we notice:
We get notifications with multiple entitys. But the entitys are all the same with different or old data. Should that happen? We expect to get the latest entity data.

example with about one unique entity several times with different (old) data:

{
    "_msgid": "12345",
    "payload": {
        "id": "urn:ngsi-ld:Notification:12345",
        "type": "Notification",
        "subscriptionId": "urn:ngsi-ld:subscription:12345",
        "notifiedAt": "2024-05-30T07:35:24.725Z",
        "data": [
            {
                "id": "urn:ngsi-ld:Vehicle:AIS:211891460",
                "type": "Vehicle",
                "vehicleConfiguration": "Other",
                "name": "WAVELAB",
                "vehicleIdentificationNumber": 4814550,
                "vehiclePlateIdentifier": "DD8087",
                "vehicleType": "vessel",
                "category": "tracked",
                "refVehicleModel": "urn:ngsi-ld:VehicleModel:vessel:211891460",
                "observationDateTime": {
                    "@type": "DateTime",
                    "@value": "2024-05-29T15:23:52.011Z"
                },
                "location": {
                    "coordinates": [
                        10.16625,
                        54.3388
                    ],
                    "type": "Point"
                },
                "speed": 6.3,
                "source": "IN:HDT",
                "dateObserved": {
                    "@type": "DateTime",
                    "@value": "2024-05-30T07:34:49.778Z"
                },
                "bearing": 5.82
            },
            {
                "id": "urn:ngsi-ld:Vehicle:AIS:211891460",
                "type": "Vehicle",
                "vehicleConfiguration": "Other",
                "name": "WAVELAB",
                "vehicleIdentificationNumber": 4814550,
                "vehiclePlateIdentifier": "DD8087",
                "vehicleType": "vessel",
                "category": "tracked",
                "refVehicleModel": "urn:ngsi-ld:VehicleModel:vessel:211891460",
                "observationDateTime": {
                    "@type": "DateTime",
                    "@value": "2024-05-29T15:23:52.011Z"
                },
                "location": {
                    "coordinates": [
                        10.16625,
                        54.3388
                    ],
                    "type": "Point"
                },
                "speed": 6.3,
                "dateObserved": {
                    "@type": "DateTime",
                    "@value": "2024-05-30T07:34:49.922Z"
                },
                "bearing": 5.8,
                "source": "IN:HDT"
            },
            {
                "id": "urn:ngsi-ld:Vehicle:AIS:211891460",
                "type": "Vehicle",
                "vehicleConfiguration": "Other",
                "name": "WAVELAB",
                "vehicleIdentificationNumber": 4814550,
                "vehiclePlateIdentifier": "DD8087",
                "vehicleType": "vessel",
                "category": "tracked",
                "refVehicleModel": "urn:ngsi-ld:VehicleModel:vessel:211891460",
                "observationDateTime": {
                    "@type": "DateTime",
                    "@value": "2024-05-29T15:23:52.011Z"
                },
                "location": {
                    "coordinates": [
                        10.16625,
                        54.3388
                    ],
                    "type": "Point"
                },
                "speed": 6.3,
                "dateObserved": {
                    "@type": "DateTime",
                    "@value": "2024-05-30T07:34:50.151Z"
                },
                "bearing": 5.8,
                "source": "IN:HDT"
            },
            {
                "id": "urn:ngsi-ld:Vehicle:AIS:211891460",
[...]

@ibordach
Copy link

By chance a new find. Right before the FD_SETSIZE we saw:
WARN@12:49:35 orionldAlterationsTreat.cpp[329]: Still not enough bytes read for the notification response body. I give up

Might that help?

@kzangeli
Copy link
Collaborator

It just might.
If the broker is still waiting to read bytes ...
I'll check what it does after saying "I give up". If it does not close that connection ... let's hope not!!! :)

@kzangeli
Copy link
Collaborator

So, I created a functest with 1000 notifications to a "bad notification client" - a notification client that accepts the connection and reads the incoming notification BUT it doesn't respond. It leaves the poor broker waiting for that response, and finally it times out.
Didn't work too well on the broker side, had to fix the mechanism a little.

Unfortunately, I believe I asked you guys to start the broker without any subscriptions at all and you still had your problem, so, this fix (coming in a few hours) doesn't change anything for you.

I'm still not convinced it's not just "normal execution". File descriptors are needed and the more load the broker receives, the more simultaneously open fds there will be.
I'd set the fd-max to a high value (hundreds of thousands) and test. As far as I know you haven't done that test yet and I really don't understand why.

Anyway, new version on its way, even though I doubt it will fix your problem.

@kzangeli
Copy link
Collaborator

The PR has been merged, in case you want to test. Dockerfiles should be ready shortly

@ibordach
Copy link

THX, we will test it.

About the fd-max: We have, as I see, no limit of 1024 filedescriptors.

[root@orion-ld-deployment-84fcb9c689-86jf2 /]# ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 160805
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Or how do you increase FD_SETSIZE outside the source?

@ibordach
Copy link

What we did in the last weeks to reduce the number of "crashes":

  1. We reduced the number of upserts per second massively
  2. We reduced the number of subscriptions
  3. We reduced the use of idPattern in subscriptions (that seems to be successful)

After all these actions we are able to reduce our 8 instances of orion-ld to only 2 instances, possibly one.

But we will investigate further. idPattern could be a great problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants