How to manually set the number of router threads #1042

pbromb · 2023-12-30T14:45:55Z

Hello,
I am running an application using Nginx Unit in Kubernetes environment.
The unit: router process automatically runs in the number of threads, which equals the number of CPU cores on the physical server.

src/nxt_router.c

if (rtcf->threads == 0) {
    rtcf->threads = nxt_ncpu;
}

in the code I also found

src/nxt_router.c

static nxt_conf_map_t  nxt_router_conf[] = {
    {
        nxt_string("listeners_threads"),
        NXT_CONF_MAP_INT32,
        offsetof(nxt_router_conf_t, threads),
    },
};

but I was not able to set this from the configuration file

Is it possible to set this value ?

The text was updated successfully, but these errors were encountered:

ac000 · 2024-01-03T16:12:54Z

On Sat, 30 Dec 2023 06:46:06 -0800 Paweł B ***@***.***> wrote: Hello, I am running an application using Nginx Unit in Kubernetes environment. The `unit: router` process automatically runs in the number of threads, which equals the number of CPU cores on the physical server. ### src/nxt_router.c ``` if (rtcf->threads == 0) { rtcf->threads = nxt_ncpu; } ``` in the code I also found ### src/nxt_router.c ``` static nxt_conf_map_t nxt_router_conf[] = { { nxt_string("listeners_threads"), NXT_CONF_MAP_INT32, offsetof(nxt_router_conf_t, threads), }, }; ``` but I was not able to set this from the configuration file Is it possible to set this value ?

Looks like it isn't currently hooked up to the configuration system...

juliantaylor · 2024-08-09T11:30:03Z

There needs to be a way to configure the number of threads to use in particular as unit is using sched_yield using spinlocks with no sleep which have extremely bad performance when overcommitted.

Running unit on a host with 64 cores (including hyperthreads) with a cpu limit of 2 under load will cause the system to spend 90% or more of its time rescheduling its threads waiting for the lock while the process which is actually holding the lock only has a very small chance of running and releasing it.

A sample profile of an application under this load looks like this:

-   92.81%     0.00%  unitd    [unknown]                                         [k] 0x0000000000000004                                                                                                           ◆
   - 0x4                                                                                                                                                                                                          ▒
      - 91.42% 0x7f844e03f527                                                                                                                                                                                     ▒
         - 55.42% entry_SYSCALL_64_after_hwframe                                                                                                                                                                  ▒
            - do_syscall_64                                                                                                                                                                                       ▒
               - 39.63% __x64_sys_sched_yield                                                                                                                                                                     ▒
                  - 34.50% schedule                                                                                                                                                                               ▒
                     - 34.11% __schedule                                                                                                                                                                          ▒
                        - 23.19% pick_next_task_fair

pretty much all of the cpu usage ends up in the userspace part of pthread_yield and the kernel scheduler.

In general using spinlocks from userspace is very dangerous as userspace cannot portably determine whether the lock holder is running and go to sleep if not and thus a preempted lock holder can be blocked by the process spinning on the spinlock from releasing the lock as you also cannot give the lock holder priority for the rescheduling triggered by sched_yield.

The discrepancy between available cpu for the process group and the number of threads that can contend for the locks which can exist in kubernetes/cfs_quota-using systems will amplify these problems massively and having no option to reduce the thread count is a large problem.

An environment variable that allows configuring the thread count would be a good solution.

ac000 · 2024-08-09T16:06:35Z

I hear you regarding the use of spinlocks.

We had an internal discussion a while back about the use of sched_yield(2) which at the very least it seems should only be used in conjunction with real-time scheduling.

It would be interesting to see what effect this patch has, whether it's good, bad or indifferent (yes, it may likely fubar things)...

diff --git ./src/nxt_spinlock.c ./src/nxt_spinlock.c
index 940be724..78636fc2 100644
--- ./src/nxt_spinlock.c
+++ ./src/nxt_spinlock.c
@@ -82,8 +82,6 @@ nxt_thread_spin_lock(nxt_thread_spinlock_t *lock)
                 goto again;
             }
         }
-
-        nxt_thread_yield();
     }
 }

Then there is this code

/* It should be adjusted with the "spinlock_count" directive. */                
static nxt_uint_t  nxt_spinlock_count = 1000;                                   
                                                                                
                                                                                
void                                                                            
nxt_thread_spin_init(nxt_uint_t ncpu, nxt_uint_t count)                         
{                                                                               
    switch (ncpu) {                                                             
                                                                                
    case 0:                                                                     
        /* Explicit spinlock count. */                                          
        nxt_spinlock_count = count;                                             
        break;                                                                  
                                                                                
    case 1:                                                                     
        /* Spinning is useless on UP. */                                        
        nxt_spinlock_count = 0;                                                 
        break;                                                                  
                                                                                
    default:                                                                    
        /*                                                                      
         * SMP.                                                                 
         *                                                                      
         * TODO: The count should be 10 on a virtualized system                 
         * since virtualized CPUs may share the same physical CPU.              
         */                                                                     
        nxt_spinlock_count = 1000;                                              
        break;                                                                  
    }                                                                           
}

in src/nxt_spinlock.c

(I'm not sure in what cases ncpu would be 0)

This function is called once from Unit with the arguments of (nr_cpus, 0)

It may be worth trying to lower (I guess) nxt_spinlock_count (in the default case of the switch statement)...

juliantaylor · 2024-08-09T17:14:04Z

decreasing nxt_spinlock_count would likely have a detrimental effect as it would cause the sched_yield to be called more often which triggers more kernel rescheduling which is very costly especially on larger systems.

The yielding approach is somewhat outdated, its idea is that the spinning thread defers its timeslice and then hopefully the process holding the actual lock runs and releases the lock.
This works out when the machines are not very loaded and releasing this one waiter has a high chance of scheduling the holder. But it terrible if there are multiple spinning waiters and one holder as is the case here where there can be dozes or hundreds of waiters and one holder, unless there are also dozens of free cpu cores (or allowed quota as in the case of e.g. kubernetes pods) the chance that the holder actually gets scheduled is very low (1/spinning waiters).

What spinlocks need to do is go to sleep when the holder is not currently running, so that the spinning waiters don't prevent the locker from running and freeing the lock but that is not actually trivial to do from userspace.

There was a solution to this proposed for linux using restart-able sequences not very long ago and I think it was even merged, see this https://lwn.net/Articles/944895/ but it would require a very recent kernel to do and is of course not portable.

ac000 · 2024-08-10T02:10:52Z

Testing our spinlock implementation in an albeit simple test program where a number of threads are trying to increment a counter.

By decreasing nxt_spinlock_count to 10 I see a reduction in runtime of ~50-75%, so it may be worth a shot...

ac000 · 2024-08-10T15:55:39Z

Just to clarify

Running unit on a host with 64 cores (including hyperthreads) with a cpu limit of 2 under load will cause the system to

What is meant by 'cpu limit of 2'?

How many router threads were there?

Is this bare metal or a VM?

Was this with an application or just static file serving?

Is the above profile for a particular lock or an amalgamation of all the spin-locks?

juliantaylor · 2024-08-10T16:00:27Z

A containerized workload, the host has 64-256 cores (including hyperthreads), the application has access to two of them (using cgroups cfs quotas)
there are as many router threads as the host has cores as it is not configurable.

the profile is just a snapshot of what was currently running, could be any of the locks.

ac000 · 2024-08-10T23:10:19Z

Ah, OK, so the system may have 64 'cpus', but you're limiting unit to only use two of those and _SC_NPROCESSORS_ONLN is not affected by things like cputsets and sched_setaffinity(2).

On Linux at least (need to check other systems) we could use sched_getaffinity(2) instead to get the number of 'cpus' that are allowed to run on.

juliantaylor · 2024-08-11T09:05:24Z

affinity is only one of the options, more commonly cpu bandwidth is limited using cgroups instead of pinning to cores

there are two versions v2 uses /sys/fs/cgroup/cpu.max which contains timeslice the process can run and period the timeslice applies to, the available cpu is timeslice divided by the period (the result can be fractional).
see cpu.max in https://docs.kernel.org/admin-guide/cgroup-v2.html

the older variant cgroup v1 uses two pseudofiles cfs_quota_us and cfs_period_us in the cgroup filesystem

ac000 · 2024-08-11T13:17:01Z

Yes, but trying to determine the number of threads to run based on the things like cpu quota is a little different problem than just getting the number of threads to match the available number of cpus.

Maybe at some point it makes sense to make it configurable but at the very least we should try to get it basically right by default...

juliantaylor · 2024-08-11T13:24:44Z

its the same problem, in particular with your choice of lock and implementation you must never start more threads than can actually run, the method on how your application is limited does not matter.
bandwidth limiting via cgroups is extremely common, supporting this will give you the most coverage on linux systems

making it configurable is preferable though as it allows users to override the bad defaults when you don't properly detect the environment, its also like a 3 line patch if you use an environment variable.

ac000 · 2024-08-11T16:53:46Z

its the same problem, in particular with your choice of lock and implementation you must never start more threads than

Yes, this is a whole bigger discussion.

can actually run, the method on how your application is limited does not matter. bandwidth limiting via cgroups is extremely common, supporting this will give you the most coverage on linux systems

I have a patch to make use of sched_getaffinity(2) on Linux. This will help if Unit is being run under cpuset(7)'s or started via something like taskset(1).

It will also help with things like docker where the --cpuset-cpus= option has been used.

That's a simple easy win.

Determining the number of threads to create based on cpu quotas is a slightly different problem that will take longer to develop and test etc (if it's something we even wish to do...).

making it configurable is preferable though as it allows users to override the bad defaults when you don't properly detect the environment, its also like a 3 line patch if you use an environment variable.

Hooking it up to the config system is relatively simple and we may do that, but we should first try to get things right by default.

I'll bring the question of making it configurable up with the wider team...

At startup, the unit router process creates a number of threads, it tries to create the same number of threads (not incl the main thread) as there are 'cpus' in the system. On Linux the number of available cpus is determined via a call to sysconf(_SC_NPROCESSORS_ONLN); in a lot of cases this produces the right result, i.e. on a four cpu system this will return 4. However this can break down if unit has been restricted in the cpus it's allowed to run on via something like cpuset()'s and/or sched_setaffinity(2). For example, on a four 'cpu' system, starting unit will create an extra 4 router threads $ /opt/unit/sbin/unitd $ ps -efL | grep router andrew 234102 234099 234102 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234103 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234104 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234105 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234106 0 5 17:00 pts/10 00:00:00 unit: router Say we want to limit unit to two cpus, i.e. $ taskset -a -c 2-3 /opt/unit/sbin/unitd $ ps -efL | grep router andrew 235772 235769 235772 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235773 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235774 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235775 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235776 0 5 17:08 pts/10 00:00:00 unit: router So despite limiting unit to two cpus $ grep Cpus_allowed_list /proc/235772/status Cpus_allowed_list: 2-3 It still created 4 threads, probably not such an issue in this case, but if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd have 64 threads vying to run on two cpus and with our spinlock implementation this can cause a lot of thread scheduling and congestion overhead. Besides, our intention is currently to create nr router threads == nr cpus. To resolve this, on Linux at least, this patch makes use of sched_getaffinity(2) to determine what cpus unit is actually allowed to run on. We still use the result of sysconf(_SC_NPROCESSORS_ONLN); as a fallback, we also use its result to allocate the required cpuset size (where sched_getaffinity() will store its result) as the standard cpu_set_t only has space to store 1023 cpus. So with this patch if we try to limit unit to two cpus we now get $ taskset -a -c 2-3 /opt/unit/sbin/unitd $ ps -efL | grep router andrew 236887 236884 236887 0 3 17:20 pts/10 00:00:00 unit: router andrew 236887 236884 236888 0 3 17:20 pts/10 00:00:00 unit: router andrew 236887 236884 236889 0 3 17:20 pts/10 00:00:00 unit: router This also applies to the likes of docker, if you run docker with the --cpuset-cpus="" option, unit will now create a number of router threads that matches the cpu count specified. Perhaps useful if you are running a number of unit docker instances on a high cpu count machine. Link: <nginx#1042> Signed-off-by: Andrew Clayton <a.clayton@nginx.com>

ac000 · 2024-08-12T21:33:51Z

Here's a patch to enable configuring the number of router threads...

diff --git ./src/nxt_conf_validation.c ./src/nxt_conf_validation.c
index 04091745..86199f84 100644
--- ./src/nxt_conf_validation.c
+++ ./src/nxt_conf_validation.c
@@ -240,6 +240,7 @@ static nxt_int_t nxt_conf_vldt_js_module_element(nxt_conf_validation_t *vldt,
 
 static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[];
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_static_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_forwarded_members[];
@@ -309,6 +310,11 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[] = {
         .type       = NXT_CONF_VLDT_OBJECT,
         .validator  = nxt_conf_vldt_object,
         .u.members  = nxt_conf_vldt_http_members,
+    }, {
+        .name       = nxt_string("router"),
+        .type       = NXT_CONF_VLDT_OBJECT,
+        .validator  = nxt_conf_vldt_object,
+        .u.members  = nxt_conf_vldt_router_members,
 #if (NXT_HAVE_NJS)
     }, {
         .name       = nxt_string("js_module"),
@@ -377,6 +383,16 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[] = {
 };
 
 
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[] = {
+    {
+        .name       = nxt_string("threads"),
+        .type       = NXT_CONF_VLDT_INTEGER,
+    },
+
+    NXT_CONF_VLDT_END
+};
diff --git ./src/nxt_conf_validation.c ./src/nxt_conf_validation.c
index 04091745..86199f84 100644
--- ./src/nxt_conf_validation.c
+++ ./src/nxt_conf_validation.c
@@ -240,6 +240,7 @@ static nxt_int_t nxt_conf_vldt_js_module_element(nxt_conf_va
lidation_t *vldt,
 
 static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[];
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_static_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_forwarded_members[];
@@ -309,6 +310,11 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_member
s[] = {
         .type       = NXT_CONF_VLDT_OBJECT,
         .validator  = nxt_conf_vldt_object,
         .u.members  = nxt_conf_vldt_http_members,
+    }, {
+        .name       = nxt_string("router"),
+        .type       = NXT_CONF_VLDT_OBJECT,
+        .validator  = nxt_conf_vldt_object,
+        .u.members  = nxt_conf_vldt_router_members,
 #if (NXT_HAVE_NJS)
     }, {
         .name       = nxt_string("js_module"),
@@ -377,6 +383,16 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[]
 = {
 };
 
 
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[] = {
+    {
+        .name       = nxt_string("threads"),
+        .type       = NXT_CONF_VLDT_INTEGER,
+    },
+
+    NXT_CONF_VLDT_END
+};
+
+
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[] = {
     {
         .name       = nxt_string("read_timeout"),
diff --git ./src/nxt_router.c ./src/nxt_router.c
index 43209451..2676402b 100644
--- ./src/nxt_router.c
+++ ./src/nxt_router.c
@@ -1412,7 +1412,7 @@ nxt_router_conf_send(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
 
 static nxt_conf_map_t  nxt_router_conf[] = {
     {
-        nxt_string("listeners_threads"),
+        nxt_string("threads"),
         NXT_CONF_MAP_INT32,
         offsetof(nxt_router_conf_t, threads),
     },
@@ -1630,7 +1630,7 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     nxt_conf_value_t            *js_module;
 #endif
     nxt_conf_value_t            *root, *conf, *http, *value, *websocket;
-    nxt_conf_value_t            *applications, *application;
+    nxt_conf_value_t            *applications, *application, *router_conf;
     nxt_conf_value_t            *listeners, *listener;
     nxt_socket_conf_t           *skcf;
     nxt_router_conf_t           *rtcf;
@@ -1640,6 +1640,7 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     nxt_router_app_conf_t       apcf;
     nxt_router_listener_conf_t  lscf;
 
+    static const nxt_str_t  router_path = nxt_string("/settings/router");
     static const nxt_str_t  http_path = nxt_string("/settings/http");
     static const nxt_str_t  applications_path = nxt_string("/applications");
     static const nxt_str_t  listeners_path = nxt_string("/listeners");
@@ -1673,11 +1674,14 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     rtcf = tmcf->router_conf;
     mp = rtcf->mem_pool;
 
-    ret = nxt_conf_map_object(mp, root, nxt_router_conf,
-                              nxt_nitems(nxt_router_conf), rtcf);
-    if (ret != NXT_OK) {
-        nxt_alert(task, "root map error");
-        return NXT_ERROR;
+    router_conf = nxt_conf_get_path(root, &router_path);
+    if (router_conf != NULL) {
+        ret = nxt_conf_map_object(mp, router_conf, nxt_router_conf,
+                                  nxt_nitems(nxt_router_conf), rtcf);
+        if (ret != NXT_OK) {
+            nxt_alert(task, "rooter_conf map error");
+            return NXT_ERROR;
+        }
     }
 
     if (rtcf->threads == 0) {

Configure like

{
    "listeners": {

    },                                                                          
                                                                                
    "settings": {                                                               
        "router": {                                                             
            "threads": 2                                                                                        
        }                                                                       
    },

    
}

Hopefully something like this can get in the next release of Unit...

Unit generally creates an extra number of router threads (to handle client connections, not incl the main thread) to match the number of available CPUs. There are cases when this can go wrong, e.g on a high CPU count machine and Unit is being effectively limited to a few CPUs via the cgroups cpu controller. So Unit may create a large number of router threads when they are only going to effectively run on a couple of CPUs or so. There may be other cases where you would like to tweak the number of router threads, depending on your workload. As it turns out it looks like it was intended to be made configurable but was just never hooked up to the config system. This adds a new '/settings/router/threads' config option which can be set like { "listen": { ... }, "settings": { "router": { "threads": 2 } }, ... } Before this patch (on a four cpu system) $ ps -efL | grep router andrew 419832 419829 419832 0 5 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419833 0 5 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419834 0 5 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 445145 0 5 03:31 pts/10 00:00:00 unit: router andrew 419832 419829 445146 0 5 03:31 pts/10 00:00:00 unit: router After, with a threads setting of 2. $ ps -efL | grep router andrew 419832 419829 419832 0 3 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419833 0 3 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419834 0 3 Aug12 pts/10 00:00:00 unit: router Closes: nginx#1042 Signed-off-by: Andrew Clayton <a.clayton@nginx.com>

At startup, the unit router process creates a number of threads, it tries to create the same number of threads (not incl the main thread) as there are 'cpus' in the system. On Linux the number of available cpus is determined via a call to sysconf(_SC_NPROCESSORS_ONLN); in a lot of cases this produces the right result, i.e. on a four cpu system this will return 4. However this can break down if unit has been restricted in the cpus it's allowed to run on via something like cpuset()'s and/or sched_setaffinity(2). For example, on a four 'cpu' system, starting unit will create an extra 4 router threads $ /opt/unit/sbin/unitd $ ps -efL | grep router andrew 234102 234099 234102 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234103 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234104 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234105 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234106 0 5 17:00 pts/10 00:00:00 unit: router Say we want to limit unit to two cpus, i.e. $ taskset -a -c 2-3 /opt/unit/sbin/unitd $ ps -efL | grep router andrew 235772 235769 235772 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235773 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235774 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235775 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235776 0 5 17:08 pts/10 00:00:00 unit: router So despite limiting unit to two cpus $ grep Cpus_allowed_list /proc/235772/status Cpus_allowed_list: 2-3 It still created 4 threads, probably not such an issue in this case, but if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd have 64 threads vying to run on two cpus and with our spinlock implementation this can cause a lot of thread scheduling and congestion overhead. Besides, our intention is currently to create nr router threads == nr cpus. To resolve this, on Linux at least, this patch makes use of sched_getaffinity(2) to determine what cpus unit is actually allowed to run on. We still use the result of sysconf(_SC_NPROCESSORS_ONLN); as a fallback, we also use its result to allocate the required cpuset size (where sched_getaffinity() will store its result) as the standard cpu_set_t only has space to store 1023 cpus. So with this patch if we try to limit unit to two cpus we now get $ taskset -a -c 2-3 /opt/unit/sbin/unitd $ ps -efL | grep router andrew 236887 236884 236887 0 3 17:20 pts/10 00:00:00 unit: router andrew 236887 236884 236888 0 3 17:20 pts/10 00:00:00 unit: router andrew 236887 236884 236889 0 3 17:20 pts/10 00:00:00 unit: router This also applies to the likes of docker, if you run docker with the --cpuset-cpus="" option, unit will now create a number of router threads that matches the cpu count specified. Perhaps useful if you are running a number of unit docker instances on a high cpu count machine. Link: <nginx#1042> Signed-off-by: Andrew Clayton <a.clayton@nginx.com>

Unit generally creates an extra number of router threads (to handle client connections, not incl the main thread) to match the number of available CPUs. There are cases when this can go wrong, e.g on a high CPU count machine and Unit is being effectively limited to a few CPUs via the cgroups cpu controller. So Unit may create a large number of router threads when they are only going to effectively run on a couple of CPUs or so. There may be other cases where you would like to tweak the number of router threads, depending on your workload. As it turns out it looks like it was intended to be made configurable but was just never hooked up to the config system. This adds a new '/settings/listen_threads' config option which can be set like { "listen": { ... }, "settings": { "listen_threads": 2, ... }, ... } Before this patch (on a four cpu system) $ ps -efL | grep router andrew 419832 419829 419832 0 5 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419833 0 5 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419834 0 5 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 445145 0 5 03:31 pts/10 00:00:00 unit: router andrew 419832 419829 445146 0 5 03:31 pts/10 00:00:00 unit: router After, with a threads setting of 2 $ ps -efL | grep router andrew 419832 419829 419832 0 3 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419833 0 3 Aug12 pts/10 00:00:00 unit: router andrew 419832 419829 419834 0 3 Aug12 pts/10 00:00:00 unit: router Closes: nginx#1042 Signed-off-by: Andrew Clayton <a.clayton@nginx.com>

At startup, the unit router process creates a number of threads, it tries to create the same number of threads (not incl the main thread) as there are 'cpus' in the system. On Linux the number of available cpus is determined via a call to sysconf(_SC_NPROCESSORS_ONLN); in a lot of cases this produces the right result, i.e. on a four cpu system this will return 4. However this can break down if unit has been restricted in the cpus it's allowed to run on via something like cpuset()'s and/or sched_setaffinity(2). For example, on a four 'cpu' system, starting unit will create an extra 4 router threads $ /opt/unit/sbin/unitd $ ps -efL | grep router andrew 234102 234099 234102 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234103 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234104 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234105 0 5 17:00 pts/10 00:00:00 unit: router andrew 234102 234099 234106 0 5 17:00 pts/10 00:00:00 unit: router Say we want to limit unit to two cpus, i.e. $ taskset -a -c 2-3 /opt/unit/sbin/unitd $ ps -efL | grep router andrew 235772 235769 235772 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235773 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235774 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235775 0 5 17:08 pts/10 00:00:00 unit: router andrew 235772 235769 235776 0 5 17:08 pts/10 00:00:00 unit: router So despite limiting unit to two cpus $ grep Cpus_allowed_list /proc/235772/status Cpus_allowed_list: 2-3 It still created 4 threads, probably not such an issue in this case, but if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd have 64 threads vying to run on two cpus and with our spinlock implementation this can cause a lot of thread scheduling and congestion overhead. Besides, our intention is currently to create nr router threads == nr cpus. To resolve this, on Linux at least, this patch makes use of sched_getaffinity(2) to determine what cpus unit is actually allowed to run on. We still use the result of sysconf(_SC_NPROCESSORS_ONLN); as a fallback, we also use its result to allocate the required cpuset size (where sched_getaffinity() will store its result) as the standard cpu_set_t only has space to store 1023 cpus. So with this patch if we try to limit unit to two cpus we now get $ taskset -a -c 2-3 /opt/unit/sbin/unitd $ ps -efL | grep router andrew 236887 236884 236887 0 3 17:20 pts/10 00:00:00 unit: router andrew 236887 236884 236888 0 3 17:20 pts/10 00:00:00 unit: router andrew 236887 236884 236889 0 3 17:20 pts/10 00:00:00 unit: router This also applies to the likes of docker, if you run docker with the --cpuset-cpus="" option, unit will now create a number of router threads that matches the cpu count specified. Perhaps useful if you are running a number of unit docker instances on a high cpu count machine. Link: <nginx#1042> Signed-off-by: Andrew Clayton <a.clayton@nginx.com>

ac000 mentioned this issue Aug 12, 2024

Better available CPU count determination on Linux #1383

Merged

ac000 linked a pull request Aug 13, 2024 that will close this issue

router: Make the number of router threads configurable #1385

Merged

ac000 closed this as completed in 57c88fd Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to manually set the number of router threads #1042

How to manually set the number of router threads #1042

pbromb commented Dec 30, 2023

ac000 commented Jan 3, 2024 via email

juliantaylor commented Aug 9, 2024 •

edited

Loading

ac000 commented Aug 9, 2024

juliantaylor commented Aug 9, 2024

ac000 commented Aug 10, 2024

ac000 commented Aug 10, 2024

juliantaylor commented Aug 10, 2024

ac000 commented Aug 10, 2024

juliantaylor commented Aug 11, 2024 •

edited

Loading

ac000 commented Aug 11, 2024 •

edited

Loading

juliantaylor commented Aug 11, 2024 •

edited

Loading

ac000 commented Aug 11, 2024

ac000 commented Aug 12, 2024

How to manually set the number of router threads #1042

How to manually set the number of router threads #1042

Comments

pbromb commented Dec 30, 2023

src/nxt_router.c

src/nxt_router.c

ac000 commented Jan 3, 2024 via email

juliantaylor commented Aug 9, 2024 • edited Loading

ac000 commented Aug 9, 2024

juliantaylor commented Aug 9, 2024

ac000 commented Aug 10, 2024

ac000 commented Aug 10, 2024

juliantaylor commented Aug 10, 2024

ac000 commented Aug 10, 2024

juliantaylor commented Aug 11, 2024 • edited Loading

ac000 commented Aug 11, 2024 • edited Loading

juliantaylor commented Aug 11, 2024 • edited Loading

ac000 commented Aug 11, 2024

ac000 commented Aug 12, 2024

juliantaylor commented Aug 9, 2024 •

edited

Loading

juliantaylor commented Aug 11, 2024 •

edited

Loading

ac000 commented Aug 11, 2024 •

edited

Loading

juliantaylor commented Aug 11, 2024 •

edited

Loading