Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to manually set the number of router threads #1042

Closed
pbromb opened this issue Dec 30, 2023 · 13 comments · Fixed by #1385
Closed

How to manually set the number of router threads #1042

pbromb opened this issue Dec 30, 2023 · 13 comments · Fixed by #1385

Comments

@pbromb
Copy link

pbromb commented Dec 30, 2023

Hello,
I am running an application using Nginx Unit in Kubernetes environment.
The unit: router process automatically runs in the number of threads, which equals the number of CPU cores on the physical server.

src/nxt_router.c

if (rtcf->threads == 0) {
    rtcf->threads = nxt_ncpu;
}

in the code I also found

src/nxt_router.c

static nxt_conf_map_t  nxt_router_conf[] = {
    {
        nxt_string("listeners_threads"),
        NXT_CONF_MAP_INT32,
        offsetof(nxt_router_conf_t, threads),
    },
};

but I was not able to set this from the configuration file

Is it possible to set this value ?

@ac000
Copy link
Member

ac000 commented Jan 3, 2024 via email

@juliantaylor
Copy link

juliantaylor commented Aug 9, 2024

There needs to be a way to configure the number of threads to use in particular as unit is using sched_yield using spinlocks with no sleep which have extremely bad performance when overcommitted.

Running unit on a host with 64 cores (including hyperthreads) with a cpu limit of 2 under load will cause the system to spend 90% or more of its time rescheduling its threads waiting for the lock while the process which is actually holding the lock only has a very small chance of running and releasing it.

A sample profile of an application under this load looks like this:

-   92.81%     0.00%  unitd    [unknown]                                         [k] 0x0000000000000004                                                                                                           ◆
   - 0x4                                                                                                                                                                                                          ▒
      - 91.42% 0x7f844e03f527                                                                                                                                                                                     ▒
         - 55.42% entry_SYSCALL_64_after_hwframe                                                                                                                                                                  ▒
            - do_syscall_64                                                                                                                                                                                       ▒
               - 39.63% __x64_sys_sched_yield                                                                                                                                                                     ▒
                  - 34.50% schedule                                                                                                                                                                               ▒
                     - 34.11% __schedule                                                                                                                                                                          ▒
                        - 23.19% pick_next_task_fair   

pretty much all of the cpu usage ends up in the userspace part of pthread_yield and the kernel scheduler.

In general using spinlocks from userspace is very dangerous as userspace cannot portably determine whether the lock holder is running and go to sleep if not and thus a preempted lock holder can be blocked by the process spinning on the spinlock from releasing the lock as you also cannot give the lock holder priority for the rescheduling triggered by sched_yield.

The discrepancy between available cpu for the process group and the number of threads that can contend for the locks which can exist in kubernetes/cfs_quota-using systems will amplify these problems massively and having no option to reduce the thread count is a large problem.

An environment variable that allows configuring the thread count would be a good solution.

@ac000
Copy link
Member

ac000 commented Aug 9, 2024

I hear you regarding the use of spinlocks.

We had an internal discussion a while back about the use of sched_yield(2) which at the very least it seems should only be used in conjunction with real-time scheduling.

It would be interesting to see what effect this patch has, whether it's good, bad or indifferent (yes, it may likely fubar things)...

diff --git ./src/nxt_spinlock.c ./src/nxt_spinlock.c
index 940be724..78636fc2 100644
--- ./src/nxt_spinlock.c
+++ ./src/nxt_spinlock.c
@@ -82,8 +82,6 @@ nxt_thread_spin_lock(nxt_thread_spinlock_t *lock)
                 goto again;
             }
         }
-
-        nxt_thread_yield();
     }
 }
 

Then there is this code

/* It should be adjusted with the "spinlock_count" directive. */                
static nxt_uint_t  nxt_spinlock_count = 1000;                                   
                                                                                
                                                                                
void                                                                            
nxt_thread_spin_init(nxt_uint_t ncpu, nxt_uint_t count)                         
{                                                                               
    switch (ncpu) {                                                             
                                                                                
    case 0:                                                                     
        /* Explicit spinlock count. */                                          
        nxt_spinlock_count = count;                                             
        break;                                                                  
                                                                                
    case 1:                                                                     
        /* Spinning is useless on UP. */                                        
        nxt_spinlock_count = 0;                                                 
        break;                                                                  
                                                                                
    default:                                                                    
        /*                                                                      
         * SMP.                                                                 
         *                                                                      
         * TODO: The count should be 10 on a virtualized system                 
         * since virtualized CPUs may share the same physical CPU.              
         */                                                                     
        nxt_spinlock_count = 1000;                                              
        break;                                                                  
    }                                                                           
}

in src/nxt_spinlock.c

(I'm not sure in what cases ncpu would be 0)

This function is called once from Unit with the arguments of (nr_cpus, 0)

It may be worth trying to lower (I guess) nxt_spinlock_count (in the default case of the switch statement)...

@juliantaylor
Copy link

decreasing nxt_spinlock_count would likely have a detrimental effect as it would cause the sched_yield to be called more often which triggers more kernel rescheduling which is very costly especially on larger systems.

The yielding approach is somewhat outdated, its idea is that the spinning thread defers its timeslice and then hopefully the process holding the actual lock runs and releases the lock.
This works out when the machines are not very loaded and releasing this one waiter has a high chance of scheduling the holder. But it terrible if there are multiple spinning waiters and one holder as is the case here where there can be dozes or hundreds of waiters and one holder, unless there are also dozens of free cpu cores (or allowed quota as in the case of e.g. kubernetes pods) the chance that the holder actually gets scheduled is very low (1/spinning waiters).

What spinlocks need to do is go to sleep when the holder is not currently running, so that the spinning waiters don't prevent the locker from running and freeing the lock but that is not actually trivial to do from userspace.

There was a solution to this proposed for linux using restart-able sequences not very long ago and I think it was even merged, see this https://lwn.net/Articles/944895/ but it would require a very recent kernel to do and is of course not portable.

@ac000
Copy link
Member

ac000 commented Aug 10, 2024

Testing our spinlock implementation in an albeit simple test program where a number of threads are trying to increment a counter.

By decreasing nxt_spinlock_count to 10 I see a reduction in runtime of ~50-75%, so it may be worth a shot...

@ac000
Copy link
Member

ac000 commented Aug 10, 2024

Just to clarify

Running unit on a host with 64 cores (including hyperthreads) with a cpu limit of 2 under load will cause the system to

What is meant by 'cpu limit of 2'?

How many router threads were there?

Is this bare metal or a VM?

Was this with an application or just static file serving?

Is the above profile for a particular lock or an amalgamation of all the spin-locks?

@juliantaylor
Copy link

A containerized workload, the host has 64-256 cores (including hyperthreads), the application has access to two of them (using cgroups cfs quotas)
there are as many router threads as the host has cores as it is not configurable.

the profile is just a snapshot of what was currently running, could be any of the locks.

@ac000
Copy link
Member

ac000 commented Aug 10, 2024

Ah, OK, so the system may have 64 'cpus', but you're limiting unit to only use two of those and _SC_NPROCESSORS_ONLN is not affected by things like cputsets and sched_setaffinity(2).

On Linux at least (need to check other systems) we could use sched_getaffinity(2) instead to get the number of 'cpus' that are allowed to run on.

@juliantaylor
Copy link

juliantaylor commented Aug 11, 2024

affinity is only one of the options, more commonly cpu bandwidth is limited using cgroups instead of pinning to cores

there are two versions v2 uses /sys/fs/cgroup/cpu.max which contains timeslice the process can run and period the timeslice applies to, the available cpu is timeslice divided by the period (the result can be fractional).
see cpu.max in https://docs.kernel.org/admin-guide/cgroup-v2.html

the older variant cgroup v1 uses two pseudofiles cfs_quota_us and cfs_period_us in the cgroup filesystem

@ac000
Copy link
Member

ac000 commented Aug 11, 2024

Yes, but trying to determine the number of threads to run based on the things like cpu quota is a little different problem than just getting the number of threads to match the available number of cpus.

Maybe at some point it makes sense to make it configurable but at the very least we should try to get it basically right by default...

@juliantaylor
Copy link

juliantaylor commented Aug 11, 2024

its the same problem, in particular with your choice of lock and implementation you must never start more threads than can actually run, the method on how your application is limited does not matter.
bandwidth limiting via cgroups is extremely common, supporting this will give you the most coverage on linux systems

making it configurable is preferable though as it allows users to override the bad defaults when you don't properly detect the environment, its also like a 3 line patch if you use an environment variable.

@ac000
Copy link
Member

ac000 commented Aug 11, 2024

its the same problem, in particular with your choice of lock and implementation you must never start more threads than

Yes, this is a whole bigger discussion.

can actually run, the method on how your application is limited does not matter. bandwidth limiting via cgroups is extremely common, supporting this will give you the most coverage on linux systems

I have a patch to make use of sched_getaffinity(2) on Linux. This will help if Unit is being run under cpuset(7)'s or started via something like taskset(1).

It will also help with things like docker where the --cpuset-cpus= option has been used.

That's a simple easy win.

Determining the number of threads to create based on cpu quotas is a slightly different problem that will take longer to develop and test etc (if it's something we even wish to do...).

making it configurable is preferable though as it allows users to override the bad defaults when you don't properly detect the environment, its also like a 3 line patch if you use an environment variable.

Hooking it up to the config system is relatively simple and we may do that, but we should first try to get things right by default.

I'll bring the question of making it configurable up with the wider team...

ac000 added a commit to ac000/unit that referenced this issue Aug 12, 2024
At startup, the unit router process creates a number of threads, it
tries to create the same number of threads (not incl the main thread) as
there are 'cpus' in the system.

On Linux the number of available cpus is determined via a call to

  sysconf(_SC_NPROCESSORS_ONLN);

in a lot of cases this produces the right result, i.e. on a four cpu
system this will return 4.

However this can break down if unit has been restricted in the cpus it's
allowed to run on via something like cpuset()'s and/or
sched_setaffinity(2).

For example, on a four 'cpu' system, starting unit will create an extra
4 router threads

  $ /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   234102 234099 234102  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234103  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234104  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234105  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234106  0    5 17:00 pts/10   00:00:00 unit: router

Say we want to limit unit to two cpus, i.e.

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   235772 235769 235772  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235773  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235774  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235775  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235776  0    5 17:08 pts/10   00:00:00 unit: router

So despite limiting unit to two cpus

  $ grep Cpus_allowed_list /proc/235772/status
  Cpus_allowed_list:	2-3

It still created 4 threads, probably not such an issue in this case, but
if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd
have 64 threads vying to run on two cpus and with our spinlock
implementation this can cause a lot of thread scheduling and congestion
overhead.

Besides, our intention is currently to create nr router threads == nr
cpus.

To resolve this, on Linux at least, this patch makes use of
sched_getaffinity(2) to determine what cpus unit is actually allowed to
run on.

We still use the result of

  sysconf(_SC_NPROCESSORS_ONLN);

as a fallback, we also use its result to allocate the required cpuset
size (where sched_getaffinity() will store its result) as the standard
cpu_set_t only has space to store 1023 cpus.

So with this patch if we try to limit unit to two cpus we now get

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   236887 236884 236887  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236888  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236889  0    3 17:20 pts/10   00:00:00 unit: router

This also applies to the likes of docker, if you run docker with the
--cpuset-cpus="" option, unit will now create a number of router threads
that matches the cpu count specified.

Perhaps useful if you are running a number of unit docker instances on a
high cpu count machine.

Link: <nginx#1042>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
ac000 added a commit to ac000/unit that referenced this issue Aug 12, 2024
At startup, the unit router process creates a number of threads, it
tries to create the same number of threads (not incl the main thread) as
there are 'cpus' in the system.

On Linux the number of available cpus is determined via a call to

  sysconf(_SC_NPROCESSORS_ONLN);

in a lot of cases this produces the right result, i.e. on a four cpu
system this will return 4.

However this can break down if unit has been restricted in the cpus it's
allowed to run on via something like cpuset()'s and/or
sched_setaffinity(2).

For example, on a four 'cpu' system, starting unit will create an extra
4 router threads

  $ /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   234102 234099 234102  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234103  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234104  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234105  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234106  0    5 17:00 pts/10   00:00:00 unit: router

Say we want to limit unit to two cpus, i.e.

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   235772 235769 235772  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235773  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235774  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235775  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235776  0    5 17:08 pts/10   00:00:00 unit: router

So despite limiting unit to two cpus

  $ grep Cpus_allowed_list /proc/235772/status
  Cpus_allowed_list:	2-3

It still created 4 threads, probably not such an issue in this case, but
if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd
have 64 threads vying to run on two cpus and with our spinlock
implementation this can cause a lot of thread scheduling and congestion
overhead.

Besides, our intention is currently to create nr router threads == nr
cpus.

To resolve this, on Linux at least, this patch makes use of
sched_getaffinity(2) to determine what cpus unit is actually allowed to
run on.

We still use the result of

  sysconf(_SC_NPROCESSORS_ONLN);

as a fallback, we also use its result to allocate the required cpuset
size (where sched_getaffinity() will store its result) as the standard
cpu_set_t only has space to store 1023 cpus.

So with this patch if we try to limit unit to two cpus we now get

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   236887 236884 236887  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236888  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236889  0    3 17:20 pts/10   00:00:00 unit: router

This also applies to the likes of docker, if you run docker with the
--cpuset-cpus="" option, unit will now create a number of router threads
that matches the cpu count specified.

Perhaps useful if you are running a number of unit docker instances on a
high cpu count machine.

Link: <nginx#1042>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
@ac000
Copy link
Member

ac000 commented Aug 12, 2024

Here's a patch to enable configuring the number of router threads...

diff --git ./src/nxt_conf_validation.c ./src/nxt_conf_validation.c
index 04091745..86199f84 100644
--- ./src/nxt_conf_validation.c
+++ ./src/nxt_conf_validation.c
@@ -240,6 +240,7 @@ static nxt_int_t nxt_conf_vldt_js_module_element(nxt_conf_validation_t *vldt,
 
 static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[];
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_static_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_forwarded_members[];
@@ -309,6 +310,11 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[] = {
         .type       = NXT_CONF_VLDT_OBJECT,
         .validator  = nxt_conf_vldt_object,
         .u.members  = nxt_conf_vldt_http_members,
+    }, {
+        .name       = nxt_string("router"),
+        .type       = NXT_CONF_VLDT_OBJECT,
+        .validator  = nxt_conf_vldt_object,
+        .u.members  = nxt_conf_vldt_router_members,
 #if (NXT_HAVE_NJS)
     }, {
         .name       = nxt_string("js_module"),
@@ -377,6 +383,16 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[] = {
 };
 
 
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[] = {
+    {
+        .name       = nxt_string("threads"),
+        .type       = NXT_CONF_VLDT_INTEGER,
+    },
+
+    NXT_CONF_VLDT_END
+};
diff --git ./src/nxt_conf_validation.c ./src/nxt_conf_validation.c
index 04091745..86199f84 100644
--- ./src/nxt_conf_validation.c
+++ ./src/nxt_conf_validation.c
@@ -240,6 +240,7 @@ static nxt_int_t nxt_conf_vldt_js_module_element(nxt_conf_va
lidation_t *vldt,
 
 static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[];
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_static_members[];
 static nxt_conf_vldt_object_t  nxt_conf_vldt_forwarded_members[];
@@ -309,6 +310,11 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_setting_member
s[] = {
         .type       = NXT_CONF_VLDT_OBJECT,
         .validator  = nxt_conf_vldt_object,
         .u.members  = nxt_conf_vldt_http_members,
+    }, {
+        .name       = nxt_string("router"),
+        .type       = NXT_CONF_VLDT_OBJECT,
+        .validator  = nxt_conf_vldt_object,
+        .u.members  = nxt_conf_vldt_router_members,
 #if (NXT_HAVE_NJS)
     }, {
         .name       = nxt_string("js_module"),
@@ -377,6 +383,16 @@ static nxt_conf_vldt_object_t  nxt_conf_vldt_http_members[]
 = {
 };
 
 
+static nxt_conf_vldt_object_t  nxt_conf_vldt_router_members[] = {
+    {
+        .name       = nxt_string("threads"),
+        .type       = NXT_CONF_VLDT_INTEGER,
+    },
+
+    NXT_CONF_VLDT_END
+};
+
+
 static nxt_conf_vldt_object_t  nxt_conf_vldt_websocket_members[] = {
     {
         .name       = nxt_string("read_timeout"),
diff --git ./src/nxt_router.c ./src/nxt_router.c
index 43209451..2676402b 100644
--- ./src/nxt_router.c
+++ ./src/nxt_router.c
@@ -1412,7 +1412,7 @@ nxt_router_conf_send(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
 
 static nxt_conf_map_t  nxt_router_conf[] = {
     {
-        nxt_string("listeners_threads"),
+        nxt_string("threads"),
         NXT_CONF_MAP_INT32,
         offsetof(nxt_router_conf_t, threads),
     },
@@ -1630,7 +1630,7 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     nxt_conf_value_t            *js_module;
 #endif
     nxt_conf_value_t            *root, *conf, *http, *value, *websocket;
-    nxt_conf_value_t            *applications, *application;
+    nxt_conf_value_t            *applications, *application, *router_conf;
     nxt_conf_value_t            *listeners, *listener;
     nxt_socket_conf_t           *skcf;
     nxt_router_conf_t           *rtcf;
@@ -1640,6 +1640,7 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     nxt_router_app_conf_t       apcf;
     nxt_router_listener_conf_t  lscf;
 
+    static const nxt_str_t  router_path = nxt_string("/settings/router");
     static const nxt_str_t  http_path = nxt_string("/settings/http");
     static const nxt_str_t  applications_path = nxt_string("/applications");
     static const nxt_str_t  listeners_path = nxt_string("/listeners");
@@ -1673,11 +1674,14 @@ nxt_router_conf_create(nxt_task_t *task, nxt_router_temp_conf_t *tmcf,
     rtcf = tmcf->router_conf;
     mp = rtcf->mem_pool;
 
-    ret = nxt_conf_map_object(mp, root, nxt_router_conf,
-                              nxt_nitems(nxt_router_conf), rtcf);
-    if (ret != NXT_OK) {
-        nxt_alert(task, "root map error");
-        return NXT_ERROR;
+    router_conf = nxt_conf_get_path(root, &router_path);
+    if (router_conf != NULL) {
+        ret = nxt_conf_map_object(mp, router_conf, nxt_router_conf,
+                                  nxt_nitems(nxt_router_conf), rtcf);
+        if (ret != NXT_OK) {
+            nxt_alert(task, "rooter_conf map error");
+            return NXT_ERROR;
+        }
     }
 
     if (rtcf->threads == 0) {

Configure like

{
    "listeners": {

    },                                                                          
                                                                                
    "settings": {                                                               
        "router": {                                                             
            "threads": 2                                                                                        
        }                                                                       
    },

    
}

Hopefully something like this can get in the next release of Unit...

ac000 added a commit to ac000/unit that referenced this issue Aug 13, 2024
Unit generally creates an extra number of router threads (to handle
client connections, not incl the main thread) to match the number of
available CPUs.

There are cases when this can go wrong, e.g on a high CPU count machine
and Unit is being effectively limited to a few CPUs via the cgroups cpu
controller. So Unit may create a large number of router threads when
they are only going to effectively run on a couple of CPUs or so.

There may be other cases where you would like to tweak the number of
router threads, depending on your workload.

As it turns out it looks like it was intended to be made configurable
but was just never hooked up to the config system.

This adds a new '/settings/router/threads' config option which can be
set like

  {
      "listen": {
          ...
      },

      "settings": {
          "router": {
              "threads": 2
          }
      },

      ...
  }

Before this patch (on a four cpu system)

  $  ps -efL | grep router
  andrew   419832 419829 419832  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419833  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419834  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 445145  0    5 03:31 pts/10   00:00:00 unit: router
  andrew   419832 419829 445146  0    5 03:31 pts/10   00:00:00 unit: router

After, with a threads setting of 2.

  $ ps -efL | grep router
  andrew   419832 419829 419832  0    3 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419833  0    3 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419834  0    3 Aug12 pts/10   00:00:00 unit: router

Closes: nginx#1042
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
@ac000 ac000 linked a pull request Aug 13, 2024 that will close this issue
ac000 added a commit to ac000/unit that referenced this issue Aug 13, 2024
Unit generally creates an extra number of router threads (to handle
client connections, not incl the main thread) to match the number of
available CPUs.

There are cases when this can go wrong, e.g on a high CPU count machine
and Unit is being effectively limited to a few CPUs via the cgroups cpu
controller. So Unit may create a large number of router threads when
they are only going to effectively run on a couple of CPUs or so.

There may be other cases where you would like to tweak the number of
router threads, depending on your workload.

As it turns out it looks like it was intended to be made configurable
but was just never hooked up to the config system.

This adds a new '/settings/router/threads' config option which can be
set like

  {
      "listen": {
          ...
      },

      "settings": {
          "router": {
              "threads": 2
          }
      },

      ...
  }

Before this patch (on a four cpu system)

  $  ps -efL | grep router
  andrew   419832 419829 419832  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419833  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419834  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 445145  0    5 03:31 pts/10   00:00:00 unit: router
  andrew   419832 419829 445146  0    5 03:31 pts/10   00:00:00 unit: router

After, with a threads setting of 2.

  $ ps -efL | grep router
  andrew   419832 419829 419832  0    3 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419833  0    3 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419834  0    3 Aug12 pts/10   00:00:00 unit: router

Closes: nginx#1042
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
ac000 added a commit to ac000/unit that referenced this issue Aug 13, 2024
At startup, the unit router process creates a number of threads, it
tries to create the same number of threads (not incl the main thread) as
there are 'cpus' in the system.

On Linux the number of available cpus is determined via a call to

  sysconf(_SC_NPROCESSORS_ONLN);

in a lot of cases this produces the right result, i.e. on a four cpu
system this will return 4.

However this can break down if unit has been restricted in the cpus it's
allowed to run on via something like cpuset()'s and/or
sched_setaffinity(2).

For example, on a four 'cpu' system, starting unit will create an extra
4 router threads

  $ /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   234102 234099 234102  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234103  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234104  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234105  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234106  0    5 17:00 pts/10   00:00:00 unit: router

Say we want to limit unit to two cpus, i.e.

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   235772 235769 235772  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235773  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235774  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235775  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235776  0    5 17:08 pts/10   00:00:00 unit: router

So despite limiting unit to two cpus

  $ grep Cpus_allowed_list /proc/235772/status
  Cpus_allowed_list:	2-3

It still created 4 threads, probably not such an issue in this case, but
if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd
have 64 threads vying to run on two cpus and with our spinlock
implementation this can cause a lot of thread scheduling and congestion
overhead.

Besides, our intention is currently to create nr router threads == nr
cpus.

To resolve this, on Linux at least, this patch makes use of
sched_getaffinity(2) to determine what cpus unit is actually allowed to
run on.

We still use the result of

  sysconf(_SC_NPROCESSORS_ONLN);

as a fallback, we also use its result to allocate the required cpuset
size (where sched_getaffinity() will store its result) as the standard
cpu_set_t only has space to store 1023 cpus.

So with this patch if we try to limit unit to two cpus we now get

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   236887 236884 236887  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236888  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236889  0    3 17:20 pts/10   00:00:00 unit: router

This also applies to the likes of docker, if you run docker with the
--cpuset-cpus="" option, unit will now create a number of router threads
that matches the cpu count specified.

Perhaps useful if you are running a number of unit docker instances on a
high cpu count machine.

Link: <nginx#1042>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
ac000 added a commit to ac000/unit that referenced this issue Aug 15, 2024
Unit generally creates an extra number of router threads (to handle
client connections, not incl the main thread) to match the number of
available CPUs.

There are cases when this can go wrong, e.g on a high CPU count machine
and Unit is being effectively limited to a few CPUs via the cgroups cpu
controller. So Unit may create a large number of router threads when
they are only going to effectively run on a couple of CPUs or so.

There may be other cases where you would like to tweak the number of
router threads, depending on your workload.

As it turns out it looks like it was intended to be made configurable
but was just never hooked up to the config system.

This adds a new '/settings/listen_threads' config option which can be
set like

  {
      "listen": {
          ...
      },

      "settings": {
          "listen_threads": 2,

          ...
      },

      ...
  }

Before this patch (on a four cpu system)

  $  ps -efL | grep router
  andrew   419832 419829 419832  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419833  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419834  0    5 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 445145  0    5 03:31 pts/10   00:00:00 unit: router
  andrew   419832 419829 445146  0    5 03:31 pts/10   00:00:00 unit: router

After, with a threads setting of 2

  $ ps -efL | grep router
  andrew   419832 419829 419832  0    3 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419833  0    3 Aug12 pts/10   00:00:00 unit: router
  andrew   419832 419829 419834  0    3 Aug12 pts/10   00:00:00 unit: router

Closes: nginx#1042
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
ac000 added a commit to ac000/unit that referenced this issue Aug 19, 2024
At startup, the unit router process creates a number of threads, it
tries to create the same number of threads (not incl the main thread) as
there are 'cpus' in the system.

On Linux the number of available cpus is determined via a call to

  sysconf(_SC_NPROCESSORS_ONLN);

in a lot of cases this produces the right result, i.e. on a four cpu
system this will return 4.

However this can break down if unit has been restricted in the cpus it's
allowed to run on via something like cpuset()'s and/or
sched_setaffinity(2).

For example, on a four 'cpu' system, starting unit will create an extra
4 router threads

  $ /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   234102 234099 234102  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234103  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234104  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234105  0    5 17:00 pts/10   00:00:00 unit: router
  andrew   234102 234099 234106  0    5 17:00 pts/10   00:00:00 unit: router

Say we want to limit unit to two cpus, i.e.

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   235772 235769 235772  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235773  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235774  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235775  0    5 17:08 pts/10   00:00:00 unit: router
  andrew   235772 235769 235776  0    5 17:08 pts/10   00:00:00 unit: router

So despite limiting unit to two cpus

  $ grep Cpus_allowed_list /proc/235772/status
  Cpus_allowed_list:	2-3

It still created 4 threads, probably not such an issue in this case, but
if we had a 64 'cpu' system and wanted to limit unit two cpus, then we'd
have 64 threads vying to run on two cpus and with our spinlock
implementation this can cause a lot of thread scheduling and congestion
overhead.

Besides, our intention is currently to create nr router threads == nr
cpus.

To resolve this, on Linux at least, this patch makes use of
sched_getaffinity(2) to determine what cpus unit is actually allowed to
run on.

We still use the result of

  sysconf(_SC_NPROCESSORS_ONLN);

as a fallback, we also use its result to allocate the required cpuset
size (where sched_getaffinity() will store its result) as the standard
cpu_set_t only has space to store 1023 cpus.

So with this patch if we try to limit unit to two cpus we now get

  $ taskset -a -c 2-3 /opt/unit/sbin/unitd
  $ ps -efL | grep router
  andrew   236887 236884 236887  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236888  0    3 17:20 pts/10   00:00:00 unit: router
  andrew   236887 236884 236889  0    3 17:20 pts/10   00:00:00 unit: router

This also applies to the likes of docker, if you run docker with the
--cpuset-cpus="" option, unit will now create a number of router threads
that matches the cpu count specified.

Perhaps useful if you are running a number of unit docker instances on a
high cpu count machine.

Link: <nginx#1042>
Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
@ac000 ac000 closed this as completed in 57c88fd Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants