-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader process apparently running on wrong NUMA node #1133
Conversation
Backing up a bit... the point I had in my mind about binding the leader process to a NUMA node is so that the shared-memory objects it creates for the workers would be local to the NUMA node that the worker is running on. However, there's also the question that the shared objects made by the worker (e.g. counters) should remain NUMA-local to the worker. Reading them from a remote NUMA node would force them out of the Modified MESI state and into Shared; https://en.wikipedia.org/wiki/MESI_protocol. So I agree that we should try to get the manager is running on a CPU node of the NUMA node. I think probably we should just just change lib.numa's "bind_to_numa_node" to also (try to) bind CPU. WDYT? |
d2de90d
to
8ac7f0b
Compare
OK, I simplified the code. All the logic about figuring out what are the OS's available CPUs remains, as it's necessary to not schedule the manager process in an isolated CPU. But now Example on correct NUMA node: $ sudo ./snabb lwaftr run --cpu 11 --conf lwaftr.conf --on-a-stick 83:00.0
lwaftr.conf: loading compiled configuration from lwaftr.o
lwaftr.conf: compiled configuration is up to date.
Migrating instance '0000:83:00.0' to '83:00.0'
Binding data-plane PID 15760 to CPU 11.
Bound main process to NUMA node: 1 (CPU 6) Example on wrong NUMA node: $ sudo ./snabb lwaftr run --cpu 2 --conf lwaftr.conf --on-a-stick 83:00.0
lwaftr.conf: loading compiled configuration from lwaftr.o
lwaftr.conf: compiled configuration is up to date.
Migrating instance '0000:83:00.0' to '83:00.0'
Warning: No CPU available on local NUMA node 1
Warning: Assigning CPU 2 from remote node 0
Binding data-plane PID 15753 to CPU 2.
Bound main process to NUMA node: 0 (CPU 0) |
src/lib/cpuset.lua
Outdated
return parse_cpulist_from_file(node_path..'/cpulist') | ||
end | ||
local function isolated_cpus () | ||
return parse_cpulist_from_file('/sys/devices/system/cpu/isolated') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever! I had no idea this file was a thing.
Two requests:
(1) Can we make a method to subtract cpulists ? That way avail = subtract(cpus_in_node, isolated_cpus).
(2) Can we set affinity in bind_to_numa_node
to the interserction of the current CPU affinity with all non-isolated CPUs on a node, not just the first one? That will give Linux a bit more freedom to optimize these non-performance-critical processes, while also preserving the user's ability to do taskset -c N
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the patch and thanks for bearing with my feedback; lgtm!
src/lib/cpuset.lua
Outdated
local function isolated_cpus () | ||
return set(parse_cpulist_from_file('/sys/devices/system/cpu/isolated')) | ||
end | ||
local function substract (s, t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "subtract"
src/lib/cpuset.lua
Outdated
table.sort(ret) | ||
return ret | ||
end | ||
return substract(cpus_in_node(node), isolated_cpus()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a FIXME I think this should be unioned with the current CPU affinity (as returned by sched_getaffinity), to allow for users to run the manager with "taskset". However, not a blocker!
Fixes #731.
I triaged #731 and sometimes the manager process was bound to CPU 0 (NUMA 0) and other times to CPU 6 (NUMA 1). I think the reason why the manager process didn't get bound to a core in the same NUMA node as the lwAFTR's process was that actually the process was not bind to any CPU core. The manager simply called
numa.bind_to_numa_node
but this doesn't bind a process to a core. Thus, the OS scheduled the process to any of the cores available in both NUMA nodes (sometimes 0, sometimes 6).