Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader process apparently running on wrong NUMA node #1133

Merged
merged 5 commits into from
Sep 5, 2018

Conversation

dpino
Copy link
Member

@dpino dpino commented Jul 2, 2018

Fixes #731.

I triaged #731 and sometimes the manager process was bound to CPU 0 (NUMA 0) and other times to CPU 6 (NUMA 1). I think the reason why the manager process didn't get bound to a core in the same NUMA node as the lwAFTR's process was that actually the process was not bind to any CPU core. The manager simply called numa.bind_to_numa_node but this doesn't bind a process to a core. Thus, the OS scheduled the process to any of the cores available in both NUMA nodes (sometimes 0, sometimes 6).

@wingo
Copy link

wingo commented Sep 3, 2018

Backing up a bit... the point I had in my mind about binding the leader process to a NUMA node is so that the shared-memory objects it creates for the workers would be local to the NUMA node that the worker is running on.

However, there's also the question that the shared objects made by the worker (e.g. counters) should remain NUMA-local to the worker. Reading them from a remote NUMA node would force them out of the Modified MESI state and into Shared; https://en.wikipedia.org/wiki/MESI_protocol. So I agree that we should try to get the manager is running on a CPU node of the NUMA node.

I think probably we should just just change lib.numa's "bind_to_numa_node" to also (try to) bind CPU. WDYT?

@dpino dpino force-pushed the fix-issue-731 branch 2 times, most recently from d2de90d to 8ac7f0b Compare September 3, 2018 17:34
@dpino
Copy link
Member Author

dpino commented Sep 3, 2018

OK, I simplified the code. All the logic about figuring out what are the OS's available CPUs remains, as it's necessary to not schedule the manager process in an isolated CPU. But now CPUSet:bind_to_numa_node binds the process to the first available CPU. Before the manager called CPUSet:bind_to_first_available_cpu which ended up calling Numa:bind_to_numa_node.

Example on correct NUMA node:

$ sudo ./snabb lwaftr run --cpu 11 --conf lwaftr.conf --on-a-stick 83:00.0
lwaftr.conf: loading compiled configuration from lwaftr.o
lwaftr.conf: compiled configuration is up to date.
Migrating instance '0000:83:00.0' to '83:00.0'
Binding data-plane PID 15760 to CPU 11.
Bound main process to NUMA node: 1 (CPU 6)

Example on wrong NUMA node:

$ sudo ./snabb lwaftr run --cpu 2 --conf lwaftr.conf --on-a-stick 83:00.0
lwaftr.conf: loading compiled configuration from lwaftr.o
lwaftr.conf: compiled configuration is up to date.
Migrating instance '0000:83:00.0' to '83:00.0'
Warning: No CPU available on local NUMA node 1
Warning: Assigning CPU 2 from remote node 0
Binding data-plane PID 15753 to CPU 2.
Bound main process to NUMA node: 0 (CPU 0)

return parse_cpulist_from_file(node_path..'/cpulist')
end
local function isolated_cpus ()
return parse_cpulist_from_file('/sys/devices/system/cpu/isolated')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever! I had no idea this file was a thing.

Two requests:

(1) Can we make a method to subtract cpulists ? That way avail = subtract(cpus_in_node, isolated_cpus).

(2) Can we set affinity in bind_to_numa_node to the interserction of the current CPU affinity with all non-isolated CPUs on a node, not just the first one? That will give Linux a bit more freedom to optimize these non-performance-critical processes, while also preserving the user's ability to do taskset -c N.

Copy link

@wingo wingo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the patch and thanks for bearing with my feedback; lgtm!

local function isolated_cpus ()
return set(parse_cpulist_from_file('/sys/devices/system/cpu/isolated'))
end
local function substract (s, t)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "subtract"

table.sort(ret)
return ret
end
return substract(cpus_in_node(node), isolated_cpus())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a FIXME I think this should be unioned with the current CPU affinity (as returned by sched_getaffinity), to allow for users to run the manager with "taskset". However, not a blocker!

@dpino dpino merged commit 66dd91d into Igalia:lwaftr Sep 5, 2018
@dpino dpino deleted the fix-issue-731 branch September 5, 2018 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants