Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/intelrdt: skip reading /proc/cpuinfo #3485

Merged
merged 5 commits into from
Jul 28, 2022

Conversation

corhere
Copy link
Contributor

@corhere corhere commented May 25, 2022

Reading /proc/cpuinfo is a surprisingly expensive operation. Since kernel version 4.12, opening /proc/cpuinfo on an x86 system can block for around 20 milliseconds while the kernel samples the current CPU frequency. There is a very recent patch which gets rid of the delay, but has yet to make it into the mainline kernel. Regardless, kernels for which opening /proc/cpuinfo takes 20ms will continue to be run in production for years to come. libcontainer only opens /proc/cpuinfo to read the processor feature flags so all the delays to get an accurate snapshot of the CPU frequency are just wasted time.

If we wanted to, we could interrogate the CPU features directly from userspace using the CPUID instruction. However, Intel and AMD CPUs have flags in different positions for their analogous sub-features and there are CPU quirks which would need to be accounted for. Some Haswell server CPUs support RDT/CAT but are missing the CPUID flags advertising their support; the kernel checks for support on that processor family by probing the the hardware using privileged RDMSR/WRMSR instructions. This sort of probing could not be implemented in userspace so it would not be possible to check for RDT feature support in userspace without false negatives on some hardware configurations.

It looks like libcontainer reads the CPU feature flags as a kind of optimization so that it can skip checking whether the kernel supports an RDT sub-feature if the hardware support is missing. As the kernel only exposes subtrees in the resctrl filesystem for RDT sub-features with hardware and kernel support, checking the CPU feature flags is redundant from a correctness point of view. Remove the /proc/cpuinfo check as it is an optimization which actually hurts performance.

Benchmarks: runc run is 17 ms faster when "intelRdt" is enabled in config.json
  • Hardware: AWS EC2 c5n.metal
  • OS: Ubuntu 20.04 LTS
  • /sys/fs/resctrl mounted
  • OCI config:
    • "args": ["/bin/true"]
    • "linux": {"intelRdt": {"memBwSchema": "MB:0=20;1=70"}}
3f0daac This PR
Run 1
real	0m0.098s
user	0m0.029s
sys	0m0.043s
real	0m0.092s
user	0m0.023s
sys	0m0.048s
Run 2
real	0m0.110s
user	0m0.019s
sys	0m0.052s
real	0m0.074s
user	0m0.022s
sys	0m0.047s
Run 3
real	0m0.087s
user	0m0.019s
sys	0m0.051s
real	0m0.077s
user	0m0.025s
sys	0m0.046s

@kolyshkin
Copy link
Contributor

Since commit edeb3b3 (#3306) we only read cpuinfo if intelrdt is available in hardware, yet you're right, it does not make sense to read it at all, since we can get the needed info with a few stat(2) calls.

Can we take one more step and ditch mountinfo parsing as well? This is only used to check for mbaScEnabled, which is apparently not used anywhere except to avoid stat of $root/info/MB file. Would be nice to remove this.

@xiaochenshen @marquiz PTAL

@xiaochenshen
Copy link
Contributor

@corhere Thank you very much for working on it.

Reading /proc/cpuinfo is a surprisingly expensive operation. Since kernel version 4.12, opening /proc/cpuinfo on an x86 system can block for around 20 milliseconds while the kernel samples the current CPU frequency.

I am really surprised for the performance hurt from reading /proc/cpuinfo. This issue should be fixed. Thank you.

There is a very recent patch which gets rid of the delay, but has yet to make it into the mainline kernel. Regardless, kernels for which opening /proc/cpuinfo takes 20ms will continue to be run in production for years to come. libcontainer only opens /proc/cpuinfo to read the processor feature flags so all the delays to get an accurate snapshot of the CPU frequency are just wasted time.

Yes, in most of the cases the production kernels or older upstream kernels are without this fix. Reading /proc/cpuinfo is too expensive.

It looks like libcontainer reads the CPU feature flags as a kind of optimization so that it can skip checking whether the kernel supports an RDT sub-feature if the hardware support is missing. As the kernel only exposes subtrees in the resctrl filesystem for RDT sub-features with hardware and kernel support, checking the CPU feature flags is redundant from a correctness point of view. Remove the /proc/cpuinfo check as it is an optimization which actually hurts performance.
Benchmarks: runc run is 17 ms faster when "intelRdt" is enabled in config.json

I agree with you. We could remove the redundant check on /proc/cpuinfo.

xiaochenshen
xiaochenshen previously approved these changes May 26, 2022
Copy link
Contributor

@xiaochenshen xiaochenshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@xiaochenshen
Copy link
Contributor

xiaochenshen commented May 26, 2022

@kolyshkin

Can we take one more step and ditch mountinfo parsing as well? This is only used to check for mbaScEnabled, which is apparently not used anywhere except to avoid stat of $root/info/MB file. Would be nice to remove this.

I don't think the mountinfo check could be removed so far. MBA Software Controller is a "pure" software feature based on MBA by kernel, the only way of exposing the capability to user is through mount option "mba_MBps" in current kernel.

An alternative hacking way is to check "MB:" line in default schemata file: without mount option "mba_MBps", the data is a percentage in range [0, 100%], but with mount option "mba_MBps", the data is a u32 value in range [0, ((u32)~0U)].

$ cat /sys/fs/resctrl/schemata | grep MB
(1) w/o "mba_MBps" mount option: "MB:0= 100;1= 100"
(2) w/  "mba_MBps" mount option: "MB:0=4294967295;1=4294967295"

@corhere
Copy link
Contributor Author

corhere commented May 26, 2022

Given that Intel's own libraries also parse mountinfo to check whether the mba_MBps option is enabled, I don't think there are any alternatives to parsing mountinfo.
https://github.com/intel/goresctrl/blob/1b61a39df7b692eb89e41a75a27b7f786461012d/pkg/rdt/info.go#L243-L251
https://github.com/intel/intel-cmt-cat/blob/c76f7a34195fba05c96ea3d6450a24a790041ef1/lib/os_cap.c#L745-L746

I suspect that @xiaochenshen's hacky check might result in false negatives. The /sys/fs/resctrl/schemata file describes the configuration for the default resource group so a system with the resctrl filesystem mounted with the mba_MBps option could also have the default group's memory bandwidth set to a value <= 100 MBps.

That all being said, it looks like the only other place the mbaScEnabled variable is referenced is in the function intelrdt.IsMBAScEnabled(), which is unused in libcontainer. So we could elide parsing mountinfo in rootOnce without breaking compatibility with external users by lazily parsing mountinfo the first time IsMBAScEnabled is called. But mountinfo is not only parsed to check if mba_sc is enabled; it is also used to find the mount-point of the resctrl fs. Otherwise libcontainer would need to assume that the resctrl fs is always mounted to /sys/fs/resctrl, which is a pretty reasonable assumption IMO and aligns with Intel's C library (but not their Go library).

@xiaochenshen
Copy link
Contributor

@corhere

I suspect that @xiaochenshen's hacky check might result in false negatives. The /sys/fs/resctrl/schemata file describes the configuration for the default resource group so a system with the resctrl filesystem mounted with the mba_MBps option could also have the default group's memory bandwidth set to a value <= 100 MBps.

Correct. This hacking way is not reliable.

That all being said, it looks like the only other place the mbaScEnabled variable is referenced is in the function intelrdt.IsMBAScEnabled(), which is unused in libcontainer. So we could elide parsing mountinfo in rootOnce without breaking compatibility with external users by lazily parsing mountinfo the first time IsMBAScEnabled is called.

Sounds good to me.

But mountinfo is not only parsed to check if mba_sc is enabled; it is also used to find the mount-point of the resctrl fs. Otherwise libcontainer would need to assume that the resctrl fs is always mounted to /sys/fs/resctrl, which is a pretty reasonable assumption IMO and aligns with Intel's C library (but not their Go library).

Yes. Parsing mountinfo is necessary to find the mount point of resctrl fs. We could not assume resctrl fs is always mounted to the recommended default mount point - /sys/fs/resctrl.
Actually, resctrl fs could be mounted at any proper place instead of /sys/fs/resctrl.

@kolyshkin
Copy link
Contributor

intelrdt.IsMBAScEnabled(), which is unused in libcontainer

... and elsewhere (I tried to search through some well known users of runc/libcontainer and haven't found any).

But mountinfo is not only parsed to check if mba_sc is enabled; it is also used to find the mount-point of the resctrl fs. Otherwise libcontainer would need to assume that the resctrl fs is always mounted to /sys/fs/resctrl, which is a pretty reasonable assumption IMO and aligns with Intel's C library (but not their Go library).

Yes. Parsing mountinfo is necessary to find the mount point of resctrl fs. We could not assume resctrl fs is always mounted to the recommended default mount point - /sys/fs/resctrl. Actually, resctrl fs could be mounted at any proper place instead of /sys/fs/resctrl.

Well, this is easy -- first we do statfs(2) on default path (/sys/fs/resctrl), checking that it exists and comparing the returned fstype (it should be RDTGROUP_SUPER_MAGIC as defined in kernel headers as well as x/sys/unix). This is very fast and will work in 99.9% cases.

If not, we fallback to parsing /proc/mountinfo.

Currently we can't do the fast path because unfortunately there's no other way to read mount options (and I've actually added the comment explaining that sad situation in commit 5e201e7).

If we can get rid of unconditional parsing of mountinfo, that would be awesome.

@corhere @xiaochenshen thanks for looking into it!

@corhere
Copy link
Contributor Author

corhere commented May 27, 2022

@kolyshkin @xiaochenshen I have updated the PR with the discussed fast path, plus some more improvements such that the slow path is only executed if RDT is configured in the container's OCI spec. PTAL.

xiaochenshen
xiaochenshen previously approved these changes May 31, 2022
Copy link
Contributor

@xiaochenshen xiaochenshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the fast path updates. Thank you.

@marquiz
Copy link
Contributor

marquiz commented May 31, 2022

I like the idea in this PR very much 😄 I haven't closely reviewed the code yet, though

if _, err := os.Stat(filepath.Join(root, "info", "L3")); err == nil {
catEnabled = true
}
if _, err := os.Stat(filepath.Join(root, "info", "L3")); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there's one consideration that I don't know if we should care about here. If resctrlfs is mounted with -o cdp we will have L3CODE/ and L3DATA/ (instead of L3/) under the info/ dir.

I don't remember if CDP is officially supported in the runtime spec but I guess specifying something like L3DATA:0=ff\nL3CODE:0=ff in intelRdt.l3CacheSchema would work in practice.

Any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That certainly sounds like we could get false negatives on IsCATEnabled() if resctrlfs is mounted with -o cdp and is worth investigating further. However, as the code in main would have the exact same deficiency, any improvements to feature detection or how a container's L3 allocation config is applied when RDT/CDP is enabled in the kernel probably should be made in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems this can be addressed in a follow-up. @marquiz feel free to create it :)

Copy link
Contributor

@marquiz marquiz Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, sounds good. Especially as the old code has the same limitation as you pointed out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #3537 so that we don't forget 👍

Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had just one comment/question. Looks very good to me 👍

marquiz
marquiz previously approved these changes Jun 2, 2022
Copy link
Contributor

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin
Copy link
Contributor

Left a couple of nits wrt two last commits, but overall LGTM

kolyshkin
kolyshkin previously approved these changes Jun 8, 2022
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin kolyshkin added this to the 1.2.0 milestone Jun 22, 2022
@kolyshkin
Copy link
Contributor

@AkihiroSuda @thaJeztah @cyphar PTAL

1 similar comment
@kolyshkin
Copy link
Contributor

@AkihiroSuda @thaJeztah @cyphar PTAL

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a go at reviewing this, and found some unused code that (as a result) could simplify this patch.

libcontainer/intelrdt/intelrdt.go Outdated Show resolved Hide resolved
libcontainer/intelrdt/intelrdt.go Outdated Show resolved Hide resolved
libcontainer/intelrdt/intelrdt.go Outdated Show resolved Hide resolved
@corhere corhere dismissed stale reviews from kolyshkin, marquiz, and xiaochenshen via 916f62a July 26, 2022 19:38
@corhere corhere force-pushed the 3181-rdt-init-without-cpuinfo branch from 29e8a89 to 916f62a Compare July 26, 2022 19:38
@corhere corhere requested review from thaJeztah and kolyshkin July 26, 2022 19:45
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Comment on lines +271 to +277
// Has the resctrl fs been mounted to the default mount point?
if statfs.Type == unix.RDTGROUP_SUPER_MAGIC {
intelRdtRoot = defaultResctrlMountpoint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh (very!) random comment 😅; I recall some "lunch time" discussions I had with @kolyshkin about "where" comments should go for conditional statements.

My take was that putting the comment before the condition is an easy pitfal to describe the code, not the intent (and you may end up with describing the "if" statement, only in words instead of code), whereas putting it inside the branch "gently" forces one to write the branch itself (why we're doing it).

// check if foo is enabled
if foo {
    do stuff
}
if foo {
  // foo is enabled; start the flux capacitor to go back to the future.
  do stuff
}

(Of course, it highly depends on the situation; in some cases you'd want to describe a more complex conditional block as a whole to describe the whole intent.)

@thaJeztah
Copy link
Member

@kolyshkin PTAL

corhere added 5 commits July 28, 2022 12:06
This function is unused, and removing it simplifies the changes which
follow this commit.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Reading /proc/cpuinfo is a surprisingly expensive operation. Since
kernel version 4.12 [1], opening /proc/cpuinfo on an x86 system can
block for around 20 milliseconds while the kernel samples the current
CPU frequency. There is a very recent patch [2] which gets rid of the
delay, but has yet to make it into the mainline kenel. Regardless,
kernels for which opening /proc/cpuinfo takes 20ms will continue to be
run in production for years to come. libcontainer only opens
/proc/cpuinfo to read the processor feature flags so all the delays to
get an accurate snapshot of the CPU frequency are just wasted time.

If we wanted to, we could interrogate the CPU features directly from
userspace using the `CPUID` instruction. However, Intel and AMD CPUs
have flags in different positions for their analogous sub-features and
there are CPU quirks [3] which would need to be accounted for. Some
Haswell server CPUs support RDT/CAT but are missing the `CPUID` flags
advertising their support; the kernel checks for support on that
processor family by probing the the hardware using privileged
RDMSR/WRMSR instructions [4]. This sort of probing could not be
implemented in userspace so it would not be possible to check for RDT
feature support in userspace without false negatives on some hardware
configurations.

It looks like libcontainer reads the CPU feature flags as a kind of
optimization so that it can skip checking whether the kernel supports an
RDT sub-feature if the hardware support is missing. As the kernel only
exposes subtrees in the `resctrl` filesystem for RDT sub-features with
hardware and kernel support, checking the CPU feature flags is redundant
from a correctness point of view. Remove the /proc/cpuinfo check as it
is an optimization which actually hurts performance.

[1]: https://unix.stackexchange.com/a/526679
[2]: https://lore.kernel.org/all/20220415161206.875029458@linutronix.de/
[3]: https://github.com/torvalds/linux/blob/7cf6a8a17f5b134b7e783c2d45c53298faef82a7/arch/x86/kernel/cpu/resctrl/core.c#L834-L851
[4]: https://github.com/torvalds/linux/blob/a6b450573b912316ad36262bfc70e7c3870c56d1/arch/x86/kernel/cpu/resctrl/core.c#L111-L153

Signed-off-by: Cory Snider <csnider@mirantis.com>
The intelrdt package only needs to parse mountinfo to find the mount
point of the resctrl filesystem. Users are generally going to mount the
resctrl filesystem to the pre-created /sys/fs/resctrl directory, so
there is a common case where mountinfo parsing is not required. Optimize
for the common case with a fast path which checks both for the existence
of the /sys/fs/resctrl directory and whether the resctrl filesystem was
mounted to that path using a single statfs syscall.

Signed-off-by: Cory Snider <csnider@mirantis.com>
The OCI runtime spec mandates "[i]f intelRdt is not set, the runtime
MUST NOT manipulate any resctrl pseudo-filesystems." Attempting to
delete files counts as manipulating, so stop doing that when the
container's RDT configuration is nil.

Signed-off-by: Cory Snider <csnider@mirantis.com>
Unless the container's runtime config has intelRdt configuration set,
any checks for whether Intel RDT is supported or the resctrl filesystem
is mounted are a waste of time as, per the OCI Runtime Spec, "the
runtime MUST NOT manipulate any resctrl pseudo-filesystems." And in the
likely case where Intel RDT is supported by both the hardware and
kernel but the resctrl filesystem is not mounted, these checks can get
expensive as the intelrdt package needs to parse mountinfo to check
whether the filesystem has been mounted to a non-standard path.
Optimize for the common case of containers with no intelRdt
configuration by only performing the checks when the container has opted
in.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@kolyshkin kolyshkin force-pushed the 3181-rdt-init-without-cpuinfo branch from 916f62a to ea0bd78 Compare July 28, 2022 19:06
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@kolyshkin kolyshkin merged commit 3a5294f into opencontainers:main Jul 28, 2022
@corhere corhere deleted the 3181-rdt-init-without-cpuinfo branch July 28, 2022 21:30
@kolyshkin kolyshkin mentioned this pull request Aug 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants