Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed topology with offline CPUs on x86 #2579

Merged
merged 4 commits into from
Jun 11, 2020

Conversation

iwankgb
Copy link
Collaborator

@iwankgb iwankgb commented Jun 9, 2020

It hopefully and ultimately fixes #2566. Works on x86_64. More checks pending.

CC: @BenTheElder

@iwankgb iwankgb marked this pull request as draft June 9, 2020 20:40
@iwankgb
Copy link
Collaborator Author

iwankgb commented Jun 9, 2020

There are still issues on ARM:

[root@dfw2-c2 cadvisor]# echo 0 > /sys/devices/system/cpu/cpu5/online
[root@dfw2-c2 cadvisor]# echo 0 > /sys/devices/system/cpu/cpu2/online
[root@dfw2-c2 cadvisor]# make build && ./cadvisor
>> building assets
>> building binaries
>> building cadvisor
W0609 21:03:48.736150   13899 nvidia.go:61] NVIDIA GPU metrics will not be available: no NVIDIA devices found
E0609 21:03:48.860350   13899 machine.go:252] Cannot open /sys/bus/cpu/devices/cpu2/topology/core_id, number of unique core_id  set to 0
E0609 21:03:48.860386   13899 machine.go:71] Cannot read number of physical cores correctly, number of cores set to 0
E0609 21:03:48.862627   13899 machine.go:252] Cannot open /sys/bus/cpu/devices/cpu2/topology/physical_package_id, number of unique physical_package_id  set to 0
E0609 21:03:48.862656   13899 machine.go:85] Cannot read number of sockets correctly, number of sockets set to 0
^CI0609 21:03:52.569477   13899 manager.go:1157] Exiting thread watching subcontainers
I0609 21:03:52.569522   13899 manager.go:393] Exiting global housekeeping thread
I0609 21:03:52.569552   13899 cadvisor.go:243] Exiting given signal: interrupt
[root@dfw2-c2 cadvisor]# uname -a
Linux dfw2-c2.large.arm-01 4.14.0-115.2.2.el7a.aarch64 #1 SMP Wed Nov 28 22:05:51 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

(this is Ampere eMAG 8180, on Raspberry Pi 4 it seems to be even worse).

@iwankgb
Copy link
Collaborator Author

iwankgb commented Jun 9, 2020

@BenTheElder, x86_64 part should work and can be tested.

@BenTheElder
Copy link

BenTheElder commented Jun 9, 2020

testing shortly 👀
edit: or not so shortly ...

@BenTheElder
Copy link

I see: "num_cores":5,"num_physical_cores":6,"num_sockets":1,
versus:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ    MINMHZ
  0    0      0    0 0:0:0:0          yes 4000.0000 1200.0000
  1    0      0    1 1:1:1:0          yes 4000.0000 1200.0000
  2    0      0    2 2:2:2:0          yes 4000.0000 1200.0000
  3    0      0    3 3:3:3:0          yes 4000.0000 1200.0000
  4    0      0    4 4:4:4:0          yes 4000.0000 1200.0000
  5    0      0    5 5:5:5:0          yes 4000.0000 1200.0000
  6    -      -    - :::               no 4000.0000 1200.0000
  7    -      -    - :::               no 4000.0000 1200.0000
  8    -      -    - :::               no 4000.0000 1200.0000
  9    -      -    - :::               no 4000.0000 1200.0000
 10    -      -    - :::               no 4000.0000 1200.0000
 11    -      -    - :::               no 4000.0000 1200.0000

@BenTheElder
Copy link

W0609 19:03:27.643460 3221212 sysfs.go:340] unable to read /sys/devices/system/node/node0/cpu0/online: open /sys/devices/system/node/node0/cpu0/online: no such file or directory

@iwankgb
Copy link
Collaborator Author

iwankgb commented Jun 10, 2020

Sounds legit, making 0 CPU offline might not be a very good idea.

@BenTheElder
Copy link

It sounds like on some systems CONFIG_BOOTPARAM_HOTPLUG_CPU0 may allow it to be offline, but probably on most it can't be? 🤔

also looks like it may be amd64 only https://www.kernel.org/doc/html/latest/core-api/cpu_hotplug.html

Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
… test

Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@critical.today>
@iwankgb
Copy link
Collaborator Author

iwankgb commented Jun 10, 2020

@BenTheElder check it on your side, please!
@dashpole pull-cadvisor-e2e is not triggered for some reason.

@BenTheElder
Copy link

I see "num_cores":6,"num_physical_cores":6,"num_sockets":1, and no warnings now

@dashpole
Copy link
Collaborator

dashpole commented Jun 10, 2020

seems like it is marked as a draft? edit: I don't think thats it. Let me retrigger

@dashpole
Copy link
Collaborator

/retest

@iwankgb iwankgb marked this pull request as ready for review June 10, 2020 21:22
@iwankgb
Copy link
Collaborator Author

iwankgb commented Jun 10, 2020

Anyway - draft status removed, it's ready for review now.

@dghubble
Copy link

I see expected counts on my (SMT disabled) system again: "num_cores": 1, "num_physical_cores": 1, "num_sockets": 1 👍

@dims
Copy link
Collaborator

dims commented Jun 11, 2020

/lgtm

Copy link
Collaborator

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dashpole dashpole merged commit 4059641 into google:master Jun 11, 2020
@BenTheElder
Copy link

thank you! working on an update PR for kubernetes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong CPU count on x86_64
5 participants