Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips #16672

Merged
merged 2 commits into from
Mar 28, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Mar 27, 2023

This PR adds detection of asymetric core types (Power (P) & Efficiency (E))
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks making use of resources.cores
on Linux platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728

Fixes #12029
Fixes #11046
Fixes #9408

…chips

This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.

Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.

For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.

Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.

node attributes before

cpu.arch                  = arm64
cpu.modelname             = Apple M2 Pro
cpu.numcores              = 12
cpu.reservablecores       = 0
cpu.totalcompute          = 1000

node attributes after

cpu.arch                  = arm64
cpu.frequency.efficiency  = 2424
cpu.frequency.power       = 3504
cpu.modelname             = Apple M2 Pro
cpu.numcores.efficiency   = 4
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 37728
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions, but LGTM. Just missing a CHANGELOG entry 🙂

Smoke test on my machine:

Attributes
cpu.arch                  = arm64
cpu.frequency.efficiency  = 2064
cpu.frequency.power       = 3228
cpu.modelname             = Apple M1 Max
cpu.numcores.efficiency   = 2
cpu.numcores.power        = 8
cpu.reservablecores       = 0
cpu.totalcompute          = 29952

}
var err error
if cpuPowerCoreCount, err = cpu.Counts(true); err != nil {
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %v", err))
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %w", err))

I think this let us unwrap the inner error? I'm still getting used to the new errors stuff 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// detecting little cores on non-apple platforms
cpuTotalTicks = uint64(cpuPowerCoreCount) * cpuPowerCoreMHz

initErr = err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this overwriting the accumulated errors with just the last one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this bit isn't in a loop - it's just setting the global initErr inside a sync.Do so that the cpu detection only happens once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we write to initErr twice inside the default clause right?

if cpuInfoStats, err = cpu.InfoWithContext(ctx); err != nil {
initErr = errors.Join(initErr, fmt.Errorf("Unable to obtain CPU information: %w", err))
}

if cpuPowerCoreCount, err = cpu.Counts(true); err != nil {
initErr = errors.Join(initErr, fmt.Errorf("failed to detect number of CPU cores: %w", err))
}

If both fail I expect initErr to have two erros, but this initErr = err is making it so only the last error is returned.


var cpuInfo []cpu.InfoStat
if cpuInfo, err = cpu.InfoWithContext(ctx); err != nil {
errors.Join(err, fmt.Errorf("failed to detect CPU information: %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the joined error being ignore? Or does it mutate err?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Join returns an error that wraps the given errors.

So, I think we need to capture the returned error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah @jrasell is right - this line actually does nothing because the result is returned from Join ... however this line is unnecessary anyway since the assignment to initErr already contains a contextual message

logger hclog.Logger

// accumulates result in these resource structs
resources *structs.Resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep the COMPAT note in case do we clean this up in the future?

// COMPAT(0.10): Remove in 0.10
resp.Resources = &structs.Resources{

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only minor comments and as Luiz mentioned, missing changelog.

@@ -3,6 +3,6 @@

package fingerprint

func (f *CPUFingerprint) deriveReservableCores(req *FingerprintRequest) ([]uint16, error) {
return nil, nil
func (f *CPUFingerprint) deriveReservableCores(string) []uint16 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ignore the unused parameter

Suggested change
func (f *CPUFingerprint) deriveReservableCores(string) []uint16 {
func (f *CPUFingerprint) deriveReservableCores(_ string) []uint16 {


var cpuInfo []cpu.InfoStat
if cpuInfo, err = cpu.InfoWithContext(ctx); err != nil {
errors.Join(err, fmt.Errorf("failed to detect CPU information: %v", err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Join returns an error that wraps the given errors.

So, I think we need to capture the returned error.

// node to fall into a unique computed node class
cpuMhzPerCore = math.Floor(cpuMhzPerCore)
cpuTotalTicks = math.Floor(float64(cpuNumCores) * cpuMhzPerCore)
for _, cpu := range cpuInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the cpu variable, as this clashes with an import?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line
Projects
None yet
3 participants