Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncore perf #519

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

spandruvada
Copy link

Hello maintainers,

This pull request for uncore power optimization on Intel servers. Please provide feedback.

tebrandt and others added 2 commits March 22, 2023 07:30
Add a property "uncore_max_delta_mhz" to cpu_plugin, so that uncore
maximum frequency can be set to a lower value from the default maximum.
Lowering uncore maximum frequency can have significant impact on the
total power consummations of SoC.

The uncore can consume significant amount of power in Intel's Xeon
servers based on the workload characteristics. To optimize the total
power and improve overall performance, SoCs have internal algorithms
for scaling uncore frequency.

Refer to the following link to get details:
https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html

Based on experiments done on Intel Sapphire Rapids server by running a
wide variety of server workloads, power saving is significant with some
performance loss. For example a 500 MHz reduction causes 5% performance
loss and reduces total active power by 13%.

This property knob helps users to optimize power based on their tolerance
for reduced performance. The unit for this property knob is in MHz.

Signed-off-by: Todd Brandt <todd.e.brandt@intel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Based on experiments done on Intel Sapphire Rapids server by running a
wide variety of server workloads from phoronix, power saving is 8% with
performance loss of 1.5%.

Introduce a profile which inherits "throughput-performance" and reduce
the uncore maximum frequency by 200MHz.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
@joemario
Copy link

I just tested this patch on a single socket Sapphire Rapids system, running a 6.3.0-0.rc4.35.eln126.x86_64 kernel.

I did see power savings at idle, (12% when C1 was not pinned, and 8% when C1 was pinned).
Linpack's performance did drop 1.6%. I did no other testing.

While I agree this would be good to get into TuneD, I'm not sure doing it with a new TuneD profile is the way to go.
The profile which is part of this patchset is throughput-performance. That means if someone wanted to use this feature with another profile, they wouldn't be able to.

Question back to Jaroslav and the TuneD maintainers:
Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?

@joemario
Copy link

To add a clarification to my earlier comment, there was no power savings when the cpus were busy. Those 8% and 12% power savings were only for idle systems.

data = self._cmd.read_file(fpath, err_ret=None, no_error=True)
try:
value = int(data.strip())
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException'

Except block directly handles BaseException.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in the next version, once issue with the general direction regarding profile is clear.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I assume from this that the "int" call has its own baseexception that I'm overriding. Not sure what the problem is with that but I can replace this with a simple "if re.match('[0-9]+', data.strip()):". That try/catch is just there in case the sysfs file returns garbage (e.g. anything other than a decimal integer). If it happens I ignore its package and move on like it doesn't exist.

@spandruvada
Copy link
Author

To add a clarification to my earlier comment, there was no power savings when the cpus were busy. Those 8% and 12% power savings were only for idle systems.

When CPUs are 100% busy and accessing lots of memory all the time, then saving will not be seen. But workloads has idle times and also don't access uncore constantly to keep uncore high. So overall this is good knob.
It is fine it is deployed using some other method than suggested here.

@jmencak
Copy link
Contributor

jmencak commented Apr 6, 2023

Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?

Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.

@spandruvada
Copy link
Author

spandruvada commented Apr 7, 2023

Wouldn't it be better to create configuration files so that power saving tuning knobs like this one could be turned on/off in any TuneD profile?

Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.

Do you mean that add a knob to:
tuned-main.conf
?

@jmencak
Copy link
Contributor

jmencak commented Apr 8, 2023

Absolutely. I'd personally like to avoid adding yet another profile. The "knobs" that could make use of the functionality in any TuneD profile would be nice though.

Do you mean that add a knob to: tuned-main.conf ?

Personally, that's not what I had in mind. tuned-main.conf is a global configuration file and these "knobs" probably should not go there. I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

First, we probably need to agree what "knobs"/variables/tunables we need alongside with their names (just uncore_max_delta_mhz ?) and take it from there.

@joemario
Copy link

joemario commented Apr 8, 2023

I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

Hi Jiri, Jaroslav, and Srinivas:
Jiri's comment is similar to what I was thinking as well.

For example, we could:

  1. Have a /etc/tuned/power-saving-variables.conf file.
  2. That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
  3. Users could uncomment and set them as desired.
  4. This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.

I would defer to Jaroslav on his preferred approach for doing something like this.

@spandruvada
Copy link
Author

I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well.

For example, we could:

  1. Have a /etc/tuned/power-saving-variables.conf file.
  2. That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
  3. Users could uncomment and set them as desired.
  4. This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.

I would defer to Jaroslav on his preferred approach for doing something like this.

Hi Jiri, Jaroslav, Joe

We can resubmit as suggested by Joe. Please let us know.

@spandruvada
Copy link
Author

spandruvada commented Apr 19, 2023

I did not thought this through completely yet, but I was thinking to have the "knobs" in a new configuration file under /etc/tuned/something.conf. Please see /etc/tuned/cpu-partitioning-variables.conf for example. @yarda, do you think that would make more sense or do you have a better suggestion?

Hi Jiri, Jaroslav, and Srinivas: Jiri's comment is similar to what I was thinking as well.
For example, we could:

  1. Have a /etc/tuned/power-saving-variables.conf file.
  2. That file would contain a list of all powersaving knobs that TuneD supports, initially all commented out.
  3. Users could uncomment and set them as desired.
  4. This power-saving.conf file could be included in all the RHEL profiles. It would only cause power-savings if a user uncommented and set any of the knobs in that file.

I would defer to Jaroslav on his preferred approach for doing something like this.

Hi Jiri, Jaroslav, Joe

We can resubmit as suggested by Joe. Please let us know.

Hi Joe,

I created /etc/tuned/power-saving-variables.conf with contents.
I added some example values for test

# governor=performance
energy_perf_bias=performance
# min_perf_pct=100
# max_perf_pct=100
# energy_performance_preference=balance_performance
# force_latency=cstate.name:C6|cstate.id:4|10
# pm_qos_resume_latency_us

/////////////
Suppose you include this in profiles/balanced/tuned.conf

[main]
summary=General non-specialized tuned profile

[variables]
include=/etc/tuned/power-saving-variables.conf

[modules]
cpufreq_conservative=+r

[cpu]
priority=10
governor=conservative|powersave

# energy_perf_bias=normal "This is what balanced defines"
# but user has defined energy_perf_bias=performance
# /etc/tuned/power-saving-variables.conf

energy_perf_bias=${energy_perf_bias}

There is no way to say that if ${energy_perf_bias} is defined use the value
else use energy_perf_bias=normal

So what you are suggesting, may not be possible with current tuned?

@novacain1
Copy link

Does this rely on the intel_uncore kmod?

Will this allow similar functionality in tuned like we've been able to use in the past via the msr-tools package, specifically the binaries rdmsr and wrmsr user space applications? In some cases the uncore frequency is set in server platform BIOS and has no knobs to control / modify without modifying registers, which isn't user friendly.

@yarda
Copy link
Contributor

yarda commented Jun 22, 2023

For the configurations we expect users usually to customize, the variables are way to go. If users are not expected to usually customize the configuration then new profile is probably a better solution.

In this specific case I would also prefer variables, i.e. the .conf file.

There is no way to say that if ${energy_perf_bias} is defined use the value
else use energy_perf_bias=normal

IMHO this should be possible to implement with the ${f:regex_search_ternary} built in function, I could prepare some proof of concept. Something semantically similar is already used in the realtime and cpu-partitioning profiles.

Regarding the current implementation in this PR:

  • from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}

[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}
  • Why is it called uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernel sysfs?
  • I think verification should be also supported in the code for the tuned-adm verify to work.

@sgruszka
Copy link
Contributor

Hi, I would like to move this PR forward.

IIUC we have two separate issues here, first is adding uncore frequency knob to cpu plugin, and second incorporate the knob into existing profiles (via variable.conf). For now I would like to concentrate on the first issue - a, once that done move to the second one.

  • from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}

[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}

It's of cource reasonable to configure this per cpu. The one problem I see is that the all cpu's in the config might not necessary located in the same die. Basically will be required that devices='cpu list' is configured to die. That's ok I think , we can log error and make verify fail , if the configuration is not correct.

  • Why is it called uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernel sysfs?

Yes, it would be better to have the same units obviously. Regarding name it's delta, because we don't know apriori what is the maximum frequency. We read that value from sysfs intial_max_freq_khz and subtract the delta and write to max_freq_khz.

@sgruszka
Copy link
Contributor

Does this rely on the intel_uncore kmod?

Yes.

Will this allow similar functionality in tuned like we've been able to use in the past via the msr-tools package, specifically the binaries rdmsr and wrmsr user space applications? In some cases the uncore frequency is set in server platform BIOS and has no knobs to control / modify without modifying registers, which isn't user friendly.

Yes, I think using sysfs intel_uncore_frequency would be preferred to configure uncore frequency over rdmsr/wrmsr.

@bartwensley
Copy link

  • from the kernel documentation it seems this tuning knob is per package die, thus I expect there could be multiple such knobs on the system accessible through subdirs, but If I am not mistaken in the current PR all are controlled in the first plugin instance. I think better would be to add support per device which would allow setting all the knobs individually, e.g.:
[cpus_group1]
devices=cpu1,cpu2,cpu3,...
type=cpu
uncore_max_delta_mhz= ${VALUE1}

[cpus_group2]
devices=cpu16,cpu17,cpu18,...
type=cpu
uncore_max_delta_mhz= ${VALUE2}

It's of cource reasonable to configure this per cpu. The one problem I see is that the all cpu's in the config might not necessary located in the same die. Basically will be required that devices='cpu list' is configured to die. That's ok I think , we can log error and make verify fail , if the configuration is not correct.

According to the kernel documentation (https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html) this is configured per package and die combination - not per CPU. So I don't see how or why we would try to configure this for specific CPUs in tuned.

  • Why is it called uncore_max_delta_mhz? For no confusion wouldn't be better to use the same name and units as are used in the kernel sysfs?

Yes, it would be better to have the same units obviously. Regarding name it's delta, because we don't know apriori what is the maximum frequency. We read that value from sysfs intial_max_freq_khz and subtract the delta and write to max_freq_khz.

I believe that in at least some cases, either Intel or the HW vendor is going to require a specific uncore frequency to be set. I think tuned needs to allow the uncore max_freq_khz and min_freq_khz to be set specifically - not using deltas. Or perhaps we should allow either method to be used?

@sgruszka
Copy link
Contributor

According to the kernel documentation (https://docs.kernel.org/admin-guide/pm/intel_uncore_frequency_scaling.html) this is configured per package and die combination - not per CPU. So I don't see how or why we would try to configure this for specific CPUs in tuned.

We know about topology, the die_id can be read from sysfs i.e for cpu6 is:
/sys/devices/system/cpu/cpu6/topology/die_id

However there are uncore's that might not contain cpu's (there are uncore* entries additionally to package_die entries in . /sys/devices/system/cpu/intel_uncore_frequency). So this is not that simple.
One global variable that control all uncore's as provided this PR is the simplest solution.

I believe that in at least some cases, either Intel or the HW vendor is going to require a specific uncore frequency to be set. I think tuned needs to allow the uncore max_freq_khz and min_freq_khz to be set specifically - not using deltas. Or perhaps we should allow either method to be used?

I'm open to suggestions here. It can be direct value or percentage of max for example.
However direct value would not be portable from system to system, hence I think both configuration methods indeed should be provided (direct and percentage/delta)

@sgruszka
Copy link
Contributor

sgruszka commented Dec 4, 2023

I've opened another PR . Please check it out to see if it goes to the right direction. It allows to define uncore freq per cpu (the config is checked against topology) . Hope it's something reasonable. Otherwise what I think could be done is one global option (as in this PR) or separate intel_uncore plugin where device is defined as entry in
/sys/devices/system/cpu/intel_uncore_frequency/* i.e. package_NN_die_MM or uncoreNN .

@yarda
Copy link
Contributor

yarda commented Jul 25, 2024

Uncore support has been merged, if you want the setting in the throughput-performance profile, please update this PR or create new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants