Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an issue with the last level cache values on Linux running on certain AMD Processors #108492

Merged
merged 12 commits into from
Oct 11, 2024

Conversation

mrsharm
Copy link
Member

@mrsharm mrsharm commented Oct 2, 2024

Fixes: #76290

Problem Details

We recently discovered an issue that affects Unix based VMs where we are taking the host’s (as opposed to the VM’s) for certain AMD processor's last level cache size to be used in the GetLogicalProcessorCacheSizeFromOS call to discern the gen0 budget for both WKS and SVR. This is because sysconf, the method we first try in GetLogicalProcessorCacheSizeFromOS, gets us the last level cache of the host machine as opposed to the fallback code path that reads the value of /sys/devices/system/cpu/cpu0/cache/index{LastLevelCache}/size. Consequently, the gen0 budgets are significantly different between Unix VMs and Windows using certain AMD processors on the same machine and the further implication of this is that we are probably setting much larger value than expected for the Gen0 budget for the GC running on Unix based VMs.

The details from my v16 CPU based DevBox with an AMD EPYC 7763 64 Core Processor running Ubuntu 22.04.3 via WSL are as follows:

  • sysconf returns the last cache size (L3) for the host machine (AMD EPYC™ 7763 – specs are here) as 256 MB.
    • Can be repro’d on the command line using:
      getconf -a | grep “LEVEL3_CACHE_SIZE” => LEVEL3_CACHE_SIZE 268435456
  • Reading /sys/devices/system/cpu/cpu0/cache/index3/size returns 32 MiB, the same as the result from GetLogicalProcessorCacheSizeFromOS from Windows that calls GetLogicalProcessorInformation function (sysinfoapi.h) - Win32 apps | Microsoft Learn.
    • Can be repro’d on the command line using lscpu => L3: 32 MiB (1 instance)

How To Check for the Issue

  1. Get sysconf output: getconf -a | grep "LEVEL"
  2. The full output of lscpu
  3. Check if the L3 (or if available, L4) cache size is the same from sysconf and that from lscpu.
  4. If the values are different, the issue exists.

Solution

  • By default, with no configuration changes, first try to read in the cache information from sysfs and if that fails, fall back to the heuristic we use to compute the value for the ARM* cases.
  • A new configuration DOTNET_GCCacheSizeFromSysConf can be set to 1 to revert to the current behavior i.e., using sysconf to obtain the last level cache.

Performance Testing

Ran with the following GCPerfSim configurations for SVR: -tc 28 -tagb 100 -tlgb 0 -lohar 0-pohar 0 -sohsr 100-4000 -lohsr 102400-204800 -pohsr 100-204800 -sohsi 0 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time

image

Metric Not With SysConf With SysConf
Peak Heap Size (MB) 339.458 2656.194
% Pause Time in GC 39.0 7.7

@mrsharm mrsharm marked this pull request as ready for review October 4, 2024 19:19
@mrsharm mrsharm changed the title [Work in Progress] Fix an issue with the last level cache values on Linux running on certain AMD Processors Fix an issue with the last level cache values on Linux running on certain AMD Processors Oct 4, 2024
@Maoni0
Copy link
Member

Maoni0 commented Oct 5, 2024

the rest looks okay to me.. would be great if @janvorli could take a look.

@janvorli
Copy link
Member

janvorli commented Oct 8, 2024

There is one thing I keep thinking about. Would it be a problem in case the /sys/devices/system/cpu/cpu0/cache is not present to still read the size from sysconf? In other words, to let the new config knob control just the order in which we try to use the /sys/devices/system/cpu/cpu0/cache and sysconf? I wonder if in the case the /sys/devices/system/cpu/cpu0/cache is missing, the heuristic fallback would give us a reasonable value.

@Maoni0
Copy link
Member

Maoni0 commented Oct 9, 2024

the way I look at this is from the user's POV the values from sysconf is simply incorrect. so it'd be better to just take the heuristic values. another option is to treat sysconf to always give us the full cache sizes and check to see how many cores this process is actually allowed to use and get a ratio (so if sysconf reports 128mb and 64 cores, our process can only use 8 cores, we take 1/8 of 128mb).

@janvorli
Copy link
Member

janvorli commented Oct 9, 2024

I would rather avoid introducing this new kind of heuristic, because it depends on the internal topology of the processor. One of the cases we were seeing this issue occurred on some AMD processors, because they have 4 separate 3rd level caches where each one is shared by 1/4 of the cores. In this case, the customer was using the full CPU and still the sysconf was returning a sum of the 3rd level cache sizes, it means 4 times higher value.
So let's keep this change as is.

@mrsharm
Copy link
Member Author

mrsharm commented Nov 5, 2024

How To Check for the Issue

  1. Get sysconf output for the last level cache size: getconf -a | grep "LEVEL"
  2. The full output of lscpu
  3. Check if the last level cache size is the same from sysconf and that from lscpu.
  4. If the values are different, the issue exists.

@mrsharm
Copy link
Member Author

mrsharm commented Nov 12, 2024

/backport to release/9.0-staging

Copy link
Contributor

Started backporting to release/9.0-staging: https://github.com/dotnet/runtime/actions/runs/11803383363

Copy link
Contributor

@mrsharm backporting to release/9.0-staging failed, the patch most likely resulted in conflicts:

$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: Started work on the Last Level Cache optimization
.git/rebase-apply/patch:112: trailing whitespace.
#endif 
.git/rebase-apply/patch:152: trailing whitespace.
// It seems ok to set the same default sizes when the cache info isn’t available on the /sys/ path like we did with arm. 
.git/rebase-apply/patch:166: trailing whitespace.
    
warning: 3 lines add whitespace errors.
Using index info to reconstruct a base tree...
M	src/coreclr/gc/gcconfig.h
Falling back to patching base and 3-way merge...
Auto-merging src/coreclr/gc/gcconfig.h
CONFLICT (content): Merge conflict in src/coreclr/gc/gcconfig.h
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Started work on the Last Level Cache optimization
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Copy link
Contributor

@mrsharm an error occurred while backporting to release/9.0-staging, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this pull request Dec 10, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Dec 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GC picks wrong L3 cache size on Linux
3 participants