Add benchmarks to README? #12

jeromefroe · 2017-11-13T15:18:41Z

Hi! I was wondering if it would possible to include some benchmarks in the README? We run some open-source software with cpu quotas and being able to link to some benchmarks in the README might go a long way to convincing other people to incorporate automaxprocs into their projects as well.

The text was updated successfully, but these errors were encountered:

embano1 · 2018-01-01T21:15:51Z

@jeromefroe working on a blog post covering exactly this (incl. benchmarks). hope to have it done by January. will ping you again...

codesuki · 2018-06-20T08:01:01Z

@embano1 Did you finish the blog post? :)

embano1 · 2018-07-18T15:40:04Z

Hi @codesuki

Sorry for my delayed answer (PTO and other stuff). I did some benchmarking (mostly CPU-bound workloads like calculating prime numbers or locks) but did not have the time to write the blog post, which is still planned.

Anyways, here's some data points I ran on a 16 core cloud box.

The tests show different CPU cgroup settings (i.e. CFS quota) and the effect of the benchmark run time.

The first diagram summarises a prime benchmark I wrote for the tests (https://github.com/embano1/gotutorials/tree/master/concprime). It spawns many active Goroutines to find prime numbers. The benchmark execution time is compared in different runs with CPU CFS quota off/1/2/4 vs. different GOMAXPROCS settings (1-16) on a 16 core box. As an example, look at the orange bars comparing the case for GOMAXPROCS=16. The first orange bar shows the run w/out CPU CFS quota, i.e. the fastest of all runs (as expected). The second orange bar is where the container is constrained to 1 CPU (CFS quota 100ms, period 100ms). It's the worst result, meaning that you should tune GOMAXPROCS to CFS quota accordingly especially on large boxes (+8 CPUs).

The second diagram is a mutex lock contention benchmark comparing Go's SyncMap vs. a map with a mutex (https://medium.com/@deckarep/the-new-kid-in-town-gos-sync-map-de24a6bf7c2c). Here you can see that it becomes really critical for performance when there's many mutexes in the game. Compare the orange (map w/ r/w mutex) and yellow lines (map w r/w mutex and CFS quota == 1 CPU). GOMAXPROCS is shown on the horizontal axis. Everything is fine for GOMAXPROCS=1, but it gets really worse with CFS quota applied and GOMAXPROCS=16 (default on that machine). The chart cuts of, you can see the values for both cases in the table below the chart. For sync map w/ r/w mutex and CFS quota == 1 CPU we got two orders of magnitude slower performance when GOMAXPROC is not tuned (151 vs 15000 ns/op)!

I ack that these are synthetic benchmarks but they prove the point. If there's misalignment between CFS quota, the language runtime tuning (in this case GOMAXPROCS) and the workload is mostly CPU bound (e.g. spawning a lot of active goroutines, calculations, etc.) then this could cause performance degradation.

I think it's not that hard to write custom benchmarks to validate the impact of misaligned GOMAXPROCS to CFS quota for the specific application. In fact I recommend having benchmarking/stress-testing being part of CI to establish a baseline and compare against production. I discussed this intensively in a talk at KubeCon (https://www.youtube.com/watch?v=8-apJyr2gi0).

Hope that helps.

codesuki · 2018-07-19T05:17:30Z

Danke for following up and the write up! Very informative. I'll play with the benchmark a bit.

One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for map w/ r/w mutex ns/op goes up.
I didn't think deeply about the reason, but might that be because of go routine overhead?

Great talk BTW!

embano1 · 2018-07-19T08:46:58Z

Thank you (also on the talk!) and "gern geschehen" :)

One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for map w/ r/w mutex ns/op goes up.
I didn't think deeply about the reason, but might that be because of go routine overhead?

The problem is lock/CPU cache contention when there is more than one OS thread (simply speaking number of GOMAXPROCS) active for the program. So in any case, one thread would always beat >1 OS threads when solely looking at locking mechanisms and contention. Details here: https://www.youtube.com/watch?v=C1EtfDnsdDs

Now, should we advise always setting GOMAXPROCS=1? Of course not, since that would hurt performance and destroy the benefits of multi-core machines. I'm just looking at a very specific computing problem here. So with higher GOMAXPROCS (aligned to CPU CFS quota), and thus potential lock contention, most programs perform better by leveraging multi-threading and Goroutines. That's why I alluded to doing benchmarks and stress testing against limits posed on the program, i.e. CPU CFS in that specific case.

I would say Dave Cheney can be considered an authoritative Go source :) and thus linking to his great material on performance tuning: https://github.com/davecheney/high-performance-go-workshop

I'm also pleased to hear that there will be changes coming to Go runtime memory management with regards to memory limits (https://blog.golang.org/ismmkeynote), but that's not related to our discussion here :)

codesuki · 2018-07-20T01:46:07Z

Nochmals danke!
This aligns with what I expected but couldn't put in words.
Makes total sense now.
I'll have a look at the links, thanks for those, too.

gaocegege · 2019-12-26T09:13:46Z

Just FYI.

Running the benchmark https://github.com/embano1/gotutorials/tree/master/concprime with native, docker --cpus 4, docker --cpus 2, docker --cpus 1, docker --cpuset-cpus 0,1,2,3, kubernetes resources.limits=4, kubernetes resources.limits=2, kubernetes resources.limits=1.

Test env:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               142
Model name:          Intel(R) Core(TM) i7-7560U CPU @ 2.40GHz
Stepping:            9
CPU MHz:             1011.469
CPU max MHz:         3800.0000
CPU min MHz:         400.0000
BogoMIPS:            4800.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            4096K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

Got the result

Raw results:

arch || GOMAXPROCS || duration (ms)
Native 4 CPU	1	200
Native 4 CPU	2	134
Native 4 CPU	4	123
docker -cpuset-cpus 4	1	291
docker -cpuset-cpus 4	2	162
docker -cpuset-cpus 4	4	116
docker --cpus 1	1	311
docker --cpus 1	2	320
docker --cpus 1	4	495
docker --cpus 2	1	302
docker --cpus 2	2	179
docker --cpus 2	4	191
docker --cpus 4	1	272
docker --cpus 4	2	169
docker --cpus 4	4	134
K8s resources.limits.cpu=1	1	243
K8s resources.limits.cpu=1	2	277
K8s resources.limits.cpu=1	4	424
K8s resources.limits.cpu=2	1	260
K8s resources.limits.cpu=2	2	154
K8s resources.limits.cpu=2	4	192
K8s resources.limits.cpu=4	1	248
K8s resources.limits.cpu=4	2	154
K8s resources.limits.cpu=4	4	125

library(ggplot2)
library(purrr)

maxprocs = read.delim("maxprocs.txt", header = FALSE, sep = "\t", dec = ".")

maxprocs$V2 = factor(maxprocs$V2, c("1", "2", "4"))

p = ggplot(maxprocs, aes(x = V1, y = V3, fill = V2)) + 
  geom_col(position = "dodge2", width = 0.7) +
  coord_flip() +
  theme(legend.position = "top") + 
  guides(fill = guide_legend(title = "GOMAXPROCS", title.position = "left")) +
  labs(y = "Duration (ms) (lower is better)", x= "")

p

prashantv · 2019-12-27T19:07:28Z

I'll add some more data measured from our internal load balancer at Uber.

We ran the load balancer with 200% CPU quota (e.g., 2 cores), and used yab to benchmark.

GOMAXPROCS	RPS	P50 (ms)	P99.9 (ms)
1	28,893.18	1.46	19.70
2 (equal to quota)	44,715.07	0.84	26.38
3	44,212.93	0.66	30.07
4	41,071.15	0.57	42.94
8	33,111.69	0.43	64.32
Default (24)	22,191.40	0.45	76.19

When GOMAXPROCS is increased above the CPU quota, we see P50 decrease slightly, but see significant increases to P99. We also see that the total RPS handled also decreases.

When GOMAXPROCS is higher than the CPU quota allocated, we also saw significant throttling:

$ cat /sys/fs/cgroup/cpu,cpuacct/system.slice/[...]/cpu.stat
nr_periods 42227334
nr_throttled 131923
throttled_time 88613212216618

Once GOMAXPROCS was reduced to match the CPU quota, we saw no CPU throttling.

This diff adds a performance notes from #12 (comment) to the readme.

abhinav · 2022-04-09T18:15:47Z

Fixed by #52. Thanks to @SaveTheRbtz for the PR.

embano1 mentioned this issue Jul 20, 2018

resources.cpu.requests + GOMAXPROCS prometheus-operator/prometheus-operator#501

Closed

derekperkins mentioned this issue Oct 23, 2018

Investigate setting $GOMAXPROCS to Linux limits vitessio/vitess#4302

Open

This was referenced Jun 18, 2019

Investigate behaviour of gomaxprocs on cloud elastic/apm-server#1841

Closed

MAXPROCS and CPU limits cortexproject/cortex#902

Closed

jochen42 mentioned this issue Oct 12, 2019

#77 autogomaxprocs i-love-flamingo/flamingo#79

Merged

anyasabo mentioned this issue Mar 18, 2020

Add automaxprocs elastic/cloud-on-k8s#2724

Merged

alphastorm mentioned this issue May 25, 2020

Automatically set GOMAXPROCS to match Linux container CPU quota keep-network/keep-core#1827

Open

alibo mentioned this issue May 23, 2021

CPUThrottlingHigh false positives kubernetes-monitoring/kubernetes-mixin#108

Open

ukclivecox mentioned this issue Aug 5, 2021

Correctly set GOMAXPROCS for executor and operator SeldonIO/seldon-core#3468

Closed

timebertt mentioned this issue Nov 15, 2021

Add automax procs to gcm gardener/gardener#4979

Merged

SaveTheRbtz mentioned this issue Apr 9, 2022

README: add a benchmark #52

Merged

abhinav pushed a commit that referenced this issue Apr 9, 2022

README: add performance section (#52)

ce572da

This diff adds a performance notes from #12 (comment) to the readme.

abhinav closed this as completed Apr 9, 2022

solracsf mentioned this issue May 9, 2023

Automatically set GOMAXPROCS to match Linux container CPU quota. restic/restic#4128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmarks to README? #12

Add benchmarks to README? #12

jeromefroe commented Nov 13, 2017

embano1 commented Jan 1, 2018

codesuki commented Jun 20, 2018

embano1 commented Jul 18, 2018

codesuki commented Jul 19, 2018 •

edited

Loading

embano1 commented Jul 19, 2018

codesuki commented Jul 20, 2018

gaocegege commented Dec 26, 2019

prashantv commented Dec 27, 2019 •

edited

Loading

abhinav commented Apr 9, 2022

Add benchmarks to README? #12

Add benchmarks to README? #12

Comments

jeromefroe commented Nov 13, 2017

embano1 commented Jan 1, 2018

codesuki commented Jun 20, 2018

embano1 commented Jul 18, 2018

codesuki commented Jul 19, 2018 • edited Loading

embano1 commented Jul 19, 2018

codesuki commented Jul 20, 2018

gaocegege commented Dec 26, 2019

prashantv commented Dec 27, 2019 • edited Loading

abhinav commented Apr 9, 2022

codesuki commented Jul 19, 2018 •

edited

Loading

prashantv commented Dec 27, 2019 •

edited

Loading