-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmarks to README? #12
Comments
@jeromefroe working on a blog post covering exactly this (incl. benchmarks). hope to have it done by January. will ping you again... |
@embano1 Did you finish the blog post? :) |
Hi @codesuki Sorry for my delayed answer (PTO and other stuff). I did some benchmarking (mostly CPU-bound workloads like calculating prime numbers or locks) but did not have the time to write the blog post, which is still planned. Anyways, here's some data points I ran on a 16 core cloud box. The tests show different CPU cgroup settings (i.e. CFS quota) and the effect of the benchmark run time. The first diagram summarises a prime benchmark I wrote for the tests (https://github.com/embano1/gotutorials/tree/master/concprime). It spawns many active Goroutines to find prime numbers. The benchmark execution time is compared in different runs with CPU CFS quota off/1/2/4 vs. different GOMAXPROCS settings (1-16) on a 16 core box. As an example, look at the orange bars comparing the case for GOMAXPROCS=16. The first orange bar shows the run w/out CPU CFS quota, i.e. the fastest of all runs (as expected). The second orange bar is where the container is constrained to 1 CPU (CFS quota 100ms, period 100ms). It's the worst result, meaning that you should tune GOMAXPROCS to CFS quota accordingly especially on large boxes (+8 CPUs). The second diagram is a mutex lock contention benchmark comparing Go's SyncMap vs. a map with a mutex (https://medium.com/@deckarep/the-new-kid-in-town-gos-sync-map-de24a6bf7c2c). Here you can see that it becomes really critical for performance when there's many mutexes in the game. Compare the orange (map w/ r/w mutex) and yellow lines (map w r/w mutex and CFS quota == 1 CPU). GOMAXPROCS is shown on the horizontal axis. Everything is fine for GOMAXPROCS=1, but it gets really worse with CFS quota applied and GOMAXPROCS=16 (default on that machine). The chart cuts of, you can see the values for both cases in the table below the chart. For sync map w/ r/w mutex and CFS quota == 1 CPU we got two orders of magnitude slower performance when GOMAXPROC is not tuned (151 vs 15000 ns/op)! I ack that these are synthetic benchmarks but they prove the point. If there's misalignment between CFS quota, the language runtime tuning (in this case GOMAXPROCS) and the workload is mostly CPU bound (e.g. spawning a lot of active goroutines, calculations, etc.) then this could cause performance degradation. I think it's not that hard to write custom benchmarks to validate the impact of misaligned GOMAXPROCS to CFS quota for the specific application. In fact I recommend having benchmarking/stress-testing being part of CI to establish a baseline and compare against production. I discussed this intensively in a talk at KubeCon (https://www.youtube.com/watch?v=8-apJyr2gi0). Hope that helps. |
Danke for following up and the write up! Very informative. I'll play with the benchmark a bit. One thing, in the second graph it seems that no matter what the quota is, the best setting is GOMAXPROCS=1, because even for Great talk BTW! |
Thank you (also on the talk!) and "gern geschehen" :)
The problem is lock/CPU cache contention when there is more than one OS thread (simply speaking number of Now, should we advise always setting I would say Dave Cheney can be considered an authoritative Go source :) and thus linking to his great material on performance tuning: https://github.com/davecheney/high-performance-go-workshop I'm also pleased to hear that there will be changes coming to Go runtime memory management with regards to memory limits (https://blog.golang.org/ismmkeynote), but that's not related to our discussion here :) |
Nochmals danke! |
Just FYI. Running the benchmark https://github.com/embano1/gotutorials/tree/master/concprime with native, docker --cpus 4, docker --cpus 2, docker --cpus 1, docker --cpuset-cpus 0,1,2,3, kubernetes resources.limits=4, kubernetes resources.limits=2, kubernetes resources.limits=1. Test env:
Got the result Raw results:
library(ggplot2)
library(purrr)
maxprocs = read.delim("maxprocs.txt", header = FALSE, sep = "\t", dec = ".")
maxprocs$V2 = factor(maxprocs$V2, c("1", "2", "4"))
p = ggplot(maxprocs, aes(x = V1, y = V3, fill = V2)) +
geom_col(position = "dodge2", width = 0.7) +
coord_flip() +
theme(legend.position = "top") +
guides(fill = guide_legend(title = "GOMAXPROCS", title.position = "left")) +
labs(y = "Duration (ms) (lower is better)", x= "")
p |
I'll add some more data measured from our internal load balancer at Uber. We ran the load balancer with 200% CPU quota (e.g., 2 cores), and used yab to benchmark.
When GOMAXPROCS is increased above the CPU quota, we see P50 decrease slightly, but see significant increases to P99. We also see that the total RPS handled also decreases. When GOMAXPROCS is higher than the CPU quota allocated, we also saw significant throttling:
Once GOMAXPROCS was reduced to match the CPU quota, we saw no CPU throttling. |
This diff adds a performance notes from #12 (comment) to the readme.
Fixed by #52. Thanks to @SaveTheRbtz for the PR. |
Hi! I was wondering if it would possible to include some benchmarks in the README? We run some open-source software with cpu quotas and being able to link to some benchmarks in the README might go a long way to convincing other people to incorporate
automaxprocs
into their projects as well.The text was updated successfully, but these errors were encountered: