-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: cut allocations #691
Conversation
Thanks for the PR @das7pad. hopefully, in the coming weekend, I should take a look at the PR. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #691 +/- ##
==========================================
+ Coverage 78.04% 79.06% +1.01%
==========================================
Files 5 5
Lines 902 917 +15
==========================================
+ Hits 704 725 +21
+ Misses 142 136 -6
Partials 56 56 ☔ View full report in Codecov by Sentry. |
@das7pad looks like this project is active again, would you be able to rebase this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I think it would be great what @luisdavim says about rebasing this PR.
367ad61
to
1465de4
Compare
Sure, done :) Below are the latest benchmark results mux project benchmarksYou can reproduce these benchmarks using docker, pinned to CPU 1:
Modern Xeon E, 3.4 GHz
Older Xeon Gold, 2.3 GHz(Actual CPU identifier is rather
Older Xeon E, 2.4 GHz
Popular go-http-routing-benchmarkI pushed three branches for comparison to my fork:
You can reproduce these benchmarks using docker, pinned to CPU 1: docker run --rm --pull always -v /logs:/logs --cpuset-cpus 1 -d golang:1.18 bash -exc 'git clone https://github.com/das7pad/go-http-routing-benchmark.git && cd go-http-routing-benchmark && for branch in before after after-omit-route; do git checkout "$branch" && go test -benchmem -bench Gorilla -count 100 -timeout 1h > "/logs/$branch.txt"; done; go install golang.org/x/perf/cmd/benchstat@latest; benchstat /logs/before.txt /logs/after.txt > /logs/compare-before-vs-after.txt; benchstat /logs/before.txt /logs/after-omit-route.txt > /logs/compare-before-vs-after-omit-route.txt' Modern Xeon E, 3.4 GHzBefore vs After with omit Route flag enabled
Before vs After
Older Xeon Gold, 2.3 GHzBefore vs After with omit Route flag enabled (Actual CPU identifier is rather
Before vs After (Actual CPU identifier is rather
Older Xeon E, 2.4 GHzBefore vs After with omit Route flag enabled
Before vs After
|
FWIW: The commits of this PR were cherry-picked into MinIO's fork of mux. The fork powered the MinIO server for the past 7 months, via minio/minio#16456. This exposure gave the PR a very decent "manual test in production". |
@coreydaley , any chances you could have a look and maybe get this merged? Thanks. |
@AlexVulaj AFAIK this looks fine, would you mind also taking a look at it? |
Left a few comments - it also looks like something (unrelated to this PR) is triggering a flag with the security scan for go 1.21. @coreydaley we can likely ignore that for this PR but we should take a deeper look into that. |
Head branch was pushed to by a user without write access
Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
Save 3 allocations worth 448B per served request with no vars. Save 2 allocations worth 400B per served request with vars. Populating the request ctx before vs after is O(1400ns) vs O(200ns). ``` $ go test -benchmem -benchtime 1000000x -bench BenchmarkVars goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkVarsOld-8 1000000 1430 ns/op 896 B/op 6 allocs/op BenchmarkVarsEmpty-8 1000000 184.3 ns/op 448 B/op 3 allocs/op BenchmarkVarsSet-8 1000000 221.7 ns/op 496 B/op 4 allocs/op PASS ok github.com/gorilla/mux 1.863s $ go test -benchmem -benchtime 1000000x -bench BenchmarkVars goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkVarsOld-8 1000000 1435 ns/op 896 B/op 6 allocs/op BenchmarkVarsEmpty-8 1000000 184.3 ns/op 448 B/op 3 allocs/op BenchmarkVarsSet-8 1000000 228.2 ns/op 496 B/op 4 allocs/op PASS ok github.com/gorilla/mux 1.876s go test -benchmem -benchtime 1000000x -bench BenchmarkVars goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkVarsOld-8 1000000 1390 ns/op 896 B/op 6 allocs/op BenchmarkVarsEmpty-8 1000000 188.8 ns/op 448 B/op 3 allocs/op BenchmarkVarsSet-8 1000000 225.8 ns/op 496 B/op 4 allocs/op PASS ok github.com/gorilla/mux 1.832s ``` ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkPopulateContext goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkPopulateContext/no_populated_vars-8 5000000 570.6 ns/op 560 B/op 6 allocs/op BenchmarkPopulateContext/empty_var-8 5000000 872.3 ns/op 928 B/op 9 allocs/op BenchmarkPopulateContext/populated_vars-8 5000000 861.6 ns/op 912 B/op 8 allocs/op PASS ok github.com/gorilla/mux 11.547s ``` ``` func requestWithVarsOld(r *http.Request, vars map[string]string) *http.Request { ctx := context.WithValue(r.Context(), varsKey, vars) return r.WithContext(ctx) } func requestWithRouteOld(r *http.Request, route *Route) *http.Request { ctx := context.WithValue(r.Context(), routeKey, route) return r.WithContext(ctx) } func BenchmarkVarsOld(b *testing.B) { req := newRequest(http.MethodGet, "http://localhost/") r := new(Route) var vars map[string]string b.ResetTimer() for i := 0; i < b.N; i++ { req = requestWithVarsOld(req, vars) req = requestWithRouteOld(req, r) } } func BenchmarkVarsEmpty(b *testing.B) { req := newRequest(http.MethodGet, "http://localhost/") r := new(Route) var vars map[string]string b.ResetTimer() for i := 0; i < b.N; i++ { requestWithRouteAndVars(req, r, vars) } } func BenchmarkVarsSet(b *testing.B) { req := newRequest(http.MethodGet, "http://localhost/") r := new(Route) vars := map[string]string{"foo": "bar"} b.ResetTimer() for i := 0; i < b.N; i++ { requestWithRouteAndVars(req, r, vars) } } ``` Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
Save one allocation worth 48B per request on route w/o vars. Before: ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkPopulateContext goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkPopulateContext/no_populated_vars-8 5000000 570.6 ns/op 560 B/op 6 allocs/op BenchmarkPopulateContext/empty_var-8 5000000 872.3 ns/op 928 B/op 9 allocs/op BenchmarkPopulateContext/populated_vars-8 5000000 861.6 ns/op 912 B/op 8 allocs/op PASS ok github.com/gorilla/mux 11.547s ``` After: ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkPopulateContext goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkPopulateContext/no_populated_vars-8 5000000 530.7 ns/op 512 B/op 5 allocs/op BenchmarkPopulateContext/empty_var-8 5000000 969.2 ns/op 928 B/op 9 allocs/op BenchmarkPopulateContext/populated_vars-8 5000000 944.7 ns/op 912 B/op 8 allocs/op PASS ok github.com/gorilla/mux 12.246s ``` Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
Save one allocation worth 16B per route matcher w/o named regexes/vars. Also save one extra regex pass per route matcher w/o named regexes/vars. Before: ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkMuxSimple goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkMuxSimple-8 5000000 477.8 ns/op 512 B/op 5 allocs/op PASS ok github.com/gorilla/mux 2.410s ``` After: ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkMuxSimple goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkMuxSimple-8 5000000 379.3 ns/op 496 B/op 4 allocs/op PASS ok github.com/gorilla/mux 1.917s ``` Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
Save 4 allocations worth 200B per request cycle with redirect. The rewrite operation takes before O(600ns) vs after O(200ns). ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkStrictSlash goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkStrictSlashClone-8 5000000 183.9 ns/op 64 B/op 4 allocs/op BenchmarkStrictSlashParse-8 5000000 559.8 ns/op 264 B/op 8 allocs/op PASS ok github.com/gorilla/mux 3.740s $ go test -benchmem -benchtime 5000000x -bench BenchmarkStrictSlash goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkStrictSlashClone-8 5000000 180.4 ns/op 64 B/op 4 allocs/op BenchmarkStrictSlashParse-8 5000000 573.5 ns/op 264 B/op 8 allocs/op PASS ok github.com/gorilla/mux 3.788s $ go test -benchmem -benchtime 5000000x -bench BenchmarkStrictSlash goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkStrictSlashClone-8 5000000 175.8 ns/op 64 B/op 4 allocs/op BenchmarkStrictSlashParse-8 5000000 569.4 ns/op 264 B/op 8 allocs/op PASS ok github.com/gorilla/mux 3.744s ``` ``` func BenchmarkStrictSlashClone(b *testing.B) { req := newRequest(http.MethodGet, "http://localhost/x") b.ResetTimer() for i := 0; i < b.N; i++ { _ = replaceURLPath(req.URL, req.URL.Path+"/") } } func BenchmarkStrictSlashParse(b *testing.B) { req := newRequest(http.MethodGet, "http://localhost/x") b.ResetTimer() for i := 0; i < b.N; i++ { u, _ := url.Parse(req.URL.String()) u.Path += "/" _ = u.String() } } ``` Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
Optionally save 3 allocations worth 448B per request with no vars. ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkMuxSimple goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkMuxSimple/default-8 5000000 349.3 ns/op 496 B/op 4 allocs/op BenchmarkMuxSimple/omit_route_from_ctx-8 5000000 157.8 ns/op 48 B/op 1 allocs/op PASS ok github.com/gorilla/mux 2.556s $ go test -benchmem -benchtime 5000000x -bench BenchmarkMuxSimple goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkMuxSimple/default-8 5000000 354.7 ns/op 496 B/op 4 allocs/op BenchmarkMuxSimple/omit_route_from_ctx-8 5000000 160.8 ns/op 48 B/op 1 allocs/op PASS ok github.com/gorilla/mux 2.602s $ go test -benchmem -benchtime 5000000x -bench BenchmarkMuxSimple goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkMuxSimple/default-8 5000000 376.4 ns/op 496 B/op 4 allocs/op BenchmarkMuxSimple/omit_route_from_ctx-8 5000000 168.1 ns/op 48 B/op 1 allocs/op PASS ok github.com/gorilla/mux 2.745s ``` ``` $ go test -benchmem -benchtime 5000000x -bench BenchmarkPopulateContext goos: linux goarch: amd64 pkg: github.com/gorilla/mux cpu: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz BenchmarkPopulateContext/no_populated_vars-8 5000000 381.6 ns/op 496 B/op 4 allocs/op BenchmarkPopulateContext/empty_var-8 5000000 913.6 ns/op 928 B/op 9 allocs/op BenchmarkPopulateContext/populated_vars-8 5000000 914.0 ns/op 912 B/op 8 allocs/op BenchmarkPopulateContext/omit_route_/static-8 5000000 168.6 ns/op 48 B/op 1 allocs/op BenchmarkPopulateContext/omit_route_/dynamic-8 5000000 827.4 ns/op 880 B/op 8 allocs/op PASS ok github.com/gorilla/mux 16.049s ``` Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
Co-authored-by: Alex Vulaj <avulaj@redhat.com> Signed-off-by: Jakob Ackermann <das7pad@outlook.com>
83efd14
to
e136241
Compare
Summary of Changes
Hello!
I found a few "low hanging" allocations that can be deferred until needed or
even skipped entirely.
With all the optimizations combined, we can see the best improvement for
simple, static routes (like a
/status
endpoint) that do not read theRoute
from the request context via
CurrentRoute
and do not populate any vars.On these routes, we can process requests with a single allocation for the
RouteMatch
object. Previously there were 9 extra allocations.For said routes the processing overhead (ns/op) in mux dropped by 75%, which is
a speedup of 4x.
Other routes can expect to see a double-digit percentage reduction in both
processing overhead (ns/op) and allocations as well.
These are driven by merging the context population into a single operation,
eliminating two of ten allocations.
(Eliminating that last allocation for the
RouteMatch
in the best caserequires significant refactoring to maintain full backwards compatibility.
Something for another day.)
Each commit message contains benchmark results for showcasing particular
(micro) optimizations in reduced allocations and in a few cases notable direct
CPU time savings.
I also ran longer benchmarks with 100 repetitions in multiple settings on
different generations of (server) CPUs.
First, there is the full set of benchmarks in this repository and second, the
popular benchmarks https://github.com/julienschmidt/go-http-routing-benchmark.
All but the last change are entirely "free", as in they do not cut features for
gains in performance. The last change for omitting the
Route
from thecontext is behind an optional flag that users can opt in when they do not read
the
Route
from the request context.Said flag is stored local in a
Router
, so users can enable/disable the flagon Subrouters individually.
Benchmark results
I added all the new benchmarks onto a
baseline
branch for comparing theperformance of the changes, tip is 0eba4f5.
I'm running these tests on "shared" compute instances (and my Laptop), so
expect some noise (and frequency scaling on the i7).
mux project benchmarks
You can reproduce these benchmarks using docker, pinned to CPU 1:
docker run --rm --pull always -v /logs:/logs --cpuset-cpus 1 -d golang:1.18 bash -exc 'git clone https://github.com/das7pad/mux.git && cd mux && for branch in baseline perf-cut-allocations; do git checkout "$branch" && go test -benchmem -bench . -count 100 -timeout 1h > "/logs/$branch-all.txt"; done; go install golang.org/x/perf/cmd/benchstat@latest; benchstat /logs/baseline-all.txt /logs/perf-cut-allocations-all.txt > /logs/compare-all.txt'
Modern Xeon E, 3.4 GHz
Older Xeon Gold, 2.3 GHz
(Actual CPU identifier is rather
Intel(R) Xeon(R) Gold 5122 CPU @ 2.30GHz
)Older Xeon E, 2.4 GHz
Popular go-http-routing-benchmark
I pushed three branches for comparison to my fork:
https://github.com/das7pad/go-http-routing-benchmark
before
, this is the baseline branch mentioned aboveafter
, this is the PR revisionafter-omit-route
, likeafter
with theOmitRouteFromContext
flag enabledYou can reproduce these benchmarks using docker, pinned to CPU 1:
docker run --rm --pull always -v /logs:/logs --cpuset-cpus 1 -d golang:1.18 bash -exc 'git clone https://github.com/das7pad/go-http-routing-benchmark.git && cd go-http-routing-benchmark && for branch in before after after-omit-route; do git checkout "$branch" && go test -benchmem -bench Gorilla -count 100 -timeout 1h > "/logs/$branch.txt"; done; go install golang.org/x/perf/cmd/benchstat@latest; benchstat /logs/before.txt /logs/after.txt > /logs/compare-before-vs-after.txt; benchstat /logs/before.txt /logs/after-omit-route.txt > /logs/compare-before-vs-after-omit-route.txt'
Modern Xeon E, 3.4 GHz
Before vs After with omit Route flag enabled
Before vs After
Older Xeon Gold, 2.3 GHz
Before vs After with omit Route flag enabled
(Actual CPU identifier is rather
Intel(R) Xeon(R) Gold 5122 CPU @ 2.30GHz
)Before vs After
(Actual CPU identifier is rather
Intel(R) Xeon(R) Gold 5122 CPU @ 2.30GHz
)Older Xeon E, 2.4 GHz
Before vs After with omit Route flag enabled
Before vs After
Older i7, frequency scaling around 3.4 GHz, n=10
Sorry, only 10 iterations each. I do not want to hear the fan for too long :)
Before vs After with omit Route flag enabled
Before vs After
If you read this far, please consider running benchmarks for your own use cases
of
mux
and report back any changes. Thanks!go.mod override
You can use the following override in your
go.mod
file:Optionally, you can enable the flag for not storing the
Route
in the request context: