Remove/improve the per-VU buffer pool? #794

na-- · 2018-10-02T11:28:20Z

Currently each VU has its own buffer pool that's used for storing HTTP response bodies. The current approach has a few limitations:

Each buffer has a fixed size of 100 elements, which seems very arbitrary...
Each VU has its own buffer, but each buffer has to be concurrent because of things like http.batch(), so why not simply have a global one?
We use the github.com/oxtoacart/bpool library for the buffer pools, which doesn't seem very alive. Though it looks simple and self-contained, so it's unlikely that any maintenance is needed... The bigger issue though is that these static buffer pools don't get de-allocated when the Go runtime needs space and the buffers live for the entire lifetime of a k6 script, which may not be what we want when we have hundreds of VUs.

Like I mention in this TODO, using something like sync.Pool may be a better idea, pending some benchmarks of course. sync.Pool has dynamic resizes and allows objects to be automatically garbage collected when needed, which seems useful in our case. More details here and here

The text was updated successfully, but these errors were encountered:

na-- · 2019-06-10T08:14:17Z

This needs more investigation, but as part of my current fixes of #1044 and #1041, I decided to benchmark a few things. Specifically, how the removal of the current BPools or their replacement with sync.Pool (either a global one or a per-VU one) would affect the k6 performance in a script that's biased towards big response bodies.

Here's the k6 script:

import http from "k6/http";

export let options = {
    vus: 50,
    iterations: 10000,
    batchPerHost: 25,
};

var testServer = "http://127.0.0.1:8080"

export default function () {
    http.get(`${testServer}/`);
    http.post(`${testServer}/post`);
    http.batch([
        ["GET", `${testServer}/gzip`],
        ["GET", `${testServer}/gzip`],
        ["GET", `${testServer}/deflate`],
        ["GET", `${testServer}/deflate`],
        ["GET", `${testServer}/redirect/5`],
        ["GET", `${testServer}/get`],
        ["GET", `${testServer}/html`],
        ["GET", `${testServer}/bytes/100000`],
        ["GET", `${testServer}/image/png`],
        ["GET", `${testServer}/image/jpeg`],
        ["GET", `${testServer}/image/jpeg`],
        ["GET", `${testServer}/image/webp`],
        ["GET", `${testServer}/image/svg`],
        ["GET", `${testServer}/forms/post`],
        ["GET", `${testServer}/bytes/100000`],
        ["GET", `${testServer}/stream-bytes/100000`],
    ]);
    http.get(`${testServer}/get`);
    http.get(`${testServer}/get`);
    http.get(`${testServer}/1mbdata`);
}

Most of the methods are the ones supplied by go-httpbin and /1mbdata just returns 0123456789 repeated 100000 times...

The results are as follows:

k6 master, no changes (i.e. with the current buffer pools):

    data_received..............: 14 GB  264 MB/s
    data_sent..................: 26 MB  486 kB/s
    http_req_blocked...........: avg=191.82µs min=1.72µs  med=2.88µs   max=751.45ms p(90)=4.19µs   p(95)=5.76µs
    http_req_connecting........: avg=28.08µs  min=0s      med=0s       max=41.13ms  p(90)=0s       p(95)=0s
    http_req_duration..........: avg=34.37ms  min=62.12µs med=15.06ms  max=434.78ms p(90)=94.53ms  p(95)=124.2ms
    http_req_receiving.........: avg=30.81µs  min=5.68µs  med=15.69µs  max=38.08ms  p(90)=28.57µs  p(95)=37.92µs
    http_req_sending...........: avg=30.31µs  min=4.94µs  med=9.09µs   max=37.83ms  p(90)=17.83µs  p(95)=30.13µs
    http_req_tls_handshaking...: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
    http_req_waiting...........: avg=34.3ms   min=42.49µs med=14.99ms  max=434.66ms p(90)=94.45ms  p(95)=124.11ms
    http_reqs..................: 260000 4826.606081/s
    iteration_duration.........: avg=268.84ms min=8.91ms  med=256.89ms max=1.05s    p(90)=423.65ms p(95)=475.34ms
    iterations.................: 10000  185.638695/s
    vus........................: 50     min=50 max=50
    vus_max....................: 50     min=50 max=50

50.20user 12.76system 0:56.44elapsed 111%CPU (0avgtext+0avgdata 820368maxresident)k
0inputs+0outputs (0major+211722minor)pagefaults 0swaps
/usr/bin/time k6 run test_case.js  50.20s user 12.77s system 111% cpu 56.450 total

k6 master, with a global sync.Pool buffer pool

    data_received..............: 14 GB  106 MB/s
    data_sent..................: 26 MB  195 kB/s
    http_req_blocked...........: avg=157.81µs min=1.7µs   med=3.56µs   max=195.58ms p(90)=7.04µs   p(95)=10.38µs
    http_req_connecting........: avg=102µs    min=0s      med=0s       max=148.95ms p(90)=0s       p(95)=0s
    http_req_duration..........: avg=68.06ms  min=70.74µs med=55.82ms  max=632.74ms p(90)=137.43ms p(95)=170.37ms
    http_req_receiving.........: avg=464.66µs min=5.83µs  med=21.67µs  max=497.69ms p(90)=64.49µs  p(95)=185.06µs
    http_req_sending...........: avg=183.81µs min=5.23µs  med=16.22µs  max=505.08ms p(90)=94.93µs  p(95)=178.4µs
    http_req_tls_handshaking...: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
    http_req_waiting...........: avg=67.42ms  min=47.43µs med=55.3ms   max=627.86ms p(90)=136.06ms p(95)=168.85ms
    http_reqs..................: 260000 1939.251673/s
    iteration_duration.........: avg=654.83ms min=25.25ms med=647.16ms max=1.39s    p(90)=862.23ms p(95)=933.59ms
    iterations.................: 10000  74.586603/s
    vus........................: 50     min=50 max=50
    vus_max....................: 50     min=50 max=50

295.57user 20.53system 2:16.65elapsed 231%CPU (0avgtext+0avgdata 11315028maxresident)k
53944inputs+0outputs (344major+2903479minor)pagefaults 0swaps
/usr/bin/time k6 run test_case.js  295.57s user 20.54s system 231% cpu 2:16.67 total

k6 master, with a per-VU sync.Pool buffer:

    data_received..............: 14 GB  223 MB/s
    data_sent..................: 26 MB  410 kB/s
    http_req_blocked...........: avg=56.49µs  min=1.6µs   med=2.98µs   max=180.57ms p(90)=4.85µs   p(95)=7.52µs
    http_req_connecting........: avg=37.09µs  min=0s      med=0s       max=67.12ms  p(90)=0s       p(95)=0s
    http_req_duration..........: avg=39.99ms  min=59.34µs med=19.78ms  max=473.49ms p(90)=106.35ms p(95)=139.87ms
    http_req_receiving.........: avg=51.73µs  min=5.39µs  med=15.81µs  max=46.31ms  p(90)=40.27µs  p(95)=120.86µs
    http_req_sending...........: avg=58.16µs  min=5.02µs  med=10.73µs  max=76.12ms  p(90)=59.21µs  p(95)=143.35µs
    http_req_tls_handshaking...: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
    http_req_waiting...........: avg=39.88ms  min=43.57µs med=19.63ms  max=473.37ms p(90)=106.19ms p(95)=139.71ms
    http_reqs..................: 260000 4072.147867/s
    iteration_duration.........: avg=317.82ms min=10.4ms  med=306.19ms max=946.52ms p(90)=479.33ms p(95)=534.56ms
    iterations.................: 10000  156.621072/s
    vus........................: 50     min=50 max=50
    vus_max....................: 50     min=50 max=50

92.84user 11.24system 1:06.08elapsed 157%CPU (0avgtext+0avgdata 572172maxresident)k
8inputs+0outputs (0major+147581minor)pagefaults 0swaps
/usr/bin/time k6 run test_case.js  92.85s user 11.25s system 157% cpu 1:06.08 total

k6 master, with NO buffer pools, i.e.

    data_received..............: 14 GB  206 MB/s
    data_sent..................: 26 MB  380 kB/s
    http_req_blocked...........: avg=299.74µs min=1.6µs   med=3.11µs   max=949.71ms p(90)=5.3µs    p(95)=8.36µs
    http_req_connecting........: avg=28.11µs  min=0s      med=0s       max=72.07ms  p(90)=0s       p(95)=0s
    http_req_duration..........: avg=42.44ms  min=62.35µs med=21.4ms   max=533.53ms p(90)=112.56ms p(95)=145.44ms
    http_req_receiving.........: avg=75.73µs  min=4.91µs  med=17.1µs   max=57.35ms  p(90)=52.04µs  p(95)=142.48µs
    http_req_sending...........: avg=64.08µs  min=5.05µs  med=12µs     max=59.29ms  p(90)=87.88µs  p(95)=156.64µs
    http_req_tls_handshaking...: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s
    http_req_waiting...........: avg=42.3ms   min=40.25µs med=21.22ms  max=533.51ms p(90)=112.4ms  p(95)=145.3ms
    http_reqs..................: 260000 3776.917335/s
    iteration_duration.........: avg=342.67ms min=8.18ms  med=332.22ms max=1.52s    p(90)=516.66ms p(95)=575.92ms
    iterations.................: 10000  145.266051/s
    vus........................: 50     min=50 max=50
    vus_max....................: 50     min=50 max=50

108.15user 11.68system 1:10.91elapsed 168%CPU (0avgtext+0avgdata 567568maxresident)k
0inputs+0outputs (0major+145023minor)pagefaults 0swaps
/usr/bin/time k6 run test_case.js  108.16s user 11.69s system 168% cpu 1:10.92 total

So, this needs more investigation and probably more realistic benchmarking, but for now it seems like we shouldn't switch the current buffer pools with sync.Pool.

(Edit: for posterity, the master commit that was tested in the above benchmarks is 607b63c)

na-- · 2019-06-18T18:36:01Z

Seems like the sync.Pool would be improved in Go 1.13, so we should probably re-run these benchmarks when it comes out.

davidpst · 2023-01-26T15:32:50Z

@na--, as you describe in this issue, memory usage could be optimized by using a shared sync.Pool for all VUs. In our test cases the memory problem is worse because we have large data sets and http responses varies between 100k and 3Mb, since the current pool never de-allocate this memory, the total consumption is very high for executions with many VUs.

We have validated K6 using sync.Pool and the memory improvement is important. Do you think we could collaborate with this improvement? you can see how the implementation is in #2879

na-- · 2023-01-27T08:43:29Z

I see you opened a PR, so I'll try to take a look at it later today and share specific concerns there.

In general, we'd need some benchmarks before we can make this change. I remember sync.Pool had some optimizations over the last few years, so at the very least, it's probably worth repeating my primitive ones from the OP, but something even more robust and automated is probably in order 🤔 Though Go benchmarks might not be a good fit for something so heavy 🤔

na-- · 2023-01-27T10:57:38Z

Thinking about this some more, there a few further optimizations we might be able to make here 🤔 Instead of using a distinct sync.Pool that has bytes.Buffer objects, we can have a binary buffer implementation that is aware of the Pool it originated from and also aware of how that pool was used 💡 That is, the buffer pool can lightly keep track what was the average size of the buffers that were required before.

I see a few potential problems that can be addressed by this:

At the moment, when a Buffer proves too small to hold the data we want to write to it, it grows its internal []byte buffer by at least twice its size and copies over the old data. However, the old slice of bytes just ends up being discarded and presumably garbage collected. Instead, if our implementation was aware of the buffer pool, it might choose to return it back there based on some heuristic (e.g. if it was X% larger than the average buffer size used before).
Since we don't keep track of how large the previously used buffers ended up being, whenever we need a new buffer, we allocate the smallest one, which probably leads to too much copying of the data around.
Since the buffers will be able to be garbage collected now, due to the sync.Pool magic, we probably don't need to return smaller buffers back to the pool... Not just buffers that we've outgrown, but even final objects that were returned to the pool manually 🤔

One potential danger with 1. is that we shouldn't have references to the middle of internal byte slice, but that should be relatively easy to ensure with the proper docs and a more limited API than bytes.Buffer 🤔

And maybe these two things don't matter all that much when we have a "hot" Buffer + sync.Pool, since that system will probably self-balance somewhat into some efficient equilibrium 🤔 But it might be worth it to check these assumptions with some realistic benchmarks.

davidpst · 2023-01-27T13:38:46Z

About how we can measure this optimization is true that is difficult to validate with a Go benchmarks, we tried to do it but we hadn't a clear result.

Of course we have validated it with our test scripts and the improvement is considerable. Attached some metrics of two executions, one with the current version and the other using "sync.Pool".

K6 0.42.0

K6 0.42.0 using sync.Pool

na-- · 2023-01-30T10:19:34Z

I've provisionally added this to the v0.44.0 milestone. It's likely not going to make it in time for v0.43.0, since the merge window for that closes at the end of the week, and we already have quite a lot of work-in-progress things to finish up before then. We'd need some time to benchmark this ourselves, and maybe also investigate some of the potential improvements from #794 (comment).

davidpst · 2023-01-30T11:22:34Z

Thanks @na--. Also we are thinking how to benchmark this modification (and the improvements that you commented) with Go benchmarks, if we'd find how to automate it and see the improvements, we will add to the PR.

na-- · 2023-01-30T14:31:31Z

FWIW, we now have quite reliable ~end-to-end tests in https://github.com/grafana/k6/tree/master/cmd/tests. This one, for for example, spins up a server and has some checks with it:

k6/cmd/tests/cmd_run_test.go

Lines 488 to 538 in 582ec4d

    
           func TestSetupTeardownThresholds(t *testing.T) { 
        
           	t.Parallel() 
        
           	tb := httpmultibin.NewHTTPMultiBin(t) 
        
           	script := tb.Replacer.Replace(` 
        
           		import http from "k6/http"; 
        
           		import { check } from "k6"; 
        
           		import { Counter } from "k6/metrics"; 
        
           		let statusCheck = { "status is 200": (r) => r.status === 200 } 
        
           		let myCounter = new Counter("setup_teardown"); 
        
           		export let options = { 
        
           			iterations: 5, 
        
           			thresholds: { 
        
           				"setup_teardown": ["count == 2"], 
        
           				"iterations": ["count == 5"], 
        
           				"http_reqs": ["count == 7"], 
        
           			}, 
        
           		}; 
        
           		export function setup() { 
        
           			check(http.get("HTTPBIN_IP_URL"), statusCheck) && myCounter.add(1); 
        
           		}; 
        
           		export default function () { 
        
           			check(http.get("HTTPBIN_IP_URL"), statusCheck); 
        
           		}; 
        
           		export function teardown() { 
        
           			check(http.get("HTTPBIN_IP_URL"), statusCheck) && myCounter.add(1); 
        
           		}; 
        
           	`) 
        
           	ts := getSimpleCloudOutputTestState(t, script, nil, cloudapi.RunStatusFinished, cloudapi.ResultStatusPassed, 0) 
        
           	cmd.ExecuteWithGlobalState(ts.GlobalState) 
        
           	stdOut := ts.Stdout.String() 
        
           	assert.Contains(t, stdOut, `✓ http_reqs......................: 7`) 
        
           	assert.Contains(t, stdOut, `✓ iterations.....................: 5`) 
        
           	assert.Contains(t, stdOut, `✓ setup_teardown.................: 2`) 
        
           	logMsgs := ts.LoggerHook.Drain() 
        
           	for _, msg := range logMsgs { 
        
           		if msg.Level != logrus.DebugLevel { 
        
           			assert.Failf(t, "unexpected log message", "level %s, msg '%s'", msg.Level, msg.Message) 
        
           		} 
        
           	} 
        
           	assert.True(t, testutils.LogContains(logMsgs, logrus.DebugLevel, "Metrics emission of VUs and VUsMax metrics stopped")) 
        
           	assert.True(t, testutils.LogContains(logMsgs, logrus.DebugLevel, "Metrics processing finished!")) 
        
           }

Potentially, with a few *testing.T replacements with testing.TB, it may be usable for benchmark tests as well 🤔 I am not sure how well Go's auto-tuning benchmarks will handle long-running tests, but it should be easy enough to check and simulate a relatively realistic k6 tests to benchmark from that package 🤔

davidpst · 2023-02-01T16:57:42Z

I didn't know this kind of test in k6. I added a new one in the same package that can allow us to validate the performance of sync.Pool (BenchmarkReadResponseBody).

I ran this test with 500 VUs and during 2 minutes and got the same diferente between both version.

K6 0.42.0

K6 0.42.0 using sync.Pool

You can validate different configurations for VUs/duration and then run:

go test -run=^$ -bench ^BenchmarkReadResponseBody$ -benchmem -memprofile heap.0.42.2.pprof

After that just run pprof as usual:

go tool pprof -http=:8282 heap.0.42.2.pprof

mstoykov · 2023-03-29T06:56:16Z

Another possible library is zeropool by a fellow grafanista and is considered for usage in prometheus.

na-- added refactor lower prio evaluation needed proposal needs to be validated or tested before fully implementing it in k6 labels Oct 2, 2018

na-- added the performance label Jun 10, 2019

na-- mentioned this issue Jun 10, 2019

Fix a bunch of HTTP measurement and handling issues #1047

Merged

davidpst mentioned this issue Jan 27, 2023

Improve the per-VU buffer pool #2879

Merged

na-- removed the lower prio label Jan 27, 2023

na-- added this to the v0.44.0 milestone Jan 30, 2023

na-- removed this from the v0.44.0 milestone Apr 3, 2023

mstoykov removed the evaluation needed proposal needs to be validated or tested before fully implementing it in k6 label Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove/improve the per-VU buffer pool? #794

Remove/improve the per-VU buffer pool? #794

na-- commented Oct 2, 2018

na-- commented Jun 10, 2019 •

edited

Loading

na-- commented Jun 18, 2019

davidpst commented Jan 26, 2023 •

edited

Loading

na-- commented Jan 27, 2023 •

edited

Loading

na-- commented Jan 27, 2023 •

edited

Loading

davidpst commented Jan 27, 2023

na-- commented Jan 30, 2023

davidpst commented Jan 30, 2023

na-- commented Jan 30, 2023

davidpst commented Feb 1, 2023

mstoykov commented Mar 29, 2023

Remove/improve the per-VU buffer pool? #794

Remove/improve the per-VU buffer pool? #794

Comments

na-- commented Oct 2, 2018

na-- commented Jun 10, 2019 • edited Loading

k6 master, no changes (i.e. with the current buffer pools):

k6 master, with a global sync.Pool buffer pool

k6 master, with a per-VU sync.Pool buffer:

k6 master, with NO buffer pools, i.e.

na-- commented Jun 18, 2019

davidpst commented Jan 26, 2023 • edited Loading

na-- commented Jan 27, 2023 • edited Loading

na-- commented Jan 27, 2023 • edited Loading

davidpst commented Jan 27, 2023

na-- commented Jan 30, 2023

davidpst commented Jan 30, 2023

na-- commented Jan 30, 2023

davidpst commented Feb 1, 2023

mstoykov commented Mar 29, 2023

na-- commented Jun 10, 2019 •

edited

Loading

davidpst commented Jan 26, 2023 •

edited

Loading

na-- commented Jan 27, 2023 •

edited

Loading

na-- commented Jan 27, 2023 •

edited

Loading