read performance #42

Yiling-J · 2024-08-08T09:42:24Z

use striped lossy ring buffer as read buffer(https://github.com/maypok86/otter/blob/main/internal/lossy/buffer.go)
improve clock performance(time.Unix(0, ts).UTC() -> time.Unix(0, ts), UTC() will strip monotonic clock, and make Since slow)
use RBMutex(https://github.com/puzpuzpuz/xsync/blob/main/rbmutex.go) to improve read performance.
use striped counter(https://github.com/puzpuzpuz/xsync/blob/main/counter.go) to improve read performance.

before:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkGetParallel/theine-12   	        39227174	        31.04 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetParallel/ristretto-12         	47108266	        24.54 ns/op	      17 B/op	       1 allocs/op

BenchmarkGetSingleParallel/theine-12      	28667926	        40.72 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetSingleParallel/ristretto-12   	23612684	        47.27 ns/op	      16 B/op	       1 allocs/op

after:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkGetParallel/theine-12            52328856	        22.56 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetParallel/ristretto-12         50361355	        25.36 ns/op	      17 B/op	       1 allocs/op
BenchmarkGetParallel/otter-12             181070922	         7.306 ns/op	       0 B/op	       0 allocs/op

BenchmarkGetSingleParallel/theine-12      	53619693	        21.98 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetSingleParallel/ristretto-12   	21261020	        52.45 ns/op	      16 B/op	       1 allocs/op
BenchmarkGetSingleParallel/otter-12       	207251763	         5.774 ns/op	       0 B/op	       0 allocs/op

codecov · 2024-08-08T09:42:51Z

Codecov Report

Attention: Patch coverage is 87.40157% with 16 lines in your changes missing coverage. Please review.

Project coverage is 88.20%. Comparing base (e8341cf) to head (a632fb8).
Report is 1 commits behind head on main.

Files	Patch %	Lines
internal/buffer.go	86.66%	10 Missing ⚠️
internal/utils.go	75.00%	2 Missing and 1 partial ⚠️
internal/xruntime/xruntime.go	57.14%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
- Coverage   88.72%   88.20%   -0.53%     
==========================================
  Files          24       25       +1     
  Lines        2564     2637      +73     
==========================================
+ Hits         2275     2326      +51     
- Misses        199      217      +18     
- Partials       90       94       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

maypok86 · 2024-08-09T17:51:30Z

In general, it looks good, but I couldn't repeat the distribution of your benchmarks.

maypok86 · 2024-08-09T18:23:07Z

I used benchmarks based on go-cache-benchmark-plus with a few edits.

Theine version borrowed from the perf branch.
Added the IgnoreInternalCost = true flag for ristretto.
Otter version is the latest release.

I think the current version of go-cache-benchmark has too many bugs, so the benchmarks are fixed too.

func BenchmarkGetParallel(b *testing.B) {
	keys := []string{}
	for i := 0; i < 100000; i++ {
		keys = append(keys, fmt.Sprintf("%d", i))
	}
	for _, client := range benchClients {
		client.Init(100000)
		// The filling has been added
		for _, key := range keys {
			client.Set(key, key)
		}
		b.ResetTimer()
		b.Run(client.Name(), func(b *testing.B) {
			b.RunParallel(func(p *testing.PB) {
			        // It seems better not to read one key out of all the goroutines,
			        // but the results of this almost do not change in any way.
				counter := rand.Int() % 100000
				for p.Next() {
					client.Get(keys[counter%100000])
					counter++
				}
			})
		})
		client.Close()
	}
}

I got the following on M1 Max.

goos: darwin
goarch: arm64
pkg: github.com/Yiling-J/go-cache-benchmark-plus
BenchmarkGetParallel
BenchmarkGetParallel/theine
BenchmarkGetParallel/theine-10         	16778953	        71.68 ns/op
BenchmarkGetParallel/ristretto
BenchmarkGetParallel/ristretto-10      	46205827	        28.45 ns/op
BenchmarkGetParallel/otter
BenchmarkGetParallel/otter-10          	194468833	         6.116 ns/op
PASS

Everything is a bit different on Set benchmarks.

func BenchmarkSetParallel(b *testing.B) {
	keys := []string{}
	for i := 0; i < 1000000; i++ {
		keys = append(keys, fmt.Sprintf("%d", i))
	}
	for _, client := range benchClients {
		client.Init(100000)
		b.ResetTimer()
		b.Run(client.Name(), func(b *testing.B) {
			b.RunParallel(func(p *testing.PB) {
				// 1 insert and inf updates seem to be too cheating for theine.
				// Since it has a fastpath, which will only help for this case.
				// So let's add a random selection.
				counter := rand.Int() % 1000000
				for p.Next() {
					client.Set(keys[counter%1000000], "bar")
					counter++
				}
			})
		})
		client.Close()
	}
}

And we get the following.

goos: darwin
goarch: arm64
pkg: github.com/Yiling-J/go-cache-benchmark-plus
BenchmarkSetParallel
BenchmarkSetParallel/theine
BenchmarkSetParallel/theine-10         	 2895408	       346.8 ns/op
BenchmarkSetParallel/ristretto
BenchmarkSetParallel/ristretto-10      	16717233	       123.5 ns/op
BenchmarkSetParallel/otter
BenchmarkSetParallel/otter-10          	 2526064	       486.4 ns/op
PASS

Yiling-J · 2024-08-10T01:51:13Z

@maypok86 I actually prefill the cache in the Get method, but the code hasn't been pushed yet. I've now updated the repo. I'm considering also fill the cache for set and make it an updating benchmark, but let me foucs on read performance first. This is the updated result:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkGetParallel/theine-10   	38047278	        27.11 ns/op
BenchmarkGetParallel/ristretto-10         	31198701	        38.89 ns/op
BenchmarkGetParallel/otter-10             	113487500	        10.34 ns/op
BenchmarkGetSingleParallel/theine-10      	45519601	        23.73 ns/op
BenchmarkGetSingleParallel/ristretto-10   	22509024	        57.80 ns/op
BenchmarkGetSingleParallel/otter-10       	169000716	         7.001 ns/op

I can't reproduce your Get result(71.68 ns/op), which theine is still much slower than ristretto, is that perf branch? What about your benchmark(maypok86/benchmarks) reuslt, this is my result:

BenchmarkCache/zipf_otter_reads=100%,writes=0%-8         	94667461	        12.76 ns/op	  78339696 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8        	31750724	        32.85 ns/op	  30441332 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8     	27664843	        44.10 ns/op	  22676619 ops/s

maypok86 · 2024-08-10T19:36:17Z

is that perf branch?

Yeah, that was the first thing I checked. This also proves the change in results since the latest release. For example, here are the results of my benchmarks.

With theine from the perf branch. (go get github.com/Yiling-J/theine-go@perf)

goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/throughput
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8                217757455                5.719 ns/op     174842022 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8               17539639                64.19 ns/op       15579447 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8            36242826                32.87 ns/op       30425438 ops/s
PASS
ok      github.com/maypok86/benchmarks/throughput       4.769s

With theine from the latest release. (go get github.com/Yiling-J/theine-go@latest)

goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/throughput
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8                183928129                6.199 ns/op     161314899 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8               10660195               102.0 ns/op         9803268 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8            40094166                30.01 ns/op       33326931 ops/s
PASS
ok      github.com/maypok86/benchmarks/throughput       4.644s

Yiling-J · 2024-08-11T12:41:17Z

@maypok86 I figured out the issue. The atomic add operation is significantly slower on ARM64 compared to AMD64 (Intel). I’ve switched to using the striped counter from xsync instead, which should improve performance on ARM64 and I’ve already tested it on Aliyun Yitian ECS.

maypok86 · 2024-08-11T14:46:58Z

Yes, theine has become much faster.

BenchmarkCache/zipf_otter_reads=100%,writes=0%-8                219222543                5.654 ns/op     176863089 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8               32655468                38.71 ns/op       25832774 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8            43702114                31.86 ns/op       31389724 ops/s

Yiling-J · 2024-08-12T03:08:15Z

@maypok86 I figured out the issue. The atomic add operation is significantly slower on ARM64 compared to AMD64 (Intel). I’ve switched to using the striped counter from xsync instead, which should improve performance on ARM64 and I’ve already tested it on Aliyun Yitian ECS.

related golang/go#60905 (comment)

Yiling-J · 2024-08-12T10:29:14Z

@vmg This PR improves read performance and scalability, you might consider incorporating some of the changes into Vitess Theine.

vmg · 2024-08-21T08:04:59Z

Thank you for the heads up @Yiling-J! I'm back from my holiday and will try to backport this in the upcoming weeks. :)

Yiling-J · 2024-08-22T02:03:05Z

@vmg BTW there are a few minor fixes and improvements related to the code in this PR
#43
#44
#45 (use uint64 counter)

Yiling-J added 2 commits August 8, 2024 17:32

use striped lossy buffer as read buffer

4673348

remove mpsc

a632fb8

Yiling-J added 2 commits August 8, 2024 17:45

fix lint

c8e972a

Use RBMutex as shard mutex

384b34d

Yiling-J mentioned this pull request Aug 9, 2024

Experiment with S3-FIFO eviction policy #29

Open

Yiling-J marked this pull request as ready for review August 9, 2024 07:34

add striped counter

2ec8f25

Yiling-J merged commit b78ab76 into main Aug 12, 2024
102 checks passed

Yiling-J deleted the perf branch October 18, 2024 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read performance #42

read performance #42

Yiling-J commented Aug 8, 2024 •

edited

Loading

codecov bot commented Aug 8, 2024

maypok86 commented Aug 9, 2024

maypok86 commented Aug 9, 2024

Yiling-J commented Aug 10, 2024 •

edited

Loading

maypok86 commented Aug 10, 2024

Yiling-J commented Aug 11, 2024 •

edited

Loading

maypok86 commented Aug 11, 2024

Yiling-J commented Aug 12, 2024

Yiling-J commented Aug 12, 2024

vmg commented Aug 21, 2024

Yiling-J commented Aug 22, 2024

read performance #42

read performance #42

Conversation

Yiling-J commented Aug 8, 2024 • edited Loading

codecov bot commented Aug 8, 2024

Codecov Report

maypok86 commented Aug 9, 2024

maypok86 commented Aug 9, 2024

Yiling-J commented Aug 10, 2024 • edited Loading

maypok86 commented Aug 10, 2024

Yiling-J commented Aug 11, 2024 • edited Loading

maypok86 commented Aug 11, 2024

Yiling-J commented Aug 12, 2024

Yiling-J commented Aug 12, 2024

vmg commented Aug 21, 2024

Yiling-J commented Aug 22, 2024

Yiling-J commented Aug 8, 2024 •

edited

Loading

Yiling-J commented Aug 10, 2024 •

edited

Loading

Yiling-J commented Aug 11, 2024 •

edited

Loading