Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read performance #42

Merged
merged 5 commits into from
Aug 12, 2024
Merged

read performance #42

merged 5 commits into from
Aug 12, 2024

Conversation

Yiling-J
Copy link
Owner

@Yiling-J Yiling-J commented Aug 8, 2024

before:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkGetParallel/theine-12   	        39227174	        31.04 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetParallel/ristretto-12         	47108266	        24.54 ns/op	      17 B/op	       1 allocs/op

BenchmarkGetSingleParallel/theine-12      	28667926	        40.72 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetSingleParallel/ristretto-12   	23612684	        47.27 ns/op	      16 B/op	       1 allocs/op

after:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkGetParallel/theine-12            52328856	        22.56 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetParallel/ristretto-12         50361355	        25.36 ns/op	      17 B/op	       1 allocs/op
BenchmarkGetParallel/otter-12             181070922	         7.306 ns/op	       0 B/op	       0 allocs/op

BenchmarkGetSingleParallel/theine-12      	53619693	        21.98 ns/op	       0 B/op	       0 allocs/op
BenchmarkGetSingleParallel/ristretto-12   	21261020	        52.45 ns/op	      16 B/op	       1 allocs/op
BenchmarkGetSingleParallel/otter-12       	207251763	         5.774 ns/op	       0 B/op	       0 allocs/op

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 87.40157% with 16 lines in your changes missing coverage. Please review.

Project coverage is 88.20%. Comparing base (e8341cf) to head (a632fb8).
Report is 1 commits behind head on main.

Files Patch % Lines
internal/buffer.go 86.66% 10 Missing ⚠️
internal/utils.go 75.00% 2 Missing and 1 partial ⚠️
internal/xruntime/xruntime.go 57.14% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #42      +/-   ##
==========================================
- Coverage   88.72%   88.20%   -0.53%     
==========================================
  Files          24       25       +1     
  Lines        2564     2637      +73     
==========================================
+ Hits         2275     2326      +51     
- Misses        199      217      +18     
- Partials       90       94       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Yiling-J Yiling-J marked this pull request as ready for review August 9, 2024 07:34
@maypok86
Copy link

maypok86 commented Aug 9, 2024

In general, it looks good, but I couldn't repeat the distribution of your benchmarks.

@maypok86
Copy link

maypok86 commented Aug 9, 2024

I used benchmarks based on go-cache-benchmark-plus with a few edits.

  1. Theine version borrowed from the perf branch.
  2. Added the IgnoreInternalCost = true flag for ristretto.
  3. Otter version is the latest release.

I think the current version of go-cache-benchmark has too many bugs, so the benchmarks are fixed too.

func BenchmarkGetParallel(b *testing.B) {
	keys := []string{}
	for i := 0; i < 100000; i++ {
		keys = append(keys, fmt.Sprintf("%d", i))
	}
	for _, client := range benchClients {
		client.Init(100000)
		// The filling has been added
		for _, key := range keys {
			client.Set(key, key)
		}
		b.ResetTimer()
		b.Run(client.Name(), func(b *testing.B) {
			b.RunParallel(func(p *testing.PB) {
			        // It seems better not to read one key out of all the goroutines,
			        // but the results of this almost do not change in any way.
				counter := rand.Int() % 100000
				for p.Next() {
					client.Get(keys[counter%100000])
					counter++
				}
			})
		})
		client.Close()
	}
}

I got the following on M1 Max.

goos: darwin
goarch: arm64
pkg: github.com/Yiling-J/go-cache-benchmark-plus
BenchmarkGetParallel
BenchmarkGetParallel/theine
BenchmarkGetParallel/theine-10         	16778953	        71.68 ns/op
BenchmarkGetParallel/ristretto
BenchmarkGetParallel/ristretto-10      	46205827	        28.45 ns/op
BenchmarkGetParallel/otter
BenchmarkGetParallel/otter-10          	194468833	         6.116 ns/op
PASS

Everything is a bit different on Set benchmarks.

func BenchmarkSetParallel(b *testing.B) {
	keys := []string{}
	for i := 0; i < 1000000; i++ {
		keys = append(keys, fmt.Sprintf("%d", i))
	}
	for _, client := range benchClients {
		client.Init(100000)
		b.ResetTimer()
		b.Run(client.Name(), func(b *testing.B) {
			b.RunParallel(func(p *testing.PB) {
				// 1 insert and inf updates seem to be too cheating for theine.
				// Since it has a fastpath, which will only help for this case.
				// So let's add a random selection.
				counter := rand.Int() % 1000000
				for p.Next() {
					client.Set(keys[counter%1000000], "bar")
					counter++
				}
			})
		})
		client.Close()
	}
}

And we get the following.

goos: darwin
goarch: arm64
pkg: github.com/Yiling-J/go-cache-benchmark-plus
BenchmarkSetParallel
BenchmarkSetParallel/theine
BenchmarkSetParallel/theine-10         	 2895408	       346.8 ns/op
BenchmarkSetParallel/ristretto
BenchmarkSetParallel/ristretto-10      	16717233	       123.5 ns/op
BenchmarkSetParallel/otter
BenchmarkSetParallel/otter-10          	 2526064	       486.4 ns/op
PASS

@Yiling-J
Copy link
Owner Author

Yiling-J commented Aug 10, 2024

@maypok86 I actually prefill the cache in the Get method, but the code hasn't been pushed yet. I've now updated the repo. I'm considering also fill the cache for set and make it an updating benchmark, but let me foucs on read performance first. This is the updated result:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkGetParallel/theine-10   	38047278	        27.11 ns/op
BenchmarkGetParallel/ristretto-10         	31198701	        38.89 ns/op
BenchmarkGetParallel/otter-10             	113487500	        10.34 ns/op
BenchmarkGetSingleParallel/theine-10      	45519601	        23.73 ns/op
BenchmarkGetSingleParallel/ristretto-10   	22509024	        57.80 ns/op
BenchmarkGetSingleParallel/otter-10       	169000716	         7.001 ns/op

I can't reproduce your Get result(71.68 ns/op), which theine is still much slower than ristretto, is that perf branch? What about your benchmark(maypok86/benchmarks) reuslt, this is my result:

BenchmarkCache/zipf_otter_reads=100%,writes=0%-8         	94667461	        12.76 ns/op	  78339696 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8        	31750724	        32.85 ns/op	  30441332 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8     	27664843	        44.10 ns/op	  22676619 ops/s

@maypok86
Copy link

is that perf branch?

Yeah, that was the first thing I checked. This also proves the change in results since the latest release. For example, here are the results of my benchmarks.

With theine from the perf branch. (go get github.com/Yiling-J/theine-go@perf)

goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/throughput
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8                217757455                5.719 ns/op     174842022 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8               17539639                64.19 ns/op       15579447 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8            36242826                32.87 ns/op       30425438 ops/s
PASS
ok      github.com/maypok86/benchmarks/throughput       4.769s

With theine from the latest release. (go get github.com/Yiling-J/theine-go@latest)

goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/throughput
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8                183928129                6.199 ns/op     161314899 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8               10660195               102.0 ns/op         9803268 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8            40094166                30.01 ns/op       33326931 ops/s
PASS
ok      github.com/maypok86/benchmarks/throughput       4.644s

@Yiling-J
Copy link
Owner Author

Yiling-J commented Aug 11, 2024

@maypok86 I figured out the issue. The atomic add operation is significantly slower on ARM64 compared to AMD64 (Intel). I’ve switched to using the striped counter from xsync instead, which should improve performance on ARM64 and I’ve already tested it on Aliyun Yitian ECS.

@maypok86
Copy link

Yes, theine has become much faster.

BenchmarkCache/zipf_otter_reads=100%,writes=0%-8                219222543                5.654 ns/op     176863089 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8               32655468                38.71 ns/op       25832774 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8            43702114                31.86 ns/op       31389724 ops/s

@Yiling-J
Copy link
Owner Author

@maypok86 I figured out the issue. The atomic add operation is significantly slower on ARM64 compared to AMD64 (Intel). I’ve switched to using the striped counter from xsync instead, which should improve performance on ARM64 and I’ve already tested it on Aliyun Yitian ECS.

related golang/go#60905 (comment)

@Yiling-J Yiling-J merged commit b78ab76 into main Aug 12, 2024
102 checks passed
@Yiling-J
Copy link
Owner Author

@vmg This PR improves read performance and scalability, you might consider incorporating some of the changes into Vitess Theine.

@vmg
Copy link

vmg commented Aug 21, 2024

Thank you for the heads up @Yiling-J! I'm back from my holiday and will try to backport this in the upcoming weeks. :)

@Yiling-J
Copy link
Owner Author

@vmg BTW there are a few minor fixes and improvements related to the code in this PR
#43
#44
#45 (use uint64 counter)

@Yiling-J Yiling-J deleted the perf branch October 18, 2024 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants