-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: large value performance degradation since switching to pebble #49750
Comments
Do we remember what happened between April 22-28, 2019? I thought @ajkr landed a tuning fix in there, but I'm having trouble finding the PR. I wonder if Pebble is missing similar tuning, given that it seems to have dropped back down to a similar level. |
Perhaps |
Wrong date range though, hmm |
#37172 - seems like it was probably something from this bag of fixes. |
Ah, I think it was facebook/rocksdb#5183 / cockroachdb/rocksdb#29. |
Pebble should have the same behavior as RocksDB here. Perhaps it is busted somehow. @jbowens is going to track down what is going on. Shouldn't be difficult given how large the delta is. |
Bump LogWriter's pending queue size from 4 to 16. The impact is muted and not statistically significant with small records but at larger record sizes the impact is appreciable. This may help with cockroachdb/cockroach#49750, but I don't think it's the primary issue. ``` name old time/op new time/op delta RecordWrite/size=8-16 32.7ns ± 5% 32.2ns ± 5% ~ (p=0.055 n=24+25) RecordWrite/size=16-16 33.8ns ± 7% 33.6ns ± 5% ~ (p=0.663 n=23+25) RecordWrite/size=32-16 36.6ns ± 4% 36.6ns ± 9% ~ (p=0.755 n=23+24) RecordWrite/size=64-16 41.5ns ± 5% 41.5ns ±12% ~ (p=0.890 n=24+24) RecordWrite/size=256-16 68.2ns ± 5% 67.9ns ± 8% ~ (p=0.679 n=24+24) RecordWrite/size=1028-16 134ns ± 8% 125ns ± 7% -6.44% (p=0.000 n=23+23) RecordWrite/size=4096-16 357ns ±15% 340ns ± 8% -4.90% (p=0.001 n=24+24) RecordWrite/size=65536-16 5.76µs ±10% 5.17µs ± 7% -10.32% (p=0.000 n=25+25) name old speed new speed delta RecordWrite/size=8-16 245MB/s ± 5% 249MB/s ± 5% ~ (p=0.055 n=24+25) RecordWrite/size=16-16 472MB/s ± 7% 476MB/s ± 6% ~ (p=0.532 n=24+25) RecordWrite/size=32-16 875MB/s ± 4% 875MB/s ± 8% ~ (p=0.792 n=23+24) RecordWrite/size=64-16 1.54GB/s ± 5% 1.54GB/s ±11% ~ (p=0.945 n=24+25) RecordWrite/size=256-16 3.76GB/s ± 5% 3.77GB/s ± 7% ~ (p=0.690 n=24+24) RecordWrite/size=1028-16 7.69GB/s ± 7% 8.22GB/s ± 7% +6.93% (p=0.000 n=23+23) RecordWrite/size=4096-16 11.4GB/s ±13% 12.1GB/s ± 7% +5.58% (p=0.001 n=25+24) RecordWrite/size=65536-16 11.4GB/s ±11% 12.7GB/s ± 7% +11.39% (p=0.000 n=25+25) ```
Bump LogWriter's pending queue size from 4 to 16. The impact is muted and not statistically significant with small records but at larger record sizes the impact is appreciable. This may help with cockroachdb/cockroach#49750, but I don't think it's the primary issue. ``` name old time/op new time/op delta RecordWrite/size=8-16 32.7ns ± 5% 32.2ns ± 5% ~ (p=0.055 n=24+25) RecordWrite/size=16-16 33.8ns ± 7% 33.6ns ± 5% ~ (p=0.663 n=23+25) RecordWrite/size=32-16 36.6ns ± 4% 36.6ns ± 9% ~ (p=0.755 n=23+24) RecordWrite/size=64-16 41.5ns ± 5% 41.5ns ±12% ~ (p=0.890 n=24+24) RecordWrite/size=256-16 68.2ns ± 5% 67.9ns ± 8% ~ (p=0.679 n=24+24) RecordWrite/size=1028-16 134ns ± 8% 125ns ± 7% -6.44% (p=0.000 n=23+23) RecordWrite/size=4096-16 357ns ±15% 340ns ± 8% -4.90% (p=0.001 n=24+24) RecordWrite/size=65536-16 5.76µs ±10% 5.17µs ± 7% -10.32% (p=0.000 n=25+25) name old speed new speed delta RecordWrite/size=8-16 245MB/s ± 5% 249MB/s ± 5% ~ (p=0.055 n=24+25) RecordWrite/size=16-16 472MB/s ± 7% 476MB/s ± 6% ~ (p=0.532 n=24+25) RecordWrite/size=32-16 875MB/s ± 4% 875MB/s ± 8% ~ (p=0.792 n=23+24) RecordWrite/size=64-16 1.54GB/s ± 5% 1.54GB/s ±11% ~ (p=0.945 n=24+25) RecordWrite/size=256-16 3.76GB/s ± 5% 3.77GB/s ± 7% ~ (p=0.690 n=24+24) RecordWrite/size=1028-16 7.69GB/s ± 7% 8.22GB/s ± 7% +6.93% (p=0.000 n=23+23) RecordWrite/size=4096-16 11.4GB/s ±13% 12.1GB/s ± 7% +5.58% (p=0.001 n=25+24) RecordWrite/size=65536-16 11.4GB/s ±11% 12.7GB/s ± 7% +11.39% (p=0.000 n=25+25) ```
49957: vendor: bump Pebble to feb930 r=jbowens a=jbowens ``` feb930 db: close tableCache on open error 660b76 internal/record: bump LogWriter pending queue size a9b799 db: remove table loading goroutine d18729 db: add a per-tableCacheShard table closing goroutine 9687c6 internal/manifest: add Level type ``` Includes cockroachdb/pebble#722, which partially addresses #49750: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32/size=64kb 641 ± 7% 1158 ± 3% +80.80% (p=0.016 n=4+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=32/size=64kb 177 ±34% 67 ±31% -62.13% (p=0.016 n=4+5) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=32/size=64kb 990 ±15% 584 ± 3% -41.02% (p=0.000 n=4+5) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=32/size=64kb 1.34k ±10% 0.79k ± 6% -41.00% (p=0.016 n=4+5) ``` Release note: None 49967: cmd/generate-binary: move some decimal encoding tests to auto-gen script r=otan a=arulajmani `sql/pgwire/testdata/encodings.json` is autogenerated using `cmd/generate-binary/main.go`. Previously, a few tests were missing from this file, which would have been lost the next time new tests were added and `encodings.json` was autogenerated. This PR fixes that by moving the tests to the auto-gen script. Release note (none) 49974: build: add build that simply compiles CRDB on supported platforms r=jlinder a=otan Abstract away the process of building from `./pkg/cmd/publish-*-artifacts`, and use this in `./pkg/cmd/compile-builds`. This is intended to become a CI job for TeamCity. Build: https://teamcity.cockroachdb.com/admin/editBuild.html?id=buildType:Cockroach_UnitTests_CompileBuilds Release note: None Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: arulajmani <arulajmani@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
@jbowens' fix to the |
Running on 9bc18e0 (current master) shows the following perf with RocksDB as old and Pebble as new:
So we've eliminated the perf difference for the Interestingly, RocksDB's behavior isn't necessarily better. On the
Should we be incorporating this additional RocksDB compaction heuristic into Pebble instead? I'm not sure. I think the RocksDB behavior is unintentional and the performance benefit won't hold up over longer durations. These tests are only running for 10m. Here is a graph of throughput for a 1h run of That looks bad. What is happening is the Lbase->Lbase+1 compactions eventually become unblocked by the opening up of Lbase-1. Then there is a ton of compaction backlog to work through causing performance to fall off a cliff. Here's what Pebble looks like on the same test: I poked around at some of the other metrics on the Pebble run. It looks like the performance problem is being caused by one node. Here are some disk IO graphs: Notice how I have the MANIFESTs from all of the nodes and I'm continuing to poke around this cluster to see if there is anything else to see. |
Here is the LSM visualization for |
Ran the 1h Perform is lower initially than without L0-sublevels, but more stable over time. Read-amplification never gets out of control: Here is a MANIFEST from one of the nodes, though |
Cc @sumeerbhola and @itsbilal regarding how L0 sublevels perform. |
Interesting. The maximization of read IOPS looks very similar to the slow backups we saw in #49710. Maybe readahead needs more tuning, especially if rocksdb is able to get more bytes throughput for the same number of IOPS as pebble. Thanks for sending the MANIFEST, taking a look there as well. |
I was able to visualize the manifest no problem, here's a zipped html file (too large to upload to github): https://drive.google.com/file/d/1CtAjnuoRhHlCvUJF90wNqN_SJhMr7prI/view?usp=sharing |
Huh, for me: |
Hmm, looks like I just wasn't patient enough. Building the L0 sublevels for each of the 16889 version edits is slow. |
On my Macbook: |
This visualization is interesting.
|
I experimented with this without sublevels and it didn't affect the shape. I think it is actually the limited compaction concurrency (
Yes, FlushSplitBytes was set to 10MB. I think we're splitting on every Lbase file boundary. |
@sumeerbhola Here is a MANIFEST.db-size.zip from a run where I tweaked Pebble to use |
There is some funky behavior going on with the L0 sublevel compactions. Most of the compactions seem to be for sstables at the end of each level. And I'm mostly seeing L0->Lbase compaction. Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions. |
The problem is not what I thought it was. I added some extra instrumentation about compaction picking decisions. Here is an example early in the run where we are starving L5->L6 compactions:
The columns are "score", "level size", and "level max size". The Lx->Ly indicate in-progress compactions. The |
I've been experimenting with adjusting the level scoring. An observation on the scores above is that we'll frequently see situations like
This looks a bit dramatic on the surface, though it does nicely priority L5->L6. In practice, this has the effect of smoothing the level scores. Here is an example from a run:
Note how the scores for each level are very similar. We've also avoided the inverted LSM shape. Unfortunately, L3 is too large. That is increasing the write-amplification of L0->L3 compactions which we can see in the metrics output:
Notice how much data is being read and written for compactions to L3. That seems suboptimal. Here is a MANIFEST of part of the run. I have some other ideas with which to experiment here. |
Another tweak the scoring heuristics reduced the L3 write-amplification:
Throughput was somewhat higher, but read-amplification as significantly higher and showed no signs of ever stopping. I'm not sure that the read-amplification growth is a huge problem, though. It is indicative of the system having trouble keeping up with writes, but the system is having trouble. We could push harder to keep read amplification down, though doing so only further hurts write throughput. Interestingly, the latest heuristic gets rid of the |
I think this should be adaptive based on number of sub-levels. Something like
where minFlushSplitBytes and targetFileSize are configured constants (the 2MB value we currentl use for the latter should be ok).
Based on looking at the visualization of this MANIFEST, I think this could be explained by a combination of factors:
I don't quite understand this -- allowing for concurrency in L0->Lbase to avoid wasteful intra-L0 compactions was the main motivation for sub-level compactions. And in that sense it is good that L0->Lbase are reaching higher up in the sub-levels -- this will reduce write amplification since we've picked up all files in a vertical slice across sublevels in one compaction (and without making the compaction huge). |
I have a couple of questions about the heuristics being introduced
I still think it is worth trying the ignored heuristics from the prototype https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go , but I doubt I will have time until end of this week. |
Yeah, I think my comment was just wrong. We don't have intra-L0 compactions. That is perfectly fine and an indication that L0 sublevels is working as designed.
I've been a bit sloppy in my phraseology. The inverted LSM shape is not problematic in and of itself, but I have found that a more normal LSM shape leads to lower write amplification and lower write amplification leads to higher throughput.
The harm from having
I'm going to spend some quality time with these heuristics this week. I'll definitely try and run an experiment incorporating those heuristics as-is. |
We could capture that concern directly with trying to avoid |
I manually patched in @sumeerbhola's changes to compaction scoring heuristics from cockroachdb/pebble#563. See https://gist.github.com/petermattis/590b45e21774600275b0f6a61ab0d8f8. The LSM metrics at the end of a 1h
Contrast this with the LSM metrics with cockroachdb/pebble#760:
Note the different write-amplification on L3 and L4. Of course, there is some apples-to-oranges comparison here, as there are different compaction concurrency heuristics. |
@sumeerbhola MANIFEST.sumeer-heuristics.zip is a MANIFEST from the run mentioned in the previous message. |
The difference in the |
should this still be open? |
Closing this since (a) tuning 2-level compaction heuristics is likely not a path to improvement (this also came up recently in a TaoBench import benchmark investigation), (b) we have other issues open to investigate multi-level compactions etc. |
What is your situation?
Starting May 11th, after the merging of #48145, we've begun to observe performance regressions in our large-value KV workloads (4kb and 64kb values). See
https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fsize%3D64kb&tab=aws
Jira issue: CRDB-4200
The text was updated successfully, but these errors were encountered: