storage: large value performance degradation since switching to pebble #49750

ajwerner · 2020-06-01T13:23:29Z

What is your situation?

Starting May 11th, after the merging of #48145, we've begun to observe performance regressions in our large-value KV workloads (4kb and 64kb values). See

https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fsize%3D64kb&tab=aws

Jira issue: CRDB-4200

nvanbenschoten · 2020-06-01T13:47:57Z

Do we remember what happened between April 22-28, 2019? I thought @ajkr landed a tuning fix in there, but I'm having trouble finding the PR. I wonder if Pebble is missing similar tuning, given that it seems to have dropped back down to a similar level.

ajwerner · 2020-06-01T13:55:11Z

Perhaps libroach: enable rocksdb WAL recycling #35591

ajwerner · 2020-06-01T13:55:44Z

Wrong date range though, hmm

ajwerner · 2020-06-01T14:05:50Z

#37172 - seems like it was probably something from this bag of fixes.

nvanbenschoten · 2020-06-01T14:08:34Z

Ah, I think it was facebook/rocksdb#5183 / cockroachdb/rocksdb#29.

petermattis · 2020-06-01T14:19:54Z

Ah, I think it was facebook/rocksdb#5183 / cockroachdb/rocksdb#29.

Pebble should have the same behavior as RocksDB here. Perhaps it is busted somehow. @jbowens is going to track down what is going on. Shouldn't be difficult given how large the delta is.

Bump LogWriter's pending queue size from 4 to 16. The impact is muted and not statistically significant with small records but at larger record sizes the impact is appreciable. This may help with cockroachdb/cockroach#49750, but I don't think it's the primary issue. ``` name old time/op new time/op delta RecordWrite/size=8-16 32.7ns ± 5% 32.2ns ± 5% ~ (p=0.055 n=24+25) RecordWrite/size=16-16 33.8ns ± 7% 33.6ns ± 5% ~ (p=0.663 n=23+25) RecordWrite/size=32-16 36.6ns ± 4% 36.6ns ± 9% ~ (p=0.755 n=23+24) RecordWrite/size=64-16 41.5ns ± 5% 41.5ns ±12% ~ (p=0.890 n=24+24) RecordWrite/size=256-16 68.2ns ± 5% 67.9ns ± 8% ~ (p=0.679 n=24+24) RecordWrite/size=1028-16 134ns ± 8% 125ns ± 7% -6.44% (p=0.000 n=23+23) RecordWrite/size=4096-16 357ns ±15% 340ns ± 8% -4.90% (p=0.001 n=24+24) RecordWrite/size=65536-16 5.76µs ±10% 5.17µs ± 7% -10.32% (p=0.000 n=25+25) name old speed new speed delta RecordWrite/size=8-16 245MB/s ± 5% 249MB/s ± 5% ~ (p=0.055 n=24+25) RecordWrite/size=16-16 472MB/s ± 7% 476MB/s ± 6% ~ (p=0.532 n=24+25) RecordWrite/size=32-16 875MB/s ± 4% 875MB/s ± 8% ~ (p=0.792 n=23+24) RecordWrite/size=64-16 1.54GB/s ± 5% 1.54GB/s ±11% ~ (p=0.945 n=24+25) RecordWrite/size=256-16 3.76GB/s ± 5% 3.77GB/s ± 7% ~ (p=0.690 n=24+24) RecordWrite/size=1028-16 7.69GB/s ± 7% 8.22GB/s ± 7% +6.93% (p=0.000 n=23+23) RecordWrite/size=4096-16 11.4GB/s ±13% 12.1GB/s ± 7% +5.58% (p=0.001 n=25+24) RecordWrite/size=65536-16 11.4GB/s ±11% 12.7GB/s ± 7% +11.39% (p=0.000 n=25+25) ```

49957: vendor: bump Pebble to feb930 r=jbowens a=jbowens ``` feb930 db: close tableCache on open error 660b76 internal/record: bump LogWriter pending queue size a9b799 db: remove table loading goroutine d18729 db: add a per-tableCacheShard table closing goroutine 9687c6 internal/manifest: add Level type ``` Includes cockroachdb/pebble#722, which partially addresses #49750: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32/size=64kb 641 ± 7% 1158 ± 3% +80.80% (p=0.016 n=4+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=32/size=64kb 177 ±34% 67 ±31% -62.13% (p=0.016 n=4+5) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=32/size=64kb 990 ±15% 584 ± 3% -41.02% (p=0.000 n=4+5) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=32/size=64kb 1.34k ±10% 0.79k ± 6% -41.00% (p=0.016 n=4+5) ``` Release note: None 49967: cmd/generate-binary: move some decimal encoding tests to auto-gen script r=otan a=arulajmani `sql/pgwire/testdata/encodings.json` is autogenerated using `cmd/generate-binary/main.go`. Previously, a few tests were missing from this file, which would have been lost the next time new tests were added and `encodings.json` was autogenerated. This PR fixes that by moving the tests to the auto-gen script. Release note (none) 49974: build: add build that simply compiles CRDB on supported platforms r=jlinder a=otan Abstract away the process of building from `./pkg/cmd/publish-*-artifacts`, and use this in `./pkg/cmd/compile-builds`. This is intended to become a CI job for TeamCity. Build: https://teamcity.cockroachdb.com/admin/editBuild.html?id=buildType:Cockroach_UnitTests_CompileBuilds Release note: None Co-authored-by: Jackson Owens <jackson@cockroachlabs.com> Co-authored-by: arulajmani <arulajmani@gmail.com> Co-authored-by: Oliver Tan <otan@cockroachlabs.com>

petermattis · 2020-06-18T00:03:00Z

@jbowens' fix to the LogWriter pending queue size and my change to the compaction concurrency heuristic have helped for some of the large-value workloads, but not for the kv95/.../size=64kb variants. I need to write up some more notes tomorrow, but the short summary is that RocksDB's compaction behavior accidentally reduces compaction concurrency which helps with this workload for short durations (i.e. 10m) but is problematic for longer durations (1h+). Pebble also suffers at longer durations. I've experimented with various tweaks to the compaction concurrency heuristics, but I'm unconvinced these adjustments are worthwhile (they will need to be validated on other workloads). The adjustments also feel fragile, as if I'm over-tuning for this specific workload.

petermattis · 2020-06-18T16:57:36Z

Running on 9bc18e0 (current master) shows the following perf with RocksDB as old and Pebble as new:

name                              old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/size=4kb     2.90k ± 1%   2.97k ± 1%   +2.44%  (p=0.000 n=10+10)
kv0/enc=false/nodes=3/size=64kb      353 ± 1%     261 ± 1%  -26.04%  (p=0.000 n=10+10)
kv95/enc=false/nodes=3/size=4kb    36.4k ± 2%   36.3k ± 3%     ~     (p=0.739 n=10+10)
kv95/enc=false/nodes=3/size=64kb   7.03k ± 1%   5.21k ± 1%  -25.90%  (p=0.000 n=10+10)

So we've eliminated the perf difference for the size=4kb workload, and narrowed it for size=64kb. An analysis of the remaining difference on size=64kb shows that it is due to different compaction behavior. This is true even for the kv95 workload for which only 5% of operations are writes. The large values cause significant compaction pressure and the different behavior from RocksDB and Pebble accounts for this difference.

Interestingly, RocksDB's behavior isn't necessarily better. On the kv95 workload, it has a tendency to encounter the problem described in cockroachdb/pebble#203. Pebble seems to encounter this situation less frequently. I tracked down the reason why to one aspect of the RocksDB compaction heuristics: the inflating of the size of Lbase using the size of L0. Counterintuitively, the strange shape of the RocksDB LSM actually reduces write-amplification (at the expense of read amplification) and helps the workload in the short term. If I adjust the RocksDB compaction heuristics to look more like the Pebble heuristics, the delta on the size=64kb workloads shrinks:

name                              old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/size=4kb     2.90k ± 1%   2.97k ± 1%   +2.42%  (p=0.000 n=10+10)
kv0/enc=false/nodes=3/size=64kb      303 ± 2%     261 ± 1%  -14.00%  (p=0.000 n=10+10)
kv95/enc=false/nodes=3/size=4kb    37.0k ± 3%   36.3k ± 3%   -2.11%  (p=0.006 n=9+10)
kv95/enc=false/nodes=3/size=64kb   6.03k ± 1%   5.21k ± 1%  -13.63%  (p=0.000 n=9+10)

Should we be incorporating this additional RocksDB compaction heuristic into Pebble instead? I'm not sure. I think the RocksDB behavior is unintentional and the performance benefit won't hold up over longer durations. These tests are only running for 10m. Here is a graph of throughput for a 1h run of kv95/enc=false/nodes=3/size=64kb on RocksDB:

That looks bad. What is happening is the Lbase->Lbase+1 compactions eventually become unblocked by the opening up of Lbase-1. Then there is a ton of compaction backlog to work through causing performance to fall off a cliff. Here's what Pebble looks like on the same test:

I poked around at some of the other metrics on the Pebble run. It looks like the performance problem is being caused by one node. Here are some disk IO graphs:

Notice how n2 is "pegged" on read ops and bandwidth. It looks like something on n2 got horked up with compactions, as read-amplification started to increase dramatically at the same time:

I have the MANIFESTs from all of the nodes and I'm continuing to poke around this cluster to see if there is anything else to see.

petermattis · 2020-06-18T17:26:34Z

n2 dramatically decreased the number of compactions it was performing right around when the badness started. Spelunking the Pebble log, I sorted the compaction durations to find the compaction that took the longest: 1143.1s (19m). This compaction (JOB 18353) started at 16:20:02 and involved compacting 7.6 GB from L0 + 2.9 GB from L3. It is curious why that compaction took as long as it did. It averaged a measly 6.5 MB/sec. This is again the L0->Lbase compaction problem. It will be interesting to experiment with L0-sublevels and flush splitting, though I suspect that multi-level compactions may be necessary to reduce write-amplification as the read and write bandwidth seems to be nearing a limit.

Here is the LSM visualization for n2 for posterity: 2.MANIFEST.pebble.html.zip

petermattis · 2020-06-18T19:17:35Z

Ran the 1h kv95/enc=false/nodes=3/size=64kb workload on top of #50371 which enables L0-sublevels.

Perform is lower initially than without L0-sublevels, but more stable over time. Read-amplification never gets out of control:

Here is a MANIFEST from one of the nodes, though pebble lsm is refusing to visualize it.

petermattis · 2020-06-18T19:18:22Z

Cc @sumeerbhola and @itsbilal regarding how L0 sublevels perform.

itsbilal · 2020-06-18T19:29:34Z

Interesting. The maximization of read IOPS looks very similar to the slow backups we saw in #49710. Maybe readahead needs more tuning, especially if rocksdb is able to get more bytes throughput for the same number of IOPS as pebble.

Thanks for sending the MANIFEST, taking a look there as well.

itsbilal · 2020-06-18T19:37:07Z

I was able to visualize the manifest no problem, here's a zipped html file (too large to upload to github): https://drive.google.com/file/d/1CtAjnuoRhHlCvUJF90wNqN_SJhMr7prI/view?usp=sharing

petermattis · 2020-06-18T19:40:21Z

Huh, for me: pebble lsm on the above MANIFEST just spins. kill -QUIT shows it present in a NewL0Sublevels call. How long did that take to generate for you?

petermattis · 2020-06-18T19:42:47Z

Hmm, looks like I just wasn't patient enough. Building the L0 sublevels for each of the 16889 version edits is slow.

itsbilal · 2020-06-18T19:44:51Z

On my Macbook:
./pebble lsm ./1.MANIFEST.pebble > output3.html 84.04s user 2.78s system 106% cpu 1:21.21 total

sumeerbhola · 2020-06-18T21:06:33Z

This visualization is interesting.

We have 14GB in the LSM (of which 6GB is in L6) before we make L3 the base level. We should definitely make that change that uses total bytes to compute target bytes for L6 and higher.
The L0 files are narrow (unlike the slack thread discussion from yesterday which was without L0 sub-levels), so flush splits are somewhat working. But most of the L0 sstables are quite tiny. Was FlushSplitBytes set to 10MB? Around the 14000 tick mark we have 3200 files for 1.6GB in L0 -- that is too many files.
And when one scrolls over the sublevels slowly starting from the left (again at the 14000 tick mark) one can see that the splits are quite poor -- one needs to get to almost the half way point before it scrolls forward for L3 and lower and then it starts moving forward rapidly for those lower levels. I suspect we could get much better performance with better split points.

petermattis · 2020-06-18T21:09:20Z

We have 14GB in the LSM (of which 6GB is in L6) before we make L3 the base level. We should definitely make that change that uses total bytes to compute target bytes for L6 and higher.

I experimented with this without sublevels and it didn't affect the shape. I think it is actually the limited compaction concurrency (MaxConcurrentCompactions = 3) that is the bigger effect. I'll definitely try it out, though.

The L0 files are narrow (unlike the slack thread discussion from yesterday which was without L0 sub-levels), so flush splits are somewhat working. But most of the L0 sstables are quite tiny. Was FlushSplitBytes set to 10MB? Around the 14000 tick mark we have 3200 files for 1.6GB in L0 -- that is too many files.

Yes, FlushSplitBytes was set to 10MB. I think we're splitting on every Lbase file boundary.

petermattis · 2020-06-18T22:48:50Z

@sumeerbhola Here is a MANIFEST.db-size.zip from a run where I tweaked Pebble to use dbSize rather than bottomLevelSize, exactly as your L0-sublevel code originally did. You'll have to get a new version of pebble in order to get the L0-sublevel visualization (the visualization is ~400MB). You'll notice that L6 is even smaller in this run than when using bottomLevelSize. That matches me previous experience. I think we're starving L5->L6 compactions due to lack of compaction concurrency. Or perhaps we're allowing too many L0->Lbase compactions concurrently. Do we ever need more than one L0->Lbase compaction?

petermattis · 2020-06-19T00:29:10Z

There is some funky behavior going on with the L0 sublevel compactions. Most of the compactions seem to be for sstables at the end of each level. And I'm mostly seeing L0->Lbase compaction. Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions.

petermattis · 2020-06-19T12:15:11Z

The problem is not what I thought it was. I added some extra instrumentation about compaction picking decisions. Here is an example early in the run where we are starving L5->L6 compactions:

  *L0:   5.0     551 M     8.0 E  [L0->L4]
   L4:  15.5     767 M      64 M  [L4->L5]
   L5:   4.5     1.9 G     437 M
   L6:   0.0     450 M     2.9 G

The columns are "score", "level size", and "level max size". The Lx->Ly indicate in-progress compactions. The * marks the level we've chosen a new compaction for. While L5 is larger than L6, the scoring considers that less of a problem than the size of L0 and L4. So we end up only performing L0->L4 and L4->L5 compactions.

petermattis · 2020-06-19T17:22:02Z

I've been experimenting with adjusting the level scoring. An observation on the scores above is that we'll frequently see situations like L4->L5 which appear higher priority than L5->L6, but in fact only hurt our future desired state. To account for that, I experimented with adjusting each level's score (for L1-L6) by dividing by the next level's score. For the above data, we'd have something like:

        old-score  new-score  level-size  target-size
   L0:        5.0        5.5       551 M        8.0 E
   L4:       15.5        3.5       767 M         64 M
   L5:        4.5       30.0       1.9 G        437 M
   L6:        0.2        0.2       450 M        2.9 G

This looks a bit dramatic on the surface, though it does nicely priority L5->L6. In practice, this has the effect of smoothing the level scores. Here is an example from a run:

   L0:   6.0     558 M     8.0 E  [L0->L3]
   L3:   6.1     5.6 G      64 M
  *L4:   6.1     6.8 G     473 M  [L4->L5]
   L5:   6.0     8.3 G     3.4 G
   L6:   0.0      10 G      25 G

Note how the scores for each level are very similar. We've also avoided the inverted LSM shape. Unfortunately, L3 is too large. That is increasing the write-amplification of L0->L3 compactions which we can see in the metrics output:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         4   218 M       -    64 G       -       -       -       -    64 G       -       -       -     1.0
      0      2449   697 M    6.50    64 G     0 B       0     0 B       0    62 G   110 K     0 B      21     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3      1423   5.5 G    6.00    49 G     0 B       0     0 B       0   206 G    53 K   229 G       1     4.3
      4      1216   7.0 G    6.10    21 G     0 B       0    11 G   2.8 K    57 G    11 K    61 G       1     2.7
      5       859   8.6 G    6.01    13 G     0 B       0   8.2 G   1.5 K    41 G   4.4 K    42 G       1     3.1
      6       460    11 G       -    12 G     0 B       0    85 M      24    40 G   2.3 K    41 G       1     3.4
  total      6407    32 G       -    64 G     0 B       0    19 G   4.4 K   470 G   181 K   374 G      25     7.4

Notice how much data is being read and written for compactions to L3. That seems suboptimal.

Here is a MANIFEST of part of the run. I have some other ideas with which to experiment here.

petermattis · 2020-06-19T18:36:40Z

Another tweak the scoring heuristics reduced the L3 write-amplification:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    60 M       -    89 G       -       -       -       -    89 G       -       -       -     1.0
      0     20580   3.6 G    2.72    89 G     0 B       0     0 B       0    85 G   245 K   6.2 M     111     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       272   1.1 G    2.80    69 G     0 B       0     0 B       0    59 G    15 K    90 G       1     0.9
      4       520   3.2 G    2.79    37 G     0 B       0    12 G   3.0 K   101 G    16 K   107 G       1     2.7
      5       841   9.4 G    2.79    30 G     0 B       0    11 G   1.8 K   109 G   9.9 K   111 G       1     3.7
      6      1105    28 G       -    29 G     0 B       0    92 M      27   116 G   5.5 K   117 G       1     4.0
  total     23318    45 G       -    89 G     0 B       0    23 G   4.9 K   560 G   292 K   425 G     115     6.3

Throughput was somewhat higher, but read-amplification as significantly higher and showed no signs of ever stopping.

I'm not sure that the read-amplification growth is a huge problem, though. It is indicative of the system having trouble keeping up with writes, but the system is having trouble. We could push harder to keep read amplification down, though doing so only further hurts write throughput.

Interestingly, the latest heuristic gets rid of the L0CompactionThreshold tunable. That feels like progress.

sumeerbhola · 2020-06-22T13:24:31Z

Yes, FlushSplitBytes was set to 10MB.

I think this should be adaptive based on number of sub-levels. Something like

max(minFlushSplitBytes, targetFileSize * numSubLevels)

where minFlushSplitBytes and targetFileSize are configured constants (the 2MB value we currentl use for the latter should be ok).

You'll notice that L6 is even smaller in this run than when using bottomLevelSize. That matches me previous experience. I think we're starving L5->L6 compactions due to lack of compaction concurrency. Or perhaps we're allowing too many L0->Lbase compactions concurrently.

Based on looking at the visualization of this MANIFEST, I think this could be explained by a combination of factors:

since we start using L3 earlier we have one more level that is potentially demanding compaction.
the target size for Lbase and Lbase+1 is primarily a function of the constant LBaseMaxBytes (ignoring the smoothing of the level multiplier), so their score will be unaffected by this change. But L5 score is decreased when we no longer use bottomLevelSize, so it loses to L0, L3 and L4 more often.

Do we ever need more than one L0->Lbase compaction?
...
Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions.

I don't quite understand this -- allowing for concurrency in L0->Lbase to avoid wasteful intra-L0 compactions was the main motivation for sub-level compactions. And in that sense it is good that L0->Lbase are reaching higher up in the sub-levels -- this will reduce write amplification since we've picked up all files in a vertical slice across sublevels in one compaction (and without making the compaction huge).

sumeerbhola · 2020-06-22T13:43:01Z

I have a couple of questions about the heuristics being introduced

what is bad about the shape in storage: large value performance degradation since switching to pebble #49750 (comment) that motivate trying to improve it? The read amplification is low. Is the write amplification higher than what the revised heuristics achieve?
Regarding the heuristic change in storage: large value performance degradation since switching to pebble #49750 (comment) this seems roughly to be saying "don't let a level become much larger than the next lower level". Intuitively this seems harmless, since compacting from a very large level to a small level has small write amplification. This is different from the currentByteRatios heuristic in
https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go#L499-L505 which was roughly "don't let a level become much larger than the next higher level", since that does have an effect on write amplification when compacting down from that higher level.

I still think it is worth trying the ignored heuristics from the prototype https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go , but I doubt I will have time until end of this week.

petermattis · 2020-06-22T15:39:09Z

Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions.

I don't quite understand this -- allowing for concurrency in L0->Lbase to avoid wasteful intra-L0 compactions was the main motivation for sub-level compactions. And in that sense it is good that L0->Lbase are reaching higher up in the sub-levels -- this will reduce write amplification since we've picked up all files in a vertical slice across sublevels in one compaction (and without making the compaction huge).

Yeah, I think my comment was just wrong. We don't have intra-L0 compactions. That is perfectly fine and an indication that L0 sublevels is working as designed.

what is bad about the shape in #49750 (comment) that motivate trying to improve it? The read amplification is low. Is the write amplification higher than what the revised heuristics achieve?

I've been a bit sloppy in my phraseology. The inverted LSM shape is not problematic in and of itself, but I have found that a more normal LSM shape leads to lower write amplification and lower write amplification leads to higher throughput.

Regarding the heuristic change in #49750 (comment) this seems roughly to be saying "don't let a level become much larger than the next lower level". Intuitively this seems harmless, since compacting from a very large level to a small level has small write amplification. This is different from the currentByteRatios heuristic in
https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go#L499-L505 which was roughly "don't let a level become much larger than the next higher level", since that does have an effect on write amplification when compacting down from that higher level.

The harm from having size(Ln) >> size(Ln+1) is that compacting into Ln then becomes more expensive. I agree that the subsequent compaction into Ln+1 doesn't add much to write amplification. I also agree with "don't let a level become much larger than the next higher level". The change to the scoring heuristic in cockroachdb/pebble#760 has this effect. I need to wrap my head around the currentByteRatios heuristic as I don't fully understand the difference between it and what is done in cockroachdb/pebble#760.

I still think it is worth trying the ignored heuristics from the prototype https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go , but I doubt I will have time until end of this week.

I'm going to spend some quality time with these heuristics this week. I'll definitely try and run an experiment incorporating those heuristics as-is.

sumeerbhola · 2020-06-22T18:06:34Z

The harm from having size(Ln) >> size(Ln+1) is that compacting into Ln then becomes more expensive. I agree that the subsequent compaction into Ln+1 doesn't add much to write amplification.

We could capture that concern directly with trying to avoid size(Ln-1) << size(Ln), which is what currentByteRatios tries to do.

petermattis · 2020-06-22T18:22:35Z

I'm going to spend some quality time with these heuristics this week. I'll definitely try and run an experiment incorporating those heuristics as-is.

I manually patched in @sumeerbhola's changes to compaction scoring heuristics from cockroachdb/pebble#563. See https://gist.github.com/petermattis/590b45e21774600275b0f6a61ab0d8f8.

The LSM metrics at the end of a 1h kv95/enc=false/nodes=3/size=64kb run show:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    47 M       -    77 G       -       -       -       -    77 G       -       -       -     1.0
      0      1930   650 M    8.50    77 G     0 B       0     0 B       0    74 G   124 K   276 M      21     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       803   3.1 G   52.46    60 G     0 B       0     0 B       0   196 G    51 K   217 G       1     3.3
      4      2244    15 G   30.93    48 G     0 B       0   289 M      79   159 G    28 K   171 G       1     3.3
      5      1557    19 G    4.92    22 G     0 B       0    29 M       8    57 G   6.4 K    60 G       1     2.6
      6        76   549 M       -   557 M     0 B       0   105 M      29   1.5 G     248   1.6 G       1     2.8
  total      6610    39 G       -    77 G     0 B       0   423 M     116   565 G   210 K   449 G      25     7.3

Contrast this with the LSM metrics with cockroachdb/pebble#760:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         2    90 M       -    90 G       -       -       -       -    90 G       -       -       -     1.0
      0      9184   2.8 G    3.02    90 G     0 B       0     0 B       0    85 G   186 K     0 B      50     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       360   1.4 G    3.22    72 G     0 B       0     0 B       0    85 G    22 K   112 G       1     1.2
      4       623   3.8 G    3.21    46 G     0 B       0   7.3 G   1.9 K   106 G    18 K   112 G       1     2.3
      5       873    10 G    3.22    41 G     0 B       0   3.6 G     940   104 G    10 K   108 G       1     2.6
      6      1096    28 G       -    30 G     0 B       0    43 M      22    93 G   4.5 K    95 G       1     3.1
  total     12136    46 G       -    90 G     0 B       0    11 G   2.8 K   563 G   241 K   428 G      54     6.3

Note the different write-amplification on L3 and L4. Of course, there is some apples-to-oranges comparison here, as there are different compaction concurrency heuristics.

petermattis · 2020-06-22T18:23:25Z

@sumeerbhola MANIFEST.sumeer-heuristics.zip is a MANIFEST from the run mentioned in the previous message.

sumeerbhola · 2020-06-22T19:27:21Z

The difference in the BytesMoved is significant. Maybe that is a downside of too many bytes in level i+1 -- even if level i also has a lot of bytes, the probability of being able to move from i to i+1 gets lowered.
The score computation is the same, so it's unclear to me why the first run has a significantly higher L0 score despite far fewer bytes in L0.

ajwerner · 2022-05-26T16:19:58Z

should this still be open?

sumeerbhola · 2023-01-18T20:37:13Z

Closing this since (a) tuning 2-level compaction heuristics is likely not a path to improvement (this also came up recently in a TaoBench import benchmark investigation), (b) we have other issues open to investigate multi-level compactions etc.

ajwerner added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jun 1, 2020

petermattis assigned jbowens Jun 1, 2020

jbowens mentioned this issue Jun 2, 2020

internal/record: bump LogWriter pending queue size cockroachdb/pebble#722

Merged

jbowens mentioned this issue Jun 8, 2020

vendor: bump Pebble to feb930 #49957

Merged

itsbilal mentioned this issue Jun 22, 2020

compaction: Explore more efficient flush split semantics cockroachdb/pebble#763

Closed

jlinder added the T-storage Storage Team label Jun 16, 2021

exalate-issue-sync bot unassigned jbowens May 31, 2022

sumeerbhola closed this as completed Jan 18, 2023

jbowens added this to Storage Jun 4, 2024

jbowens moved this to Done in Storage Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: large value performance degradation since switching to pebble #49750

storage: large value performance degradation since switching to pebble #49750

ajwerner commented Jun 1, 2020 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Jun 1, 2020

ajwerner commented Jun 1, 2020

ajwerner commented Jun 1, 2020

ajwerner commented Jun 1, 2020

nvanbenschoten commented Jun 1, 2020

petermattis commented Jun 1, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

itsbilal commented Jun 18, 2020

itsbilal commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

itsbilal commented Jun 18, 2020

sumeerbhola commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 19, 2020

petermattis commented Jun 19, 2020

petermattis commented Jun 19, 2020

petermattis commented Jun 19, 2020

sumeerbhola commented Jun 22, 2020

sumeerbhola commented Jun 22, 2020

petermattis commented Jun 22, 2020

sumeerbhola commented Jun 22, 2020

petermattis commented Jun 22, 2020

petermattis commented Jun 22, 2020

sumeerbhola commented Jun 22, 2020

ajwerner commented May 26, 2022

sumeerbhola commented Jan 18, 2023

storage: large value performance degradation since switching to pebble #49750

storage: large value performance degradation since switching to pebble #49750

Comments

ajwerner commented Jun 1, 2020 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Jun 1, 2020

ajwerner commented Jun 1, 2020

ajwerner commented Jun 1, 2020

ajwerner commented Jun 1, 2020

nvanbenschoten commented Jun 1, 2020

petermattis commented Jun 1, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

itsbilal commented Jun 18, 2020

itsbilal commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

itsbilal commented Jun 18, 2020

sumeerbhola commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 18, 2020

petermattis commented Jun 19, 2020

petermattis commented Jun 19, 2020

petermattis commented Jun 19, 2020

petermattis commented Jun 19, 2020

sumeerbhola commented Jun 22, 2020

sumeerbhola commented Jun 22, 2020

petermattis commented Jun 22, 2020

sumeerbhola commented Jun 22, 2020

petermattis commented Jun 22, 2020

petermattis commented Jun 22, 2020

sumeerbhola commented Jun 22, 2020

ajwerner commented May 26, 2022

sumeerbhola commented Jan 18, 2023

ajwerner commented Jun 1, 2020 •

edited by cockroach-jira-scripts

Loading