Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: large value performance degradation since switching to pebble #49750

Closed
ajwerner opened this issue Jun 1, 2020 · 32 comments
Closed
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-storage Storage Team

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Jun 1, 2020

What is your situation?

Starting May 11th, after the merging of #48145, we've begun to observe performance regressions in our large-value KV workloads (4kb and 64kb values). See

image
https://roachperf.crdb.dev/?filter=&view=kv0%2Fenc%3Dfalse%2Fnodes%3D3%2Fsize%3D64kb&tab=aws

Jira issue: CRDB-4200

@ajwerner ajwerner added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jun 1, 2020
@nvanbenschoten
Copy link
Member

Do we remember what happened between April 22-28, 2019? I thought @ajkr landed a tuning fix in there, but I'm having trouble finding the PR. I wonder if Pebble is missing similar tuning, given that it seems to have dropped back down to a similar level.

@ajwerner
Copy link
Contributor Author

ajwerner commented Jun 1, 2020

Perhaps libroach: enable rocksdb WAL recycling #35591

@ajwerner
Copy link
Contributor Author

ajwerner commented Jun 1, 2020

Wrong date range though, hmm

@ajwerner
Copy link
Contributor Author

ajwerner commented Jun 1, 2020

#37172 - seems like it was probably something from this bag of fixes.

@nvanbenschoten
Copy link
Member

Ah, I think it was facebook/rocksdb#5183 / cockroachdb/rocksdb#29.

@petermattis
Copy link
Collaborator

Ah, I think it was facebook/rocksdb#5183 / cockroachdb/rocksdb#29.

Pebble should have the same behavior as RocksDB here. Perhaps it is busted somehow. @jbowens is going to track down what is going on. Shouldn't be difficult given how large the delta is.

jbowens added a commit to jbowens/pebble that referenced this issue Jun 2, 2020
Bump LogWriter's pending queue size from 4 to 16. The impact is muted
and not statistically significant with small records but at larger
record sizes the impact is appreciable.

This may help with cockroachdb/cockroach#49750, but I don't think it's
the primary issue.

```
name                       old time/op    new time/op    delta
RecordWrite/size=8-16        32.7ns ± 5%    32.2ns ± 5%     ~     (p=0.055 n=24+25)
RecordWrite/size=16-16       33.8ns ± 7%    33.6ns ± 5%     ~     (p=0.663 n=23+25)
RecordWrite/size=32-16       36.6ns ± 4%    36.6ns ± 9%     ~     (p=0.755 n=23+24)
RecordWrite/size=64-16       41.5ns ± 5%    41.5ns ±12%     ~     (p=0.890 n=24+24)
RecordWrite/size=256-16      68.2ns ± 5%    67.9ns ± 8%     ~     (p=0.679 n=24+24)
RecordWrite/size=1028-16      134ns ± 8%     125ns ± 7%   -6.44%  (p=0.000 n=23+23)
RecordWrite/size=4096-16      357ns ±15%     340ns ± 8%   -4.90%  (p=0.001 n=24+24)
RecordWrite/size=65536-16    5.76µs ±10%    5.17µs ± 7%  -10.32%  (p=0.000 n=25+25)

name                       old speed      new speed      delta
RecordWrite/size=8-16       245MB/s ± 5%   249MB/s ± 5%     ~     (p=0.055 n=24+25)
RecordWrite/size=16-16      472MB/s ± 7%   476MB/s ± 6%     ~     (p=0.532 n=24+25)
RecordWrite/size=32-16      875MB/s ± 4%   875MB/s ± 8%     ~     (p=0.792 n=23+24)
RecordWrite/size=64-16     1.54GB/s ± 5%  1.54GB/s ±11%     ~     (p=0.945 n=24+25)
RecordWrite/size=256-16    3.76GB/s ± 5%  3.77GB/s ± 7%     ~     (p=0.690 n=24+24)
RecordWrite/size=1028-16   7.69GB/s ± 7%  8.22GB/s ± 7%   +6.93%  (p=0.000 n=23+23)
RecordWrite/size=4096-16   11.4GB/s ±13%  12.1GB/s ± 7%   +5.58%  (p=0.001 n=25+24)
RecordWrite/size=65536-16  11.4GB/s ±11%  12.7GB/s ± 7%  +11.39%  (p=0.000 n=25+25)
```
jbowens added a commit to cockroachdb/pebble that referenced this issue Jun 2, 2020
Bump LogWriter's pending queue size from 4 to 16. The impact is muted
and not statistically significant with small records but at larger
record sizes the impact is appreciable.

This may help with cockroachdb/cockroach#49750, but I don't think it's
the primary issue.

```
name                       old time/op    new time/op    delta
RecordWrite/size=8-16        32.7ns ± 5%    32.2ns ± 5%     ~     (p=0.055 n=24+25)
RecordWrite/size=16-16       33.8ns ± 7%    33.6ns ± 5%     ~     (p=0.663 n=23+25)
RecordWrite/size=32-16       36.6ns ± 4%    36.6ns ± 9%     ~     (p=0.755 n=23+24)
RecordWrite/size=64-16       41.5ns ± 5%    41.5ns ±12%     ~     (p=0.890 n=24+24)
RecordWrite/size=256-16      68.2ns ± 5%    67.9ns ± 8%     ~     (p=0.679 n=24+24)
RecordWrite/size=1028-16      134ns ± 8%     125ns ± 7%   -6.44%  (p=0.000 n=23+23)
RecordWrite/size=4096-16      357ns ±15%     340ns ± 8%   -4.90%  (p=0.001 n=24+24)
RecordWrite/size=65536-16    5.76µs ±10%    5.17µs ± 7%  -10.32%  (p=0.000 n=25+25)

name                       old speed      new speed      delta
RecordWrite/size=8-16       245MB/s ± 5%   249MB/s ± 5%     ~     (p=0.055 n=24+25)
RecordWrite/size=16-16      472MB/s ± 7%   476MB/s ± 6%     ~     (p=0.532 n=24+25)
RecordWrite/size=32-16      875MB/s ± 4%   875MB/s ± 8%     ~     (p=0.792 n=23+24)
RecordWrite/size=64-16     1.54GB/s ± 5%  1.54GB/s ±11%     ~     (p=0.945 n=24+25)
RecordWrite/size=256-16    3.76GB/s ± 5%  3.77GB/s ± 7%     ~     (p=0.690 n=24+24)
RecordWrite/size=1028-16   7.69GB/s ± 7%  8.22GB/s ± 7%   +6.93%  (p=0.000 n=23+23)
RecordWrite/size=4096-16   11.4GB/s ±13%  12.1GB/s ± 7%   +5.58%  (p=0.001 n=25+24)
RecordWrite/size=65536-16  11.4GB/s ±11%  12.7GB/s ± 7%  +11.39%  (p=0.000 n=25+25)
```
craig bot pushed a commit that referenced this issue Jun 8, 2020
49957: vendor: bump Pebble to feb930 r=jbowens a=jbowens

```
feb930 db: close tableCache on open error
660b76 internal/record: bump LogWriter pending queue size
a9b799 db: remove table loading goroutine
d18729 db: add a per-tableCacheShard table closing goroutine
9687c6 internal/manifest: add Level type
```

Includes cockroachdb/pebble#722, which partially addresses #49750:
```
name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=32/size=64kb     641 ± 7%    1158 ± 3%  +80.80%  (p=0.016 n=4+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=32/size=64kb     177 ±34%      67 ±31%  -62.13%  (p=0.016 n=4+5)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=32/size=64kb     990 ±15%     584 ± 3%  -41.02%  (p=0.000 n=4+5)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=32/size=64kb   1.34k ±10%   0.79k ± 6%  -41.00%  (p=0.016 n=4+5)
```

Release note: None

49967: cmd/generate-binary: move some decimal encoding tests to auto-gen script r=otan a=arulajmani

`sql/pgwire/testdata/encodings.json` is autogenerated using
`cmd/generate-binary/main.go`. Previously, a few tests were missing
from this file, which would have been lost the next time new tests were
added and `encodings.json` was autogenerated. This PR fixes that by
moving the tests to the auto-gen script.

Release note (none)

49974: build: add build that simply compiles CRDB on supported platforms r=jlinder a=otan

Abstract away the process of building from
`./pkg/cmd/publish-*-artifacts`, and use this in
`./pkg/cmd/compile-builds`. This is intended to become a CI job for
TeamCity.

Build: https://teamcity.cockroachdb.com/admin/editBuild.html?id=buildType:Cockroach_UnitTests_CompileBuilds

Release note: None



Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
Co-authored-by: arulajmani <arulajmani@gmail.com>
Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
@petermattis
Copy link
Collaborator

@jbowens' fix to the LogWriter pending queue size and my change to the compaction concurrency heuristic have helped for some of the large-value workloads, but not for the kv95/.../size=64kb variants. I need to write up some more notes tomorrow, but the short summary is that RocksDB's compaction behavior accidentally reduces compaction concurrency which helps with this workload for short durations (i.e. 10m) but is problematic for longer durations (1h+). Pebble also suffers at longer durations. I've experimented with various tweaks to the compaction concurrency heuristics, but I'm unconvinced these adjustments are worthwhile (they will need to be validated on other workloads). The adjustments also feel fragile, as if I'm over-tuning for this specific workload.

@petermattis
Copy link
Collaborator

Running on 9bc18e0 (current master) shows the following perf with RocksDB as old and Pebble as new:

name                              old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/size=4kb     2.90k ± 1%   2.97k ± 1%   +2.44%  (p=0.000 n=10+10)
kv0/enc=false/nodes=3/size=64kb      353 ± 1%     261 ± 1%  -26.04%  (p=0.000 n=10+10)
kv95/enc=false/nodes=3/size=4kb    36.4k ± 2%   36.3k ± 3%     ~     (p=0.739 n=10+10)
kv95/enc=false/nodes=3/size=64kb   7.03k ± 1%   5.21k ± 1%  -25.90%  (p=0.000 n=10+10)

So we've eliminated the perf difference for the size=4kb workload, and narrowed it for size=64kb. An analysis of the remaining difference on size=64kb shows that it is due to different compaction behavior. This is true even for the kv95 workload for which only 5% of operations are writes. The large values cause significant compaction pressure and the different behavior from RocksDB and Pebble accounts for this difference.

Interestingly, RocksDB's behavior isn't necessarily better. On the kv95 workload, it has a tendency to encounter the problem described in cockroachdb/pebble#203. Pebble seems to encounter this situation less frequently. I tracked down the reason why to one aspect of the RocksDB compaction heuristics: the inflating of the size of Lbase using the size of L0. Counterintuitively, the strange shape of the RocksDB LSM actually reduces write-amplification (at the expense of read amplification) and helps the workload in the short term. If I adjust the RocksDB compaction heuristics to look more like the Pebble heuristics, the delta on the size=64kb workloads shrinks:

name                              old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/size=4kb     2.90k ± 1%   2.97k ± 1%   +2.42%  (p=0.000 n=10+10)
kv0/enc=false/nodes=3/size=64kb      303 ± 2%     261 ± 1%  -14.00%  (p=0.000 n=10+10)
kv95/enc=false/nodes=3/size=4kb    37.0k ± 3%   36.3k ± 3%   -2.11%  (p=0.006 n=9+10)
kv95/enc=false/nodes=3/size=64kb   6.03k ± 1%   5.21k ± 1%  -13.63%  (p=0.000 n=9+10)

Should we be incorporating this additional RocksDB compaction heuristic into Pebble instead? I'm not sure. I think the RocksDB behavior is unintentional and the performance benefit won't hold up over longer durations. These tests are only running for 10m. Here is a graph of throughput for a 1h run of kv95/enc=false/nodes=3/size=64kb on RocksDB:

Screen Shot 2020-06-17 at 10 31 24 AM

That looks bad. What is happening is the Lbase->Lbase+1 compactions eventually become unblocked by the opening up of Lbase-1. Then there is a ton of compaction backlog to work through causing performance to fall off a cliff. Here's what Pebble looks like on the same test:

Screen Shot 2020-06-18 at 12 48 04 PM

I poked around at some of the other metrics on the Pebble run. It looks like the performance problem is being caused by one node. Here are some disk IO graphs:

Screen Shot 2020-06-18 at 12 49 41 PM
Screen Shot 2020-06-18 at 12 48 35 PM
Screen Shot 2020-06-18 at 12 49 21 PM

Notice how n2 is "pegged" on read ops and bandwidth. It looks like something on n2 got horked up with compactions, as read-amplification started to increase dramatically at the same time:

Screen Shot 2020-06-18 at 12 52 14 PM

I have the MANIFESTs from all of the nodes and I'm continuing to poke around this cluster to see if there is anything else to see.

@petermattis
Copy link
Collaborator

Screen Shot 2020-06-18 at 1 06 55 PM

n2 dramatically decreased the number of compactions it was performing right around when the badness started. Spelunking the Pebble log, I sorted the compaction durations to find the compaction that took the longest: 1143.1s (19m). This compaction (JOB 18353) started at 16:20:02 and involved compacting 7.6 GB from L0 + 2.9 GB from L3. It is curious why that compaction took as long as it did. It averaged a measly 6.5 MB/sec. This is again the L0->Lbase compaction problem. It will be interesting to experiment with L0-sublevels and flush splitting, though I suspect that multi-level compactions may be necessary to reduce write-amplification as the read and write bandwidth seems to be nearing a limit.

Here is the LSM visualization for n2 for posterity: 2.MANIFEST.pebble.html.zip

@petermattis
Copy link
Collaborator

Ran the 1h kv95/enc=false/nodes=3/size=64kb workload on top of #50371 which enables L0-sublevels.

Screen Shot 2020-06-18 at 3 09 24 PM

Perform is lower initially than without L0-sublevels, but more stable over time. Read-amplification never gets out of control:

Screen Shot 2020-06-18 at 3 10 16 PM

Screen Shot 2020-06-18 at 3 10 32 PM
Screen Shot 2020-06-18 at 3 10 39 PM
Screen Shot 2020-06-18 at 3 10 47 PM
Screen Shot 2020-06-18 at 3 10 53 PM
Screen Shot 2020-06-18 at 3 11 02 PM

Here is a MANIFEST from one of the nodes, though pebble lsm is refusing to visualize it.

@petermattis
Copy link
Collaborator

Cc @sumeerbhola and @itsbilal regarding how L0 sublevels perform.

@itsbilal
Copy link
Member

Interesting. The maximization of read IOPS looks very similar to the slow backups we saw in #49710. Maybe readahead needs more tuning, especially if rocksdb is able to get more bytes throughput for the same number of IOPS as pebble.

Thanks for sending the MANIFEST, taking a look there as well.

@itsbilal
Copy link
Member

I was able to visualize the manifest no problem, here's a zipped html file (too large to upload to github): https://drive.google.com/file/d/1CtAjnuoRhHlCvUJF90wNqN_SJhMr7prI/view?usp=sharing

@petermattis
Copy link
Collaborator

Huh, for me: pebble lsm on the above MANIFEST just spins. kill -QUIT shows it present in a NewL0Sublevels call. How long did that take to generate for you?

@petermattis
Copy link
Collaborator

Hmm, looks like I just wasn't patient enough. Building the L0 sublevels for each of the 16889 version edits is slow.

@itsbilal
Copy link
Member

On my Macbook:
./pebble lsm ./1.MANIFEST.pebble > output3.html 84.04s user 2.78s system 106% cpu 1:21.21 total

@sumeerbhola
Copy link
Collaborator

This visualization is interesting.

  • We have 14GB in the LSM (of which 6GB is in L6) before we make L3 the base level. We should definitely make that change that uses total bytes to compute target bytes for L6 and higher.
  • The L0 files are narrow (unlike the slack thread discussion from yesterday which was without L0 sub-levels), so flush splits are somewhat working. But most of the L0 sstables are quite tiny. Was FlushSplitBytes set to 10MB? Around the 14000 tick mark we have 3200 files for 1.6GB in L0 -- that is too many files.
  • And when one scrolls over the sublevels slowly starting from the left (again at the 14000 tick mark) one can see that the splits are quite poor -- one needs to get to almost the half way point before it scrolls forward for L3 and lower and then it starts moving forward rapidly for those lower levels. I suspect we could get much better performance with better split points.

@petermattis
Copy link
Collaborator

We have 14GB in the LSM (of which 6GB is in L6) before we make L3 the base level. We should definitely make that change that uses total bytes to compute target bytes for L6 and higher.

I experimented with this without sublevels and it didn't affect the shape. I think it is actually the limited compaction concurrency (MaxConcurrentCompactions = 3) that is the bigger effect. I'll definitely try it out, though.

The L0 files are narrow (unlike the slack thread discussion from yesterday which was without L0 sub-levels), so flush splits are somewhat working. But most of the L0 sstables are quite tiny. Was FlushSplitBytes set to 10MB? Around the 14000 tick mark we have 3200 files for 1.6GB in L0 -- that is too many files.

Yes, FlushSplitBytes was set to 10MB. I think we're splitting on every Lbase file boundary.

@petermattis
Copy link
Collaborator

@sumeerbhola Here is a MANIFEST.db-size.zip from a run where I tweaked Pebble to use dbSize rather than bottomLevelSize, exactly as your L0-sublevel code originally did. You'll have to get a new version of pebble in order to get the L0-sublevel visualization (the visualization is ~400MB). You'll notice that L6 is even smaller in this run than when using bottomLevelSize. That matches me previous experience. I think we're starving L5->L6 compactions due to lack of compaction concurrency. Or perhaps we're allowing too many L0->Lbase compactions concurrently. Do we ever need more than one L0->Lbase compaction?

@petermattis
Copy link
Collaborator

There is some funky behavior going on with the L0 sublevel compactions. Most of the compactions seem to be for sstables at the end of each level. And I'm mostly seeing L0->Lbase compaction. Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions.

@petermattis
Copy link
Collaborator

The problem is not what I thought it was. I added some extra instrumentation about compaction picking decisions. Here is an example early in the run where we are starving L5->L6 compactions:

  *L0:   5.0     551 M     8.0 E  [L0->L4]
   L4:  15.5     767 M      64 M  [L4->L5]
   L5:   4.5     1.9 G     437 M
   L6:   0.0     450 M     2.9 G

The columns are "score", "level size", and "level max size". The Lx->Ly indicate in-progress compactions. The * marks the level we've chosen a new compaction for. While L5 is larger than L6, the scoring considers that less of a problem than the size of L0 and L4. So we end up only performing L0->L4 and L4->L5 compactions.

@petermattis
Copy link
Collaborator

I've been experimenting with adjusting the level scoring. An observation on the scores above is that we'll frequently see situations like L4->L5 which appear higher priority than L5->L6, but in fact only hurt our future desired state. To account for that, I experimented with adjusting each level's score (for L1-L6) by dividing by the next level's score. For the above data, we'd have something like:

        old-score  new-score  level-size  target-size
   L0:        5.0        5.5       551 M        8.0 E
   L4:       15.5        3.5       767 M         64 M
   L5:        4.5       30.0       1.9 G        437 M
   L6:        0.2        0.2       450 M        2.9 G

This looks a bit dramatic on the surface, though it does nicely priority L5->L6. In practice, this has the effect of smoothing the level scores. Here is an example from a run:

   L0:   6.0     558 M     8.0 E  [L0->L3]
   L3:   6.1     5.6 G      64 M
  *L4:   6.1     6.8 G     473 M  [L4->L5]
   L5:   6.0     8.3 G     3.4 G
   L6:   0.0      10 G      25 G

Note how the scores for each level are very similar. We've also avoided the inverted LSM shape. Unfortunately, L3 is too large. That is increasing the write-amplification of L0->L3 compactions which we can see in the metrics output:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         4   218 M       -    64 G       -       -       -       -    64 G       -       -       -     1.0
      0      2449   697 M    6.50    64 G     0 B       0     0 B       0    62 G   110 K     0 B      21     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3      1423   5.5 G    6.00    49 G     0 B       0     0 B       0   206 G    53 K   229 G       1     4.3
      4      1216   7.0 G    6.10    21 G     0 B       0    11 G   2.8 K    57 G    11 K    61 G       1     2.7
      5       859   8.6 G    6.01    13 G     0 B       0   8.2 G   1.5 K    41 G   4.4 K    42 G       1     3.1
      6       460    11 G       -    12 G     0 B       0    85 M      24    40 G   2.3 K    41 G       1     3.4
  total      6407    32 G       -    64 G     0 B       0    19 G   4.4 K   470 G   181 K   374 G      25     7.4

Notice how much data is being read and written for compactions to L3. That seems suboptimal.

Screen Shot 2020-06-19 at 1 18 54 PM
Screen Shot 2020-06-19 at 1 19 06 PM

Here is a MANIFEST of part of the run. I have some other ideas with which to experiment here.

@petermattis
Copy link
Collaborator

Another tweak the scoring heuristics reduced the L3 write-amplification:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    60 M       -    89 G       -       -       -       -    89 G       -       -       -     1.0
      0     20580   3.6 G    2.72    89 G     0 B       0     0 B       0    85 G   245 K   6.2 M     111     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       272   1.1 G    2.80    69 G     0 B       0     0 B       0    59 G    15 K    90 G       1     0.9
      4       520   3.2 G    2.79    37 G     0 B       0    12 G   3.0 K   101 G    16 K   107 G       1     2.7
      5       841   9.4 G    2.79    30 G     0 B       0    11 G   1.8 K   109 G   9.9 K   111 G       1     3.7
      6      1105    28 G       -    29 G     0 B       0    92 M      27   116 G   5.5 K   117 G       1     4.0
  total     23318    45 G       -    89 G     0 B       0    23 G   4.9 K   560 G   292 K   425 G     115     6.3

Throughput was somewhat higher, but read-amplification as significantly higher and showed no signs of ever stopping.

Screen Shot 2020-06-19 at 2 22 21 PM
Screen Shot 2020-06-19 at 2 22 31 PM

I'm not sure that the read-amplification growth is a huge problem, though. It is indicative of the system having trouble keeping up with writes, but the system is having trouble. We could push harder to keep read amplification down, though doing so only further hurts write throughput.

Interestingly, the latest heuristic gets rid of the L0CompactionThreshold tunable. That feels like progress.

@sumeerbhola
Copy link
Collaborator

Yes, FlushSplitBytes was set to 10MB.

I think this should be adaptive based on number of sub-levels. Something like

max(minFlushSplitBytes, targetFileSize * numSubLevels)

where minFlushSplitBytes and targetFileSize are configured constants (the 2MB value we currentl use for the latter should be ok).

You'll notice that L6 is even smaller in this run than when using bottomLevelSize. That matches me previous experience. I think we're starving L5->L6 compactions due to lack of compaction concurrency. Or perhaps we're allowing too many L0->Lbase compactions concurrently.

Based on looking at the visualization of this MANIFEST, I think this could be explained by a combination of factors:

  • since we start using L3 earlier we have one more level that is potentially demanding compaction.
  • the target size for Lbase and Lbase+1 is primarily a function of the constant LBaseMaxBytes (ignoring the smoothing of the level multiplier), so their score will be unaffected by this change. But L5 score is decreased when we no longer use bottomLevelSize, so it loses to L0, L3 and L4 more often.

Do we ever need more than one L0->Lbase compaction?
...
Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions.

I don't quite understand this -- allowing for concurrency in L0->Lbase to avoid wasteful intra-L0 compactions was the main motivation for sub-level compactions. And in that sense it is good that L0->Lbase are reaching higher up in the sub-levels -- this will reduce write amplification since we've picked up all files in a vertical slice across sublevels in one compaction (and without making the compaction huge).

@sumeerbhola
Copy link
Collaborator

I have a couple of questions about the heuristics being introduced

I still think it is worth trying the ignored heuristics from the prototype https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go , but I doubt I will have time until end of this week.

@petermattis
Copy link
Collaborator

Where are the intra-L0 compactions? The L0->Lbase compactions reach so high up in the sublevels that they may be blocking intra->L0 compactions.

I don't quite understand this -- allowing for concurrency in L0->Lbase to avoid wasteful intra-L0 compactions was the main motivation for sub-level compactions. And in that sense it is good that L0->Lbase are reaching higher up in the sub-levels -- this will reduce write amplification since we've picked up all files in a vertical slice across sublevels in one compaction (and without making the compaction huge).

Yeah, I think my comment was just wrong. We don't have intra-L0 compactions. That is perfectly fine and an indication that L0 sublevels is working as designed.

what is bad about the shape in #49750 (comment) that motivate trying to improve it? The read amplification is low. Is the write amplification higher than what the revised heuristics achieve?

I've been a bit sloppy in my phraseology. The inverted LSM shape is not problematic in and of itself, but I have found that a more normal LSM shape leads to lower write amplification and lower write amplification leads to higher throughput.

Regarding the heuristic change in #49750 (comment) this seems roughly to be saying "don't let a level become much larger than the next lower level". Intuitively this seems harmless, since compacting from a very large level to a small level has small write amplification. This is different from the currentByteRatios heuristic in
https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go#L499-L505 which was roughly "don't let a level become much larger than the next higher level", since that does have an effect on write amplification when compacting down from that higher level.

The harm from having size(Ln) >> size(Ln+1) is that compacting into Ln then becomes more expensive. I agree that the subsequent compaction into Ln+1 doesn't add much to write amplification. I also agree with "don't let a level become much larger than the next higher level". The change to the scoring heuristic in cockroachdb/pebble#760 has this effect. I need to wrap my head around the currentByteRatios heuristic as I don't fully understand the difference between it and what is done in cockroachdb/pebble#760.

I still think it is worth trying the ignored heuristics from the prototype https://github.com/sumeerbhola/pebble/blob/sublevel/compaction_picker.go , but I doubt I will have time until end of this week.

I'm going to spend some quality time with these heuristics this week. I'll definitely try and run an experiment incorporating those heuristics as-is.

@sumeerbhola
Copy link
Collaborator

The harm from having size(Ln) >> size(Ln+1) is that compacting into Ln then becomes more expensive. I agree that the subsequent compaction into Ln+1 doesn't add much to write amplification.

We could capture that concern directly with trying to avoid size(Ln-1) << size(Ln), which is what currentByteRatios tries to do.

@petermattis
Copy link
Collaborator

I'm going to spend some quality time with these heuristics this week. I'll definitely try and run an experiment incorporating those heuristics as-is.

I manually patched in @sumeerbhola's changes to compaction scoring heuristics from cockroachdb/pebble#563. See https://gist.github.com/petermattis/590b45e21774600275b0f6a61ab0d8f8.

The LSM metrics at the end of a 1h kv95/enc=false/nodes=3/size=64kb run show:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         1    47 M       -    77 G       -       -       -       -    77 G       -       -       -     1.0
      0      1930   650 M    8.50    77 G     0 B       0     0 B       0    74 G   124 K   276 M      21     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       803   3.1 G   52.46    60 G     0 B       0     0 B       0   196 G    51 K   217 G       1     3.3
      4      2244    15 G   30.93    48 G     0 B       0   289 M      79   159 G    28 K   171 G       1     3.3
      5      1557    19 G    4.92    22 G     0 B       0    29 M       8    57 G   6.4 K    60 G       1     2.6
      6        76   549 M       -   557 M     0 B       0   105 M      29   1.5 G     248   1.6 G       1     2.8
  total      6610    39 G       -    77 G     0 B       0   423 M     116   565 G   210 K   449 G      25     7.3

Contrast this with the LSM metrics with cockroachdb/pebble#760:

__level_____count____size___score______in__ingest(sz_cnt)____move(sz_cnt)___write(sz_cnt)____read___r-amp___w-amp
    WAL         2    90 M       -    90 G       -       -       -       -    90 G       -       -       -     1.0
      0      9184   2.8 G    3.02    90 G     0 B       0     0 B       0    85 G   186 K     0 B      50     1.0
      1         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      2         0     0 B    0.00     0 B     0 B       0     0 B       0     0 B       0     0 B       0     0.0
      3       360   1.4 G    3.22    72 G     0 B       0     0 B       0    85 G    22 K   112 G       1     1.2
      4       623   3.8 G    3.21    46 G     0 B       0   7.3 G   1.9 K   106 G    18 K   112 G       1     2.3
      5       873    10 G    3.22    41 G     0 B       0   3.6 G     940   104 G    10 K   108 G       1     2.6
      6      1096    28 G       -    30 G     0 B       0    43 M      22    93 G   4.5 K    95 G       1     3.1
  total     12136    46 G       -    90 G     0 B       0    11 G   2.8 K   563 G   241 K   428 G      54     6.3

Note the different write-amplification on L3 and L4. Of course, there is some apples-to-oranges comparison here, as there are different compaction concurrency heuristics.

@petermattis
Copy link
Collaborator

@sumeerbhola MANIFEST.sumeer-heuristics.zip is a MANIFEST from the run mentioned in the previous message.

@sumeerbhola
Copy link
Collaborator

The difference in the BytesMoved is significant. Maybe that is a downside of too many bytes in level i+1 -- even if level i also has a lot of bytes, the probability of being able to move from i to i+1 gets lowered.
The score computation is the same, so it's unclear to me why the first run has a significantly higher L0 score despite far fewer bytes in L0.

@ajwerner
Copy link
Contributor Author

should this still be open?

@sumeerbhola
Copy link
Collaborator

Closing this since (a) tuning 2-level compaction heuristics is likely not a path to improvement (this also came up recently in a TaoBench import benchmark investigation), (b) we have other issues open to investigate multi-level compactions etc.

@jbowens jbowens added this to Storage Jun 4, 2024
@jbowens jbowens moved this to Done in Storage Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-storage Storage Team
Projects
Archived in project
Development

No branches or pull requests

7 participants