Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk: use pebble sstable builder to write SSTs #37528

Merged
merged 3 commits into from
Aug 8, 2019

Conversation

dt
Copy link
Member

@dt dt commented May 15, 2019

bulk: add wrapper for pebble SSTable builder to make SSTs
Comparing to the Rocks-based builder in engine:

name             old time/op    new time/op    delta
WriteSSTable-12    4.66µs ± 6%    3.05µs ±13%   -34.54%  (p=0.000 n=9+10)

name             old speed      new speed      delta
WriteSSTable-12   258MB/s ± 6%   396MB/s ±12%   +53.62%  (p=0.000 n=9+10)

name             old alloc/op   new alloc/op   delta
WriteSSTable-12      300B ± 0%     1662B ± 0%  +453.83%  (p=0.000 n=8+10)

name             old allocs/op  new allocs/op  delta
WriteSSTable-12      0.00           0.00           ~     (all equal)
name               old time/op    new time/op    delta
ClusterRestore-12    2.09µs ±19%    1.80µs ±10%  -14.05%  (p=0.016 n=5+5)

name               old speed      new speed      delta
ClusterRestore-12  55.5MB/s ±17%  64.4MB/s ±10%  +16.02%  (p=0.016 n=5+5)

Release note (performance improvement): speed up file-writing during bulk-ingestion.

@dt dt requested review from danhhz, petermattis and a team May 15, 2019 14:54
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Is this also the sstables we're handing to AddSSTable? If so, how much due diligence has been done to make sure the metadata etc matches the old ones? For example, TBI metadata, bloom filters, etc. Does pebble do all of it for us? There really isn't any info in this commit message

Can you also include the BenchmarkImportWorkload and BenchmarkConvertToSSTable diffs, plz

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @petermattis)

@dt
Copy link
Member Author

dt commented May 15, 2019

I haven't done much digging beyond just that the tests passed and running one benchmark. I'm pretty sure the custom collectors are not there, not sure about the prefix bloom filters being the same.

@dt dt requested a review from ajkr May 15, 2019 18:05
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @dt, and @petermattis)


pkg/storage/bulk/sst_writer.go, line 29 at r1 (raw file):

)

// SSTWriter writes SSTables.

These are tables used for backup, right? I think that should be called out here as you're using leveldb format tables.


pkg/storage/bulk/sst_writer.go, line 40 at r1 (raw file):

var pebbleOpts = func() *pebbledb.Options {
	c := *pebbledb.DefaultComparer
	c.Compare = engine.MVCCKeyCompare

I'd prefer to see all of the fields explicitly populated:

c := &pebbledb.Comparer{
  AbbreviatedKey: ...,
  Compare: engine.MVCCKeyCompare,
  Equal: bytes.Equal,
  Split: ...,
  Successor: ...,
  Name: "cockroach_comparator",
}

I'm not sure if DefaultComparer.Successor works properly for our MVCC keys.


pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):

	c := *pebbledb.DefaultComparer
	c.Compare = engine.MVCCKeyCompare
	opts := &pebbledb.Options{TableFormat: pebbledb.TableFormatLevelDB, Comparer: &c}

Just want to confirm that you want leveldb format files. I'd only recommend that for "backup" sstables. For every other usage we should be using rocksdb format.


pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):

// MakeSSTWriter creates a new SSTWriter.
func MakeSSTWriter() SSTWriter {
	f := newMemFile(nil)

I suppose this is evidence that add vfs.NewMemFile() would be useful. Also irritating that sstable.NewWriter takes a vfs.File rather than an io.Writer. I think I can fix that for you.


pkg/storage/bulk/sst_writer.go, line 63 at r1 (raw file):

	fw.scratch = engine.EncodeKeyToBuf(fw.scratch[:0], kv.Key)
	k := pebbledb.MakeInternalKey(fw.scratch, 0, pebbledb.InternalKeyKindSet)
	return fw.fw.Add(k, kv.Value)

This can be written as fw.fw.Set(fw.scratch, kv.Value).

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done much digging beyond just that the tests passed and running one benchmark. I'm pretty sure the custom collectors are not there, not sure about the prefix bloom filters being the same.

Those options could have pretty big impacts on users of imported data, so we'll need to dig into them before merging this. Plus probably run some whole system benchmarking

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @dt, and @petermattis)


pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Just want to confirm that you want leveldb format files. I'd only recommend that for "backup" sstables. For every other usage we should be using rocksdb format.

IIRC everything uses the leveldb format right now. If that's so, I'd prefer we make the change to rockdb format for non-backup uses in a separate PR

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @danhhz, and @dt)


pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

IIRC everything uses the leveldb format right now. If that's so, I'd prefer we make the change to rockdb format for non-backup uses in a separate PR

What is "everything"? We certainly use RocksDB format for the sstables in the main database. I hope we use RocksDB format for ingested sstables, though I agree that if we're not currently doing so we should make that change in a separate PR.

Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)


pkg/storage/bulk/sst_writer.go, line 29 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

These are tables used for backup, right? I think that should be called out here as you're using leveldb format tables.

They are, though actually with another upcoming change, they won't be -- #37496 pushes the iteration down to C++ and uses the rocks writer, so actually after that change, this is used for everything except backup.


pkg/storage/bulk/sst_writer.go, line 40 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'd prefer to see all of the fields explicitly populated:

c := &pebbledb.Comparer{
  AbbreviatedKey: ...,
  Compare: engine.MVCCKeyCompare,
  Equal: bytes.Equal,
  Split: ...,
  Successor: ...,
  Name: "cockroach_comparator",
}

I'm not sure if DefaultComparer.Successor works properly for our MVCC keys.

Done (though it wasn't clear to me if I should be doing anything fancier for Successor)


pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Just want to confirm that you want leveldb format files. I'd only recommend that for "backup" sstables. For every other usage we should be using rocksdb format.

Currently we always produce LevelDB format files -- even for IMPORT, RESTORE, or index creation -- since the format is set in DBSstFileWriterNew without any control by the caller. So currently we're always ingesting Level format. That said, we could easily change that -- in fact, we could always ingest rocks format since we don't actually ingest backup SSTs directly and instead iterate and rewrite them during RESTORE anyway to remap the keys.


pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I suppose this is evidence that add vfs.NewMemFile() would be useful. Also irritating that sstable.NewWriter takes a vfs.File rather than an io.Writer. I think I can fix that for you.

I wanted direct access to the []byte which is why I didn't use the MemFS and make a File with it, since to then read the content I'd have to make another copy (same issue we've had on the reading since with LevelDB).


pkg/storage/bulk/sst_writer.go, line 63 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This can be written as fw.fw.Set(fw.scratch, kv.Value).

Done.

Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)


pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):

Previously, dt (David Taylor) wrote…

Currently we always produce LevelDB format files -- even for IMPORT, RESTORE, or index creation -- since the format is set in DBSstFileWriterNew without any control by the caller. So currently we're always ingesting Level format. That said, we could easily change that -- in fact, we could always ingest rocks format since we don't actually ingest backup SSTs directly and instead iterate and rewrite them during RESTORE anyway to remap the keys.

But I also concur with @danhhz -- I'd rather swap the implementation, then swap the format, rather than do both at once.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @dt)


pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I wanted direct access to the []byte which is why I didn't use the MemFS and make a File with it, since to then read the content I'd have to make another copy (same issue we've had on the reading since with LevelDB).

Once https://github.com/petermattis/pebble/pull/141 merges, you can simplify the memFile stuff below. memFileInfo should no longer be necessary and you'll only need to implement Write, Close, and Sync.


pkg/storage/bulk/sst_writer.go, line 65 at r2 (raw file):

			return len(k)
		}
		return keyPartEnd

Maybe add a comment that this is the same as enginepb.SplitMVCCKey.

@dt dt force-pushed the pebble-write-ssts branch 4 times, most recently from a2fb0b5 to 101d0bf Compare August 6, 2019 16:20
Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the concerns @danhhz raised above, I've been thinking about the question of if we were using the same configuration than the RocksDB writer, and how could result in significant -- but potentially hard to debug -- performance differences when later operating on the produced SSTs.

Talking with @petermattis, we've generally agreed that we trust pebble's own tests to verify pebble creates the SST you ask it to, so this concern is mostly about how we're configuring it, and if that matches how we're configuring the Rocks writer.

"Fortunately", the Rocks-based writer is actually mostly configured from scratch right now and share very little configuration with runtime Rocks config that we've actually tuned. While I'm not actually sure this is a good thin in general, it at least makes it easier to inspect it vs this pebble writer config and at least determine if switching to pebble could be a regression. The prop collectors are about the only thing shared between the rocks writer and the runtime rocksdb.

I chatted with @nvanbenschoten about this a bit, and we're unconvinced that we could really prove much to ourselves by running a workload on it -- even if SSTs were different, the fact they could be compacted away and replaced with rocks-authored SSTs at any time could make it pretty hard to tell. However, if the actual SSTs were exactly the same, that could rule out any behavior differences without doubt.

I talked to @petermattis about this and long-term, we're not sure that we want to constrain pebble to produce byte-for-byte identical SSTs to rocksdb, as that could exclude including different or extra features that we realize we want. However, in the short term, they do happen to be identical, and for the purposes of vetting this change, I think we can test that. If down the road pebble SSTs diverge, we can decide to delete this test, but that that point, I think the test will have done its job: proving that using using the pebble writer didn't change the behavior compared to the rocks writer. w.r.t checking that either writer matches runtime-written SSTs, we don't currently have tests for that and this change doesn't really change that (indeed, we know they don't: we explicitly configure them to use a different format).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr and @petermattis)


pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Once https://github.com/petermattis/pebble/pull/141 merges, you can simplify the memFile stuff below. memFileInfo should no longer be necessary and you'll only need to implement Write, Close, and Sync.

Done.


pkg/storage/bulk/sst_writer.go, line 65 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Maybe add a comment that this is the same as enginepb.SplitMVCCKey.

Done.

@petermattis
Copy link
Collaborator

However, in the short term, they do happen to be identical, and for the purposes of vetting this change, I think we can test that.

I'm good with this in the short term. The ssts should be byte-for-byte identical. There was a bug last week that was causing that to not be true, but it has been fixed. The one exception to this rule is range-deletion tombstones. Those are stored slightly differently in Pebble, though in a format that is compatible with RocksDB. So if you do add a test around this, be aware of that limitation.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: though I didn't look at pebble use closely. exciting! i've wanted this for a while

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @dt, and @petermattis)


pkg/storage/bulk/sst_batcher.go, line 330 at r5 (raw file):

				return nil, nil, err
			}
			w = MakeSSTWriter()

should SSTWriter have a Reset or something so you can reuse scratch and maybe the memFile? this also could easily be a followup


pkg/storage/bulk/sst_writer.go, line 100 at r5 (raw file):

		binary.BigEndian.PutUint64(minBuf, uint64(t.min.WallTime))
	}
	if t.min.Logical > 0 {

is this really the logic? it's contrived, but if there was a timestamp with 0 walltime and non-0 logical, this would encode only the logical time

update: i looked and i don't think this matches the behavior of the c code. it probably produces the same effect in practice, but it's definitely a distraction to go convince myself of that


pkg/storage/bulk/sst_writer.go, line 129 at r5 (raw file):

func (dummyDeleteRangeCollector) Add(key pebble.InternalKey, value []byte) error {
	return nil

panic or return a not implemented error or something? if it does end up being used somehow, it'd be nice if something failed loudly


pkg/storage/bulk/sst_writer_test.go, line 30 at r3 (raw file):

)

func makeIntTableKVs(t testing.TB, numKeys, valueSize, maxRevisions int) []engine.MVCCKeyValue {

just to keep this on you radar, i'd love to move these sorts of things to workload so we don't have to keep duplicating just enough of our encoding logic to do stuff like this. https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/format/sstable.go may be relevant. this is not a blocking comment

Copy link
Member Author

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajkr, @danhhz, and @petermattis)


pkg/storage/bulk/sst_batcher.go, line 330 at r5 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

should SSTWriter have a Reset or something so you can reuse scratch and maybe the memFile? this also could easily be a followup

Yeah, I think prefer to punt optimizations to followups, since this has the vendor change in it to rebase.

That said, I did take a quick look at doing this, but we can't reuse memFile's backing slice, it is the bytes handed back by Finish and Finish is not unsafe. Making it unsafe wouldn't buy us much since you can't hand unsafe bytes to DistSender to actually add the SSTable.

So I think best case we'd be avoiding a single kv-sized alloc per-SSTable if we added a reset that could reuse scratch.


pkg/storage/bulk/sst_writer.go, line 100 at r5 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

is this really the logic? it's contrived, but if there was a timestamp with 0 walltime and non-0 logical, this would encode only the logical time

update: i looked and i don't think this matches the behavior of the c code. it probably produces the same effect in practice, but it's definitely a distraction to go convince myself of that

nice catch. The c++ prop collector just compares and tracks the min/max timestamp as raw bytes. I initially wasn't wild about that, but I guess I've come around to it -- it means makes this a) closely match the c++ implementation and b) far smaller, so I'm sold.

In either case I've extended the test to include non-zero logical to ensure it is covered.


pkg/storage/bulk/sst_writer.go, line 129 at r5 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

panic or return a not implemented error or something? if it does end up being used somehow, it'd be nice if something failed loudly

I don't want to do so unconditionally, since it does get passed every key but it just doesn't do anything with them. Added an assertion that the keys are Sets and nothing else for now.


pkg/storage/bulk/sst_writer_test.go, line 30 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

just to keep this on you radar, i'd love to move these sorts of things to workload so we don't have to keep duplicating just enough of our encoding logic to do stuff like this. https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/format/sstable.go may be relevant. this is not a blocking comment

Yeah, I generally agree, though in this case it isn't quite drop-in since I wanted per-key timestamps and multiple-revisions-per-key. Might be room for something that reusable that isn't quite the normal workload generator stuff though.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajkr, @danhhz, @dt, and @petermattis)


pkg/storage/bulk/sst_batcher.go, line 330 at r5 (raw file):

Previously, dt (David Taylor) wrote…

Yeah, I think prefer to punt optimizations to followups, since this has the vendor change in it to rebase.

That said, I did take a quick look at doing this, but we can't reuse memFile's backing slice, it is the bytes handed back by Finish and Finish is not unsafe. Making it unsafe wouldn't buy us much since you can't hand unsafe bytes to DistSender to actually add the SSTable.

So I think best case we'd be avoiding a single kv-sized alloc per-SSTable if we added a reset that could reuse scratch.

Ack. I was thinking you could do something with bufalloc.BytesAllocator for memFile, but maybe that's not even a good here. Either way, let's leave them for followups


pkg/storage/bulk/sst_writer.go, line 100 at r5 (raw file):

Previously, dt (David Taylor) wrote…

nice catch. The c++ prop collector just compares and tracks the min/max timestamp as raw bytes. I initially wasn't wild about that, but I guess I've come around to it -- it means makes this a) closely match the c++ implementation and b) far smaller, so I'm sold.

In either case I've extended the test to include non-zero logical to ensure it is covered.

I was just imagining making the walltime non-zero check into an anything non-zero check and putting the logical inside it. basically just porting the EncodeTimestamp in libroach

if t.min != hlc.Timestamp{} {
  minBuf = make([]byte, 8, 12)
  binary.BigEndian.PutUint64(minBuf, uint64(t.min.WallTime))
  if t.min.Logical != 0 {
    minBuf = minBuf[:12]
    binary.BigEndian.PutUint32(minBuf, uint32(t.min.Logical))
  }
}

but this is also fine


pkg/storage/bulk/sst_writer.go, line 96 at r8 (raw file):

}

// Name implements

nit: incomplete


pkg/storage/bulk/sst_writer.go, line 182 at r8 (raw file):

	// we already called its Close() in Finish() and we no-op here. Thus the only
	// time we expect to be here is in a deferred Close(), in which case the caller
	// probably is already returning some other error, so returning on from this

nit: "returning one from"

dt added 3 commits August 7, 2019 18:51
This pulls some of the logic out of an existing test into helpers for:
A) making a big slice of random KV data
B) making an SST from a slice of KV data

These can then be reused in other SST-making tests to follow.

Release note: none.
Comparing to the Rocks-based builder in `engine`:

```
name             old time/op    new time/op    delta
WriteSSTable-12    4.66µs ± 6%    3.05µs ±13%   -34.54%  (p=0.000 n=9+10)

name             old speed      new speed      delta
WriteSSTable-12   258MB/s ± 6%   396MB/s ±12%   +53.62%  (p=0.000 n=9+10)

name             old alloc/op   new alloc/op   delta
WriteSSTable-12      300B ± 0%     1662B ± 0%  +453.83%  (p=0.000 n=8+10)

name             old allocs/op  new allocs/op  delta
WriteSSTable-12      0.00           0.00           ~     (all equal)
```

Release note: none.
```
name               old time/op    new time/op    delta
ClusterRestore-12    2.09µs ±19%    1.80µs ±10%  -14.05%  (p=0.016 n=5+5)

name               old speed      new speed      delta
ClusterRestore-12  55.5MB/s ±17%  64.4MB/s ±10%  +16.02%  (p=0.016 n=5+5)
```

Release note (performance improvement): speed up file-writing during bulk-ingestion.
@dt
Copy link
Member Author

dt commented Aug 8, 2019

TFTRs!

bors r+

craig bot pushed a commit that referenced this pull request Aug 8, 2019
37528: bulk: use pebble sstable builder to write SSTs r=dt a=dt

bulk: add wrapper for pebble SSTable builder to make SSTs
Comparing to the Rocks-based builder in `engine`:

```
name             old time/op    new time/op    delta
WriteSSTable-12    4.66µs ± 6%    3.05µs ±13%   -34.54%  (p=0.000 n=9+10)

name             old speed      new speed      delta
WriteSSTable-12   258MB/s ± 6%   396MB/s ±12%   +53.62%  (p=0.000 n=9+10)

name             old alloc/op   new alloc/op   delta
WriteSSTable-12      300B ± 0%     1662B ± 0%  +453.83%  (p=0.000 n=8+10)

name             old allocs/op  new allocs/op  delta
WriteSSTable-12      0.00           0.00           ~     (all equal)
```

```
name               old time/op    new time/op    delta
ClusterRestore-12    2.09µs ±19%    1.80µs ±10%  -14.05%  (p=0.016 n=5+5)

name               old speed      new speed      delta
ClusterRestore-12  55.5MB/s ±17%  64.4MB/s ±10%  +16.02%  (p=0.016 n=5+5)
```

Release note (performance improvement): speed up file-writing during bulk-ingestion.

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
@craig
Copy link
Contributor

craig bot commented Aug 8, 2019

Build succeeded

@craig craig bot merged commit 1e129d5 into cockroachdb:master Aug 8, 2019
@dt dt deleted the pebble-write-ssts branch August 8, 2019 11:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants