-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bulk: use pebble sstable builder to write SSTs #37528
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Is this also the sstables we're handing to AddSSTable? If so, how much due diligence has been done to make sure the metadata etc matches the old ones? For example, TBI metadata, bloom filters, etc. Does pebble do all of it for us? There really isn't any info in this commit message
Can you also include the BenchmarkImportWorkload
and BenchmarkConvertToSSTable
diffs, plz
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @petermattis)
I haven't done much digging beyond just that the tests passed and running one benchmark. I'm pretty sure the custom collectors are not there, not sure about the prefix bloom filters being the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @dt, and @petermattis)
pkg/storage/bulk/sst_writer.go, line 29 at r1 (raw file):
) // SSTWriter writes SSTables.
These are tables used for backup, right? I think that should be called out here as you're using leveldb format tables.
pkg/storage/bulk/sst_writer.go, line 40 at r1 (raw file):
var pebbleOpts = func() *pebbledb.Options { c := *pebbledb.DefaultComparer c.Compare = engine.MVCCKeyCompare
I'd prefer to see all of the fields explicitly populated:
c := &pebbledb.Comparer{
AbbreviatedKey: ...,
Compare: engine.MVCCKeyCompare,
Equal: bytes.Equal,
Split: ...,
Successor: ...,
Name: "cockroach_comparator",
}
I'm not sure if DefaultComparer.Successor
works properly for our MVCC keys.
pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):
c := *pebbledb.DefaultComparer c.Compare = engine.MVCCKeyCompare opts := &pebbledb.Options{TableFormat: pebbledb.TableFormatLevelDB, Comparer: &c}
Just want to confirm that you want leveldb format files. I'd only recommend that for "backup" sstables. For every other usage we should be using rocksdb format.
pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):
// MakeSSTWriter creates a new SSTWriter. func MakeSSTWriter() SSTWriter { f := newMemFile(nil)
I suppose this is evidence that add vfs.NewMemFile()
would be useful. Also irritating that sstable.NewWriter
takes a vfs.File
rather than an io.Writer
. I think I can fix that for you.
pkg/storage/bulk/sst_writer.go, line 63 at r1 (raw file):
fw.scratch = engine.EncodeKeyToBuf(fw.scratch[:0], kv.Key) k := pebbledb.MakeInternalKey(fw.scratch, 0, pebbledb.InternalKeyKindSet) return fw.fw.Add(k, kv.Value)
This can be written as fw.fw.Set(fw.scratch, kv.Value)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done much digging beyond just that the tests passed and running one benchmark. I'm pretty sure the custom collectors are not there, not sure about the prefix bloom filters being the same.
Those options could have pretty big impacts on users of imported data, so we'll need to dig into them before merging this. Plus probably run some whole system benchmarking
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @dt, and @petermattis)
pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Just want to confirm that you want leveldb format files. I'd only recommend that for "backup" sstables. For every other usage we should be using rocksdb format.
IIRC everything uses the leveldb format right now. If that's so, I'd prefer we make the change to rockdb format for non-backup uses in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @danhhz, and @dt)
pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
IIRC everything uses the leveldb format right now. If that's so, I'd prefer we make the change to rockdb format for non-backup uses in a separate PR
What is "everything"? We certainly use RocksDB format for the sstables in the main database. I hope we use RocksDB format for ingested sstables, though I agree that if we're not currently doing so we should make that change in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)
pkg/storage/bulk/sst_writer.go, line 29 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
These are tables used for backup, right? I think that should be called out here as you're using leveldb format tables.
They are, though actually with another upcoming change, they won't be -- #37496 pushes the iteration down to C++ and uses the rocks writer, so actually after that change, this is used for everything except backup.
pkg/storage/bulk/sst_writer.go, line 40 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I'd prefer to see all of the fields explicitly populated:
c := &pebbledb.Comparer{ AbbreviatedKey: ..., Compare: engine.MVCCKeyCompare, Equal: bytes.Equal, Split: ..., Successor: ..., Name: "cockroach_comparator", }
I'm not sure if
DefaultComparer.Successor
works properly for our MVCC keys.
Done (though it wasn't clear to me if I should be doing anything fancier for Successor)
pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Just want to confirm that you want leveldb format files. I'd only recommend that for "backup" sstables. For every other usage we should be using rocksdb format.
Currently we always produce LevelDB format files -- even for IMPORT, RESTORE, or index creation -- since the format is set in DBSstFileWriterNew
without any control by the caller. So currently we're always ingesting Level format. That said, we could easily change that -- in fact, we could always ingest rocks format since we don't actually ingest backup SSTs directly and instead iterate and rewrite them during RESTORE anyway to remap the keys.
pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I suppose this is evidence that add
vfs.NewMemFile()
would be useful. Also irritating thatsstable.NewWriter
takes avfs.File
rather than anio.Writer
. I think I can fix that for you.
I wanted direct access to the []byte
which is why I didn't use the MemFS and make a File with it, since to then read the content I'd have to make another copy (same issue we've had on the reading since with LevelDB).
pkg/storage/bulk/sst_writer.go, line 63 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
This can be written as
fw.fw.Set(fw.scratch, kv.Value)
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajkr and @petermattis)
pkg/storage/bulk/sst_writer.go, line 41 at r1 (raw file):
Previously, dt (David Taylor) wrote…
Currently we always produce LevelDB format files -- even for IMPORT, RESTORE, or index creation -- since the format is set in
DBSstFileWriterNew
without any control by the caller. So currently we're always ingesting Level format. That said, we could easily change that -- in fact, we could always ingest rocks format since we don't actually ingest backup SSTs directly and instead iterate and rewrite them during RESTORE anyway to remap the keys.
But I also concur with @danhhz -- I'd rather swap the implementation, then swap the format, rather than do both at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajkr and @dt)
pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):
Previously, dt (David Taylor) wrote…
I wanted direct access to the
[]byte
which is why I didn't use the MemFS and make a File with it, since to then read the content I'd have to make another copy (same issue we've had on the reading since with LevelDB).
Once https://github.com/petermattis/pebble/pull/141 merges, you can simplify the memFile
stuff below. memFileInfo
should no longer be necessary and you'll only need to implement Write
, Close
, and Sync
.
pkg/storage/bulk/sst_writer.go, line 65 at r2 (raw file):
return len(k) } return keyPartEnd
Maybe add a comment that this is the same as enginepb.SplitMVCCKey
.
a2fb0b5
to
101d0bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on the concerns @danhhz raised above, I've been thinking about the question of if we were using the same configuration than the RocksDB writer, and how could result in significant -- but potentially hard to debug -- performance differences when later operating on the produced SSTs.
Talking with @petermattis, we've generally agreed that we trust pebble's own tests to verify pebble creates the SST you ask it to, so this concern is mostly about how we're configuring it, and if that matches how we're configuring the Rocks writer.
"Fortunately", the Rocks-based writer is actually mostly configured from scratch right now and share very little configuration with runtime Rocks config that we've actually tuned. While I'm not actually sure this is a good thin in general, it at least makes it easier to inspect it vs this pebble writer config and at least determine if switching to pebble could be a regression. The prop collectors are about the only thing shared between the rocks writer and the runtime rocksdb.
I chatted with @nvanbenschoten about this a bit, and we're unconvinced that we could really prove much to ourselves by running a workload on it -- even if SSTs were different, the fact they could be compacted away and replaced with rocks-authored SSTs at any time could make it pretty hard to tell. However, if the actual SSTs were exactly the same, that could rule out any behavior differences without doubt.
I talked to @petermattis about this and long-term, we're not sure that we want to constrain pebble to produce byte-for-byte identical SSTs to rocksdb, as that could exclude including different or extra features that we realize we want. However, in the short term, they do happen to be identical, and for the purposes of vetting this change, I think we can test that. If down the road pebble SSTs diverge, we can decide to delete this test, but that that point, I think the test will have done its job: proving that using using the pebble writer didn't change the behavior compared to the rocks writer. w.r.t checking that either writer matches runtime-written SSTs, we don't currently have tests for that and this change doesn't really change that (indeed, we know they don't: we explicitly configure them to use a different format).
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr and @petermattis)
pkg/storage/bulk/sst_writer.go, line 48 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Once https://github.com/petermattis/pebble/pull/141 merges, you can simplify the
memFile
stuff below.memFileInfo
should no longer be necessary and you'll only need to implementWrite
,Close
, andSync
.
Done.
pkg/storage/bulk/sst_writer.go, line 65 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Maybe add a comment that this is the same as
enginepb.SplitMVCCKey
.
Done.
I'm good with this in the short term. The ssts should be byte-for-byte identical. There was a bug last week that was causing that to not be true, but it has been fixed. The one exception to this rule is range-deletion tombstones. Those are stored slightly differently in Pebble, though in a format that is compatible with RocksDB. So if you do add a test around this, be aware of that limitation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though I didn't look at pebble use closely. exciting! i've wanted this for a while
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @dt, and @petermattis)
pkg/storage/bulk/sst_batcher.go, line 330 at r5 (raw file):
return nil, nil, err } w = MakeSSTWriter()
should SSTWriter have a Reset or something so you can reuse scratch
and maybe the memFile
? this also could easily be a followup
pkg/storage/bulk/sst_writer.go, line 100 at r5 (raw file):
binary.BigEndian.PutUint64(minBuf, uint64(t.min.WallTime)) } if t.min.Logical > 0 {
is this really the logic? it's contrived, but if there was a timestamp with 0 walltime and non-0 logical, this would encode only the logical time
update: i looked and i don't think this matches the behavior of the c code. it probably produces the same effect in practice, but it's definitely a distraction to go convince myself of that
pkg/storage/bulk/sst_writer.go, line 129 at r5 (raw file):
func (dummyDeleteRangeCollector) Add(key pebble.InternalKey, value []byte) error { return nil
panic or return a not implemented error or something? if it does end up being used somehow, it'd be nice if something failed loudly
pkg/storage/bulk/sst_writer_test.go, line 30 at r3 (raw file):
) func makeIntTableKVs(t testing.TB, numKeys, valueSize, maxRevisions int) []engine.MVCCKeyValue {
just to keep this on you radar, i'd love to move these sorts of things to workload so we don't have to keep duplicating just enough of our encoding logic to do stuff like this. https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/format/sstable.go may be relevant. this is not a blocking comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajkr, @danhhz, and @petermattis)
pkg/storage/bulk/sst_batcher.go, line 330 at r5 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
should SSTWriter have a Reset or something so you can reuse
scratch
and maybe thememFile
? this also could easily be a followup
Yeah, I think prefer to punt optimizations to followups, since this has the vendor change in it to rebase.
That said, I did take a quick look at doing this, but we can't reuse memFile
's backing slice, it is the bytes handed back by Finish
and Finish
is not unsafe. Making it unsafe wouldn't buy us much since you can't hand unsafe bytes to DistSender to actually add the SSTable.
So I think best case we'd be avoiding a single kv-sized alloc per-SSTable if we added a reset that could reuse scratch
.
pkg/storage/bulk/sst_writer.go, line 100 at r5 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
is this really the logic? it's contrived, but if there was a timestamp with 0 walltime and non-0 logical, this would encode only the logical time
update: i looked and i don't think this matches the behavior of the c code. it probably produces the same effect in practice, but it's definitely a distraction to go convince myself of that
nice catch. The c++ prop collector just compares and tracks the min/max timestamp as raw bytes. I initially wasn't wild about that, but I guess I've come around to it -- it means makes this a) closely match the c++ implementation and b) far smaller, so I'm sold.
In either case I've extended the test to include non-zero logical to ensure it is covered.
pkg/storage/bulk/sst_writer.go, line 129 at r5 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
panic or return a not implemented error or something? if it does end up being used somehow, it'd be nice if something failed loudly
I don't want to do so unconditionally, since it does get passed every key but it just doesn't do anything with them. Added an assertion that the keys are Sets
and nothing else for now.
pkg/storage/bulk/sst_writer_test.go, line 30 at r3 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
just to keep this on you radar, i'd love to move these sorts of things to workload so we don't have to keep duplicating just enough of our encoding logic to do stuff like this. https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/workloadccl/format/sstable.go may be relevant. this is not a blocking comment
Yeah, I generally agree, though in this case it isn't quite drop-in since I wanted per-key timestamps and multiple-revisions-per-key. Might be room for something that reusable that isn't quite the normal workload generator stuff though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajkr, @danhhz, @dt, and @petermattis)
pkg/storage/bulk/sst_batcher.go, line 330 at r5 (raw file):
Previously, dt (David Taylor) wrote…
Yeah, I think prefer to punt optimizations to followups, since this has the vendor change in it to rebase.
That said, I did take a quick look at doing this, but we can't reuse
memFile
's backing slice, it is the bytes handed back byFinish
andFinish
is not unsafe. Making it unsafe wouldn't buy us much since you can't hand unsafe bytes to DistSender to actually add the SSTable.So I think best case we'd be avoiding a single kv-sized alloc per-SSTable if we added a reset that could reuse
scratch
.
Ack. I was thinking you could do something with bufalloc.BytesAllocator for memFile, but maybe that's not even a good here. Either way, let's leave them for followups
pkg/storage/bulk/sst_writer.go, line 100 at r5 (raw file):
Previously, dt (David Taylor) wrote…
nice catch. The c++ prop collector just compares and tracks the min/max timestamp as raw bytes. I initially wasn't wild about that, but I guess I've come around to it -- it means makes this a) closely match the c++ implementation and b) far smaller, so I'm sold.
In either case I've extended the test to include non-zero logical to ensure it is covered.
I was just imagining making the walltime non-zero check into an anything non-zero check and putting the logical inside it. basically just porting the EncodeTimestamp in libroach
if t.min != hlc.Timestamp{} {
minBuf = make([]byte, 8, 12)
binary.BigEndian.PutUint64(minBuf, uint64(t.min.WallTime))
if t.min.Logical != 0 {
minBuf = minBuf[:12]
binary.BigEndian.PutUint32(minBuf, uint32(t.min.Logical))
}
}
but this is also fine
pkg/storage/bulk/sst_writer.go, line 96 at r8 (raw file):
} // Name implements
nit: incomplete
pkg/storage/bulk/sst_writer.go, line 182 at r8 (raw file):
// we already called its Close() in Finish() and we no-op here. Thus the only // time we expect to be here is in a deferred Close(), in which case the caller // probably is already returning some other error, so returning on from this
nit: "returning one from"
This pulls some of the logic out of an existing test into helpers for: A) making a big slice of random KV data B) making an SST from a slice of KV data These can then be reused in other SST-making tests to follow. Release note: none.
Comparing to the Rocks-based builder in `engine`: ``` name old time/op new time/op delta WriteSSTable-12 4.66µs ± 6% 3.05µs ±13% -34.54% (p=0.000 n=9+10) name old speed new speed delta WriteSSTable-12 258MB/s ± 6% 396MB/s ±12% +53.62% (p=0.000 n=9+10) name old alloc/op new alloc/op delta WriteSSTable-12 300B ± 0% 1662B ± 0% +453.83% (p=0.000 n=8+10) name old allocs/op new allocs/op delta WriteSSTable-12 0.00 0.00 ~ (all equal) ``` Release note: none.
``` name old time/op new time/op delta ClusterRestore-12 2.09µs ±19% 1.80µs ±10% -14.05% (p=0.016 n=5+5) name old speed new speed delta ClusterRestore-12 55.5MB/s ±17% 64.4MB/s ±10% +16.02% (p=0.016 n=5+5) ``` Release note (performance improvement): speed up file-writing during bulk-ingestion.
TFTRs! bors r+ |
37528: bulk: use pebble sstable builder to write SSTs r=dt a=dt bulk: add wrapper for pebble SSTable builder to make SSTs Comparing to the Rocks-based builder in `engine`: ``` name old time/op new time/op delta WriteSSTable-12 4.66µs ± 6% 3.05µs ±13% -34.54% (p=0.000 n=9+10) name old speed new speed delta WriteSSTable-12 258MB/s ± 6% 396MB/s ±12% +53.62% (p=0.000 n=9+10) name old alloc/op new alloc/op delta WriteSSTable-12 300B ± 0% 1662B ± 0% +453.83% (p=0.000 n=8+10) name old allocs/op new allocs/op delta WriteSSTable-12 0.00 0.00 ~ (all equal) ``` ``` name old time/op new time/op delta ClusterRestore-12 2.09µs ±19% 1.80µs ±10% -14.05% (p=0.016 n=5+5) name old speed new speed delta ClusterRestore-12 55.5MB/s ±17% 64.4MB/s ±10% +16.02% (p=0.016 n=5+5) ``` Release note (performance improvement): speed up file-writing during bulk-ingestion. Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Build succeeded |
bulk: add wrapper for pebble SSTable builder to make SSTs
Comparing to the Rocks-based builder in
engine
:Release note (performance improvement): speed up file-writing during bulk-ingestion.