Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Move SSTWriter to engine, use it instead of Rocks-based writer #42252

Merged
merged 4 commits into from
Nov 11, 2019

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Nov 6, 2019

Currently, we use RocksDBSstFileWriter for writing SSTs when receiving cockroach snapshots. This incurs some extra performance overhead with cgo crossings that we can avoid by using the sstable.Writer in pebble.

This PR moves a wrapper for the Pebble-based writer that currently resides in bulk to engine, then adds methods to it so it could be used as a drop-in replacement (almost) for the RocksDB writer. In the second commit, all non-test uses of RocksDBSstFileWriter are moved to SSTWriter.

Parts taken from #39802 , and fixes #41657.

@itsbilal itsbilal self-assigned this Nov 6, 2019
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though my comments below indicate that there will be work to do in a follow-on PR.

Reviewed 7 of 7 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, @petermattis, and @sumeerbhola)


pkg/ccl/importccl/load.go, line 365 at r2 (raw file):

	}

	if err := base.WriteFile(ctx, filename, bytes.NewReader(sstContents)); err != nil {

We build the sstable in memory, only to immediately write it to disk. More evidence that there is further API cleanup needed here (not for this PR).


pkg/storage/engine/sst_writer.go, line 27 at r1 (raw file):

	f  *memFile
	// DataSize tracks the total key and value bytes added so far.
	DataSize uint64

What is this used for? sstable.Writer.EstimatedSize() will give a more accurate estimation of the final sstable size.


pkg/storage/engine/sst_writer.go, line 42 at r1 (raw file):

		MergerName:              "nullptr",
		TablePropertyCollectors: PebbleTablePropertyCollectors,
	}

Better to use DefaultPebbleOptions().MakeWriterOptions(0) here and then override TableFormat to be LevelDB. That way, changes to the default options show up here.


pkg/storage/engine/sst_writer.go, line 85 at r1 (raw file):

// Truncate returns the in-memory buffer from the previous call to Truncate
// (or the start), until now.
func (fw *SSTWriter) Truncate() ([]byte, error) {

What is this about? Seems like an odd method to exist here.

Oh, it is used by multiSSTWriter to stream the data out to disk. That is just convoluted. We should instead have SSTWriter write directly to disk. Can you add a TODO that this method should be removed and SSTWriter should take an optional path to write to disk.


pkg/storage/engine/sst_writer.go, line 173 at r1 (raw file):

}

type memFile struct {

We should expose a writable version of vfs.NewMemFile, though I suppose you don't get access to the data.


pkg/storage/engine/sst_writer.go, line 195 at r1 (raw file):

}

func (f *memFile) ReadAt(p []byte, off int64) (int, error) {

Do you need Read and ReadAt? I'm pretty sure sstable.Writer doesn't need those methods.

@itsbilal itsbilal force-pushed the pebble-sstwriter-take-two branch 4 times, most recently from 53a1161 to 9c7b84c Compare November 7, 2019 21:49
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went down the rabbit hole and actually updated engine.SSTWriter to take file handles, in the third commit. The stuff in store_snapshot.go became a lot simpler as a result. PTAL!

I could also do a fourth commit where I move over the last remaining RocksDBSstFileWriter uses to the pebble one (all in tests), except for the one test that compares the two.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, @petermattis, and @sumeerbhola)


pkg/ccl/importccl/load.go, line 365 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We build the sstable in memory, only to immediately write it to disk. More evidence that there is further API cleanup needed here (not for this PR).

Done. TODO added. This is a particularly annoying instance, the interface that base belongs to doesn't let us get a file handle easily that we could just pass to SSTWriter.


pkg/storage/engine/sst_writer.go, line 27 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What is this used for? sstable.Writer.EstimatedSize() will give a more accurate estimation of the final sstable size.

It's used to determine whether we wrote anything of use to the SSTable, and if we should call SSSS.NewFile() There are times when EstimatedSize would return a non-zero number while this wouldn't, so it's probably worth keeping.


pkg/storage/engine/sst_writer.go, line 42 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Better to use DefaultPebbleOptions().MakeWriterOptions(0) here and then override TableFormat to be LevelDB. That way, changes to the default options show up here.

Done.


pkg/storage/engine/sst_writer.go, line 85 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What is this about? Seems like an odd method to exist here.

Oh, it is used by multiSSTWriter to stream the data out to disk. That is just convoluted. We should instead have SSTWriter write directly to disk. Can you add a TODO that this method should be removed and SSTWriter should take an optional path to write to disk.

Removed.


pkg/storage/engine/sst_writer.go, line 195 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do you need Read and ReadAt? I'm pretty sure sstable.Writer doesn't need those methods.

Done. Removed.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Given the changes to bulk IO and KV code, @dt and @nvanbenschoten should give this a quick review to make sure we're not messing anything up.

Reviewed 12 of 12 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajkr, @itsbilal, @petermattis, and @sumeerbhola)


pkg/ccl/importccl/load.go, line 367 at r5 (raw file):

	// TODO(itsbilal): Pass a file handle into SSTWriter instead of writing to a
	// MemFile first.

Might be able to snipe @dt or someone else from Bulk IO into doing this. ExternalStorage needs to be able to provide a Writer interface for uploading data. Shouldn't be difficult.


pkg/storage/replica_raftstorage.go, line 882 at r5 (raw file):

	}
	if unreplicatedSST.DataSize > 0 {
		if err := inSnap.SSSS.WriteSST(ctx, unreplicatedSSTFile.Data()); err != nil {

This deserves a TODO as it maintains the same strange behavior as previously where we build the sstable in memory and then write it to a file. Better to just build the sstable directly on disk.


pkg/storage/engine/sst_writer.go, line 195 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. Removed.

Read() is still present. Is it needed for something?


pkg/storage/engine/sst_writer_test.go, line 171 at r5 (raw file):

		_ = makePebbleSST(b, kvs)
	}
	b.StopTimer()

How does the benchmark time for building sstables compare between Pebble and RocksDB?

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/replica_raftstorage.go, line 882 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This deserves a TODO as it maintains the same strange behavior as previously where we build the sstable in memory and then write it to a file. Better to just build the sstable directly on disk.

Done.


pkg/storage/engine/sst_writer.go, line 195 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Read() is still present. Is it needed for something?

Nope, don't know why I thought it was being used as part of writeCloseSyncer. Removed.


pkg/storage/engine/sst_writer_test.go, line 171 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

How does the benchmark time for building sstables compare between Pebble and RocksDB?

Comparing just the SSTWriters themselves (which, it's important to note, is only part of the picture though a pretty important part), where old = RocksDBSstFileWriter:

name             old time/op   new time/op   delta
WriteSSTable-12   3.95µs ± 3%   2.48µs ± 4%  -37.37%  (p=0.008 n=5+5)

name             old speed     new speed     delta
WriteSSTable-12  301MB/s ± 3%  481MB/s ± 4%  +59.71%  (p=0.008 n=5+5)

Any benchmarks around streaming / applying snapshots would also be interesting.

This commit does two things: It moves the SSTWriter from the bulk
package to the engine package, and makes any related import changes.
The tests file for SSTWriter is also moved accordingly. In addition,
it adds any methods necessary to SSTWriter to make it adhere to
the engine.Writer interface - Put, ClearRange, etc. This is
useful in the next commit.

SSTWriter.Add() is removed as it is effectively replaced by Put().

Release note: None
Now that SSTFileWriter is in the engine package and implements
Writer, it can be used as a drop-in replacement for the
RocksDB variant.

Fixes cockroachdb#41657

Release note: None
@itsbilal
Copy link
Member Author

Resolved merge conflicts, and added a 4th commit that updates all tests using RocksDBSstFileWriter to use SSTWriter except for the ones that compare the two (or specifically test the RocksDB writer).

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick scan of the bulk-IO parts and LGTM.

Reviewed 1 of 12 files at r6, 1 of 11 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @dt, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)


pkg/ccl/importccl/load.go, line 367 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Might be able to snipe @dt or someone else from Bulk IO into doing this. ExternalStorage needs to be able to provide a Writer interface for uploading data. Shouldn't be difficult.

Hm, the Reader argument to uploads mostly comes from the cloud APIs. Some of the current implementations also rely on being able to Seek in the passed content to retry files on failure (in some cases we added this because of flaky cloud providers, others include it internally). Even if we could stick a buffer in the middle to turn a Writer into a Reader, if we switched to a more streaming model where we upload as we write, retries will be tricky (i.e. we might need to build another layer of reties, over the entire request, if we can no longer retry just uploading the result).

Happy to have my name added to the TODO, but I don't know if it'll happen right away.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 12 files at r3, 12 of 12 files at r6, 6 of 6 files at r7, 11 of 11 files at r8, 5 of 5 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/replica_sst_snapshot_storage.go, line 109 at r8 (raw file):

// the provided SST when it is finished using it. If the provided SST is empty,
// then no file will be created and nothing will be written.
func (ssss *SSTSnapshotStorageScratch) WriteSST(ctx context.Context, data []byte) error {

If we're taking a context in NewFile, why does this still also take a context?


pkg/storage/replica_sst_snapshot_storage.go, line 212 at r8 (raw file):

		return nil
	}
	if sssf.chunkSize > 0 && len(sssf.buffer) > 0 {

Remove the sssf.chunkSize > 0 condition. We don't care about that here.


pkg/storage/store_snapshot.go, line 122 at r8 (raw file):

	ssss        *SSTSnapshotStorageScratch
	currSST     engine.SSTWriter
	currSSTFile *SSTSnapshotStorageFile

Do you mind checking that we still need this field? It seems to be used once in multiSSTWriter.Close, but I'd expect engine.SSTWriter.Close to have already closed the file. Is that incorrect?

Instead of always having a MemFile internally, this changes
the SST Writer to take a fileCloseSyncer parameter that
could potentially be a non-memfile. Also update
multiSSTWriter to use a chunkSize-aware
SSTSnapshotStorageFile instead of handling truncation inside it.

Release note: None
Update any tests (other than those that explicitly test for
equivalence with RocksDB-built SSTs) to use the pebble-based
engine.SSTWriter instead of RocksDBSstFileWriter.

Release note: None
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @ajkr, @itsbilal, @nvanbenschoten, @petermattis, and @sumeerbhola)


pkg/storage/replica_sst_snapshot_storage.go, line 109 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we're taking a context in NewFile, why does this still also take a context?

Because we call NewFile in this method. The context is stored on a per-file basis, which we could probably change if we want.


pkg/storage/replica_sst_snapshot_storage.go, line 212 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Remove the sssf.chunkSize > 0 condition. We don't care about that here.

Done.


pkg/storage/store_snapshot.go, line 122 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you mind checking that we still need this field? It seems to be used once in multiSSTWriter.Close, but I'd expect engine.SSTWriter.Close to have already closed the file. Is that incorrect?

That's correct, it's not necessary. Thanks for discovering! Removed.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 7 files at r10, 5 of 5 files at r11.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @ajkr, @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/replica_sst_snapshot_storage.go, line 109 at r8 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Because we call NewFile in this method. The context is stored on a per-file basis, which we could probably change if we want.

Oh I see, I thought this was a method on SSTSnapshotStorageFile, not on SSTSnapshotStorageScratch.

@itsbilal
Copy link
Member Author

Thanks for the review!

bors r+

craig bot pushed a commit that referenced this pull request Nov 11, 2019
42252: storage: Move SSTWriter to engine, use it instead of Rocks-based writer r=itsbilal a=itsbilal

Currently, we use `RocksDBSstFileWriter` for writing SSTs when receiving cockroach snapshots. This incurs some extra performance overhead with cgo crossings that we can avoid by using the `sstable.Writer` in pebble.

This PR moves a wrapper for the Pebble-based writer that currently resides in `bulk` to `engine`, then adds methods to it so it could be used as a drop-in replacement (almost) for the RocksDB writer. In the second commit, all non-test uses of `RocksDBSstFileWriter` are moved to `SSTWriter`.

Parts taken from #39802 , and fixes #41657.

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Nov 11, 2019

Build succeeded

@craig craig bot merged commit 47396f3 into cockroachdb:master Nov 11, 2019
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/replica_sst_snapshot_storage.go, line 174 at r11 (raw file):

// SSTSnapshotStorageScratch. Writing empty contents is okay and is treated as
// a noop. The file must have not been closed.
func (sssf *SSTSnapshotStorageFile) Write(contents []byte) (int, error) {

Where is the int return value used?

@itsbilal
Copy link
Member Author


pkg/storage/replica_sst_snapshot_storage.go, line 174 at r11 (raw file):

Previously, ajkr (Andrew Kryczka) wrote…

Where is the int return value used?

It's used inside sstable.Writer.writeRawBlock in Pebble. Also it's part of the io.Writer interface so we're forced to supply it as a result.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @itsbilal, @petermattis, and @sumeerbhola)


pkg/storage/replica_sst_snapshot_storage.go, line 174 at r11 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

It's used inside sstable.Writer.writeRawBlock in Pebble. Also it's part of the io.Writer interface so we're forced to supply it as a result.

Understood, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: use pebble.sstable.Writer for writing to sstables instead of the SstFileWriter in RocksDB
6 participants