workload: allow columnar data generation (alloc -73%, MB/s +60%) #35349

danhhz · 2019-03-04T16:32:55Z

A workload.Generator previously could specifiy its initial table data
using one of two methods. One outputs a single row as []interface{}
and the other is a batch of [][]interface{}. This commit switches the
latter to be a columar batch, allowing selected performance-critical
workload generators (tpcc, bank) to generate initial table data with
dramatically reduced allocations.

The single row option still exists and will continue to be used by most
workload.Generators, as it's much simpler. An adaptor from the columnar
batch to the old row oriented [][]interface{} batch is included for
ease of use in non-performance sensitive consumers of initial table
data. IMPORT, on the other hand, is switched to using the columnar
batches directly.

Some of the savings here is reusing batches, but the majority comes from
[]interface{} requiring that everything assinged to it be on the heap,
while something like []int doesn't. We also get some amount of speedup
in initial table data consumers from fewer type switches.

This could also be accomplished without being columnar, but doing the
minimum non-columnar thing to get the same results will require
reinventing most of the work that the columnar does. Plus it will almost
certainly be convenient for columnar exec benchmarking to have a common
format.

Benchmark results:

name                               old time/op    new time/op    delta
InitialData/tpcc/warehouses=1-8       552ms ± 1%     454ms ± 0%  -17.81%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8          461µs ± 0%     332µs ± 1%  -28.10%  (p=0.008 n=5+5)
WriteCSVRows-8                       15.0µs ± 1%    16.8µs ± 1%  +12.44%  (p=0.008 n=5+5)
CSVRowsReader-8                      17.8µs ± 1%    19.3µs ± 1%   +8.41%  (p=0.008 n=5+5)

name                               old speed      new speed      delta
InitialData/tpcc/warehouses=1-8     128MB/s ± 1%   209MB/s ± 0%  +63.22%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8        223MB/s ± 0%   350MB/s ± 1%  +56.83%  (p=0.008 n=5+5)
WriteCSVRows-8                      118MB/s ± 2%   107MB/s ± 3%   -9.43%  (p=0.008 n=5+5)
CSVRowsReader-8                     101MB/s ± 4%    92MB/s ± 5%   -8.31%  (p=0.016 n=5+5)

name                               old alloc/op   new alloc/op   delta
InitialData/tpcc/warehouses=1-8       246MB ± 0%      75MB ± 0%  -69.61%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8          232kB ± 0%      19kB ± 0%  -91.75%  (p=0.008 n=5+5)
WriteCSVRows-8                       5.60kB ± 0%    5.70kB ± 0%   +1.71%  (p=0.016 n=4+5)
CSVRowsReader-8                      7.35kB ± 1%    7.39kB ± 0%     ~     (p=0.667 n=5+4)

name                               old allocs/op  new allocs/op  delta
InitialData/tpcc/warehouses=1-8       5.60M ± 0%     1.48M ± 0%  -73.56%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8          6.00k ± 0%     1.02k ± 0%  -83.01%  (p=0.008 n=5+5)
WriteCSVRows-8                         48.0 ± 0%      50.0 ± 0%   +4.17%  (p=0.008 n=5+5)
CSVRowsReader-8                        54.0 ± 0%      55.0 ± 0%   +1.85%  (p=0.008 n=5+5)

We could probably speed the CSV stuff back up with some followup work,
but (a) the overall fixtures import benchmark is still faster and (b)
we're about to switch to dt's import magic that skips the CSV roundtrip.

Current fixtures import

name                               old time/op    new time/op    delta
ImportFixture/tpcc/warehouses=1-8     3.35s ± 2%     3.29s ± 3%     ~     (p=0.310 n=5+5)

name                               old speed      new speed      delta
ImportFixture/tpcc/warehouses=1-8  26.4MB/s ± 2%  26.9MB/s ± 3%     ~     (p=0.310 n=5+5)

name                               old alloc/op   new alloc/op   delta
ImportFixture/tpcc/warehouses=1-8    3.09GB ± 0%    2.92GB ± 0%   -5.60%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ImportFixture/tpcc/warehouses=1-8     32.1M ± 0%     28.0M ± 0%  -12.86%  (p=0.008 n=5+5)

Magic fixtures import that skips CSV roundtrip:

name                               old time/op    new time/op    delta
ImportFixture/tpcc/warehouses=1-8     2.88s ± 3%     2.86s ± 4%     ~     (p=0.548 n=5+5)

name                               old speed      new speed      delta
ImportFixture/tpcc/warehouses=1-8  30.7MB/s ± 3%  30.9MB/s ± 4%     ~     (p=0.548 n=5+5)

name                               old alloc/op   new alloc/op   delta
ImportFixture/tpcc/warehouses=1-8    2.87GB ± 0%    2.73GB ± 0%   -4.97%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ImportFixture/tpcc/warehouses=1-8     26.1M ± 0%     21.5M ± 0%  -17.75%  (p=0.008 n=5+5)

Touches #34809

Release note: None

cockroach-teamcity · 2019-03-04T16:33:14Z

This change is

danhhz · 2019-03-25T18:04:04Z

Okay @jordanlewis this is as far as I got on Friday. There's a bit of a performance regression, so I haven't bothered updated the commit message or PR description yet. This is a bummer and there's definitely a bunch of places I can optimize, but I'm torn between how huge and scary this already is vs having a window between landing this and the followup perf work to really get all the benefits of this. Note that even though the csv writing (which we still use for a little while longer) is slower, fixtures import tpcc already uses drastically fewer allocs and is overall a bit faster.

name                               old time/op    new time/op    delta
pkg:github.com/cockroachdb/cockroach/pkg/workload goos:darwin goarch:amd64
InitialData/tpcc/warehouses=1-8       628ms ± 2%     539ms ± 1%  -14.14%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8          572µs ±10%     752µs ± 2%  +31.40%  (p=0.008 n=5+5)
WriteCSVRows-8                       15.9µs ± 2%    18.9µs ± 2%  +18.93%  (p=0.008 n=5+5)
CSVRowsReader-8                      18.1µs ± 0%    20.7µs ± 1%  +14.05%  (p=0.016 n=4+5)
pkg:github.com/cockroachdb/cockroach/pkg/ccl/workloadccl goos:darwin goarch:amd64
ImportFixture/tpcc/warehouses=1-8     3.25s ± 1%     3.31s ± 0%   +1.83%  (p=0.016 n=5+4)

name                               old speed      new speed      delta
pkg:github.com/cockroachdb/cockroach/pkg/workload goos:darwin goarch:amd64
InitialData/tpcc/warehouses=1-8     113MB/s ± 2%   176MB/s ± 1%  +56.23%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8        180MB/s ± 9%   154MB/s ± 2%  -14.42%  (p=0.008 n=5+5)
WriteCSVRows-8                      114MB/s ± 3%    96MB/s ± 4%  -15.28%  (p=0.008 n=5+5)
CSVRowsReader-8                     101MB/s ± 4%    87MB/s ± 5%  -14.47%  (p=0.008 n=5+5)
pkg:github.com/cockroachdb/cockroach/pkg/ccl/workloadccl goos:darwin goarch:amd64
ImportFixture/tpcc/warehouses=1-8  27.2MB/s ± 1%  26.7MB/s ± 0%   -1.80%  (p=0.016 n=5+4)

name                               old alloc/op   new alloc/op   delta
pkg:github.com/cockroachdb/cockroach/pkg/workload goos:darwin goarch:amd64
InitialData/tpcc/warehouses=1-8       246MB ± 0%      75MB ± 0%  -69.51%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8          232kB ± 0%     347kB ± 0%  +49.41%  (p=0.016 n=5+4)
WriteCSVRows-8                       5.59kB ± 0%    7.74kB ± 0%  +38.44%  (p=0.016 n=5+4)
CSVRowsReader-8                      7.38kB ± 1%    7.38kB ± 0%     ~     (p=0.206 n=5+4)
pkg:github.com/cockroachdb/cockroach/pkg/ccl/workloadccl goos:darwin goarch:amd64
ImportFixture/tpcc/warehouses=1-8    2.56GB ± 0%    2.39GB ± 0%   -6.72%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
pkg:github.com/cockroachdb/cockroach/pkg/workload goos:darwin goarch:amd64
InitialData/tpcc/warehouses=1-8       5.60M ± 0%     1.48M ± 0%  -73.56%  (p=0.008 n=5+5)
InitialData/bank/rows=1000-8          6.00k ± 0%     7.02k ± 0%  +16.97%  (p=0.008 n=5+5)
WriteCSVRows-8                         48.0 ± 0%      51.0 ± 0%   +6.25%  (p=0.008 n=5+5)
CSVRowsReader-8                        53.0 ± 0%      54.0 ± 0%   +1.89%  (p=0.008 n=5+5)
pkg:github.com/cockroachdb/cockroach/pkg/ccl/workloadccl goos:darwin goarch:amd64
ImportFixture/tpcc/warehouses=1-8     31.7M ± 0%     27.6M ± 0%  -13.02%  (p=0.016 n=5+4)

jordanlewis · 2019-03-27T13:54:38Z

Any initial guesses why it might be slower? I'll take a look soon.

danhhz · 2019-03-28T16:14:14Z

The fixtures import is almost certainly slower because CSV generation is slower. After 19.1.0 goes out the door, david is going to merge #36250, which switches fixtures import to bypass the csv by default and this is definitely faster with this PR. So even if we can't get the csv performance back, this whole problem goes away anyway.

As for why it's slower, I've tried digging into before/after profiles. My best guess so far is it's because most workload.Generators have a batch size of 1, which is the worst case for seeing columnar overhead.

jordanlewis

This is awesome work! I have a bunch of comments, but in general I'm definitely eager to see this happen.

jordanlewis · 2019-03-28T19:15:44Z

pkg/sql/exec/coldata/batch.go

@@ -56,7 +66,10 @@ func NewMemBatch(types []types.T) Batch {
 // NewMemBatchWithSize allocates a new in-memory Batch with the given column
 // size. Use for operators that have a precisely-sized output batch.
 func NewMemBatchWithSize(types []types.T, size int) Batch {
-	b := &memBatch{}
+	if max := math.MaxUint16; size > max {
+		panic(fmt.Sprintf(`batches cannot have length larger than %d; requested %d`, max, size))


jordanlewis · 2019-03-28T19:17:14Z

pkg/sql/exec/coldata/vec.go

@@ -29,6 +29,9 @@ type column interface{}
 type Vec interface {
 	Nulls

+	// Type returns the type of datums stored in this Vec.


nit: we haven't used datums yet in this code - could you just say data, to avoid eventual terminology clash with tree.Datum?

jordanlewis · 2019-03-28T19:20:25Z

pkg/workload/workload.go

+	// Allocate all the []interface{} row slices in one go.
+	datums := make([]interface{}, numRows*numCols)
+	for colIdx, col := range cb.ColVecs() {
+		switch col.Type() {


Like, here, you should really have the type information passed in, rather than having to check the ColVec, in my opinion. ColBatchToRows should also take a []types.T.

jordanlewis · 2019-03-28T19:38:32Z

pkg/workload/csv_test.go

 	}
 	table := workload.Table{
 		InitialRows: workload.BatchedTuples{
-			Batch: func(rowIdx int) [][]interface{} { return rows[rowIdx] },
+			FillBatch: func(batchIdx int, cb coldata.Batch, _ *bufalloc.ByteAllocator) {
+				*cb.(*coldata.MemBatch) = *batches[batchIdx].(*coldata.MemBatch)


Is this the reason you had to export MemBatch? This seems kind of unpleasant - could you maybe change the FillBatch interface to return a coldata.Batch instead of modifying one in place? Or does that have other problems?

jordanlewis · 2019-03-28T19:40:05Z

pkg/workload/workload.go

+	// FillBatch is a function to deterministically compute a columnar-batch of
+	// tuples given its index.
+	//
+	// To save allocations, the `Vec`s in the passed `Batch` are reused when


nit: is it just me or do we not normally enclose types in backticks in comments? obviously take it or leave it

jordanlewis · 2019-03-28T19:40:56Z

pkg/workload/workload.go

+			for i, datum := range row {
+				if datum == nil {
+					// WIP what do we do here
+					colTypes[i] = types.Bytes


Shouldn't we know the types up front somehow? I think the Tuples interface should return its column types as well.

That would also solve the allocations issue right?

jordanlewis · 2019-03-28T19:42:29Z

pkg/workload/workload.go

+}
+
+// ColBatchToRows materializes the columnar data in a coldata.Batch into rows.
+func ColBatchToRows(cb coldata.Batch) [][]interface{} {


Could you use the materializer operator instead of this function? It does the same thing.

danhhz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @danhhz and @jordanlewis)

pkg/sql/exec/coldata/vec.go, line 32 at r2 (raw file):