Allocate new Slice for VariableWidthBlockEncoding readBlock #11235

linzebing · 2022-03-01T00:57:59Z

Description

Is this change a fix, improvement, new feature, refactoring, or other?

This PR changes VariableWidthBlock to be backed by newly allocated Slices instead of returning a view into a page:

This will fix the calculation of page retained size.
VariableWidthBlockEncoding is the only block encoding that creates direct references to the input slice instead of copying data into newly allocated memory. It's better to make things consistent.
This is also blocking Implement parallel read from S3 for exchange storage #11174, where ExchangeStorageReader returns a view of a byte range of byte array, but due to VariableWidthBlockEncoding retaining references, it caused memory leaks.

This however, will have a small performance cost. I did another optimization to offset the cost --- replacing InputStreamSliceInput with LittleEndianDataInputStream in HttpPageBufferClient, such that we avoid memory copy costs in InputStreamSliceInput's internal buffer.

It turns out that these two changes combines resulted in a net win for efficiency, both in terms of CPU and latency (benchmark on hive_sf1000_parquet_part):

(benchmark on hive_sf1000_parquet_unpart):

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Core query engine.

How would you describe this change to a non-technical end user or system administrator?

N/A

Related issues, pull requests, and links

Blocks #11174

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

() No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Core Engine
* Allocate a new Slice instead of returning a view when decoding `VariableWidthBlock`s, fixing memory accounting of pages involving such blocks. Also eliminated extra memory copy when consuming data in  `HttpPageBufferClient`.Benchmark shows that these two combined demonstrated a 0.7%-2.7% CPU efficiency improvement on TPC-H and TPC-DS datasets. ({issue}`11315`)

sopel39 · 2022-03-01T12:37:41Z

core/trino-spi/src/main/java/io/trino/spi/block/VariableWidthBlockEncoding.java

@@ -68,7 +68,8 @@ public Block readBlock(BlockEncodingSerde blockEncodingSerde, SliceInput sliceIn
        boolean[] valueIsNull = decodeNullBits(sliceInput, positionCount).orElse(null);

        int blockSize = sliceInput.readInt();
-        Slice slice = sliceInput.readSlice(blockSize);
+        Slice slice = Slices.allocate(blockSize);


This implies data copying. In memory connector we just use page.compact(); (see

trino/plugin/trino-memory/src/main/java/io/trino/plugin/memory/MemoryPagesStore.java

Line 69 in f13283b

page.compact();

)

Are you suggesting me to compact page before return here?

https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java#L61

Compact will also create a copy, therefore achieve the same purpose, right?

It's the only place when in block decoding a view is preferred to a copy. It feels like it is more natural to create a copy here. @sopel39 Do you think there's a practical reason why it is better to compact vs creating a copy explicitly here?

Are you suggesting me to compact page before return here?

I'm suggesting to compact if you mean to store page. This is exactly what we do in OrderByOperator, PartitioningExchanger, PagesIndex and others.

@sopel39 Do you think there's a practical reason why it is better to compact vs creating a copy explicitly here?

You don't need that extra copy if the data is going to be consumed immediately (e.g. aggregation)

It looks like the uncompacted pages are somehow being retained for longer than expected. @linzebing what is the code path that retains those pages?

The reason we don't compact or eagerly copy it that is can be really expensive. This is the same reason we don't compact dictionaries. Also there are methods that will attempt to dedupe things, but we don't use those because they had a big performance impact.

All of that said, I'm not saying "don't do this". I'm saying "be careful" and check the performance.

Also there are methods that will attempt to dedupe things, but we don't use those because they had a big performance impact.

By that @dain means using something like io.trino.operator.project.PageProcessor.ProjectSelectedPositions#updateRetainedSize on Page level.

I think this is because compact doesn't compact the dictionary field of a DictionaryBlock, as I mentioned above.

I think we should compact dictionaries on io.trino.spi.block.DictionaryBlock#compact

I'll jump in with yet another potential solution here:
You could re-frame this as "VariableWidthBlock is only block encoding that doesn't copy memory into a new structure again". It would be a bigger change, but if the deserialization operated directly on an InputStreamSliceInput in the first place instead of eagerly copying the whole serialized page into a Slice- then this wouldn't be adding unnecessary overheads.

I've looked into trying to do something like that before, and the problems I saw were solvable but included:

Finding a clean way to still be able to compute the page checksum on the input stream (and having a way to get the current digest and reset for the next logical page at some middle point in the buffer)

Verifying that moving the deserialization point into the I/O layers (eg: exchange client, unspilling thread) wouldn't degrade performance in unexpected ways.

IMO, one way or another the data will have to be copied from an IO buffer (InputStream, IO buffer in ExchangeSource, etc.) before it can be processed by the engine. The page decoding step seems like a good place to do so. I don't really see why the page decoding should return uncompacted pages if the decoding process requires allocations and memory copying anyway.

Agreed, I didn’t notice you had made the same suggestion in another comment.

linzebing · 2022-03-03T06:31:50Z

@sopel39 , it feels like the right way to go is to allocate a new Slice here (to be consistent with other encodings). Meanwhile we can eliminate the extra copy here. https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerde.java#L224-L228

How about approving this PR to unblock and file an issue for the latter which I will follow up?

sopel39 · 2022-03-03T10:13:32Z

it feels like the right way to go is to allocate a new Slice here (to be consistent with other encodings).

We use copy here for primitive blocks because it's not possible to map Slice into native Java array. Previously, even primitive blocks were Slice based and we didn't do copy. Also, Varbinary columns might be long, so extra copies might make a difference.

Ideally, you compact pages in places where you want to store them. There are other reasons why page might not be compact (e.g. getRegion), so even if you change this code, that does not guarantee that page will be compact in your place of interest

Meanwhile we can eliminate the extra copy here. https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerde.java#L224-L228

How you can eliminate copy there?

arhimondr · 2022-03-03T19:35:15Z

We use copy here for primitive blocks because it's not possible to map Slice into native Java array. Previously, even primitive blocks were Slice based and we didn't do copy. Also, Varbinary columns might be long, so extra copies might make a difference.

This is largely true. However we do create a copy for flags and byte arrays in other places (e.g.: for "null" flags)

How you can eliminate copy there?

PageResponseHandler wraps an input stream into an InputStreamSliceInput that does unnecessary extra copy. It first copies bytes into an internal buffer and the from that buffer bytes are copied into a destination (see InputStreamSliceInput#readSlice). With a little refactor it should be possible to avoid this copy.

Assuming we eliminate the extra copy in the PageResponseHandler it should offset a copy we want to do during the block decoding, so the efficiency shouldn't be affected.

linzebing · 2022-03-03T22:41:01Z

Currently this is not a hard blocker for us so we will get around for now. Create an issue to track this #11315. Will follow up later.

arhimondr · 2022-03-09T20:20:06Z

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

    }

    private static class SerializedPageReader
            extends AbstractIterator<Slice>
    {
-        private final SliceInput input;
+        private final LittleEndianDataInputStream input;


nit: LittleEndianDataInputStream is not efficient at decoding int / long. A more efficient way would be to allocate a single Slice of a size of the header. Then read the entire header from an InputStream at once into a Slice and extract the page size as int with Slice#getInt that does that in a single machine instruction. Then the content of that Slice could be efficiently copied into a Slice returned from the computeNext method.

Make sense, addressed comment.

pettyjamesm

Added a few review notes about the current implementation. It looks like this still requires copying VariableWidthBlock contents twice if I understand correctly? Is this saving an intermediate copy through InputStreamSliceInput internal buffers?

pettyjamesm · 2022-03-09T21:34:03Z

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

+                if (b < 0) {
+                    return endOfData();
+                }
+                byte b1 = (byte) b;


Does this need to be: byte b1 = (byte) (b & 0xFF); ?

Shouldn't be needed as read() will only read one byte

pettyjamesm · 2022-03-09T21:46:39Z

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

+                int positionCount = Ints.fromBytes(b4, b3, b2, b1);
+                byte marker = input.readByte();
+                int uncompressedSize = input.readInt();
+                int compressedSize = input.readInt();


Agreed with @arhimondr, you could pre-allocate and reuse a buffer for the header size and reuse it between calls to computeNext() since this iterator isn't expected to be threadsafe. If you did, you could even avoid using LittleEndianDataInputStream entirely and operate directly on any InputStream. That it might look something like:

private byte[] headerBuffer = new byte[SERIALIZED_PAGE_HEADER_SIZE]; private Slice headerSlice = Slices.wrappedBuffer(headerBuffer); ... @Override protected Slice computeNext() { try { int read = ByteStreams.read(input, headerBuffer, 0, headerBuffer.length); if (read < 0) { return endOfData(); } else if (read != headerBuffer.length) { throw new EOFException(); } // contents of headerBuffer are visible through headerSlice int compressedSize = headerSlice.getInt(COMPRESSED_SIZE_OFFSET); byte[] output = new byte[headerBuffer.length + compressedSize]; System.arraycopy(headerBuffer, 0, output, 0, headerBuffer.length); ByteStreams.readFully(input, output, headerBuffer.length, compressedSize); return Slices.wrappedBuffer(output); ...

Make sense. Addressed.

linzebing · 2022-03-09T23:32:16Z

@pettyjamesm : yes VariableWidthBlockEncoding still does an extra copy.

The change in HttpPageBufferClient is an orthogonal optimization to offset the efficiency penalty, as you said, saving an intermediate copy in InputStreamSliceInput.

sopel39 · 2022-03-09T23:37:24Z

@linzebing What's the total number of copies now? This PR requires benchmarks to be run to check impact.

linzebing · 2022-03-10T00:08:52Z

@sopel39 : for VariableWidthBlockEncoding, it allocates a new Slice. For HttpPageBufferClient, it reduced the unnecessary memory copy in InputStreamSliceInput's internal buffer. Combined it's a net efficiency win, reducing unnecessary copies for other encoding blocks.

Benchmark result indicates that this is a net efficiency win.
(benchmark on hive_sf1000_parquet_part):

(benchmark on hive_sf1000_parquet_unpart):

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

pettyjamesm · 2022-03-10T14:22:37Z

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerde.java

+
+        int compressedSize = getIntUnchecked(headerSlice, COMPRESSED_SIZE_OFFSET);
+        byte[] outputBuffer = new byte[SERIALIZED_PAGE_HEADER_SIZE + compressedSize];
+        arraycopy(headerSlice.byteArray(), 0, outputBuffer, 0, SERIALIZED_PAGE_HEADER_SIZE);


Minor: since we’re using headerSlice.byteArray() here but the Slice is created externally, it would be safer to use headerSlice.byteArrayOffset() instead of assuming that the offset is 0

Actually I wonder if I should just use headerSlice.getBytes(0, outputBuffer, 0, SERIALIZED_PAGE_HEADER_SIZE);. Do you think this makes any difference in performance?

It shouldn't matter much here, but I suspect System.arraycopy would have a slight edge over Slice#getBytes because of the way that Slice uses Unsafe.copyMemory

I find https://groups.google.com/g/mechanical-sympathy/c/sug91A1ynF4 saying that Unsafe.copyMemory is slight faster than System.arraycopy. So using headerSlice.getBytes(0, outputBuffer, 0, SERIALIZED_PAGE_HEADER_SIZE); instead.

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

core/trino-main/src/main/java/io/trino/operator/HttpPageBufferClient.java

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerde.java

sopel39 · 2022-03-11T10:51:53Z

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

-                context.close(); // Release context buffers
-                return endOfData();
+            try {
+                int read = ByteStreams.read(inputStream, headerBuffer, 0, headerBuffer.length);


Could this potentially generate a lot of OS sys calls when pages are small?
We now have a read for page size + a read for page data. If pages are tiny (e.g. because there is a lot of nodes) this increase number of OS IO calls a lot (which are slow)

Discussed offline. Netty inputstreams have internal buffering.

Discussed offline. Netty inputstreams have internal buffering.

I don't think it's true though for unspilled data. However, page size should still be bigger than SliceInput buffer

sopel39 · 2022-03-14T12:52:25Z

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java

-                context.close(); // Release context buffers
-                return endOfData();
+            try {
+                int read = ByteStreams.read(inputStream, headerBuffer, 0, headerBuffer.length);


Discussed offline. Netty inputstreams have internal buffering.

I don't think it's true though for unspilled data. However, page size should still be bigger than SliceInput buffer

sopel39 · 2022-03-14T12:53:21Z

core/trino-main/src/main/java/io/trino/spiller/FileSingleStreamSpiller.java

@@ -168,7 +167,7 @@ private void writePages(Iterator<Page> pageIterator)

        try {
            InputStream input = closer.register(targetFile.newInputStream());
-            Iterator<Page> pages = PagesSerdeUtil.readPages(serde, new InputStreamSliceInput(input, BUFFER_SIZE));


here reads still go to OS.

The number of reads shouldn't increase as long as the average page size is above 8kb. Is this generally the case for spilling?

cla-bot bot added the cla-signed label Mar 1, 2022

linzebing requested review from martint, losipiuk and arhimondr March 1, 2022 01:49

arhimondr approved these changes Mar 1, 2022

View reviewed changes

findepi requested a review from sopel39 March 1, 2022 10:19

sopel39 reviewed Mar 1, 2022

View reviewed changes

linzebing force-pushed the page-view-fix branch from a50f892 to 0aa97c8 Compare March 3, 2022 06:52

linzebing closed this Mar 3, 2022

linzebing reopened this Mar 3, 2022

linzebing force-pushed the page-view-fix branch from 0aa97c8 to d376798 Compare March 9, 2022 00:39

arhimondr approved these changes Mar 9, 2022

View reviewed changes

pettyjamesm reviewed Mar 9, 2022

View reviewed changes

linzebing force-pushed the page-view-fix branch 2 times, most recently from 25e5d98 to 096fb7a Compare March 10, 2022 00:34

pettyjamesm reviewed Mar 10, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java Outdated Show resolved Hide resolved

core/trino-main/src/main/java/io/trino/execution/buffer/PagesSerdeUtil.java Outdated Show resolved Hide resolved

linzebing force-pushed the page-view-fix branch 4 times, most recently from 021db85 to 4ea01b1 Compare March 10, 2022 04:02

sopel39 requested review from raunaqmorarka and skrzypo987 March 10, 2022 11:26

pettyjamesm reviewed Mar 10, 2022

View reviewed changes

linzebing force-pushed the page-view-fix branch 2 times, most recently from 030cfb5 to bd30233 Compare March 10, 2022 18:50

sopel39 reviewed Mar 11, 2022

View reviewed changes

linzebing added 2 commits March 11, 2022 09:54

Avoid extra memory copy when reading pages from exchanges/spills

3e35b22

Allocate new Slice for VariableWidthBlockEncoding readBlock

525bd37

linzebing force-pushed the page-view-fix branch from bd30233 to 525bd37 Compare March 11, 2022 18:05

arhimondr approved these changes Mar 11, 2022

View reviewed changes

sopel39 approved these changes Mar 14, 2022

View reviewed changes

arhimondr merged commit d7f4d1d into trinodb:master Mar 14, 2022

github-actions bot added this to the 374 milestone Mar 14, 2022

mosabua mentioned this pull request Mar 14, 2022

Add Trino 374 release notes #11417

Merged

linzebing mentioned this pull request Mar 14, 2022

Fix calculation of page retained size for pages involving VariableWidthBlockEncoding blocks #11315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocate new Slice for VariableWidthBlockEncoding readBlock #11235

Allocate new Slice for VariableWidthBlockEncoding readBlock #11235

linzebing commented Mar 1, 2022 •

edited

Loading

sopel39 Mar 1, 2022

linzebing Mar 1, 2022 •

edited

Loading

arhimondr Mar 1, 2022

sopel39 Mar 2, 2022

arhimondr Mar 2, 2022

dain Mar 3, 2022

sopel39 Mar 4, 2022

pettyjamesm Mar 4, 2022

arhimondr Mar 7, 2022

pettyjamesm Mar 7, 2022

linzebing commented Mar 3, 2022

sopel39 commented Mar 3, 2022

arhimondr commented Mar 3, 2022 •

edited

Loading

linzebing commented Mar 3, 2022

arhimondr Mar 9, 2022

linzebing Mar 10, 2022

pettyjamesm left a comment

pettyjamesm Mar 9, 2022

linzebing Mar 9, 2022

pettyjamesm Mar 9, 2022

linzebing Mar 10, 2022

linzebing commented Mar 9, 2022 •

edited

Loading

sopel39 commented Mar 9, 2022 •

edited

Loading

linzebing commented Mar 10, 2022 •

edited

Loading

pettyjamesm Mar 10, 2022

linzebing Mar 10, 2022 •

edited

Loading

pettyjamesm Mar 10, 2022

linzebing Mar 10, 2022

sopel39 Mar 11, 2022 •

edited

Loading

linzebing Mar 11, 2022

sopel39 Mar 14, 2022

sopel39 Mar 14, 2022

sopel39 Mar 14, 2022

arhimondr Mar 14, 2022

Allocate new Slice for VariableWidthBlockEncoding readBlock #11235

Allocate new Slice for VariableWidthBlockEncoding readBlock #11235

Conversation

linzebing commented Mar 1, 2022 • edited Loading

Description

Related issues, pull requests, and links

Documentation

Release notes

Choose a reason for hiding this comment

linzebing Mar 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linzebing commented Mar 3, 2022

sopel39 commented Mar 3, 2022

arhimondr commented Mar 3, 2022 • edited Loading

linzebing commented Mar 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pettyjamesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linzebing commented Mar 9, 2022 • edited Loading

sopel39 commented Mar 9, 2022 • edited Loading

linzebing commented Mar 10, 2022 • edited Loading

Choose a reason for hiding this comment

linzebing Mar 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sopel39 Mar 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linzebing commented Mar 1, 2022 •

edited

Loading

linzebing Mar 1, 2022 •

edited

Loading

arhimondr commented Mar 3, 2022 •

edited

Loading

linzebing commented Mar 9, 2022 •

edited

Loading

sopel39 commented Mar 9, 2022 •

edited

Loading

linzebing commented Mar 10, 2022 •

edited

Loading

linzebing Mar 10, 2022 •

edited

Loading

sopel39 Mar 11, 2022 •

edited

Loading