Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Optimize UnsafeRow deserialize for scalar type #10797

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented Aug 21, 2024

Deserialize rows by memoryAddress and offset, and add the tests and benchmark, the performance improves 3 times.
Current deserialization StructBatchIterator deserializes rows one by one, get the data of the columns of one row, only set one value each time, but current implementation can get the data of one column by memory address of the rows and offset of each column, so we can set all rows of one column at once.

============================================================================
[...]amicRowVectorDeserializeBenchmark.cpp     relative  time/iter   iters/s
============================================================================
deserialize(batch_10_100k_string_only)                    421.74ns     2.37M
deserializeFast(batch_10_100k_string_only)      353.42%   119.33ns     8.38M
deserialize(batch_100_100k_string_only)                   374.03ns     2.67M
deserializeFast(batch_100_100k_string_only)     257.98%   144.98ns     6.90M
deserialize(batch_10_100k_all_types)                      271.23ns     3.69M
deserialize(batch_100_100k_all_types)                     710.39ns     1.41M
deserialize(batch_10_100k_scalar_types)                   264.71ns     3.78M
deserializeFast(batch_10_100k_scalar_types)     340.84%    77.67ns    12.88M
deserialize(batch_100_100k_scalar_types)                  258.62ns     3.87M
deserializeFast(batch_100_100k_scalar_types)    429.72%    60.18ns    16.62M

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 21, 2024
Copy link

netlify bot commented Aug 21, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 833023a
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/672aca38cd7c760008cc2aff

@jinchengchenghh
Copy link
Contributor Author

Could you help review? Thanks! @rui-mo

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserialize rows by memoryAddress and offset

Could you add more information on where the performance improvement mainly comes from? I notice previously the deserialize methods in UnsafeRowPrimitiveBatchDeserializer create intermediate values and set it to the vector, while the new way writes to the position directly.

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

velox/row/UnsafeRowDeserializers.h Outdated Show resolved Hide resolved
velox/row/UnsafeRowDeserializers.cpp Outdated Show resolved Hide resolved
/// @param offsets offset of the row deserialized data. It's size should be
/// equal to the deserialized rows. First offset is 0.
/// @param pool the memory pool to allocate Vectors.
static VectorPtr deserialize(
Copy link
Collaborator

@rui-mo rui-mo Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious. Is it necessary to keep two APIs for the deserialization instead of replacing the original one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This API only supports scalar types, it is hard to support complex data type, because it is hard to get the offsets for them. Another exists API supports all the data types.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the APIs be unified so that only complex types use the row-row deserialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ok, I will do the refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous API read the rows from several std::string_view, but the new API requires all the row in one buffer started with memoryAddress, for this public override API, it supplies ByteInputStream which consists several buffers, so we cannot get all the rows in one buffer, so I keep the two APIs but the new API supports the complex data type by revoking the old API.
https://github.com/facebookincubator/velox/blob/main/velox/serializers/UnsafeRowSerializer.cpp#L135

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users could use any of them as they need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here requires the old API, because the serialize format is
| row size | serialized row | row size | serialized row
https://github.com/facebookincubator/velox/blob/main/velox/serializers/UnsafeRowSerializer.cpp#L145

This optimization requires the serialized data is in contiguous memory.

So we should keep the 2 APIs.

@jinchengchenghh
Copy link
Contributor Author

Deserialize rows by memoryAddress and offset

Could you add more information on where the performance improvement mainly comes from? I notice previously the deserialize methods in UnsafeRowPrimitiveBatchDeserializer create intermediate values and set it to the vector, while the new way writes to the position directly.

Added to the PR description.

@jinchengchenghh
Copy link
Contributor Author

Anymore comments? @rui-mo

@rui-mo
Copy link
Collaborator

rui-mo commented Aug 29, 2024

@marin-ma Would you like to take a review? The idea appears to be similar as #10714.


int32_t getTotalStringSize(
int32_t columnIdx,
int32_t numRows,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems numRows is 'vector_size_t' type in the 'deserialize' method. Perhaps also fix this in the createFlatVectorFast functions.

for (auto pos = 0; pos < numRows; pos++) {
if (!isNullAt(memoryAddress + offsets[pos], columnIdx)) {
const uint8_t* srcptr = (memoryAddress + offsets[pos] + fieldOffset);
uint8_t* destptr = rawValues + (pos << shift);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use camel case for variable names. srcptr-> srcPtr, wordoffset -> wordOffset, destptr -> destPtr, ...

int32_t wordoffset = static_cast<int32_t>(offsetAndSize >> 32);
uint8_t bytesValue[length];
memcpy(bytesValue, memoryAddress + offsets[pos] + wordoffset, length);
uint8_t bytesValue2[16]{};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any better naming for 'bytesValue2'?

/// @param offsets offset of the row deserialized data. It's size should be
/// equal to the deserialized rows. First offset is 0.
/// @param pool the memory pool to allocate Vectors.
static VectorPtr deserialize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the APIs be unified so that only complex types use the row-row deserialization?

@jinchengchenghh
Copy link
Contributor Author

Could you help review again? Thanks! @rui-mo

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added some minors.

/// Deserialize fast when all the column types are primitive type, otherwise,
/// deserialize rows one by one, as previously. Fast deserialization
/// computes the start memoryAddress of one data by the column index and row
/// number, and set all the data of one column at once.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment suggestion:

/// Deserializes rows that are stored in contiguous memory.
/// If all column types are primitive, fast deserialization is used.
/// Otherwise, rows are deserialized one by one as before.
/// Fast deserialization calculates the starting memory address of data
/// based on the column index and row number, setting all data for one column at once.

/// deserialize rows one by one, as previously. Fast deserialization
/// computes the start memoryAddress of one data by the column index and row
/// number, and set all the data of one column at once.
/// @param memoryAddress the start memory address of the serialized rows .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra white-space after 'rows'.

/// @param type the element type.
/// @param offsets offset of each row serialized data. It's size should be
/// equal to the deserialized rows. First offset is 0.
/// @param pool the memory pool to allocate Vector.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory pool used to allocate vector.

memory::MemoryPool* pool) {
if (fastSupported(type)) {
return deserializeFast(memoryAddress, type, offsets, pool);
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for 'else' after return.

columnIdx, numRows, fieldOffset, offsets, memoryAddress);
char* rawBuffer = column->getRawStringBufferWithSpace(totalSize, true);
for (auto row = 0; row < numRows; row++) {
if (!isNullAt(memoryAddress + offsets[row], columnIdx)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'memoryAddress + offsets[row]' is used several times, any chance to make it a variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not. It uses 3 variables and used at hot spot branch, make it to a function does not make code clear.

*(int64_t*)(memoryAddress + offsets[row] + fieldOffset);
const int32_t length = static_cast<int32_t>(offsetAndSize);
const int32_t wordOffset = static_cast<int32_t>(offsetAndSize >> 32);
auto valueSrcPtr = memoryAddress + offsets[row] + wordOffset;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use auto* to indicate a pointer.

* @param pool the memory pool to allocate Vectors from
*data to a array.
* @return a VectorPtr
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your change, but it would be better to not use the old style /* */ for comments.

const vector_size_t numRows = offsets.size();
for (auto i = 0; i < numRows; i++) {
const auto length =
(i == numRows - 1 ? offsets[i] : offsets[i + 1] - offsets[i]);
Copy link
Collaborator

@rui-mo rui-mo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (i == numRows - 1), the length is 'offsets[i]'. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it, the offsets size should be numRows + 1, then we can get the correct last row length.

namespace facebook::velox::row {
namespace {

inline int64_t getFieldOffset(int64_t nullBitsetWidthInBytes, int32_t index) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add some comments for this method?

}

inline bool isNullAt(const uint8_t* memoryAddress, int32_t index) {
return bits::isBitSet(memoryAddress, index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just calls another function with the original parameters. Wondering if we need it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it. It is more clear to indicate it is the null condition.

@jinchengchenghh jinchengchenghh changed the title Optimize UnsafeRow deserialize for scalar type feat: Optimize UnsafeRow deserialize for scalar type Nov 15, 2024
velox_row_fast
velox_vector_fuzzer
Folly::folly
${FOLLY_BENCHMARK})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folly::follybenchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants