Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed bug in RLE/bitpacking hybrid algorithm #640

Merged
merged 2 commits into from
Oct 24, 2023

Conversation

norberttech
Copy link
Member

@norberttech norberttech commented Oct 24, 2023

Change Log

Added

Fixed

  • bug in RLE/bitpacking hybrid algorithm

Changed

Removed

Deprecated

Security


Description

Ref: #575

While progressing on the writer, I'm discovering some bugs, this PR started as a approach to split rows into multiple data pages when the size of a single one becomes too large according to parquet recommendations

$columnChunkContainers = [];
$previousChunkData = null;

foreach (\array_chunk($this->data, 1000) as $dataChunk) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is hardcoded for now, but the idea is to take a chunk of data, build a data page, and check if it's bigger than 8Kb (if not, drop it, merge data from the next page, and try over again).
Of course size of data page and that chunk size will come from the configuration with default values:

  • data page size = 8Kb
  • data page probe rows count = 1_000

@@ -158,7 +158,7 @@ public function readFloats(int $total) : array
$floats = [];

foreach ($floatBytes as $bytes) {
$floats[] = \unpack($this->byteOrder === ByteOrder::LITTLE_ENDIAN ? 'g' : 'G', \pack('C*', ...$bytes))[1];
$floats[] = \round(\unpack($this->byteOrder === ByteOrder::LITTLE_ENDIAN ? 'g' : 'G', \pack('C*', ...$bytes))[1], 7);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, php is a bit retarded with floats, I can't find a way to write/read floats without losing a precision

@norberttech norberttech merged commit d1e85e6 into flow-php:1.x Oct 24, 2023
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant