-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#3527: Fix parquet partitioning issue causing negative row length partitions #4368
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4368 +/- ##
===========================================
+ Coverage 70.79% 90.01% +19.22%
===========================================
Files 219 219
Lines 17944 17946 +2
===========================================
+ Hits 12703 16154 +3451
+ Misses 5241 1792 -3449
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor comment.
modin/pandas/test/test_io.py
Outdated
unique_filename = get_unique_filename(extension="parquet") | ||
from modin.config import NPartitions | ||
|
||
nrows = (NPartitions.get() - 1) * 32 - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please:
- Use
MinPartitionSize
instead of 32 - add an assertion that
nrows > MinPartitionSize.get()
? If at some point in the future,NPartitions
becomes 2, this will test the case wherenrows = 31 < min_chunk_size = 32
instead of the one you're trying to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to use MinPartitionSize
. I also removed the dependency on NPartitions
, since MinPartitionSize + 1
is a more robust failure case i.e. is expected to fail (without these changes) regardless of the value of MinPartitionSize
or NPartitions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffreykennethli Looks great! The only thing left is to add this to the release notes (and add your github handle to the contributors section of that file
160508d
to
72e4d41
Compare
d6926fb
to
2b64072
Compare
@jeffreykennethli, please resolve conflicts. After that we will merge the changes. |
a43a10a
to
d03082c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes have already been merged. Why are they shown in diff?
Signed-off-by: jeffreykennethli <jkli@ponder.io>
…lumn Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
Signed-off-by: jeffreykennethli <jkli@ponder.io>
237350d
to
4d54048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@YarShev any other comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
… partitions (#4368) Signed-off-by: jeffreykennethli <jkli@ponder.io>
What do these changes do?
In #3527 , we found that partitions from
read_parquet
would have negative_length_cache
, giving aValueError: lengths cannot be negative
when trying to materialize results, such as when indexing into a DataFrame by column name._length_cache
is set when the DataFrame is initialized and partitioned. For parquet files, this is in the column_store_dispatcher.The previous logic would partition the row lengths with a minimum chunk size (32 rows). For DataFrames with less than 32 total rows, it would partition the rows:
[32, 0, 0, .... , 0]
for each of the N partitions.For DataFrames with more than
num_partitions * min_chunk_size
rows i.e.index_len >= min_chunk_size * num_partitions
, it would be partitioned evenly[index_len / num_partitions, index_len / num_partitions, ...]
.For DataFrames where
min_chunk_size < index_len < min_chunk_size * num_partitions
, it would partition it like[32, 32, 32, ... index_len - (index_chunksize * (num_partitions - 1)]
and theindex_len - (index_chunksize * (num_partitions - 1)
expression would be negative. This ensured the sum of the row lengths of all partitions would be equal toindex_len
but the negative row lengths is wrong.It should be partitioned like
[32, 32, 32, 0, 0, ...]
where the firstm
rows aremin_chunk_size
until the partitions have more cumulative rows thanindex_len
, and be 0 for every partition after.This is the change this PR is introducing, as well as adding a error checking statement in the DataFrame constructor to check that no partitions have negative row length or column width, and a unit test that checks for indexing into all columns of a parquet file with various row lengths. I think there is room to write a more appropriate unit test, as this one minimally tests this broken behavior.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date