-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#3981, FIX-#3801, FIX-#4149: Stop broadcasting scalars to set items. #4160
FIX-#3981, FIX-#3801, FIX-#4149: Stop broadcasting scalars to set items. #4160
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4160 +/- ##
==========================================
+ Coverage 86.80% 89.72% +2.91%
==========================================
Files 201 201
Lines 16795 16797 +2
==========================================
+ Hits 14579 15071 +492
+ Misses 2216 1726 -490
Continue to review full report at Codecov.
|
89329e1
to
e2cbe97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a suggestion about using pandas no_default
value as a sentinel. Otherwise looks good.
…lars to set items. Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
Signed-off-by: mvashishtha <mahesh@ponder.io>
e2cbe97
to
b7bf0bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix conflicts.
modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py
Outdated
Show resolved
Hide resolved
…ile. Signed-off-by: mvashishtha <mahesh@ponder.io>
@YarShev Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvashishtha, LGTM, thanks!
As I see, this PR should resolve 3 issues. Please correct the title. |
@YarShev which 3 issues are you thinking of? I think it's only #3981 and #3801. The other issues I mentioned in the first comment are in pandas, or they are already fixed in Modin. |
Signed-off-by: mvashishtha mahesh@ponder.io
What do these changes do?
Normally, when assigning assigning a scalar to part of a dataframe, we broadcast the scalar to an array that is as big as the portion of the dataframe being modified. This causes a few problems:
None
ornp.NaN
to astring
series or dataframe: Value error when trying to access a series having dtype: string and some elements set to None #3981None
.MODIN_BENCHMARK_MODE=True
:I ran this with Dask on my computer, which is:
With broadcasting, the assignment took an average of 75.5 seconds on two runs, while without brodacsting, it took an average of 13.5 seconds on two runs.
I don't see any benefit to broadcasting scalars and then slicing the resulting array up for each partition, so we should stop doing that.
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date