-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#4060: Fix _to_pandas() dtypes for empty frames. #4108
FIX-#4060: Fix _to_pandas() dtypes for empty frames. #4108
Conversation
Signed-off-by: mvashishtha <mahesh@ponder.io>
c71b8f5
to
8312829
Compare
Signed-off-by: mvashishtha <mahesh@ponder.io>
Codecov Report
@@ Coverage Diff @@
## master #4108 +/- ##
==========================================
+ Coverage 85.61% 88.56% +2.94%
==========================================
Files 201 201
Lines 16852 16851 -1
==========================================
+ Hits 14428 14924 +496
+ Misses 2424 1927 -497
Continue to review full report at Codecov.
|
The unit tests revealed real bugs, so I'm trying to fix them. |
Signed-off-by: mvashishtha <mahesh@ponder.io>
dtypes=op( | ||
pandas.DataFrame(index=self.index, columns=self.columns), | ||
pandas.DataFrame(index=right_frame.index, columns=right_frame.columns), | ||
).dtypes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the fact that we're executing UDF op
twice (which could be really expensive), this is still a very unreliable construction. The output type isn't based on the source frames types at all, it always gives the same type in the result here.
Example
# Frames holding integer columns
>>> df1_i["a"].dtypes, df2_i["a"].dtypes
(dtype('int64'), dtype('int64'))
>>> (df1_i + df2_i).dtypes # int + int = int
a int64
dtype: object
# Frames holding strings:
>>> df1_s["a"].dtypes, df2_s["a"].dtypes
(dtype('O'), dtype('O'))
>>> (df1_s + df2_s).dtypes # string + string = string
a object
dtype: object
# trying to infer dtypes with the present logic:
# int + int = object -> wrong
>>> (pandas.DataFrame(index=df1_i.index, columns=df1_i.columns) + pandas.DataFrame(index=df2_i.index, columns=df2_i.columns)).dtypes
a object
dtype: object
>>> (pandas.DataFrame(index=df1_s.index, columns=df1_s.columns) + pandas.DataFrame(index=df2_s.index, columns=df2_s.columns)).dtypes
a object
dtype: object
As far as I know, we don't have support for empty frames at all, that's why handling empty cases are very tricky. They should probably go as default to pandas or even be ignored if the mismatch isn't critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchigarev you make a good point, however it is not so simple.
A common pattern we see is creating an empty dataframe and using append
to add to it. There are lots of other usecases for empty dataframes even though I do not like them. Ultimately, we need a better way to handle them (maybe a new class?) but we can't just hand everything off to pandas because we don't get the dataframe back.
#4060 is closed, @mvashishtha feel free to reopen if needed |
Signed-off-by: mvashishtha mahesh@ponder.io
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date