Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: assign doesnt cast SparseDataFrame to DataFrame #19178

Merged
merged 13 commits into from
Feb 12, 2018

Conversation

hexgnu
Copy link
Contributor

@hexgnu hexgnu commented Jan 11, 2018

The problem here is that a SparseDataFrame that calls assign should cast
to a DataFrame mainly because SparseDataFrames are a special case.

The problem here is that a SparseDataFrame that calls assign should cast
to a DataFrame mainly because SparseDataFrames are a special case.
data = self.copy()

# See GH19163
data = self.copy().to_dense()
Copy link
Contributor

@TomAugspurger TomAugspurger Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you define assign on SparseDataFrame and only densify if nescessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and actually you don't want to densify, rather you want to do something like (in SparseDataFrame)

def assign(self, **kwargs):

      # coerce to a DataFrame
      self = DataFrame(self)
      return self.assign(**kwargs)

this actually ends up copying twice though. So the real solution is to move the guts of DataFrame.assign to _assign (and leave the copy part in .assign), then call ._assign in the sparse version.

@@ -55,6 +55,13 @@ def test_assign(self):
result = df.assign(A=lambda x: x.A + x.B)
assert_frame_equal(result, expected)

# SparseDataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a separate test

@@ -448,6 +448,7 @@ Reshaping
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
- Bug in :func:`DataFrame.assign` which doesn't cast ``SparseDataFrame`` as ``DataFrame``. (:issue:`19163`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :class`DataFrame` and so on here

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 12, 2018
data = self.copy()

# See GH19163
data = self.copy().to_dense()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and actually you don't want to densify, rather you want to do something like (in SparseDataFrame)

def assign(self, **kwargs):

      # coerce to a DataFrame
      self = DataFrame(self)
      return self.assign(**kwargs)

this actually ends up copying twice though. So the real solution is to move the guts of DataFrame.assign to _assign (and leave the copy part in .assign), then call ._assign in the sparse version.

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 18, 2018

So I updated the PR locally but...

I feel the real problem is inside of SparseArray.

If you init a SparseArray with False and pass in an index it will assume the dtype is float64 which coerces False to 0.0.

Why does it assume the dtype is float64? Shouldn't it infer it based on infer_dtype_from or something similar?

Thanks!

@TomAugspurger
Copy link
Contributor

If you init a SparseArray with False and pass in an index it will assume the dtype is float64 which coerces False to 0.0.

Can you give an example?

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 18, 2018

Sure @TomAugspurger

In [4]: pd.SparseArray(False, index=[1], fill_value=False)
Out[4]: 
[0.0]
Fill: False
IntIndex
Indices: array([], dtype=int32)

The line that coerces it is here:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/sparse/array.py#L198

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 18, 2018

Versus

In [5]: pd.SparseArray(False, fill_value=False)
Out[5]: 
[False]
Fill: False
IntIndex
Indices: array([], dtype=int32)

@pep8speaks
Copy link

pep8speaks commented Jan 19, 2018

Hello @hexgnu! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 12, 2018 at 11:40 Hours UTC

@codecov
Copy link

codecov bot commented Jan 19, 2018

Codecov Report

Merging #19178 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19178      +/-   ##
==========================================
+ Coverage   91.58%   91.59%   +<.01%     
==========================================
  Files         150      150              
  Lines       48807    48806       -1     
==========================================
+ Hits        44702    44703       +1     
+ Misses       4105     4103       -2
Flag Coverage Δ
#multiple 89.96% <100%> (ø) ⬆️
#single 41.73% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/sparse/array.py 91.38% <100%> (-0.03%) ⬇️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a277108...a81796a. Read the comment docs.

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 19, 2018

Alright I think I have found the issue. I added some tests and it seems to chooch. Please review and give me any feedback.

Thanks @TomAugspurger and @jreback

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down @hexgnu, just a couple clarifying questions.

And could you add a simple test that hits this directly? Something that constructs a arr = SparseArray(value, index=[0, 1]) and assert that the arr.dtype matches the correct dtype for various value.

@@ -491,7 +491,7 @@ Groupby/Resample/Rolling
Sparse
^^^^^^

-
- Bug in :class:`SparseArray` where if a scalar and index are passed in it will coerce to float64 regardless of scalar's dtype. (:issue:`19163`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what "passed in" refers to here? Is it specifically .assign? Or any method setting / updating the sparse array?

@@ -195,7 +195,7 @@ def __new__(cls, data, sparse_index=None, index=None, kind='integer',
data = np.nan
if not is_scalar(data):
raise Exception("must only pass scalars with an index ")
values = np.empty(len(index), dtype='float64')
values = np.empty(len(index), dtype=infer_dtype_from(data)[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is infer_dtype_from_scalar more appropriate here, since we've validated that data is a scalar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that's a good point. Since infer_dtype_from just checks is_scalar again.

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 20, 2018

@TomAugspurger I did add a test in pandas/tests/sparse/test_array.py that tests this directly instead of indirectly.

@hexgnu
Copy link
Contributor Author

hexgnu commented Feb 5, 2018

@jreback @TomAugspurger ping. This is now green after fixing a linting error.

Let me know if you would like me to squash any of these commits.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'll merge later today, in case @jreback has any comments.

@@ -161,7 +161,8 @@ def __new__(cls, data, sparse_index=None, index=None, kind='integer',
data = np.nan
if not is_scalar(data):
raise Exception("must only pass scalars with an index ")
values = np.empty(len(index), dtype='float64')
values = np.empty(len(index),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a routine for this
construct_1d_from_scalar in pandas.core.dtypes.cast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the empty/fill step

@@ -555,6 +555,7 @@ Sparse

- Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`)
- Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`)
- Bug in constructing a :class:`SparseArray`: if `data` is a scalar and `index` is defined it will coerce to float64 regardless of scalar's dtype. (:issue:`19163`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, just noticed we don't have SparseArray in our API docs so this link won't work. Just

``SparseArray``

until we get that sorted out. Also double ticks around data nd index.

@jreback jreback added this to the 0.23.0 milestone Feb 12, 2018
@jreback jreback merged commit 569bc7a into pandas-dev:master Feb 12, 2018
@jreback
Copy link
Contributor

jreback commented Feb 12, 2018

thanks @hexgnu

@jreback
Copy link
Contributor

jreback commented Feb 12, 2018

great job on sparse fixes! keep em coming!

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_sparse and assign is giving the wrong results.
4 participants