BUG: assign doesnt cast SparseDataFrame to DataFrame #19178

hexgnu · 2018-01-11T02:46:38Z

The problem here is that a SparseDataFrame that calls assign should cast
to a DataFrame mainly because SparseDataFrames are a special case.

closes to_sparse and assign is giving the wrong results. #19163
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

The problem here is that a SparseDataFrame that calls assign should cast to a DataFrame mainly because SparseDataFrames are a special case.

TomAugspurger · 2018-01-11T23:37:48Z

pandas/core/frame.py

-        data = self.copy()
+
+        # See GH19163
+        data = self.copy().to_dense()


Could you define assign on SparseDataFrame and only densify if nescessary?

and actually you don't want to densify, rather you want to do something like (in SparseDataFrame)

def assign(self, **kwargs): # coerce to a DataFrame self = DataFrame(self) return self.assign(**kwargs)

this actually ends up copying twice though. So the real solution is to move the guts of DataFrame.assign to _assign (and leave the copy part in .assign), then call ._assign in the sparse version.

jreback · 2018-01-12T00:34:11Z

pandas/tests/frame/test_mutate_columns.py

@@ -55,6 +55,13 @@ def test_assign(self):
        result = df.assign(A=lambda x: x.A + x.B)
        assert_frame_equal(result, expected)

+        # SparseDataFrame


make a separate test

jreback · 2018-01-12T00:34:41Z

doc/source/whatsnew/v0.23.0.txt

@@ -448,6 +448,7 @@ Reshaping
 - Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
 - Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string.  The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
 - Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
+- Bug in :func:`DataFrame.assign` which doesn't cast ``SparseDataFrame`` as ``DataFrame``. (:issue:`19163`)


use :class`DataFrame` and so on here

jreback · 2018-01-12T00:38:36Z

pandas/core/frame.py

-        data = self.copy()
+
+        # See GH19163
+        data = self.copy().to_dense()


and actually you don't want to densify, rather you want to do something like (in SparseDataFrame)

def assign(self, **kwargs): # coerce to a DataFrame self = DataFrame(self) return self.assign(**kwargs)

this actually ends up copying twice though. So the real solution is to move the guts of DataFrame.assign to _assign (and leave the copy part in .assign), then call ._assign in the sparse version.

hexgnu · 2018-01-18T04:04:33Z

So I updated the PR locally but...

I feel the real problem is inside of SparseArray.

If you init a SparseArray with False and pass in an index it will assume the dtype is float64 which coerces False to 0.0.

Why does it assume the dtype is float64? Shouldn't it infer it based on infer_dtype_from or something similar?

Thanks!

TomAugspurger · 2018-01-18T11:59:25Z

If you init a SparseArray with False and pass in an index it will assume the dtype is float64 which coerces False to 0.0.

Can you give an example?

hexgnu · 2018-01-18T12:34:54Z

Sure @TomAugspurger

In [4]: pd.SparseArray(False, index=[1], fill_value=False)
Out[4]: 
[0.0]
Fill: False
IntIndex
Indices: array([], dtype=int32)

The line that coerces it is here:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/sparse/array.py#L198

hexgnu · 2018-01-18T12:35:47Z

Versus

In [5]: pd.SparseArray(False, fill_value=False)
Out[5]: 
[False]
Fill: False
IntIndex
Indices: array([], dtype=int32)

pep8speaks · 2018-01-19T09:09:41Z

Hello @hexgnu! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on February 12, 2018 at 11:40 Hours UTC

codecov · 2018-01-19T09:09:46Z

Codecov Report

Merging #19178 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19178      +/-   ##
==========================================
+ Coverage   91.58%   91.59%   +<.01%     
==========================================
  Files         150      150              
  Lines       48807    48806       -1     
==========================================
+ Hits        44702    44703       +1     
+ Misses       4105     4103       -2

Flag	Coverage Δ
#multiple	`89.96% <100%> (ø)`	⬆️
#single	`41.73% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/sparse/array.py	`91.38% <100%> (-0.03%)`	⬇️
pandas/util/testing.py	`83.85% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a277108...a81796a. Read the comment docs.

hexgnu · 2018-01-19T09:19:33Z

Alright I think I have found the issue. I added some tests and it seems to chooch. Please review and give me any feedback.

Thanks @TomAugspurger and @jreback

TomAugspurger

Thanks for tracking this down @hexgnu, just a couple clarifying questions.

And could you add a simple test that hits this directly? Something that constructs a arr = SparseArray(value, index=[0, 1]) and assert that the arr.dtype matches the correct dtype for various value.

TomAugspurger · 2018-01-19T15:21:09Z

doc/source/whatsnew/v0.23.0.txt

@@ -491,7 +491,7 @@ Groupby/Resample/Rolling
 Sparse
 ^^^^^^

-
+- Bug in :class:`SparseArray` where if a scalar and index are passed in it will coerce to float64 regardless of scalar's dtype. (:issue:`19163`)


Could you clarify what "passed in" refers to here? Is it specifically .assign? Or any method setting / updating the sparse array?

TomAugspurger · 2018-01-19T15:35:50Z

pandas/core/sparse/array.py

@@ -195,7 +195,7 @@ def __new__(cls, data, sparse_index=None, index=None, kind='integer',
                data = np.nan
            if not is_scalar(data):
                raise Exception("must only pass scalars with an index ")
-            values = np.empty(len(index), dtype='float64')
+            values = np.empty(len(index), dtype=infer_dtype_from(data)[0])


Is infer_dtype_from_scalar more appropriate here, since we've validated that data is a scalar?

yea that's a good point. Since infer_dtype_from just checks is_scalar again.

hexgnu · 2018-01-20T00:18:52Z

@TomAugspurger I did add a test in pandas/tests/sparse/test_array.py that tests this directly instead of indirectly.

hexgnu · 2018-02-05T04:32:14Z

@jreback @TomAugspurger ping. This is now green after fixing a linting error.

Let me know if you would like me to squash any of these commits.

TomAugspurger

Looks good. I'll merge later today, in case @jreback has any comments.

jreback · 2018-02-05T12:02:06Z

pandas/core/sparse/array.py

@@ -161,7 +161,8 @@ def __new__(cls, data, sparse_index=None, index=None, kind='integer',
                data = np.nan
            if not is_scalar(data):
                raise Exception("must only pass scalars with an index ")
-            values = np.empty(len(index), dtype='float64')
+            values = np.empty(len(index),


we have a routine for this
construct_1d_from_scalar in pandas.core.dtypes.cast

for the empty/fill step

TomAugspurger · 2018-02-05T12:05:23Z

doc/source/whatsnew/v0.23.0.txt

@@ -555,6 +555,7 @@ Sparse

 - Bug in which creating a ``SparseDataFrame`` from a dense ``Series`` or an unsupported type raised an uncontrolled exception (:issue:`19374`)
 - Bug in :class:`SparseDataFrame.to_csv` causing exception (:issue:`19384`)
+- Bug in constructing a :class:`SparseArray`: if `data` is a scalar and `index` is defined it will coerce to float64 regardless of scalar's dtype. (:issue:`19163`)


Whoops, just noticed we don't have SparseArray in our API docs so this link won't work. Just

``SparseArray``

until we get that sorted out. Also double ticks around data nd index.

jreback · 2018-02-12T12:06:15Z

thanks @hexgnu

jreback · 2018-02-12T12:06:25Z

great job on sparse fixes! keep em coming!

BUG: assign doesnt cast SparseDataFrame to DataFrame

6a10fa8

The problem here is that a SparseDataFrame that calls assign should cast to a DataFrame mainly because SparseDataFrames are a special case.

TomAugspurger reviewed Jan 11, 2018

View reviewed changes

jreback requested changes Jan 12, 2018

View reviewed changes

jreback reviewed Jan 12, 2018

View reviewed changes

jreback added Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 12, 2018

jreback requested changes Jan 12, 2018

View reviewed changes

BUG: Fixes problem with SparseArray coercing to float if index is passed

686ef8e

hexgnu added 4 commits January 19, 2018 16:12

Merge remote-tracking branch 'upstream/master' into sparse_assign

d874c7f

Cleanup

ac6213a

More cleanup

3042568

More cleanup

d900e11

TomAugspurger reviewed Jan 19, 2018

View reviewed changes

hexgnu added 3 commits January 20, 2018 07:27

Comments from PR Updates

a17a593

Merge remote-tracking branch 'upstream/master' into sparse_assign

925186d

Fix linting error

559434a

TomAugspurger approved these changes Feb 5, 2018

View reviewed changes

jreback requested changes Feb 5, 2018

View reviewed changes

TomAugspurger reviewed Feb 5, 2018

View reviewed changes

Update whatsnew entry and use cast function

16a272d

hexgnu and others added 3 commits February 12, 2018 11:34

Merge remote-tracking branch 'upstream/master' into sparse_assign

e35bc17

Merge branch 'master' into PR_TOOL_MERGE_PR_19178

2374bc6

clean

a81796a

jreback added this to the 0.23.0 milestone Feb 12, 2018

jreback approved these changes Feb 12, 2018

View reviewed changes

jreback merged commit 569bc7a into pandas-dev:master Feb 12, 2018

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018

BUG: assign doesnt cast SparseDataFrame to DataFrame (pandas-dev#19178)

aa97648

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: assign doesnt cast SparseDataFrame to DataFrame #19178

BUG: assign doesnt cast SparseDataFrame to DataFrame #19178

hexgnu commented Jan 11, 2018

TomAugspurger Jan 11, 2018 •

edited

Loading

jreback Jan 12, 2018

jreback Jan 12, 2018

jreback Jan 12, 2018

jreback Jan 12, 2018

hexgnu commented Jan 18, 2018

TomAugspurger commented Jan 18, 2018

hexgnu commented Jan 18, 2018

hexgnu commented Jan 18, 2018

pep8speaks commented Jan 19, 2018 •

edited

Loading

codecov bot commented Jan 19, 2018 •

edited

Loading

hexgnu commented Jan 19, 2018

TomAugspurger left a comment

TomAugspurger Jan 19, 2018

TomAugspurger Jan 19, 2018

hexgnu Jan 20, 2018

hexgnu commented Jan 20, 2018

hexgnu commented Feb 5, 2018

TomAugspurger left a comment

jreback Feb 5, 2018

jreback Feb 5, 2018

TomAugspurger Feb 5, 2018

jreback commented Feb 12, 2018

jreback commented Feb 12, 2018

BUG: assign doesnt cast SparseDataFrame to DataFrame #19178

BUG: assign doesnt cast SparseDataFrame to DataFrame #19178

Conversation

hexgnu commented Jan 11, 2018

TomAugspurger Jan 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexgnu commented Jan 18, 2018

TomAugspurger commented Jan 18, 2018

hexgnu commented Jan 18, 2018

hexgnu commented Jan 18, 2018

pep8speaks commented Jan 19, 2018 • edited Loading

Comment last updated on February 12, 2018 at 11:40 Hours UTC

codecov bot commented Jan 19, 2018 • edited Loading

Codecov Report

hexgnu commented Jan 19, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexgnu commented Jan 20, 2018

hexgnu commented Feb 5, 2018

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 12, 2018

jreback commented Feb 12, 2018

TomAugspurger Jan 11, 2018 •

edited

Loading

pep8speaks commented Jan 19, 2018 •

edited

Loading

codecov bot commented Jan 19, 2018 •

edited

Loading