BUG: fix issue with concat creating SparseFrame if not all series are sparse. #18924

hexgnu · 2017-12-24T02:41:15Z

Ok so after trying a few things out. This seems to be the problem.

When concatting multiple data frames together if any of the containing series are sparse then the entire data frame becomes sparse (or SparseDataFrame).

This is in fact not what we want. We want a DataFrame that contains a SparseSeries and a Series.

closes #18914,
closes #18686
closes #16874
closes #18551

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2017-12-24T03:04:17Z

this should not to be necessary. rather this is a bug in how reindexing is implemented

hexgnu · 2017-12-24T03:18:00Z

K I'll look at reindex then. I was a bit hesitant to push this fix because it felt a bit off.

hexgnu · 2017-12-24T04:23:21Z

I think it might still be get_dummies because if I create a SparseDataFrame from scratch it works fine

import pandas.util.testing as tm

df2 = pd.SparseDataFrame()
df2['GDP'] = pd.Series([1,2])
df2['Nation_AB'] = pd.SparseSeries([1,0])
df2['Nation_CD'] = pd.SparseSeries([0,1])

df = pd.DataFrame.from_items([('GDP', [1, 2]),('Nation', ['AB', 'CD'])])
df = pd.get_dummies(df, columns=['Nation'], sparse=True)  # SparseDataFrame

tm.assert_frame_equal(df, df2) # => Works

df2.reindex(columns=['GDP']) #=> doesn't fail
df.reindex(columns=['GDP']) #=> fails

df.GDP #=> fails since it's cast as a SparseSeries but has an IntBlock underneath
df[['GDP']] #=> works fine

df2.GDP.block #=> gets cast to SparseBlock somewhere in transmission

Somehow... GDP is being stored as a IntBlock but marked as SparseBlock. It doesn't feel like this is reindex's fault but I can definitely be wrong.

I might have to come back to this in the morning. Not sure where to take it.

codecov · 2017-12-24T10:44:48Z

Codecov Report

Merging #18924 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #18924      +/-   ##
==========================================
- Coverage   91.62%    91.6%   -0.03%     
==========================================
  Files         150      150              
  Lines       48724    48726       +2     
==========================================
- Hits        44642    44634       -8     
- Misses       4082     4092      +10

Flag	Coverage Δ
#multiple	`89.97% <100%> (-0.03%)`	⬇️
#single	`41.75% <100%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/dtypes/generic.py	`100% <100%> (ø)`	⬆️
pandas/core/dtypes/concat.py	`99.14% <100%> (ø)`	⬆️
pandas/plotting/_converter.py	`65.22% <0%> (-1.74%)`	⬇️
pandas/core/generic.py	`95.91% <0%> (ø)`	⬆️
pandas/util/testing.py	`83.85% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 238499a...0768990. Read the comment docs.

TomAugspurger · 2017-12-26T15:27:15Z

See #18686 as well.

Agreed with @jreback that the issue is in reindex, not get_dummies. I don't think that sparse=True implies that numeric, non-sparse blocks should become sparse. I think those numeric columns should go through untouched.

hexgnu · 2017-12-26T17:14:27Z

Alright since it's after the holiday I'll take a look today 😄.

hexgnu · 2017-12-26T19:05:07Z

I think I need some guidance on this one... It really doesn't appear to be reindex to me. Doing some debugging I couldn't re-create it without utilizing concat.

The problem is that this column "GDP" is being cast as a SparseSeries when the underlying block is an IntBlock. But reindex thinks that since it's a SparseSeries it should re-index using SparseArray only which is why it's throwing errors.

IF I create the sparse data frame from scratch I can't re-create the issue but if I use concat I can.

from pandas.core.reshape.reshape import _get_dummies_1d
from pandas.core.reshape.concat import concat

df = pd.DataFrame.from_items([('GDP', [1, 2]),('Nation', ['AB', 'CD'])])
dummy = _get_dummies_1d(df['Nation'], sparse=True, prefix='Nation')

isinstance(dummy, pd.SparseDataFrame)
isinstance(df, pd.DataFrame)

# I think this is the real problem.
concatted = concat([df.drop('Nation', axis=1), dummy], axis=1)
concatted.reindex(columns=['GDP']) # => Fails because reindex thinks it's reindexing a SparseArray but it's an IntBlock
concatted.GDP # => Fails since the underlying block is an IntBlock not a SparseBlock

I also looked at concat but have run out of my open source time for the day :(.

Let me know if I'm on the right track or barking up a tree I shouldn't be barking up. Thanks @jreback and @TomAugspurger.

hexgnu · 2017-12-27T23:47:40Z

Alright I looked at this again today...

Here is the problem as I see it:

SparseFrame is instantiating only SparseSeries even though 'GDP' isn't a SparseSeries.

On top of that SparseSeries when wrapping a SingleBlockManager doesn't coerce it into a Sparse data structure.

I think SingleBlockManager should be coerced into a sparse structure since everything else is on instantiation. So I will try that tomorrow.

jreback · 2017-12-28T14:22:26Z

yeah this should actually be returning a DataFrame which would contain SparseSeries (and a regular Series). SparseDataFrame is kind of a special DataFrame that has filling behavior.

pep8speaks · 2017-12-28T16:41:32Z

Hello @hexgnu! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 31, 2018 at 11:26 Hours UTC

hexgnu · 2017-12-28T16:55:36Z

@jreback omg thank you! found the bug. I was hitting my head trying to get SparseDataFrame to have both dense and sparse series in it but in fact needed to use DataFrame.

Found the issue and honestly it's a simple enough fix. This should close both #18914 and #18686

hexgnu · 2018-01-01T16:50:01Z

Alright so I just pushed some test fixes.

One thing that kind of confuses me is why is there both:

pandas.core.dtypes.concat._get_series_result_type
pandas.core.dtypes.concat._get_frame_result_type

They are almost identical minus some things. I didn't want to get stuck in a refactor loop but it seems like these functions should be one instead of two?

jreback · 2018-01-01T18:27:13Z

pandas/tests/reshape/test_reshape.py

+        df = get_dummies(df, columns=['Nation'], sparse=True)
+        df2 = df.reindex(columns=['GDP'])
+
+        tm.assert_index_equal(df.index, df2.index)


assert the frame itself

use the sparse arg here (so it handles both cases)

I added a few parameterized changes. This one as well as the test in test_combine_concat.py which honestly looks like it needs some cleanup someday.

jreback · 2018-01-01T18:28:16Z

doc/source/whatsnew/v0.23.0.txt

@@ -363,6 +363,7 @@ Reshaping
 - Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`)
 - Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
 - Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string.  The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
+- Bug in :func:`pandas.core.dtypes.concat._get_series_result_type` which returns SparseDataFrame even if not all contained in Frame are sparse. (:issue:`18914` and :issue:`18686`)


the whatsnew note should be from a user perspective (e.g. the error that happens with get_dummies).

I changed it to be about pd.concat I think that's user facing right? While get_dummies is the symptom pd.concat is the real problem. Lemme know if you want me to change that thanks!

jreback · 2018-01-01T18:30:17Z

have a look at https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3ASparse+label%3AReshaping, see if this closes any of those (aside from the already marked ones)

hexgnu · 2018-01-01T19:16:21Z

Also I found one more issue that was directly tied to this. I opened up a new issue though asking about DataFrame.density and whether that should be defined or not.

jreback

minor comment. did you cover the tests in the issue you are marking as closing?

jreback · 2018-01-04T00:38:05Z

doc/source/whatsnew/v0.23.0.txt

@@ -363,6 +363,7 @@ Reshaping
 - Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`)
 - Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
 - Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string.  The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
+- Bug in :func:`concat` when concatting sparse and dense series it returns only a SparseDataFrame. Should be a DataFrame. (:issue:`18914`, :issue:`18686`, and :issue:`16874`)


use double backticks around DataFrame, SparseDataFrame.

jreback · 2018-01-04T00:44:13Z

looks like we can add some tests from the issues. (make the new tests and add the gh issue on them).

hexgnu · 2018-01-04T00:49:44Z

Yea I wondered if you wanted me to do that. I’ll get on it soon thanks!

…

On Thu, Jan 4, 2018 at 8:44 AM Jeff Reback ***@***.***> wrote: looks like we can add some tests from the issues. (make the new tests and add the gh issue on them). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18924 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIdDu0VpdVzT6fxj5n5M_0wg32MRz1mks5tHB7ugaJpZM4RL0Op> .

hexgnu · 2018-01-05T01:24:27Z

So on #18686 it will close that problem but to_coo won't be defined on DataFrame. Do you want me to tack onto that existing issue I made?

I added a test for the other one plus a fix for what you pointed out in whatsnew.

This was originally brought up in :issue:`18686` and :issue:`18914`. Basically the problem is when you use get_dummies with sparse=True it will return a SparseDataFrame with sparse and dense columns. This is in fact not what we want. What we want is a DataFrame with sparse and dense columns. Inside of pandas.core.dtypes.concat is a function that defines the factory class which needed to be changed.

hexgnu · 2018-01-18T04:02:31Z

Ok this should be good to go after tests run @jreback

…lumn

…dex_sparse_column

TomAugspurger · 2018-01-30T22:13:56Z

@hexgnu close now! Just some linting failures: https://travis-ci.org/pandas-dev/pandas/jobs/334545951#L3209 Can you fix those and repush?

…lumn

hexgnu · 2018-01-30T23:44:50Z

@TomAugspurger done! thanks :) been waiting for that fix in master to show up.

jreback · 2018-01-31T00:12:15Z

pandas/core/dtypes/concat.py

        return SparseDataFrame
    else:
-        return objs[0]
+        return next(obj for obj in objs if not type(obj) == SparseDataFrame)


use ABCSparseDataFrame

actually need to add this in pandas/core/dtypes/generic.py (and a test for it)

use isinstance(obj, ABCSparseDataFrame)

hmm you can just return a DataFrame (this migth be subclasses though)

I tried to keep this as similar to the original as possible. For instance someone might have their own subclass defined of DataFrame.

jreback · 2018-01-31T00:14:21Z

pandas/tests/sparse/test_combine_concat.py

        assert isinstance(res, pd.SparseDataFrame)
        tm.assert_frame_equal(res.to_dense(), exp)

        res = pd.concat([self.dense3, sparse], axis=1)
        exp = pd.concat([self.dense3, self.dense1], axis=1)
-        assert isinstance(res, pd.SparseDataFrame)
+        # See GH18914 and #18686 for why this should be


can you just construct the dataframe and compare?

I was trying to follow the style of this test. I can definitely do that but figured since the rest of the test was following that standard I'd keep it similar.

jreback · 2018-01-31T00:14:35Z

pandas/tests/sparse/test_combine_concat.py

+        assert res.iloc[0, 0] == self.dense3.iloc[0, 0]
+
+        for column in self.dense3.columns:
+            tm.assert_series_equal(res[column], exp[column])


I would be all for splitting this long test up a bit

Alright split it up a bit. Also added some new test cases since it wasn't covering all aspects like concating onself.

jreback

lgtm. just some test cleanup. ping on green.

jreback · 2018-01-31T11:06:11Z

pandas/tests/sparse/test_combine_concat.py

+
+            assert isinstance(res, pd.SparseDataFrame)
+            tm.assert_frame_equal(res.to_dense(), exp)
+


can you add a comment here (and in test below), the purposes of the reverses & the loop

jreback · 2018-01-31T11:07:31Z

pandas/tests/sparse/test_combine_concat.py

+            res = pd.concat(sparse_frame, axis=1)
+            exp = pd.concat(dense_frame, axis=1)
+
+            # See GH18914 and #18686 for why this should be


I would remove these asserts (357-364), the assert on the frame/series is comprehensive. you can not the issue numbers at the top of the test instead

jreback · 2018-01-31T11:08:23Z

pandas/core/dtypes/concat.py

-    if any(b.is_sparse for b in result.blocks):
-        from pandas.core.sparse.api import SparseDataFrame
+
+    from pandas.core.sparse.api import SparseDataFrame


you could put this right before you actually use SparseDataFrame

jreback · 2018-01-31T11:28:48Z

lgtm. ping on green.

hexgnu · 2018-01-31T14:54:02Z

alrighty all green thanks! @jreback

jreback · 2018-02-01T13:09:27Z

thanks @hexgnu nice patch! keep em coming!

… sparse. (pandas-dev#18924)

sinhrks added Bug Sparse Sparse Data Type labels Dec 25, 2017

hexgnu force-pushed the reindex_sparse_column branch from cb099dd to 2612dcc Compare December 28, 2017 16:41

hexgnu changed the title ~~BUG: fix issue with get_dummies not using sparse correct~~ BUG: fix issue with concat creating SparseFrame if not all series are sparse. Dec 28, 2017

hexgnu force-pushed the reindex_sparse_column branch from 7b34c07 to a37ea0c Compare December 28, 2017 16:53

hexgnu force-pushed the reindex_sparse_column branch 3 times, most recently from daac0c8 to 4061e67 Compare January 1, 2018 16:48

jreback requested changes Jan 1, 2018

View reviewed changes

hexgnu force-pushed the reindex_sparse_column branch from 4061e67 to 2ef4ed1 Compare January 1, 2018 19:13

hexgnu force-pushed the reindex_sparse_column branch from 2ef4ed1 to 41411d4 Compare January 2, 2018 16:26

jreback requested changes Jan 4, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Jan 4, 2018

jreback mentioned this pull request Jan 10, 2018

to_sparse and assign is giving the wrong results. #19163

Closed

hexgnu added 3 commits January 11, 2018 07:23

Add tests for 16874

05a0717

Add more specific tests

6fc6369

hexgnu force-pushed the reindex_sparse_column branch from bfa713d to 6fc6369 Compare January 11, 2018 00:24

hexgnu added 2 commits January 17, 2018 16:43

Merge branch 'master' into reindex_sparse_column

4116da7

Merge branch 'master' into reindex_sparse_column

2ee0391

hexgnu added 2 commits January 29, 2018 10:07

Merge remote-tracking branch 'upstream/master' into reindex_sparse_co…

44aa41a

…lumn

Merge remote-tracking branch 'origin/reindex_sparse_column' into rein…

1b9e976

…dex_sparse_column

hexgnu added 2 commits January 31, 2018 06:42

FIX: fix linting failure with testing

3c0a4da

Merge remote-tracking branch 'upstream/master' into reindex_sparse_co…

90a5003

…lumn

jreback requested changes Jan 31, 2018

View reviewed changes

jreback removed this from the 0.23.0 milestone Jan 31, 2018

hexgnu added 3 commits January 31, 2018 10:10

FIX: use ABCSparseDataFrame and cleanup test a bit

126db41

Add test for ABCSparseDataFrame

6d01387

Cleanup linting problems

81aba2e

jreback approved these changes Jan 31, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Jan 31, 2018

Cleanup on tests from PR

0768990

hexgnu mentioned this pull request Feb 1, 2018

BUG?: can't print the head of a sparse dataframe in 0.21.0 #18551

Closed

jreback merged commit d3851ac into pandas-dev:master Feb 1, 2018

hexgnu deleted the reindex_sparse_column branch February 1, 2018 13:20

harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018

BUG: fix issue with concat creating SparseFrame if not all series are…

e295435

… sparse. (pandas-dev#18924)


		assert isinstance(res, pd.SparseDataFrame)
		tm.assert_frame_equal(res.to_dense(), exp)

BUG: fix issue with concat creating SparseFrame if not all series are sparse. #18924

BUG: fix issue with concat creating SparseFrame if not all series are sparse. #18924

Conversation

hexgnu commented Dec 24, 2017 • edited Loading

jreback commented Dec 24, 2017

hexgnu commented Dec 24, 2017

hexgnu commented Dec 24, 2017

codecov bot commented Dec 24, 2017 • edited Loading

Codecov Report

TomAugspurger commented Dec 26, 2017

hexgnu commented Dec 26, 2017

hexgnu commented Dec 26, 2017

hexgnu commented Dec 27, 2017

jreback commented Dec 28, 2017

pep8speaks commented Dec 28, 2017 • edited Loading

Comment last updated on January 31, 2018 at 11:26 Hours UTC

hexgnu commented Dec 28, 2017

hexgnu commented Jan 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 1, 2018

hexgnu commented Jan 1, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 4, 2018

hexgnu commented Jan 4, 2018 via email

hexgnu commented Jan 5, 2018

hexgnu commented Jan 18, 2018

TomAugspurger commented Jan 30, 2018

hexgnu commented Jan 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 31, 2018

hexgnu commented Jan 31, 2018

jreback commented Feb 1, 2018

hexgnu commented Dec 24, 2017 •

edited

Loading

codecov bot commented Dec 24, 2017 •

edited

Loading

pep8speaks commented Dec 28, 2017 •

edited

Loading