Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix issue with concat creating SparseFrame if not all series are sparse. #18924

Merged
merged 13 commits into from
Feb 1, 2018

Conversation

hexgnu
Copy link
Contributor

@hexgnu hexgnu commented Dec 24, 2017

Ok so after trying a few things out. This seems to be the problem.

When concatting multiple data frames together if any of the containing series are sparse then the entire data frame becomes sparse (or SparseDataFrame).

This is in fact not what we want. We want a DataFrame that contains a SparseSeries and a Series.

closes #18914,
closes #18686
closes #16874
closes #18551

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@jreback
Copy link
Contributor

jreback commented Dec 24, 2017

this should not to be necessary. rather this is a bug in how reindexing is implemented

@hexgnu
Copy link
Contributor Author

hexgnu commented Dec 24, 2017

K I'll look at reindex then. I was a bit hesitant to push this fix because it felt a bit off.

@hexgnu
Copy link
Contributor Author

hexgnu commented Dec 24, 2017

I think it might still be get_dummies because if I create a SparseDataFrame from scratch it works fine

import pandas.util.testing as tm

df2 = pd.SparseDataFrame()
df2['GDP'] = pd.Series([1,2])
df2['Nation_AB'] = pd.SparseSeries([1,0])
df2['Nation_CD'] = pd.SparseSeries([0,1])

df = pd.DataFrame.from_items([('GDP', [1, 2]),('Nation', ['AB', 'CD'])])
df = pd.get_dummies(df, columns=['Nation'], sparse=True)  # SparseDataFrame

tm.assert_frame_equal(df, df2) # => Works

df2.reindex(columns=['GDP']) #=> doesn't fail
df.reindex(columns=['GDP']) #=> fails

df.GDP #=> fails since it's cast as a SparseSeries but has an IntBlock underneath
df[['GDP']] #=> works fine

df2.GDP.block #=> gets cast to SparseBlock somewhere in transmission

Somehow... GDP is being stored as a IntBlock but marked as SparseBlock. It doesn't feel like this is reindex's fault but I can definitely be wrong.

I might have to come back to this in the morning. Not sure where to take it.

@codecov
Copy link

codecov bot commented Dec 24, 2017

Codecov Report

Merging #18924 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18924      +/-   ##
==========================================
- Coverage   91.62%    91.6%   -0.03%     
==========================================
  Files         150      150              
  Lines       48724    48726       +2     
==========================================
- Hits        44642    44634       -8     
- Misses       4082     4092      +10
Flag Coverage Δ
#multiple 89.97% <100%> (-0.03%) ⬇️
#single 41.75% <100%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/dtypes/generic.py 100% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 99.14% <100%> (ø) ⬆️
pandas/plotting/_converter.py 65.22% <0%> (-1.74%) ⬇️
pandas/core/generic.py 95.91% <0%> (ø) ⬆️
pandas/util/testing.py 83.85% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 238499a...0768990. Read the comment docs.

@sinhrks sinhrks added Bug Sparse Sparse Data Type labels Dec 25, 2017
@TomAugspurger
Copy link
Contributor

See #18686 as well.

Agreed with @jreback that the issue is in reindex, not get_dummies. I don't think that sparse=True implies that numeric, non-sparse blocks should become sparse. I think those numeric columns should go through untouched.

@hexgnu
Copy link
Contributor Author

hexgnu commented Dec 26, 2017

Alright since it's after the holiday I'll take a look today 😄.

@hexgnu
Copy link
Contributor Author

hexgnu commented Dec 26, 2017

I think I need some guidance on this one... It really doesn't appear to be reindex to me. Doing some debugging I couldn't re-create it without utilizing concat.

The problem is that this column "GDP" is being cast as a SparseSeries when the underlying block is an IntBlock. But reindex thinks that since it's a SparseSeries it should re-index using SparseArray only which is why it's throwing errors.

IF I create the sparse data frame from scratch I can't re-create the issue but if I use concat I can.

from pandas.core.reshape.reshape import _get_dummies_1d
from pandas.core.reshape.concat import concat

df = pd.DataFrame.from_items([('GDP', [1, 2]),('Nation', ['AB', 'CD'])])
dummy = _get_dummies_1d(df['Nation'], sparse=True, prefix='Nation')

isinstance(dummy, pd.SparseDataFrame)
isinstance(df, pd.DataFrame)

# I think this is the real problem.
concatted = concat([df.drop('Nation', axis=1), dummy], axis=1)
concatted.reindex(columns=['GDP']) # => Fails because reindex thinks it's reindexing a SparseArray but it's an IntBlock
concatted.GDP # => Fails since the underlying block is an IntBlock not a SparseBlock

I also looked at concat but have run out of my open source time for the day :(.

Let me know if I'm on the right track or barking up a tree I shouldn't be barking up. Thanks @jreback and @TomAugspurger.

@hexgnu
Copy link
Contributor Author

hexgnu commented Dec 27, 2017

Alright I looked at this again today...

Here is the problem as I see it:

SparseFrame is instantiating only SparseSeries even though 'GDP' isn't a SparseSeries.

On top of that SparseSeries when wrapping a SingleBlockManager doesn't coerce it into a Sparse data structure.

I think SingleBlockManager should be coerced into a sparse structure since everything else is on instantiation. So I will try that tomorrow.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2017

yeah this should actually be returning a DataFrame which would contain SparseSeries (and a regular Series). SparseDataFrame is kind of a special DataFrame that has filling behavior.

@pep8speaks
Copy link

pep8speaks commented Dec 28, 2017

Hello @hexgnu! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 31, 2018 at 11:26 Hours UTC

@hexgnu hexgnu changed the title BUG: fix issue with get_dummies not using sparse correct BUG: fix issue with concat creating SparseFrame if not all series are sparse. Dec 28, 2017
@hexgnu
Copy link
Contributor Author

hexgnu commented Dec 28, 2017

@jreback omg thank you! found the bug. I was hitting my head trying to get SparseDataFrame to have both dense and sparse series in it but in fact needed to use DataFrame.

Found the issue and honestly it's a simple enough fix. This should close both #18914 and #18686

@hexgnu hexgnu force-pushed the reindex_sparse_column branch 3 times, most recently from daac0c8 to 4061e67 Compare January 1, 2018 16:48
@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 1, 2018

Alright so I just pushed some test fixes.

One thing that kind of confuses me is why is there both:

pandas.core.dtypes.concat._get_series_result_type
pandas.core.dtypes.concat._get_frame_result_type

They are almost identical minus some things. I didn't want to get stuck in a refactor loop but it seems like these functions should be one instead of two?

df = get_dummies(df, columns=['Nation'], sparse=True)
df2 = df.reindex(columns=['GDP'])

tm.assert_index_equal(df.index, df2.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert the frame itself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the sparse arg here (so it handles both cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few parameterized changes. This one as well as the test in test_combine_concat.py which honestly looks like it needs some cleanup someday.

@@ -363,6 +363,7 @@ Reshaping
- Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`)
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
- Bug in :func:`pandas.core.dtypes.concat._get_series_result_type` which returns SparseDataFrame even if not all contained in Frame are sparse. (:issue:`18914` and :issue:`18686`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whatsnew note should be from a user perspective (e.g. the error that happens with get_dummies).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to be about pd.concat I think that's user facing right? While get_dummies is the symptom pd.concat is the real problem. Lemme know if you want me to change that thanks!

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

have a look at https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3ASparse+label%3AReshaping, see if this closes any of those (aside from the already marked ones)

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 1, 2018

Also I found one more issue that was directly tied to this. I opened up a new issue though asking about DataFrame.density and whether that should be defined or not.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment. did you cover the tests in the issue you are marking as closing?

@@ -363,6 +363,7 @@ Reshaping
- Bug in :func:`Series.rank` where ``Series`` containing ``NaT`` modifies the ``Series`` inplace (:issue:`18521`)
- Bug in :func:`cut` which fails when using readonly arrays (:issue:`18773`)
- Bug in :func:`Dataframe.pivot_table` which fails when the ``aggfunc`` arg is of type string. The behavior is now consistent with other methods like ``agg`` and ``apply`` (:issue:`18713`)
- Bug in :func:`concat` when concatting sparse and dense series it returns only a SparseDataFrame. Should be a DataFrame. (:issue:`18914`, :issue:`18686`, and :issue:`16874`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks around DataFrame, SparseDataFrame.

@jreback jreback added this to the 0.23.0 milestone Jan 4, 2018
@jreback
Copy link
Contributor

jreback commented Jan 4, 2018

looks like we can add some tests from the issues. (make the new tests and add the gh issue on them).

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 4, 2018 via email

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 5, 2018

So on #18686 it will close that problem but to_coo won't be defined on DataFrame. Do you want me to tack onto that existing issue I made?

I added a test for the other one plus a fix for what you pointed out in whatsnew.

This was originally brought up in :issue:`18686` and :issue:`18914`.
Basically the problem is when you use get_dummies with sparse=True it
will return a SparseDataFrame with sparse and dense columns. This is in
fact not what we want. What we want is a DataFrame with sparse and dense
columns.

Inside of pandas.core.dtypes.concat is a function that defines the
factory class which needed to be changed.
@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 18, 2018

Ok this should be good to go after tests run @jreback

@TomAugspurger
Copy link
Contributor

@hexgnu close now! Just some linting failures: https://travis-ci.org/pandas-dev/pandas/jobs/334545951#L3209 Can you fix those and repush?

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 30, 2018

@TomAugspurger done! thanks :) been waiting for that fix in master to show up.

return SparseDataFrame
else:
return objs[0]
return next(obj for obj in objs if not type(obj) == SparseDataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ABCSparseDataFrame

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually need to add this in pandas/core/dtypes/generic.py (and a test for it)

use isinstance(obj, ABCSparseDataFrame)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm you can just return a DataFrame (this migth be subclasses though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep this as similar to the original as possible. For instance someone might have their own subclass defined of DataFrame.

assert isinstance(res, pd.SparseDataFrame)
tm.assert_frame_equal(res.to_dense(), exp)

res = pd.concat([self.dense3, sparse], axis=1)
exp = pd.concat([self.dense3, self.dense1], axis=1)
assert isinstance(res, pd.SparseDataFrame)
# See GH18914 and #18686 for why this should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just construct the dataframe and compare?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to follow the style of this test. I can definitely do that but figured since the rest of the test was following that standard I'd keep it similar.

assert res.iloc[0, 0] == self.dense3.iloc[0, 0]

for column in self.dense3.columns:
tm.assert_series_equal(res[column], exp[column])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be all for splitting this long test up a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright split it up a bit. Also added some new test cases since it wasn't covering all aspects like concating onself.

@jreback jreback removed this from the 0.23.0 milestone Jan 31, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just some test cleanup. ping on green.


assert isinstance(res, pd.SparseDataFrame)
tm.assert_frame_equal(res.to_dense(), exp)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here (and in test below), the purposes of the reverses & the loop

res = pd.concat(sparse_frame, axis=1)
exp = pd.concat(dense_frame, axis=1)

# See GH18914 and #18686 for why this should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove these asserts (357-364), the assert on the frame/series is comprehensive. you can not the issue numbers at the top of the test instead

if any(b.is_sparse for b in result.blocks):
from pandas.core.sparse.api import SparseDataFrame

from pandas.core.sparse.api import SparseDataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could put this right before you actually use SparseDataFrame

@jreback jreback added this to the 0.23.0 milestone Jan 31, 2018
@jreback
Copy link
Contributor

jreback commented Jan 31, 2018

lgtm. ping on green.

@hexgnu
Copy link
Contributor Author

hexgnu commented Jan 31, 2018

alrighty all green thanks! @jreback

@jreback jreback merged commit d3851ac into pandas-dev:master Feb 1, 2018
@jreback
Copy link
Contributor

jreback commented Feb 1, 2018

thanks @hexgnu nice patch! keep em coming!

@hexgnu hexgnu deleted the reindex_sparse_column branch February 1, 2018 13:20
harisbal pushed a commit to harisbal/pandas that referenced this pull request Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment