Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove SparseSeries and SparseDataFrame #28425

Merged
merged 36 commits into from
Sep 18, 2019

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 13, 2019

As discussed on the dev call.

Closes #28407
Closes #28025
Closes #26777
Closes #26227
Not #26123
Closes #25694
Closes #25378
Not #25270
Not #24817
Closes #24270
Closes #22630
Closes #21993
Closes #21909
Not #21818
Closes #21231
Closes #20766
Closes #20172
Not #19278
Closes #18969
Not #18414
Closes #17993
Closes #15737
Closes #15634
Closes #14427
Closes #14310
Closes #13665
Not #11726
Closes #10627
Closes #9275
Not #9265
Closes #6427
Closes #5470
Closes #5078

@TomAugspurger TomAugspurger added this to the 1.0 milestone Sep 13, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do a search for @pytest.mark.filterwarnings("ignore:Sparse:FutureWarning"), I thnk there are a few left (or at least at the moment I was reviewing)

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
pandas/core/ops/__init__.py Show resolved Hide resolved
pandas/core/reshape/concat.py Outdated Show resolved Hide resolved
pandas/core/reshape/concat.py Show resolved Hide resolved
@@ -1,7 +1,7 @@
"""
Interaction with scipy.sparse matrices.

Currently only includes SparseSeries.to_coo helpers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this file to arrays/sparse.py (or as separate file next to it), since it is only used there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was going to do moves as a followup. I'd also like to move the last bit of pandas/tests/sparse/test_combine_compat.py, which had a SparseArray test in it.

pandas/core/sparse/api.py Show resolved Hide resolved
pandas/io/pytables.py Show resolved Hide resolved
@@ -53,10 +53,7 @@
Period,
RangeIndex,
Series,
SparseDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can leave this file intact for now?
(isn't it meant to be run with older version of pandas? In case we need to regenerate old pickle files, we might still want to generate sparse ones to check that reading them works / errors with a good message)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we need to remove it so that the sparse structures are not included, re-run it with an older pandas, and commit the new binary file.

I will update to use SparseArray though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd can you comment here? I think we import this file when running the tests, so we need to be able to import it under both the old and new pandas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file now be restored now you added the pickle compat? (I also think you now no longer update the old pickle files, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file gets imported on master, so I need to remove this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right the file would be created with an old release but read and tested against master I think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we want to mimic what users can do and what Tom is trying to support (i.e. write a pickle file with SparseDataFrame in an older release and read this file with the new release), this file needs to be able to generate such pickle files with SparseDataFrame?

Now, I don't really care that much about SparseDataFrame, and there are currently already legacy pickle files that contain it, so the need to be able to recreate those files before we will actually remove that support is maybe not that high, and I am maybe splitting hairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @jorisvandenbossche, I think I've misunderstood you this whole time (and I was confused why you kept pushing on this point 😄). You're correct that if we relied on these legacy files for testing reading files written with SparseSeries & SparseDataFrame, then we'd need to keep the imports here (at conditionally).

But a8b0d65 added tests dedicated for that (we need dedicated tests, since the existing ones silence all warnings).

So from a test coverage POV, we're OK with removing writing SparseSeries & SparseDataFrame here. It's just a matter of code style / prep for running this file when 1.0.0 becomes the "legacy" pickle version we support.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from a test coverage POV, we're OK with removing writing SparseSeries & SparseDataFrame here.

Yep, if we have files now, it seems ok to remove it from this file. It seems rather unlikely that after the next bump in legacy support for pickle files, we still want to keep support for SparseDataFrame, I suppose, as we now warn for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right to me. Technically, we could chose to bump the next legacy to version 0.25.x, which has SparseDataFrame, but I think we're more likely to choose 1.0.x since it's a round number.

pandas/tests/io/json/test_pandas.py Show resolved Hide resolved
pandas/tests/series/test_combine_concat.py Show resolved Hide resolved
@TomAugspurger
Copy link
Contributor Author

@WillAyd it looks like you fixed up the pickle panel stuff in https://github.com/pandas-dev/pandas/pull/27082/files.

Would you recommend removing SparseSeries / SparseDataFrame from the pickle files as a precursor to this PR?

@TomAugspurger
Copy link
Contributor Author

3cc4765 has the changes to the pickle stuff. I can split that out into a precursor PR if people want.

I regenerated the pickle files with python 3.5.6 and pandas 0.20.3.

@jorisvandenbossche
Copy link
Member

Do we still have pickle files with sparse data in them?
What error do you get if you read such a file with this branch?

@TomAugspurger
Copy link
Contributor Author

Do we still have pickle files with sparse data in them?
What error do you get if you read such a file with this branch?

As of 3cc4765 we have a pickle file with Series[Sparse] and DataFrame[Sparse], but no SparseSeries and SparseDataFrame. That converted the use of SparseSeries/Frame to Series/Frame with sparse values.

With 3cc4765, the pickle tests on this branch pass. It would also pass on master, so I could split those changes out.

@jorisvandenbossche
Copy link
Member

I don't have a strong opinion on splitting or not, either way.

For older pickle files with SparseDataFrame in it, you currently get:

AttributeError: Can't get attribute 'SparseDataFrame' on <module 'pandas.core.sparse.frame' from '/home/joris/scipy/pandas/pandas/core/sparse/frame.py'>

With our pickle_compat, I suppose we could actually map this to a function that creates a DataFrame[sparse] of it. But not sure that's worth the effort.
But a better error message indicating they need to read it in 0.25, convert and save again to read it in 1.0 would be welcome I think.

@TomAugspurger
Copy link
Contributor Author

Huh, 766a2f2 has a bit of compat code to allow reading a file saved with SparseSeries / SparseFrame. I don't really know how it works, but it seems to. I've restored the old pickle file with SparseSeries / SparseFrame and it passes locally.

@TomAugspurger TomAugspurger added Deprecate Functionality to remove in pandas Sparse Sparse Data Type labels Sep 13, 2019
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

For the docs, we need to see if it is needed to add actual output to the code-blocks

For the issues you are closing with this, we actually need to see if some of them are not also valid for Series[sparse] / SparseArray

@@ -53,10 +53,7 @@
Period,
RangeIndex,
Series,
SparseDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file now be restored now you added the pickle compat? (I also think you now no longer update the old pickle files, right?)

@jorisvandenbossche
Copy link
Member

For the issues you are closing with this, we actually need to see if some of them are not also valid for Series[sparse] / SparseArray

I already went through the most recent ones, and crossed a few in the top post.

@@ -53,10 +53,7 @@
Period,
RangeIndex,
Series,
SparseDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are correct; you just need to re-run this in an environment with pandas 0.20.3 ideally using the minimum supported Python version, which I guess right now is 3.5.3 though by the time this goes live should be 3.6, so maybe stick to the latter (I see there is one for 3.5.6 already)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally nice! This is quite the clean up. Just some comments to consider but I think this looks really good

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved
@@ -91,8 +91,7 @@ Interaction with scipy.sparse

Added :meth:`SparseSeries.to_coo` and :meth:`SparseSeries.from_coo` methods (:issue:`8048`) for converting to and from ``scipy.sparse.coo_matrix`` instances (see :ref:`here <sparse.scipysparse>`). For example, given a SparseSeries with MultiIndex we can convert to a `scipy.sparse.coo_matrix` by specifying the row and column labels as index levels:

.. ipython:: python
:okwarning:
.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done as a follow up and probably something easy for the community to contribute to, but ideally since we are converting these from ipython to python directives should add the original output (applicable in a few places below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly wouldn't object, but I think that's fairly low-value since, e.g. the 0.16 version of the docs at https://pandas.pydata.org/pandas-docs/version/0.16.0/whatsnew.html will have the correct rendering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don’t think you should do here as there’s already enough - just a nice to have a first time contributor could pick up.

I can open an issue for it later

pandas/compat/pickle_compat.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

added new legacy pickle files

Maybe rather add files for a different (later) version, since we already have them for 0.21?

@TomAugspurger
Copy link
Contributor Author

Just FYI, since the last round of reviews I've added

  1. A dummy SparseDataFrame and SparseSeries class, mirroring what we did for Panel. This prevents the immediate error with feather (I've removed the test xfails and reverted the doc changes working around it)
  2. Added a warning when loading a pickle file with a SparseDataFrame / SparseSeries saying they should re-save it

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last changes look good to me!

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Sep 17, 2019

Going through the rest of the issues to validate that they are specific to the sparse subclasses.

Edit: done. The list at the top should be accurate. All the remaining sparse issues still open should be relevant to SparseArray.

@jorisvandenbossche
Copy link
Member

Let's merge this?

@jreback
Copy link
Contributor

jreback commented Sep 18, 2019

lgtm. I am -0 on providing the pickle compat. but I guess ok.

@jreback
Copy link
Contributor

jreback commented Sep 18, 2019

merge when ready.

@TomAugspurger
Copy link
Contributor Author

Gave another glance through, and I think we're good.

Mostly agreed about the pickle compat. But since it turned out to be so easy (same internal representation) I think we're OK.

@TomAugspurger TomAugspurger merged commit 0a082d4 into pandas-dev:master Sep 18, 2019
@TomAugspurger TomAugspurger deleted the remove-sparse branch September 18, 2019 14:02
@jorisvandenbossche
Copy link
Member

Woohoo! :)

I am -0 on providing the pickle compat. but I guess ok.

I think a reason to have it (apart from that it was easy to add) is because this is a very recent deprecation, and that it also maps nicely to a supported data structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment