Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN GH23123 Move SparseArray to arrays #23147

Merged
merged 12 commits into from
Oct 16, 2018

Conversation

JustinZhengBC
Copy link
Contributor

Importing SparseArray from the Arrays folder was causing circular import paths with the __init__.py, so I had to empty it and fix every other file that imported from pandas.core.arrays

@jorisvandenbossche
Copy link
Member

This change in imports of other arrays everywhere should not be needed. Can you detail where the circular import problem was coming from?

@datapythonista datapythonista added Refactor Internal refactoring of code Sparse Sparse Data Type labels Oct 14, 2018
@datapythonista
Copy link
Member

@JustinZhengBC you've got some print() calls in your PR that I guess were for your own debugging. Can you remove them?

ExtensionScalarOpsMixin)
from .categorical import Categorical # noqa
from .datetimes import DatetimeArrayMixin # noqa
from .interval import IntervalArray # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice pattern, pls restore this

@JustinZhengBC
Copy link
Contributor Author

The main import loop problem is in pandas.core.types.dtypes.common.py, which imports SparseDtype from pandas.core.arrays.sparse.dtype for two uses in checking is_instance. Whenever a SparseArray is imported, it now executes pandas.core.arrays.__init__.py. Many of the files mentioned in __init__.py in turn import common.py, causing another import from SparseArray, causing another call to __init__.py in the process.

I will try putting the offending import statements from common.py into the functions where they are used so that they are not automatically called on import, and see if that gets rid of the loop.

@JustinZhengBC
Copy link
Contributor Author

Almost all of the tests seem to be failing in the same place. The trace is below, and the offending line is from pandas.core.arrays.sparse.array import _maybe_to_sparse, where the tests claim they can't find pandas.core.arrays.sparse.array. The weird thing is, the tests do not fail locally, and I can even execute the exact same line without any errors, so I'm a bit stuck here. Am I doing something wrong, or is it that moving a folder and thus changing the API causes tests to fail because of how they're set up?

Traceback (most recent call last):
  File "/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/_pytest/config/__init__.py", line 419, in _importconftest
    mod = conftestpath.pyimport()
  File "/home/travis/miniconda3/envs/pandas/lib/python2.7/site-packages/py/_path/local.py", line 668, in pyimport
    __import__(modname)
  File "/home/travis/build/pandas-dev/pandas/pandas/__init__.py", line 42, in <module>
    from pandas.core.api import *
  File "/home/travis/build/pandas-dev/pandas/pandas/core/api.py", line 10, in <module>
    from pandas.core.groupby import Grouper
  File "/home/travis/build/pandas-dev/pandas/pandas/core/groupby/__init__.py", line 1, in <module>
    from pandas.core.groupby.groupby import GroupBy  # flake8: noqa
  File "/home/travis/build/pandas-dev/pandas/pandas/core/groupby/groupby.py", line 39, in <module>
    from pandas.core.generic import NDFrame
  File "/home/travis/build/pandas-dev/pandas/pandas/core/generic.py", line 43, in <module>
    from pandas.core.internals import BlockManager
  File "/home/travis/build/pandas-dev/pandas/pandas/core/internals/__init__.py", line 10, in <module>
    from .managers import (  # noqa:F401
  File "/home/travis/build/pandas-dev/pandas/pandas/core/internals/managers.py", line 32, in <module>
    from pandas.core.arrays.sparse.array import _maybe_to_sparse
ImportError: No module named sparse.array
ERROR: could not load /home/travis/build/pandas-dev/pandas/pandas/conftest.py

@datapythonista
Copy link
Member

Did you try to simply import pandas? I think there is a wrong import in an __init__.py, or something similar.

@JustinZhengBC
Copy link
Contributor Author

JustinZhengBC commented Oct 15, 2018

Yes, import pandas works locally. I can even run the exact same lines in the .sh file that Travis fails on.

EDIT: I've noticed all the tests that run on python 2.7 fail said line, but the tests that run on 3.x don't. I'll look into this further.

EDIT2: Seems like somehow the __init__.py got lost in the move. Hopefully the tests work now.

@codecov
Copy link

codecov bot commented Oct 15, 2018

Codecov Report

Merging #23147 into master will increase coverage by 0.01%.
The diff coverage is 92.68%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23147      +/-   ##
==========================================
+ Coverage   92.13%   92.14%   +0.01%     
==========================================
  Files         170      170              
  Lines       51073    51017      -56     
==========================================
- Hits        47056    47011      -45     
+ Misses       4017     4006      -11
Flag Coverage Δ
#multiple 90.57% <92.68%> (+0.01%) ⬆️
#single 42.3% <68.29%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/period.py 95.56% <ø> (+1.27%) ⬆️
pandas/compat/pickle_compat.py 75.6% <ø> (ø) ⬆️
pandas/core/util/hashing.py 98.4% <ø> (ø) ⬆️
pandas/core/arrays/sparse/scipy_sparse.py 97.05% <ø> (ø)
pandas/core/arrays/sparse/dtype.py 100% <ø> (ø)
pandas/core/arrays/sparse/frame.py 94.86% <100%> (ø)
pandas/io/packers.py 88.04% <100%> (ø) ⬆️
pandas/core/arrays/integer.py 94.9% <100%> (ø) ⬆️
pandas/core/reshape/reshape.py 99.55% <100%> (ø) ⬆️
pandas/core/arrays/sparse/array.py 91.55% <100%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85dc171...120bc5e. Read the comment docs.

@jorisvandenbossche
Copy link
Member

@JustinZhengBC It seems you have removed the sparse tests, without adding them back in the tests/arrays folder.

Something general, should we keep the SparseSeries / SparseDataFrame where they were in pandas.core.sparse ? Since they are not really arrays, and we might deprecate them anyhow. Then moving them is maybe not that worth it?

@JustinZhengBC
Copy link
Contributor Author

@jorisvandenbossche thanks for the catch. All the tests should be there now. The "files changed" tab still says I deleted pandas/core/sparse/api.py even though it also says I created another file with the same name in the right location. Other than that, it shouldn't say I deleted any other files now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I would only more the actual sparse array itself here, pls leave the SparseSeries/DataFrame exactly where they were (and there tests).

@@ -19,7 +19,8 @@
"""

from numpy import ndarray
from pandas.util._validators import (validate_args, validate_kwargs,
from pandas.util._validators import (validate_args,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are changing unrelated things? (even formatting)

@TomAugspurger
Copy link
Contributor

Test failure is unrelated to the changes here. Apparently, the SparseArray.unique implementation is flaky. Taking a look now.

@TomAugspurger
Copy link
Contributor

In [4]: pd.SparseArray([0, 0]).unique()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-4-c1befb06646c> in <module>
----> 1 pd.SparseArray([0, 0]).unique()

~/sandbox/pandas/pandas/core/sparse/array.py in unique(self)
    570     def unique(self):
    571         uniques = list(pd.unique(self.sp_values))
--> 572         fill_loc = self._first_fill_value_loc()
    573         if fill_loc >= 0:
    574             uniques.insert(fill_loc, self.fill_value)

~/sandbox/pandas/pandas/core/sparse/array.py in _first_fill_value_loc(self)
    562
    563         indices = self.sp_index.to_int_index().indices
--> 564         if indices[0] > 0:
    565             return 0
    566

IndexError: index 0 is out of bounds for axis 0 with size 0

@TomAugspurger
Copy link
Contributor

And a fix

diff --git a/pandas/core/sparse/array.py b/pandas/core/sparse/array.py
index 15b5118db..17e2bd188 100644
--- a/pandas/core/sparse/array.py
+++ b/pandas/core/sparse/array.py
@@ -561,7 +561,7 @@ class SparseArray(PandasObject, ExtensionArray, ExtensionOpsMixin):
             return -1
 
         indices = self.sp_index.to_int_index().indices
-        if indices[0] > 0:
+        if len(indices) == 0 or indices[0] > 0:
             return 0
 
         diff = indices[1:] - indices[:-1]
diff --git a/pandas/tests/sparse/test_array.py b/pandas/tests/sparse/test_array.py
index 0257d9962..5b1afdc7a 100644
--- a/pandas/tests/sparse/test_array.py
+++ b/pandas/tests/sparse/test_array.py
@@ -1065,6 +1065,13 @@ def test_unique_na_fill(arr, fill_value):
     tm.assert_numpy_array_equal(a, b)
 
 
+def test_unique_all_sparse():
+    arr = SparseArray([0, 0])
+    result = arr.unique()
+    expected = SparseArray([0])
+    tm.assert_sp_array_equal(result, expected)
+
+
 def test_map():
     arr = SparseArray([0, 1, 2])
     expected = SparseArray([10, 11, 12], fill_value=10)

@jreback @jorisvandenbossche any objections to including that fix here? It seems simple enough to not need another PR, which will create a merge conflict here. Or we can merge this and do that in a separate PR.

@TomAugspurger
Copy link
Contributor

Found another random failure. Going to split these two out to a new PR.

@TomAugspurger
Copy link
Contributor

@JustinZhengBC I'm going to push some changes to your branch.

@pep8speaks
Copy link

Hello @JustinZhengBC! Thanks for updating the PR.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 15, 2018

Updates:

  1. Moved pandas/core/arrays/sparse/array.py to pandas/core/arrays/sparse.py
  2. Moved SparseDtype implementation into pandas/core/arrays/sparse.py
  3. Moved sparse/{series,frame,api,scipy_sparse}.py back to pandas/core/sparse (so no change from master)

@@ -523,7 +523,7 @@ def _add_delta(self, delta):
The result's name is set outside of _add_delta by the calling
method (__add__ or __sub__)
"""
from pandas.core.arrays.timedeltas import TimedeltaArrayMixin
from pandas.core.arrays import TimedeltaArrayMixin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these changes here before mine? I may revert them.

@TomAugspurger
Copy link
Contributor

I reverted some extraneous changes in fe53b50. @JustinZhengBC if you're interested, you can re-revert that commit later and make a separate PR with those sytlisitc changes. I notice now that it has a formatting change too in https://github.com/pandas-dev/pandas/pull/23147/files#diff-6cdf34ac065f3e1e350f94c70e76898bL6, sorry. You'll need to manually fix that if you re-revert it.

@@ -3,8 +3,8 @@
register_index_accessor,
register_series_accessor)
from pandas.core.algorithms import take # noqa
from pandas.core.arrays.base import (ExtensionArray, # noqa
ExtensionScalarOpsMixin)
from pandas.core.arrays import (ExtensionArray, # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, should have reverted this too :/ Oh well.

@TomAugspurger
Copy link
Contributor

@jreback @jorisvandenbossche @datapythonista could I get a quick sanity check on this? I'd like to merge on green since I'll have some followup PRs that will be blocked by this.

@jreback
Copy link
Contributor

jreback commented Oct 15, 2018

lgtm at a glance

@jorisvandenbossche
Copy link
Member

Looks good to me as well based on a quick view

@TomAugspurger
Copy link
Contributor

OK to ignore the numpydev failure, or will we be re-running PRs after #23178 is in?

@TomAugspurger
Copy link
Contributor

Merging so we can move forward. Will keep an eye on master though.

@TomAugspurger TomAugspurger merged commit 913f71f into pandas-dev:master Oct 16, 2018
@TomAugspurger
Copy link
Contributor

Thanks @JustinZhengBC!

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move SparseArray implementation to pandas/core/arrays
6 participants