Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Sparse Return Types #12855

Closed
gfyoung opened this issue Apr 10, 2016 · 17 comments
Closed

API: Sparse Return Types #12855

gfyoung opened this issue Apr 10, 2016 · 17 comments
Labels
Docs Sparse Sparse Data Type
Milestone

Comments

@gfyoung
Copy link
Member

gfyoung commented Apr 10, 2016

The documentation of SparseArray.cumsum says it returns a Series, but that is not the case:

>>> from pandas import SparseArray, np
>>> data = np.arange(10)
>>> s = SparseArray(data, fill_value=np.nan)
>>> type(s.cumsum())
<class 'pandas.sparse.array.SparseArray'>
>>>
>>> s = SparseArray(data, fill_value=2)
>>> type(s.cumsum())
<class 'numpy.ndarray'>

Given that to_dense now returns self.values and ignores fill_value, this current behaviour in cumsum seems illogical and should probably just return SparseArray regardless. At the very least though, I think a documentation fix is in order.

EDIT: there is a similar issue with SparseSeries that should also be addressed

@jreback
Copy link
Contributor

jreback commented Apr 10, 2016

these were all JUST fixed by @sinhrks if you would like to make a master documentation issue, pls do. but leave it as this one.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 10, 2016

@jreback : Yes, I am aware of what @sinhrks did, as this is against the master branch. He did not make any changes to cumsum in his PR.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2016

ok, then if you find more things. just add them at the top, DONT't create more issues about this

@jreback jreback changed the title SparseArray Cumsum Documentation Mismatch DOC: Sparse Documentation Apr 10, 2016
@jreback
Copy link
Contributor

jreback commented Apr 10, 2016

though this is actually not a documentation issue at all. but a functional one.

@jreback jreback changed the title DOC: Sparse Documentation DOC/API: Sparse Return Types Apr 10, 2016
@jreback jreback added this to the 0.19.0 milestone Apr 10, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 10, 2016

@jreback : well at the very least, the documentation should be changed to match?

@kawochen kawochen mentioned this issue Apr 10, 2016
18 tasks
@jreback
Copy link
Contributor

jreback commented Apr 10, 2016

@gfyoung ABOSOLUTEY NOT. Its an issue, you don't just change the documentation you fix the problem!

@jreback jreback changed the title DOC/API: Sparse Return Types API: Sparse Return Types Apr 10, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Apr 10, 2016

I'm confused. We're currently providing incorrect information in the documentation...I meant this should be a temporary measure.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2016

@gfyoung how are you confused? its an api issue, changing the documentation (which is only inherited) is pointless. you can certainly propose a fix for the issue. fixing documentation is a band-aid which is not even worth doing.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 10, 2016

@jreback : If you had just said, "it's not worthwhile to change the documentation as a stopgap measure even though it is incorrect," instead of your last two comments, that would have cleared up my confusion.

@sinhrks
Copy link
Member

sinhrks commented Apr 10, 2016

@gfyoung thanks for the catch. Can it be included in #12810? Lmk if any fix is needed on my side.

@gfyoung
Copy link
Member Author

gfyoung commented Apr 10, 2016

@sinhrks : #12810 (diff)

@gfyoung
Copy link
Member Author

gfyoung commented Nov 26, 2016

Looking at this issue again, I must say I find this SparseArray.cumsum API quite confusing. Isn't the point of this function to provide a cumulative sum as we slowly progress through the array using the sum function defined for SparseArray i.e.:

>>> s = SparseArray([1, 2, 2, np.nan, 2, 2, np.nan], fill_value=2)
>>> s.sum()
9.0
>>> s.cumsum()
array([1., 3., 5., nan, nan, nan, nan])

I would have expected the following:

array([1., 3., 5., nan, nan, 7.0, 9.0, nan])

EDIT: Same is true for SparseSeries

@jreback , @jorisvandenbossche , @sinhrks : Thoughts?

@sinhrks
Copy link
Member

sinhrks commented Nov 29, 2016

@gfyoung +1. We should make an agreement that SparseArray is compat with pandas internals rather than numpy.

pd.Series([1, 2, 2, np.nan, 2, 2, np.nan]).cumsum()
# 0    1.0
# 1    3.0
# 2    5.0
# 3    NaN
# 4    7.0
# 5    9.0
# 6    NaN
# dtype: float64

np.cumsum([1, 2, 2, np.nan, 2, 2, np.nan])
# array([  1.,   3.,   5.,  nan,  nan,  nan,  nan])

@gfyoung
Copy link
Member Author

gfyoung commented Nov 29, 2016

@sinhrks : I was about to implement a fix, but what is the meaning of fill_value exactly? Because I thought it should have an impact on aggregation behaviour:

>>> data = np.arange(10)
>>> s = SparseArray(data, fill_value=2)
>>> s.sum()
45
>>> s = SparseArray(data, fill_value=np.nan)
>>> s.sum()
45

IIUC, fill_value=2 implies that any data point with "2" as the value is considered a null / missing data point, meaning that the first sum should be 43?

@sinhrks
Copy link
Member

sinhrks commented Nov 29, 2016

fill_value is unrelated to normal missing values. It is to specify values omitted in sparse repr. This shouldn't affect to dense / aggregated results.

# this omits np.nan, holds [1, 1] as sparse values with its locations
pd.SparseArray([np.nan, np.nan, 1, 1], fill_value=np.nan)
# [nan, nan, 1.0, 1.0]
# Fill: nan
# IntIndex
# Indices: array([2, 3])

# this omits 1, holds [nan, nan] as sparse values with its locations
pd.SparseArray([np.nan, np.nan, 1, 1], fill_value=1)
# [nan, nan, 1, 1]
# Fill: 1
# IntIndex
# Indices: array([0, 1])

@gfyoung
Copy link
Member Author

gfyoung commented Nov 29, 2016

@sinhrks : Okay, so the sums should be the same?

@sinhrks
Copy link
Member

sinhrks commented Nov 30, 2016

Yes, should be the same.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 30, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 30, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 1, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 1, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 2, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 6, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 6, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
gfyoung added a commit to forking-repos/pandas that referenced this issue Dec 11, 2016
Always return SparseArray and SparseSeries for
SparseArray.cumsum() and SparseSeries.cumsum()
respectively, regardless of fill_value.

Close pandas-devgh-12855.
ischurov pushed a commit to ischurov/pandas that referenced this issue Dec 19, 2016
Always return `SparseArray` and `SparseSeries` for
`SparseArray.cumsum()` and `SparseSeries.cumsum()` respectively,
regardless of `fill_value`.

Closes pandas-dev#12855.

Author: gfyoung <gfyoung17@gmail.com>

Closes pandas-dev#14771 from gfyoung/sparse-return-type and squashes the following commits:

83314fc [gfyoung] API: Return sparse objects always for cumsum
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Sparse Sparse Data Type
Projects
None yet
Development

No branches or pull requests

3 participants