Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix wrong SparseBlock initialization #17386

Closed

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Aug 31, 2017

@codecov
Copy link

codecov bot commented Aug 31, 2017

Codecov Report

Merging #17386 into master will decrease coverage by 0.04%.
The diff coverage is 71.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17386      +/-   ##
==========================================
- Coverage   91.03%   90.99%   -0.05%     
==========================================
  Files         163      163              
  Lines       49580    49586       +6     
==========================================
- Hits        45137    45119      -18     
- Misses       4443     4467      +24
Flag Coverage Δ
#multiple 88.77% <71.42%> (-0.03%) ⬇️
#single 40.25% <42.85%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 94.7% <100%> (ø) ⬆️
pandas/core/internals.py 94.2% <66.66%> (-0.1%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.23% <0%> (-1.82%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64c8a8d...7928aaa. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 31, 2017

Codecov Report

Merging #17386 into master will decrease coverage by 0.02%.
The diff coverage is 91.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17386      +/-   ##
==========================================
- Coverage   91.43%    91.4%   -0.03%     
==========================================
  Files         163      163              
  Lines       50091    50148      +57     
==========================================
+ Hits        45800    45838      +38     
- Misses       4291     4310      +19
Flag Coverage Δ
#multiple 89.21% <91.78%> (-0.01%) ⬇️
#single 40.34% <27.39%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 94.82% <100%> (+0.04%) ⬆️
pandas/core/sparse/array.py 92.05% <100%> (-0.24%) ⬇️
pandas/core/internals.py 94.32% <90.9%> (-0.22%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.8% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96a5274...6b36d55. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Aug 31, 2017

needs tests

@gfyoung gfyoung added 2/3 Compat Sparse Sparse Data Type Bug and removed 2/3 Compat labels Aug 31, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w/o tests I don't even know what you are trying to do. Further we don't use if/then like this at all, instead dispatch per-block. but pls add tests first.

@Licht-T Licht-T force-pushed the fix-wrong-sparseblock-initialization branch from ea077b3 to 785bbcf Compare September 2, 2017 04:30
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 2, 2017

@jreback Thanks for your comment. I've just added test codes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first need to get the tests correct

@@ -527,7 +535,7 @@ def f(m, v, i):

return self.split_and_operate(None, f, False)

def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
def astype(self, dtype, copy=True, errors='raise', values=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't change things like this, which completely changes semantics

@@ -442,12 +443,19 @@ def make_a_block(nv, ref_loc):
nv = _block_shape(nv, ndim=self.ndim)
except (AttributeError, NotImplementedError):
pass
block = self.make_block(values=nv,
placement=ref_loc, fastpath=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this pattern at all, nv should be a sparse array at this point; the inference should just work. is that not the case?

@@ -1562,6 +1581,10 @@ def _nanpercentile(values, q, axis, **kw):
result = self._try_coerce_result(result)
if is_scalar(result):
return ax, self.make_block_scalar(result)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this should be a sparse array, if its not then you need to implement this on the SparseBlock itself (and you can certainly call the super method). I don't like type checking inside a block.

@@ -2653,7 +2676,7 @@ def sp_index(self):
def kind(self):
return self.values.kind

def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
def _astype(self, dtype, copy=True, raise_on_error=True, values=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not do this

@@ -321,8 +321,8 @@ def _apply_columns(self, func):
data=new_data, index=self.index, columns=self.columns,
default_fill_value=self.default_fill_value).__finalize__(self)

def astype(self, dtype):
return self._apply_columns(lambda x: x.astype(dtype))
def astype(self, dtype, copy=True, errors='raise'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this match the super signature?

@@ -1385,3 +1385,62 @@ def test_numpy_func_call(self):
'std', 'min', 'max']
for func in funcs:
getattr(np, func)(self.frame)

def test_where(self):
data = [[1, 2], [3, 4]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

sparse_df = SparseDataFrame(data)
result = sparse_df.where(sparse_df >= 2)

dense_df = DataFrame(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so these return dense frames????

q = 0.1

sparse_df = SparseDataFrame(data)
result = sparse_df.quantile(q)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are very odd tests, why are you returning dense frames?

result = sparse.where(sparse >= 2)

dense = Series(data)
expected = dense.where(dense >= 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same questions as above

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 15, 2017

@jreback Thanks for your review! I've found the initial solution is not enough through fixing tests.
Also I've modified the solution!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are changing lots of things. can you do this step by step rather that all at once. you are touching some pretty gnarly code.

@@ -978,7 +984,7 @@ def f(m, v, i):

return [self.make_block(new_values, fastpath=True)]

def coerce_to_target_dtype(self, other):
def coerce_to_target_dtype(self, other, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I dont think this actually helps, why did you add it?

Copy link
Contributor Author

@Licht-T Licht-T Sep 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparseBlock astype(dtype, copy=False) makes reinterpret cast, so I override coerce_to_target_dtype and set copy=True in SparseBlock class.
203a8f9#diff-e705e723b2d6e7c0e2a0443f80916abfR2639

**kwargs)

dtype = self.values.sp_values.dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty complex, pls simplify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these codes because coerce_to_target_dtype does not work well in SparseBlock. This is only checking the type information of SparseArray and these procedure is also implemented in IntBlock, etc. I cannot figure out how to simplify. Any suggestion for simplifying?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against this theorectially, but the implementation is fragile here. there are functions for validation already in the sparse classes.

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 15, 2017

@jreback Okay. I'll split the commit.

@Licht-T Licht-T force-pushed the fix-wrong-sparseblock-initialization branch from 2d94fba to 1c0613b Compare September 16, 2017 14:06
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 16, 2017

@jreback The big commit is now split. If you have any questions, feel free to ask.

@Licht-T Licht-T force-pushed the fix-wrong-sparseblock-initialization branch from 1c0613b to d0bf226 Compare September 16, 2017 16:14
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 16, 2017

Rebased to change commit logs

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again you are doing too much in this PR. I require much simpler PR's that are built on top of each other. IOW pls break this apart. You can add tests for everything in a single PR (the first) and simply xfail tests that don't work. That will give you a basis to build on, and simplify understandbility of the reviews / PR / code. Adding paths via if/thens is not a good way forward here.

return isinstance(element, dtype)
else:
element_dtype = infer_dtype_from(element, pandas_dtype=True)[0]
return isinstance(element, dtype) or dtype == element_dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

@@ -995,7 +1001,7 @@ def coerce_to_target_dtype(self, other):

if self.is_bool or is_object_dtype(dtype) or is_bool_dtype(dtype):
# we don't upcast to bool
return self.astype(object)
return self.astype(object, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is ok, though numpy ignores the copy= flag when dtype is object so no point in passing it if dtype is object


raise AssertionError("possible recursion in "
"coerce_to_target_dtype: {} {}".format(
self, other))

try:
return self.astype(dtype)
return self.astype(dtype, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here i guess is ok

@@ -1382,6 +1388,11 @@ def where(self, other, cond, align=True, raise_on_error=True,
if hasattr(other, 'reindex_axis'):
other = other.values

if is_scalar(other) or is_list_like(other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

@@ -1394,6 +1405,9 @@ def where(self, other, cond, align=True, raise_on_error=True,
if not hasattr(cond, 'shape'):
raise ValueError("where must have a condition that is ndarray "
"like")
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? if you find yourself adding an if/then pretty much anywhere then you are doing it wrong. better to add the method to Sparse and call super; sometimes the super method may need a bit of refactor to make it more general.

@@ -1440,7 +1454,12 @@ def func(cond, values, other):
if try_cast:
result = self._try_cast_result(result)

return self.make_block(result)
if isinstance(result, np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this is just completely confusing to do and makes the code way more complex. find a better way

@@ -1713,6 +1733,7 @@ class FloatBlock(FloatOrComplexBlock):
is_float = True
_downcast_dtype = 'int64'

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is such a huge change, what is the purpose?

**kwargs)

dtype = self.values.sp_values.dtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not against this theorectially, but the implementation is fragile here. there are functions for validation already in the sparse classes.

self._can_hold_na = False

def _can_hold_element(self, element):
""" require the same dtype as ourselves """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again this is so complex and adds so much techincal debt.

@@ -2769,9 +2858,15 @@ def sparse_reindex(self, new_index):
return self.make_block_same_class(values, sparse_index=new_index,
placement=self.mgr_locs)

def _try_coerce_result(self, result):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is the right idea, though you shoul dgenerally have a function that intercepts a ndarray and creates a SparseArray; it should be called for most sparse methods.

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

@Licht-T I believe you have all of the xfail tests in. pls rebase and update.

@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 12, 2017

@jreback Yeah! Nothing stands in my way!

BUG: Fix wrong SparseBlock initialization in quantile method

BUG: Fix make_spase mask generation not to cast when dtype is object

BUG: Add SparseArray.all method

BUG: Add copy parameter to prevent reinterpret cast of sparse

Revert and fix astype parameters

BUG: Create SparseBlock.__init__ to set type information of SparseArray

BUG: Override SparseBlock._can_hold_element

Revert changes in Block.whare

BUG: Override SparseBlock.make_block with fill_value argument

BUG: Set fill_value and ndim parameter in make_block when generating SparseBlock from result

BUG: Override SparseBlock._try_coerce_result to make result flatten and sparse

BUG: Change form _can_hold_na to _can_hold_element for supporting non NA fill value

BUG: Fix 1D check statement
SparseDataFrame.where passes (1, n)-shape SparseBlock, but actual values is n-length SparseArray

BUG: Adjust cond shape to SparseBlock
SparseDataFrame.where passes (1, n)-shape SparseBlock and condition block to Block.where,
but it compares n-length SparseArray held by the SparseBlock and (1, n)-shape condition block.

BUG: Override SparseDataFrame.where method to set _default_fill_value
@Licht-T Licht-T force-pushed the fix-wrong-sparseblock-initialization branch from d0bf226 to 6b36d55 Compare November 12, 2017 09:34
@Licht-T
Copy link
Contributor Author

Licht-T commented Nov 12, 2017

Now rebased.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

@Licht-T can you see if you can simplify this, there seem to be lots of if/then cases. I think you might be able to define / override some sparse methods in internals to avoid this.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2017

@Licht-T can you update?

@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

closing as stale, though we'd definitely take a fixed up / rebased version updated for comments

@jreback jreback closed this Jul 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AttributeError: 'IntBlock'/'FloatBlock'/etc. object has no attribute 'sp_index'
3 participants