Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pivot table drops column/index names=nan when dropna=false #14246

Closed
wants to merge 4 commits into from

Conversation

OXPHOS
Copy link
Contributor

@OXPHOS OXPHOS commented Sep 18, 2016

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Sep 18, 2016

Input:

import pandas as pd
M = pd.DataFrame([
[1, 'a', 'A'], 
[1, 'b', 'B'], 
[1, 'c', None]], 
columns=['x', 'y', 'z'])
P = M.pivot_table(values='x', index='y', columns='z', aggfunc='sum',
                             fill_value=0, margins=True, dropna=False)
print P

Output was:

z A B All
y
a 1.0 0.0 1.0
b 0.0 1.0 1.0
All 1.0 1.0 3.0

Output after fix:

z NaN A B All
y
a 0.0 0.0 1.0 1.0
b 0.0 1.0 0.0 1.0
c 1.0 0.0 0.0 1.0
All 1.0 1.0 1.0 3.0

I think the rearrangement of columns is due to the hash function problem similar as the one in #12679 and needs extra work to solve.

@@ -230,7 +230,7 @@ class Categorical(PandasObject):
_typ = 'categorical'

def __init__(self, values, categories=None, ordered=False,
name=None, fastpath=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a good idea to do this

Copy link
Contributor Author

@OXPHOS OXPHOS Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback So the key of the issue is to pass a dropna to core/algorithms.factorize(), and factorize() is called directly in categorical.__init__(). Without the dropna here, we lose the dropna value from m = MultiIndex.from_arrays(cartesian_product(table.columns.levels), names=table.columns.names) and here, and generate the wrong result:

z NaN A B All
y
a 0 1 1
b 0 0 1 1
c 0 0 0 1
All NaN 1 1 3

While the correct result should be:

z NaN A B All
y
a 0 1 1
b 0 0 1 1
c 1 0 0 1
All 1 1 1 3

@@ -4323,7 +4323,8 @@ def infer(x):
# ----------------------------------------------------------------------
# Merging / joining methods

def append(self, other, ignore_index=False, verify_integrity=False):
def append(self, other, ignore_index=False, verify_integrity=False,
dropna=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again not a good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do about this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was wrong about it. The removal of the dropna fails test pandas.tools.tests.test_pivot.test_margin_dropna. When margins are appended, the missing of dropna leads to the drop of NaN level. This could also be solved by specifying the dropna parameter in MultiIndex class constructor.

@@ -1392,7 +1392,7 @@ def __getitem__(self, key):
else:
return result

def append(self, other):
def append(self, other, dropna=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

@OXPHOS OXPHOS Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that I need to call the derived append() in Multi.py and the dropna in Multi.append() cannot be omitted. The possible way to remove this dropna as well as the one here is: have an parameter dropna in MultiIndex.__init__(). I am not sure how you like this idea. I would be appreciated if you could tell me your thoughts on it. Thanks!


@classmethod
def from_tuples(cls, tuples, sortorder=None, names=None):
def from_tuples(cls, tuples, sortorder=None, names=None, dropna=True):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is the same as here

@@ -1269,7 +1269,7 @@ def _get_join_keys(llab, rlab, shape, sort):

def concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True):
copy=True, dropna=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

import numpy as np


a = np.array(['foo', 'foo', 'foo', 'bar',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls put tests with current tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should not be here.

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Sep 18, 2016

@jreback I realized that this issue has a lot overlap with Issue #3729 and PR #12607.
The problem is that dropna is not passed at all from pivot_table and the algorithm.factorize just set the dropna=True.
I don't see other options than passing the param on and adding a param to classes seems to be the easiest solution.
I did the nosetests at local and it is failing on 2 out of 10000+ with weird type errors. I think these can be fixed.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 19, 2016
@codecov-io
Copy link

codecov-io commented Sep 22, 2016

Current coverage is 84.77% (diff: 100%)

Merging #14246 into master will increase coverage by <.01%

@@             master     #14246   diff @@
==========================================
  Files           145        145          
  Lines         51129      51154    +25   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43343      43368    +25   
  Misses         7786       7786          
  Partials          0          0          

Powered by Codecov. Last update 74e20a0...c27d3d3

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Sep 22, 2016

@jreback All tests pass now. I can move dropna out of __init__ if you think the modification is too aggressive, but it's gonna be quite verbose. [EDIT: I was thinking wrong. I need to think about how to restrain the use of dropna in groupby.py.]
Also the tests of pivot_table are a little bit messy..I will clean them up and add the test for dropna if you think we could continue working on this PR.

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Sep 22, 2016

Sorry I was thinking wrong. Will figure out how to limit the use of dropna in groupby.py

@OXPHOS OXPHOS force-pushed the pivot_table_dropna branch 2 times, most recently from 73c42bc to 2e679ea Compare September 28, 2016 21:46
@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

can you rebase

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Dec 3, 2016

@jreback It seems that the appveyor exceeded time limit.

Also I did a pep8radius master --diff and it suggested changes to 13 files I haven't touched. Not sure whether I should update those as well.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

can you rebase and we can see where this is

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Jan 3, 2017

@jreback I made some major changes and now the changes are restrained to pivot.py. Also some tests with dropna=False were not accurate.

keys = index + columns

if not dropna:
key_data = np.array(data[keys], dtype='object')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what the heck is this?

Copy link
Contributor Author

@OXPHOS OXPHOS Jan 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback converting NaN values in keys to special strings to avoid the passing of dropna around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not acceptable, we use masks if needed. converting things like that will just lead to future reader confusion and bugs.

@OXPHOS
Copy link
Contributor Author

OXPHOS commented Jan 23, 2017

Update: There is a conflict between the merge recently (#14944) and the fix I am trying to do. I need the check_null here to be dropna. I am still trying to solve the conflict.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

@OXPHOS can you rebase / update

@OXPHOS OXPHOS closed this Apr 26, 2017
@OXPHOS
Copy link
Contributor Author

OXPHOS commented Apr 26, 2017

I started over and now I'm half way through. I'll open a new pr once I cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pivot_table margins bottom-left total does not correspond to other content when dropna=False
3 participants