Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

Closed
wants to merge 6 commits into from

Conversation

summonholmes
Copy link

Being a very rare issue encountered with nested tuples as column and index labels, here is the fix I've managed to come up with - for the time being. While clean_index_list() in pandas/_libs/lib.pyx is responsible for returning an invalid result (a 3D ndarray where the inner dimensions should be nested tuples), debugging Cython is very challenging for me. And yes, use of tuples in this way is extremely uncommon. So far, the code has run successfully on two distance matrices.
fix

@gfyoung gfyoung added the Visualization plotting label Jan 11, 2019
@gfyoung
Copy link
Member

gfyoung commented Jan 11, 2019

@summonholmes : Thanks for contribution! Given the limited use-case as you point, I wonder what the trade-off might be between degree of use vs. work to maintain.

Also, we're going to need at least one test and a whatsnew entry for this contribution (probably for 0.25.0 though, if it exists, otherwise, hold on this part).

@codecov
Copy link

codecov bot commented Jan 12, 2019

Codecov Report

Merging #24732 into master will decrease coverage by 49.31%.
The diff coverage is 51.11%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #24732       +/-   ##
===========================================
- Coverage   92.39%   43.07%   -49.32%     
===========================================
  Files         166      166               
  Lines       52358    52362        +4     
===========================================
- Hits        48374    22555    -25819     
- Misses       3984    29807    +25823
Flag Coverage Δ
#multiple ?
#single 43.07% <51.11%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 56.47% <51.11%> (-39.83%) ⬇️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/core/categorical.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.35%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdc4db2...cde96f8. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 12, 2019

Codecov Report

Merging #24732 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24732      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52358    52363       +5     
==========================================
+ Hits        48373    48377       +4     
- Misses       3985     3986       +1
Flag Coverage Δ
#multiple 90.81% <0%> (-0.01%) ⬇️
#single 42.91% <0%> (-0.17%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/console.py 72.46% <0%> (-1.78%) ⬇️
pandas/tseries/offsets.py 96.69% <0%> (ø) ⬆️
pandas/core/indexing.py 93.87% <0%> (ø) ⬆️
pandas/core/dtypes/dtypes.py 95.6% <0%> (+0.02%) ⬆️
pandas/util/testing.py 88.09% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33f91d8...cec3d57. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure what this PR is trying to do

@summonholmes
Copy link
Author

not really sure what this PR is trying to do

I admit that this specific use case for nested tuple index labels is very esoteric, but in the function ensure_index, Cython function clean_index_list very rarely returns an ndarray of shape (1, 2, 2), which will result in a dimensional error. This PR is just a start, to convert the 3D ndarray back into the correct 2D ndarray.

@summonholmes : Thanks for contribution! Given the limited use-case as you point, I wonder what the trade-off might be between degree of use vs. work to maintain.

Also, we're going to need at least one test and a whatsnew entry for this contribution (probably for 0.25.0 though, if it exists, otherwise, hold on this part).

Yes, there is trade off, and moving a check such as this somewhere else might be better. Eventually, these efforts will have to make their way back to the Cython file, lib.pyx.

@jreback
Copy link
Contributor

jreback commented Jan 12, 2019

@summonholmes your are missing my point
pls show a reproducible example of what is the problem; what you are doing is likely an error

@summonholmes
Copy link
Author

summonholmes commented Jan 12, 2019

@summonholmes your are missing my point
pls show a reproducible example of what is the problem; what you are doing is likely an error

It's in the bug report that I created, but I'll post a more concise demo here:

from pandas._libs import lib
from pandas import DataFrame
from seaborn import light_palette

# Using the nested tuple cluster
broken_cluster = {
    (("Turtle", "Chicken"), (("Man", "Monkey"), "Dog")): (0, 28.375, 31.875),
    "Tuna": (28.375, 0, 41),
    "Moth": (31.875, 41, 0)
}
broken_cluster = DataFrame(broken_cluster, index=broken_cluster.keys())
broken_cluster.style.background_gradient(
    cmap=light_palette("indigo", as_cmap=True))

# Without using the nested tuple cluster
working_cluster = {
    "S": (0, 28.375, 31.875),
    "Tuna": (28.375, 0, 41),
    "Moth": (31.875, 41, 0)
}
working_cluster = DataFrame(working_cluster, index=working_cluster.keys())
working_cluster.style.background_gradient(
    cmap=light_palette("indigo", as_cmap=True))
# Highlight mins

# The culprit:
lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0]

Output of lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0]:

array([[['Turtle', 'Chicken'],
        [('Man', 'Monkey'), 'Dog']]], dtype=object)

I welcome any feedback, and any indication of what I might be doing wrong. I used tuples in this program for the sake of Pythonic automation, nothing else. You might also wish to see #24688

@jreback
Copy link
Contributor

jreback commented Jan 13, 2019

In [4]: broken_cluster.index                                                                                                                                                                                                                                            
Out[4]: Index([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna', 'Moth'], dtype='object')

this has NO support in pandas whatsoever. If you can raise an error in a performant way great would take it

@TomAugspurger
Copy link
Contributor

It's in the bug report that I created, but I'll post a more concise demo here:

What's the issue number?

It's not really clear to me what the changes here are doing. e.g. why check specifically against a shape of (1, 2, 2), not not something more general? This feels like you're trying to use pandas for something it isn't well-suited for.

@summonholmes
Copy link
Author

It's in the bug report that I created, but I'll post a more concise demo here:

What's the issue number?

It's not really clear to me what the changes here are doing. e.g. why check specifically against a shape of (1, 2, 2), not not something more general? This feels like you're trying to use pandas for something it isn't well-suited for.

The original issue I was addressing with this PR was #24687, and after further testing I've determined that nested tuples can generate even more anomalous shapes than (1, 2, 2). The fix on my end is simply converting to string and recording the actual tuples somewhere else. Yes, I would never use pandas like this on a normal day. Therefore, I believe this PRs intention will change to prevent this use of pandas in the first place.

In [4]: broken_cluster.index                                                                                                                                                                                                                                            
Out[4]: Index([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna', 'Moth'], dtype='object')

this has NO support in pandas whatsoever. If you can raise an error in a performant way great would take it

Please correct me if I'm wrong, I'd be happy to work on a fix. You're saying that an error should be raised, as soon as a tuple is assigned to a column or index label? I'm assuming that the approach required is closely related to #24688 and #24702.

@TomAugspurger
Copy link
Contributor

Thanks. I don't think the resolution of #24688 and #24702 is clear yet.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

looks like this was completly reverted, closing.

@jreback jreback closed this Jan 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants