BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

summonholmes · 2019-01-11T20:44:11Z

Being a very rare issue encountered with nested tuples as column and index labels, here is the fix I've managed to come up with - for the time being. While clean_index_list() in pandas/_libs/lib.pyx is responsible for returning an invalid result (a 3D ndarray where the inner dimensions should be nested tuples), debugging Cython is very challenging for me. And yes, use of tuples in this way is extremely uncommon. So far, the code has run successfully on two distance matrices.

gfyoung · 2019-01-11T21:36:43Z

@summonholmes : Thanks for contribution! Given the limited use-case as you point, I wonder what the trade-off might be between degree of use vs. work to maintain.

Also, we're going to need at least one test and a whatsnew entry for this contribution (probably for 0.25.0 though, if it exists, otherwise, hold on this part).

This reverts commit 9360eb7.

codecov · 2019-01-12T06:27:07Z

Codecov Report

Merging #24732 into master will decrease coverage by 49.31%.
The diff coverage is 51.11%.

@@             Coverage Diff             @@
##           master   #24732       +/-   ##
===========================================
- Coverage   92.39%   43.07%   -49.32%     
===========================================
  Files         166      166               
  Lines       52358    52362        +4     
===========================================
- Hits        48374    22555    -25819     
- Misses       3984    29807    +25823

Flag	Coverage Δ
#multiple	`?`
#single	`43.07% <51.11%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/base.py	`56.47% <51.11%> (-39.83%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/core/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.35%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
... and 124 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdc4db2...cde96f8. Read the comment docs.

codecov · 2019-01-12T06:27:08Z

Codecov Report

Merging #24732 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #24732      +/-   ##
==========================================
- Coverage   92.38%   92.38%   -0.01%     
==========================================
  Files         166      166              
  Lines       52358    52363       +5     
==========================================
+ Hits        48373    48377       +4     
- Misses       3985     3986       +1

Flag	Coverage Δ
#multiple	`90.81% <0%> (-0.01%)`	⬇️
#single	`42.91% <0%> (-0.17%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/formats/console.py	`72.46% <0%> (-1.78%)`	⬇️
pandas/tseries/offsets.py	`96.69% <0%> (ø)`	⬆️
pandas/core/indexing.py	`93.87% <0%> (ø)`	⬆️
pandas/core/dtypes/dtypes.py	`95.6% <0%> (+0.02%)`	⬆️
pandas/util/testing.py	`88.09% <0%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33f91d8...cec3d57. Read the comment docs.

jreback

not really sure what this PR is trying to do

summonholmes · 2019-01-12T16:22:17Z

not really sure what this PR is trying to do

I admit that this specific use case for nested tuple index labels is very esoteric, but in the function ensure_index, Cython function clean_index_list very rarely returns an ndarray of shape (1, 2, 2), which will result in a dimensional error. This PR is just a start, to convert the 3D ndarray back into the correct 2D ndarray.

@summonholmes : Thanks for contribution! Given the limited use-case as you point, I wonder what the trade-off might be between degree of use vs. work to maintain.

Also, we're going to need at least one test and a whatsnew entry for this contribution (probably for 0.25.0 though, if it exists, otherwise, hold on this part).

Yes, there is trade off, and moving a check such as this somewhere else might be better. Eventually, these efforts will have to make their way back to the Cython file, lib.pyx.

jreback · 2019-01-12T16:57:12Z

@summonholmes your are missing my point
pls show a reproducible example of what is the problem; what you are doing is likely an error

summonholmes · 2019-01-12T17:08:42Z

@summonholmes your are missing my point
pls show a reproducible example of what is the problem; what you are doing is likely an error

It's in the bug report that I created, but I'll post a more concise demo here:

from pandas._libs import lib
from pandas import DataFrame
from seaborn import light_palette

# Using the nested tuple cluster
broken_cluster = {
    (("Turtle", "Chicken"), (("Man", "Monkey"), "Dog")): (0, 28.375, 31.875),
    "Tuna": (28.375, 0, 41),
    "Moth": (31.875, 41, 0)
}
broken_cluster = DataFrame(broken_cluster, index=broken_cluster.keys())
broken_cluster.style.background_gradient(
    cmap=light_palette("indigo", as_cmap=True))

# Without using the nested tuple cluster
working_cluster = {
    "S": (0, 28.375, 31.875),
    "Tuna": (28.375, 0, 41),
    "Moth": (31.875, 41, 0)
}
working_cluster = DataFrame(working_cluster, index=working_cluster.keys())
working_cluster.style.background_gradient(
    cmap=light_palette("indigo", as_cmap=True))
# Highlight mins

# The culprit:
lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0]

Output of lib.clean_index_list([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog'))])[0]:

array([[['Turtle', 'Chicken'],
        [('Man', 'Monkey'), 'Dog']]], dtype=object)

I welcome any feedback, and any indication of what I might be doing wrong. I used tuples in this program for the sake of Pythonic automation, nothing else. You might also wish to see #24688

jreback · 2019-01-13T22:08:01Z

In [4]: broken_cluster.index                                                                                                                                                                                                                                            
Out[4]: Index([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna', 'Moth'], dtype='object')

this has NO support in pandas whatsoever. If you can raise an error in a performant way great would take it

TomAugspurger · 2019-01-14T13:48:53Z

It's in the bug report that I created, but I'll post a more concise demo here:

What's the issue number?

It's not really clear to me what the changes here are doing. e.g. why check specifically against a shape of (1, 2, 2), not not something more general? This feels like you're trying to use pandas for something it isn't well-suited for.

summonholmes · 2019-01-14T14:55:11Z

It's in the bug report that I created, but I'll post a more concise demo here:

What's the issue number?

It's not really clear to me what the changes here are doing. e.g. why check specifically against a shape of (1, 2, 2), not not something more general? This feels like you're trying to use pandas for something it isn't well-suited for.

The original issue I was addressing with this PR was #24687, and after further testing I've determined that nested tuples can generate even more anomalous shapes than (1, 2, 2). The fix on my end is simply converting to string and recording the actual tuples somewhere else. Yes, I would never use pandas like this on a normal day. Therefore, I believe this PRs intention will change to prevent this use of pandas in the first place.

In [4]: broken_cluster.index                                                                                                                                                                                                                                            
Out[4]: Index([(('Turtle', 'Chicken'), (('Man', 'Monkey'), 'Dog')), 'Tuna', 'Moth'], dtype='object')

this has NO support in pandas whatsoever. If you can raise an error in a performant way great would take it

Please correct me if I'm wrong, I'd be happy to work on a fix. You're saying that an error should be raised, as soon as a tuple is assigned to a column or index label? I'm assuming that the approach required is closely related to #24688 and #24702.

TomAugspurger · 2019-01-14T14:57:39Z

Thanks. I don't think the resolution of #24688 and #24702 is clear yet.

jreback · 2019-01-16T02:13:01Z

looks like this was completly reverted, closing.

gfyoung added the Visualization plotting label Jan 11, 2019

gfyoung requested a review from TomAugspurger January 11, 2019 21:35

Prevent 3D-ndarray for nested tuple labels (pandas-dev#24687)

da274d9

summonholmes force-pushed the master branch from f8fa18d to da274d9 Compare January 12, 2019 03:23

summonholmes added 3 commits January 11, 2019 23:33

Isolate the exact array shape when issue occurs

cde96f8

Prevent lists from triggering error

9360eb7

Revert "Prevent lists from triggering error"

5143c18

This reverts commit 9360eb7.

summonholmes force-pushed the master branch from 9360eb7 to cde96f8 Compare January 12, 2019 06:27

edge case for ndarray only

ef91ba2

jreback requested changes Jan 12, 2019

View reviewed changes

summonholmes force-pushed the master branch from ee93060 to ef91ba2 Compare January 12, 2019 16:11

Revert fork to original state

cec3d57

jreback closed this Jan 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

summonholmes commented Jan 11, 2019

gfyoung commented Jan 11, 2019 •

edited

Loading

codecov bot commented Jan 12, 2019

codecov bot commented Jan 12, 2019 •

edited

Loading

jreback left a comment

summonholmes commented Jan 12, 2019

jreback commented Jan 12, 2019

summonholmes commented Jan 12, 2019 •

edited

Loading

jreback commented Jan 13, 2019

TomAugspurger commented Jan 14, 2019

summonholmes commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019

jreback commented Jan 16, 2019

BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

BUG: Prevent 3D-ndarray for nested tuple labels (#24687) #24732

Conversation

summonholmes commented Jan 11, 2019

gfyoung commented Jan 11, 2019 • edited Loading

codecov bot commented Jan 12, 2019

Codecov Report

codecov bot commented Jan 12, 2019 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

summonholmes commented Jan 12, 2019

jreback commented Jan 12, 2019

summonholmes commented Jan 12, 2019 • edited Loading

jreback commented Jan 13, 2019

TomAugspurger commented Jan 14, 2019

summonholmes commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019

jreback commented Jan 16, 2019

gfyoung commented Jan 11, 2019 •

edited

Loading

codecov bot commented Jan 12, 2019 •

edited

Loading

summonholmes commented Jan 12, 2019 •

edited

Loading