Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: rename incapable of accepting tuples as new name #19497

Closed
charlie0389 opened this issue Feb 1, 2018 · 26 comments · Fixed by #21029
Closed

Bug: rename incapable of accepting tuples as new name #19497

charlie0389 opened this issue Feb 1, 2018 · 26 comments · Fixed by #21029

Comments

@charlie0389
Copy link
Contributor

Pandas is incapable of renaming a pandas.Index object with tuples as the new value. Providing a tuple as new_name in pandas.DataFrame.rename({old_name: new_name}, axis="index") returns a pandas.MultiIndex object, and providing it within a singleton tuple returns an undesirable result. See code below (work-around at bottom):...

import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.arange(5), index=[(x, x) for x in range(5)], columns=["Value"])
print(df) # Note that df.index is a pd.Index object of 2-length tuples

# Wish to rename axis label, but keep the same style
df2 = df.rename({(1,1):(1,5)}, axis="index") 

print(df2)  # Woah! - df2.index is of MultiIndex type
print(df2.index) # ... and here's proof

# Maybe I can get around this by passing it as a singleton tuple...
df3 = df.rename({(1,1):((1,5),)}, axis="index") 
print(df3) # ... apparently not

Will produce the output:

        Value
(0, 0)      0
(1, 1)      1
(2, 2)      2
(3, 3)      3
(4, 4)      4

     Value
0 0      0
1 5      1
2 2      2
3 3      3
4 4      4
MultiIndex(levels=[[0, 1, 2, 3, 4], [0, 2, 3, 4, 5]],
           labels=[[0, 1, 2, 3, 4], [0, 4, 1, 2, 3]])

           Value
(0, 0)         0
((1, 5),)      1
(2, 2)         2
(3, 3)         3
(4, 4)         4

Desired/Expected output:

        Value
(0, 0)      0
(1, 5)      1
(2, 2)      2
(3, 3)      3
(4, 4)      4

Problem description

The current behaviour is a problem for two reasons:

  1. It is un-intuitive - I can't see why a user would expect renaming an index to change the index's type.
  2. There is no way rename Index objects with tuples

I have checked for similar issues by search of the word rename, and at time of writing, pandas 0.22.0 is the latest released version.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-112-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.0.3
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.9
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.8.0
bs4: 4.5.1
html5lib: 1.0b10
sqlalchemy: 1.1.3
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Workaround

The workaround below uses set_value function which the documentation tells the user to avoid using (unless you really know what you're doing):

df.index.set_value(df.index.get_values(), (1,1), (1, 5)) 
df.reset_index(inplace=True)
df.set_index("index", inplace=True)
df.index.name = None # Arguably not necessary...
print(df)

Produces the output:

        Value
(0, 0)      0
(1, 5)      1
(2, 2)      2
(3, 3)      3
(4, 4)      4
@TomAugspurger
Copy link
Contributor

You're fighting against pandas by using tuples as keys in your Index, instead of using a MultiIndex..

cc @toobaz is this worth attempting to support?

@toobaz
Copy link
Member

toobaz commented Feb 2, 2018

I agree with @TomAugspurger that tuples as keys are weird. That said, I might have fixed this somewhere... maybe #18600 ... in any case I guess it will be supported eventually.

@charlie0389
Copy link
Contributor Author

Consider this circumstance:

  • You have segments of data which is indexed by a natural hierarchical relationship (i.e. each segment of data is suitable for a multi-index).
  • However, different segments of the data do not have the same heirarchical relationship (i.e. not the same levels, labels, or dimensions), so concatentating is not an option (or is at least messy and/or difficult to generalise).
  • It is necessary to merge/concatenate the data.
  • It is necessary to select any and all rows by index.
  • The heirarchical relationship has to be preserved in some form.

In this circumstance, it would seem to me that a simple Index with tuples is the most obvious and easy solution, and may even be the only option.

@toobaz
Copy link
Member

toobaz commented Feb 5, 2018

I might have fixed this somewhere... maybe #18600 .

Uhm, no, that PR is unrelated. And I was probably just confused.

I still think this is going to be fixed... sooner or later.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

as @TomAugspurger says above, this is simply not supported and you are fighting pandas like crazy here. The only way I could see doing this going forward would be to have an actual TupleIndex (subclassing EA) that is pretty explicity created here.

Closing this as won't fix.

@jreback jreback closed this as completed Mar 30, 2018
@jreback jreback added this to the won't fix milestone Mar 30, 2018
@TomAugspurger
Copy link
Contributor

FWIW, I think think when #17246 is fixed, this will happen to be fixed as well.

@toobaz
Copy link
Member

toobaz commented Mar 30, 2018

I don't see this as "supporting tuples", but as "supporting anything which we don't state is not supported" (and can be supported). The bug must lie in a tupleize_cols somewhere - that is, the code is "actively" doing something wrong, it's not just "missing a feature".

This said, I totally agree this is low priority.

@charlie0389
Copy link
Contributor Author

I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False, but I assume this approach would be avoided because it would be a large impact on the API. This surprising change of type is discussed in #17246, and will hopefully be included in the fix.

@jreback argues that the fix is inappropriate, and that using tuples is unsupported. If that be the case - assuming #17246 is not going to be fixed soon, or, even if it is fixed, it doesn't fix this bug - then I think it should be clearly documented that tuples are not supported. Not supporting tuples I think would be a little disappointing, simply because I can't see a more obvious way to support the circumstances I have previously outlined above.

@charlie0389
Copy link
Contributor Author

I think this thread might be benefited by an example of why supporting tuples is a good idea. Consider for example, the country Australia and the states within it: NSW, QLD, VIC, TAS, WA, SA, NT, ACT. Also consider the region "Murray Darling Basin", which also has a natural hierarchical relationship to Australia, but specifies an area within NSW, VIC and SA (but does not completely include all of those states - it specifies the water catchment area). With reference to my earlier comments about the circumstances in which tuples in the index are useful:

You have segments of data which is indexed by a natural hierarchical relationship (i.e. each segment of data is suitable for a multi-index).

There exists a natural hierarchical relationship between 'Australia' and these states - i.e. each state lies within Australia. There is also a relationship between 'Murray Darling Basin' and 'Australia'.

However, different segments of the data do not have the same heirarchical relationship (i.e. not the same levels, labels, or dimensions), so concatentating is not an option (or is at least messy and/or difficult to generalise).

Consider that you wish to include in your dataframe, data series with names:
('Australia', 'NSW')
('Australia', 'Murray Darling Basin')
It would be inappropriate to call 'Murray Darling Basin' a state, and the data that it refers to will have no obvious mathematical connection to the data regarding the other states.

It is necessary to merge/concatenate the data.

Because I should be able to.

It is necessary to select any and all rows by index.

If it's a multiindex, and there are None or * fields, my recollection is that this doesn't play nice (hence a 3-level multi-index may not be a simple workaround).

The heirarchical relationship has to be preserved in some form.

Because I want to export to a file that is interpreted by another program that understand the hierarchical relationship..

I hope it is now clear why an Index of tuples becomes the most obvious, if not the best, option for solving some problems.

@toobaz
Copy link
Member

toobaz commented Mar 31, 2018

I figure by your statement @toobaz that you are have not surveyed the fix that has been provided - indeed the crux of the problem is that Index by default returns a MultiIndex if provided tuples, as above. This can be prevented by supplying the tupleize_cols=False argument. It follows that I don't think the bug does lie in 'tupleize_cols' - it is currently the default behaviour of Index to return a MultiIndex if given tuples (because tupleize_cols, by default, is True). One could argue that the default should be False,

... or that one can pass tupleize_cols=False even when the default is tupleize_cols=True ;-)

Then probably this bug is fixed also if the default changes (as mentioned by @TomAugspurger ), but that was not my point. When I wrote that the code is "actively" doing something wrong, I just meant that tupleize_cols=True means "infer", while tupleize_cols=False means "avoid inferring", regardless of which is the default.

@TomAugspurger
Copy link
Contributor

This is hard, since it isn't really clear from the name that .rename can change the index type.

w.r.t. your example @charlie0389, it's hard to say anything without actual code / data. I suspect that a MI is able to handle your problem.

@charlie0389
Copy link
Contributor Author

Ok, for an example, please consider the following code:

print("Consider the given data:")
given_data = [0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0]
print(given_data)
print()
print("With the given identifiers:")
given_labels = [("Australia", "NSW"), ("Australia", "ACT"), 
                                      ("Australia", "QLD"), 
                                     ("Australia", "NT"), ("Australia", "SA"), 
                                     ("Australia", "WA"), 
                                     ("Australia", "TAS"), ("Australia", "VIC"), 
                                     ("Australia", "Murray-Darling Basin")]
print(given_labels)
df = pd.DataFrame(data=given_data, 
                  index=given_labels,
             columns=["Millions of Sq. kms"])
print()
print("Which can be stored appropriately in the dataframe:")
print(df)
print("""
Because data in the same column is intepreted to be of the same \
type, this form implies that all the labels \
are conceptual equals (which is True - they all identify land areas in Australia). \
Furthermore, this \
allows the user to keep the hierarchical relationship \
between the first and second fields of each tuple (and is therefore the desired form).
""")


df.index = pd.Index(df.index.tolist())
print(df)
print("""This structure implies that all items in the second index column are conceptual \
equals (which is False). (The Murray-Darling basin is not a state of Australia).
""")

# Note that restructuring doesn't really make sense either - for example:
df = pd.DataFrame(data=[0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0], 
                  index=[("Australia", None, "NSW"), ("Australia", None, "ACT"), 
                                      ("Australia", None, "QLD"), 
                                     ("Australia", None,  "NT"), ("Australia", None,  "SA"), 
                                     ("Australia", None, "WA"), 
                                     ("Australia", None,  "TAS"), ("Australia", None, "VIC"), 
                                     ("Australia", "Murray-Darling Basin", None)], 
             columns=["Millions of Sq. kms"])
df.index = pd.MultiIndex.from_tuples(df.index)
print(df)
print("""
I'd argue this structure is unacceptable because it requires knowledge/logic to mutate \
given_index and to select any (or all) rows of the table. For example:
""")

print("Selecting all items:")
print(df.loc["Australia", :, :, :])
print()
print("Selecting a single item:")
print(df.loc["Australia", "Murray-Darling Basin", :, :])
print("""Both the selections above require knowledge that there are 3 fields which: 
(a) does not correspond with the given data, and
(b) the selection method is prone to breakage (i.e. what if data that has more than 3 fields is \
appended to the frame?)""")

Which has the following output:

Consider the given data:
[0.8, 0.002, 1.7, 1.3, 1.0, 2.5, 0.06, 0.2, 1.0]

With the given identifiers:
[('Australia', 'NSW'), ('Australia', 'ACT'), ('Australia', 'QLD'), 
('Australia', 'NT'), ('Australia', 'SA'), ('Australia', 'WA'), ('Australia', 'TAS'), 
('Australia', 'VIC'), ('Australia', 'Murray-Darling Basin')]

Which can be stored appropriately in the dataframe:
                                   Millions of Sq. kms
(Australia, NSW)                                 0.800
(Australia, ACT)                                 0.002
(Australia, QLD)                                 1.700
(Australia, NT)                                  1.300
(Australia, SA)                                  1.000
(Australia, WA)                                  2.500
(Australia, TAS)                                 0.060
(Australia, VIC)                                 0.200
(Australia, Murray-Darling Basin)                1.000

Because data in the same column is intepreted to be of the same type, 
this form implies that all the labels are conceptual equals (which is True - 
they all identify land areas in Australia). Furthermore, this allows the 
user to keep the hierarchical relationship between the first and second 
fields of each tuple (and is therefore the desired form).

                                Millions of Sq. kms
Australia NSW                                 0.800
          ACT                                 0.002
          QLD                                 1.700
          NT                                  1.300
          SA                                  1.000
          WA                                  2.500
          TAS                                 0.060
          VIC                                 0.200
          Murray-Darling Basin                1.000
This structure implies that all items in the second index column are 
conceptual equals (which is False). (The Murray-Darling basin is 
not a state of Australia).

                                    Millions of Sq. kms
Australia NaN                  NSW                0.800
                               ACT                0.002
                               QLD                1.700
                               NT                 1.300
                               SA                 1.000
                               WA                 2.500
                               TAS                0.060
                               VIC                0.200
          Murray-Darling Basin NaN                1.000

I'd argue this structure is unacceptable because it requires knowledge/logic 
to mutate given_index and to select any (or all) rows of the table. For example:

Selecting all items:
                          Millions of Sq. kms
NaN                  NSW                0.800
                     ACT                0.002
                     QLD                1.700
                     NT                 1.300
                     SA                 1.000
                     WA                 2.500
                     TAS                0.060
                     VIC                0.200
Murray-Darling Basin NaN                1.000

Selecting a single item:
     Millions of Sq. kms
NaN                  1.0
Both the selections above require knowledge that there are 3 fields which: 
(a) does not correspond with the given data, and
(b) the selection method is prone to breakage (i.e. what if data 
that has more than 3 fields is appended to the frame?)

Apologies for the wordiness, but I think it illustrates the conceptual point I'm trying to make.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 4, 2018

this form implies that all the labels are conceptual equals

This structure implies that all items in the second index column are conceptual equals (which is False).

I don't think it's relevant, but what do those two sentences mean?

The reason I say it's not relevant, is because the meaning you attach to a MultiIndex is up to you. Typically they're used to represent hierarchical data, but that's not necessary. It really is just a multi-part label, just like a tuple.

Attempting to interpret the "conceptual equals" bit, it seems like you're implicitly putting data in the index. You have some kind of is_state property in your head. That property is a piece of data not a label.

I don't understand the 3-level example. Again, though, it looks like you're putting some data in the index when it should go in the columns. Assuming the new level is something like is_water.

midx = pd.MultiIndex.from_tuples(given_labels)
df = pd.DataFrame({
    "sq. kms": given_data,
    "is_water": [False] * 8 + [True]
}, index=midx)
df

results in

sq. kms is_water
Australia NSW 0.800 False
ACT 0.002 False
QLD 1.700 False
NT 1.300 False
SA 1.000 False
WA 2.500 False
TAS 0.060 False
VIC 0.200 False
Murray-Darling Basin 1.000 True

Which (IIUC) is a much better way to represent the data.

@toobaz
Copy link
Member

toobaz commented Apr 4, 2018

As much as I love indicizing stuff with MultiIndex, Python is a flexible language, people are used to that flexibility, and I think this makes it hard to argue that tuples as keys don't make sense. MultiIndexes are great if there is some hierarchical structure (i.e., levels have a meaning, i.e., "all items in the second index column are conceptual equals"), but this is not necessarily the case.

Consider keys which represent paths:

In [3]: megabytes = pd.Series([103, 30, 5],
                              index=pd.Index([('usr', 'share'), ('usr', 'bin'), ('usr', 'local', 'bin')], tupleize_cols=False))

In [4]: megabytes
Out[4]: 
(usr, share)         103
(usr, bin)            30
(usr, local, bin)      5
dtype: int64

This is not an index which it makes sense to store as MultiIndex - you don't even know ex ante the number of levels it would need. Sure, we could transform tuples in strings relatively easily... but you will apply this transformation only if you have to.

So: we can always say pandas does not support tuples because it's just too messy (in terms of API, not necessarily just implementation). I just don't think it is the case. But I might be wrong. In any case, I don't think that investigating the intentions of anybody who wants to use tuples as keys (also see #20597) is a viable long term solution :-)

@charlie0389
Copy link
Contributor Author

Is it possible to leave this bug open please?

Going from the discussion so far, no one is disputing the fact that this is a bug. The only question is whether it should be supported, which even then, there still seems to be little disagreement that this should be supported in one form or another - all the discussion appears to revolve around implementation.

@charlie0389
Copy link
Contributor Author

charlie0389 commented Apr 17, 2018

For those that stumble upon this at a later date and are similarly frustrated by this bug, the following code fixes it:

    @staticmethod
    def _transform_index(index, func, level=None, tupleize_cols=False):
        """
        Apply function to all values found in index.

        This includes transforming multiindex entries separately.
        Only apply function to one level of the MultiIndex if level is specified.
        """
        # Copied from pandas.core.internals._transform_index() with minor modification 
        # in response to pandas bug #19497
        if isinstance(index, pd.MultiIndex):
            if level is not None:
                items = [tuple(func(y) if i == level else y
                               for i, y in enumerate(x)) for x in index]
            else:
                items = [tuple(func(y) for y in x) for x in index]
            return pd.MultiIndex.from_tuples(items, names=index.names)
        else:
            items = [func(x) for x in index]
            return pd.Index(items, name=index.name, tupleize_cols=tupleize_cols)

The only differences are the function signature, and the last return line.

@toobaz
Copy link
Member

toobaz commented Apr 18, 2018

@charlie0389 if you can open a PR where you

  • change the current _transform_index to tupleize_cols=False
  • add a test (e.g. your initial example)
  • verify that it passes all current tests (the part that most worries me)

... then I think it would be a good candidate for inclusion.

If in order to pass tests you do need to change the signature, I would suggest, rather than tupleize_cols, a more general parameter such as keep_type=True (@TomAugspurger @jreback better ideas?) which, when set to False, re-interprets the index content (so potentially changing not just an Index to MultiIndex, but also the other way round if e.g. keys in a MultiIndex are replaced with non-tuples). You might then want to split the process of creating the items list and the actual creation of the index.

Reopening this at least temporarily as a fix seems feasible and simple.

@toobaz toobaz reopened this Apr 18, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 18, 2018 via email

@toobaz
Copy link
Member

toobaz commented Apr 18, 2018

Would we be ok with a rule that rename doesn’t chase the type of the index between MI and other?

Yes, that was my idea (with keep_type=True, that is, tupleize_cols=False), and that's how I would design things from scratch. I just ignore whether it breaks code relying on "chasing the type".

@TomAugspurger
Copy link
Contributor

@toobaz do you think we need a new parameter (keep_type=True) to .rename? I'm trying to think of situations where keep_type=False would be useful.

@toobaz
Copy link
Member

toobaz commented Apr 18, 2018

do you think we need a new parameter (keep_type=True) to .rename?

I don't think we do in principle: my only concern was about backwards compatibility (and if we don't, then the fix to this is really a matter of passing tupleize_cols=False).

I'm trying to think of situations where keep_type=False would be useful.

For all examples I can think of, explicitly recasting is a better solution.

@TomAugspurger
Copy link
Contributor

What's the backwards compatibility concern?

I misunderstood rename with a MI. I assumed the mapping got tuples, instead it gets the scalar elements.

In [22]: s = pd.Series(1, index=pd.MultiIndex.from_product([["A", "B"], ['a', 'b']]))

In [23]: s
Out[23]:
A  a    1
   b    1
B  a    1
   b    1
dtype: int64

In [24]: s.rename({"A": 'a'})
Out[24]:
a  a    1
   b    1
B  a    1
   b    1
dtype: int64

In that case, I think that passing tupleize_cols=False internally is just fine.

@toobaz
Copy link
Member

toobaz commented Apr 18, 2018

What's the backwards compatibility concern?

Just that somebody (I would already be happy if it doesn't happen in some tests) assumed the following is a reasonable way to create a MultiIndex:

In [2]: pd.Series(range(3)).rename({0 : (0,1), 1 : (1, 2), 2 : (2, 3)})
Out[2]: 
0  1    0
1  2    1
2  3    2
dtype: int64

@toobaz
Copy link
Member

toobaz commented Apr 18, 2018

(but mine might be pure paranoia: if tests pass, I would proceed)

@toobaz
Copy link
Member

toobaz commented Apr 18, 2018

For completeness: in principle code out there could also be relying on the fact that

In [2]: pd.Series(range(3), index=['1', '2', '3']).rename({'1' : 1, '2' : 2, '3' : 3.}).index
Out[2]: Float64Index([1.0, 2.0, 3.0], dtype='float64')

although implementation wise this can be decoupled from the issue of multi vs. flat, documentation wise we probably just want to say "the resulting index will have the same type", and disable this automatic conversion.

@TomAugspurger
Copy link
Contributor

assumed the following is a reasonable way to create a MultiIndex

Understood. That is a valid concern...

although implementation wise this can be decoupled from the issue of multi vs. flat

I think converting between types (numeric vs. Index, etc.) is fine. It's the conversion between multi vs. flat that we (maybe) want to disallow via .rename.

@jreback jreback modified the milestones: won't fix, 0.23.0 May 14, 2018
PMeira added a commit to nilmtk/nilmtk that referenced this issue Jun 25, 2018
…as-dev/pandas#19497). Still compatible with 0.22 by using `pd.MultiIndex` directly.
BaluJr pushed a commit to BaluJr/energytk that referenced this issue Oct 18, 2018
…as-dev/pandas#19497). Still compatible with 0.22 by using `pd.MultiIndex` directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants