IndexError: tuple index out of range after upgrade to 0.25 #27775

Foadsf · 2019-08-06T10:56:09Z

Root cause (in both cases using df = pd.DataFrame({'a': [1, 2, 3]})):

In [71]: pd.__version__  
Out[71]: '0.25.0'

In [73]: df.index[:, None]
Out[73]: Int64Index([0, 1, 2], dtype='int64')

In [74]: df.index[:, None].shape
Out[74]: (3,)

vs

In [10]: pd.__version__  
Out[10]: '0.24.2'

In [13]: df.index[:, None] 
Out[13]: Int64Index([0, 1, 2], dtype='int64')

In [14]: df.index[:, None].shape
Out[14]: (3, 1)

So before, indexing with [:, None] (in numpy a way to add a dimension to get 2D array) actually resulting in Index with ndim of 2 (but which is of course inconsistent state of the Index object)

Matplotlib relied on this fact when an Index is passed to plt.plot, as reported in matplotlib/matplotlib#14992

I have explained the issue here and here in details. Basically, after upgrading to the version 0.25 I got the error:

IndexError: tuple index out of range

while attempting to plot a CSV file.

The text was updated successfully, but these errors were encountered:

jreback · 2019-08-06T10:59:51Z

pls update the top section with a reproducible example; links to additional material is fine but the source material and versions should be here

Foadsf · 2019-08-06T11:06:04Z

@jreback I have actually downgraded Pandas from 0.25 to 0.24 so I'm not sure if there are other dependencies which might have also been downgraded. Right now the result of pd.show_versions() is:

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: None
pip: 19.2.1
setuptools: 41.0.1
Cython: None
numpy: 1.17.0
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.8
lxml.etree: 4.4.0
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The reproducible example is actually very simple:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

headers = ['fx', 'fy', 'fz', 'tx', 'ty', 'tz', 'currentr',
           'time', 'theta', 'omegay', 'currenty', 'pr', 'Dc', 'Fr', 'Fl']
df = pd.read_csv('data.csv', names=headers)

fig3 = plt.figure()
plt.plot(df.index, df['time'])
plt.show()

nothing particularly specific. more details including the CSV file here.

Please let me know if this is this satisfactory. Thanks for your support in advance.

jreback · 2019-08-06T11:13:40Z

pls try to reduce this to a copy pastable example w/o any external links
the likelihood of response will be higher

Foadsf · 2019-08-06T11:18:11Z

Dear @jreback ,

@anntzer has provided a small example showing the different between 0.25 and 0.24 here, so I'm just gonna quote her/him:

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
print(df.index.shape, df.index[:, None].shape)
This now prints (3,) (3,), but with pandas 0.24 used to print (3,) (3, 1) which we relied on to convert input to 2D.

jorisvandenbossche · 2019-08-06T19:40:54Z

@Foadsf I updated the top post with that example

jorisvandenbossche · 2019-08-06T19:48:15Z

So the root cause is that we don't handle well a 2D indexer on an Index class.
We basically simply ignore the fact that df.index[:, None] is a 2D indexer.

The source of Index.__getitem__ actually mentions that for such a case, a plain ndarray should be returned:

pandas/pandas/core/indexes/base.py

Lines 4241 to 4242 in 640d9e1

    
                   If resulting ndim != 1, plain ndarray is returned instead of 
        
                   corresponding `Index` subclass.

but that clearly does not happen (anymore).

TomAugspurger · 2019-08-06T19:54:54Z

Though I don't think returning an ndarray is appropriate, right? I'd be surprised to have __getitem__ change the type to a different container class.

What's the best path forward? IMO raising is the most correct thing to do. But is it worth changing?

jorisvandenbossche · 2019-08-06T19:55:42Z

This was "caused" by #27384, which optimized Index.shape to be return (len(self), ) instead of return self.values.shape.

But of course bottom line is still that an Index with 2D values is an invalid index object:

In [13]: idx = pd.Index([1, 2, 3])[:, None]                                                                                                                   

In [14]: idx.values                                                                                                                                           
Out[14]: 
array([[1],
       [2],
       [3]])

In [15]: idx.shape                                                                                                                                            
Out[15]: (3,)

jorisvandenbossche · 2019-08-06T19:59:43Z

I think short term, the easiest option is to revert the Index.shape change (but we could keep it for MultiIndex, to keep the performance improvement). That would at least solve the regression with matplotlib.

But longer term this is not really a good solution.
Raising an error certainly sounds as a valid option, but that will require changes in matplotlib.

I suppose the reason that it returned a 2D array before, might have been because it was an ndarray subclass, and in general might be useful to have see the Index as an array-like that behaves in code that expects a numpy-like array.

BTW, Series actually does this:

In [16]: pd.Series([1, 2, 3])[:, None]                                                                                                                        
Out[16]: 
array([[1],
       [2],
       [3]])

jorisvandenbossche · 2019-08-06T20:33:30Z

The Series case only works for actual numpy dtypes. Eg for categorical it returns a Series but goes wrong in all kinds of ways:

In [32]: s = pd.Series(pd.Categorical(['a', 'b']))[:, None]                                                                                                   

In [33]: type(s)                                                                                                                                              
Out[33]: pandas.core.series.Series

In [34]: s                                                                                                                                                    
Out[34]:
...
TypeError: unsupported format string passed to numpy.ndarray.__format__

In [35]: s._data                                                                                                                                              
Out[35]: 
SingleBlockManager
Items: Int64Index([[0], [1]], dtype='int64')
CategoricalBlock: 1 dtype: category

In [36]: s.index                                                                                                                                              
Out[36]: Int64Index([[0], [1]], dtype='int64')

In [37]: s.values                                                                                                                                             
Out[37]: 
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [38]: s.cat.codes                                                                                                                                          
...
ValueError: Length of passed values is 1, index implies 2

tacaswell · 2019-08-06T23:37:29Z

From Matplotlib's point of view, returning a numpy array is just fine (as we are trying to duck-type as a Series and Index as numpy arrays anyway). If we have gotten to the point where we are doing [:, None] we probably think it is close enough to a numpy array, maybe we just need to cast to numpy a bit more vigorously?

jorisvandenbossche · 2019-08-08T09:38:04Z

This is also related to #27125 (the fact that we can create an Index with >1 dimensional array).

For a 0.25.1 bugfix release, I would propose to again start returning the 2D shape.

jorisvandenbossche · 2019-08-08T09:55:37Z

I opened a PR for what I proposed above: #27818

I think for pandas it is fine to output a "invalid" (2D) shape as long as we allow to construct "invalid" Index objects. We should fix that second issue though, for which there is #27125

Foadsf mentioned this issue Aug 6, 2019

IndexError: tuple index out of range with pandas 0.25. matplotlib/matplotlib#14992

Closed

TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Aug 6, 2019

jorisvandenbossche removed the Needs Info Clarification about behavior needed to assess issue label Aug 6, 2019

jorisvandenbossche added Compat pandas objects compatability with Numpy or Python functions Regression Functionality that used to work in a prior pandas version labels Aug 6, 2019

jorisvandenbossche added this to the 0.25.1 milestone Aug 6, 2019

jorisvandenbossche mentioned this issue Aug 8, 2019

COMPAT: restore shape for 'invalid' Index with nd array #27818

Merged

jorisvandenbossche mentioned this issue Aug 8, 2019

BUG: Index constructor should not allow an ndarray with ndim > 2 #27125

Closed

tacaswell mentioned this issue Aug 9, 2019

FIX: support pandas 0.25 matplotlib/matplotlib#15007

Merged

6 tasks

jorisvandenbossche closed this as completed in #27818 Aug 9, 2019

jorisvandenbossche mentioned this issue Aug 9, 2019

API: what should a 2D indexing operation into a 1D Index do? (eg idx[:, None]) #27837

Closed

jorisvandenbossche mentioned this issue Aug 22, 2019

Index not iterable in Pandas 0.25.0 #28086

Closed

timhoffm mentioned this issue Sep 27, 2019

pyplot plot raises IndexError when x or y are pandas Index objects matplotlib/matplotlib#15342

Closed

tacaswell mentioned this issue Jan 27, 2020

pyplot.plot using pandas series raises DeprecationWarning with pandas=1.0.0rc0 matplotlib/matplotlib#16295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: tuple index out of range after upgrade to 0.25 #27775

IndexError: tuple index out of range after upgrade to 0.25 #27775

Foadsf commented Aug 6, 2019 •

edited by jorisvandenbossche

Loading

jreback commented Aug 6, 2019

Foadsf commented Aug 6, 2019

jreback commented Aug 6, 2019

Foadsf commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

TomAugspurger commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

tacaswell commented Aug 6, 2019

jorisvandenbossche commented Aug 8, 2019

jorisvandenbossche commented Aug 8, 2019

IndexError: tuple index out of range after upgrade to 0.25 #27775

IndexError: tuple index out of range after upgrade to 0.25 #27775

Comments

Foadsf commented Aug 6, 2019 • edited by jorisvandenbossche Loading

jreback commented Aug 6, 2019

Foadsf commented Aug 6, 2019

jreback commented Aug 6, 2019

Foadsf commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

TomAugspurger commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

jorisvandenbossche commented Aug 6, 2019

tacaswell commented Aug 6, 2019

jorisvandenbossche commented Aug 8, 2019

jorisvandenbossche commented Aug 8, 2019

Foadsf commented Aug 6, 2019 •

edited by jorisvandenbossche

Loading