BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

jreback · 2013-05-02T01:14:01Z

Non-unique index support clarified ENH: create BlockManager positional indexer (for easier dupe cols support) #3092
- Fix assigning a new index to a duplicate index in a DataFrame would fail Pandas inconsistenly handles identically named columns in csv export and merging #3468
- Fix construction of a DataFrame with a duplicate index
- ref_locs support to allow duplicative indices across dtypes,
  allows iget support to always find the index (even across dtypes) error msg when trying to split duplicated columns across dtypes #2194
- applymap on a DataFrame with a non-unique index now works
  (removed warning) df.applymap duplicates data with frame has dupe columns #2786, and fix Enable applymap for dataframes with duplicate columns #3230
- Fix to_csv to handle non-unique columns BUG: fix to_csv to work with dup column indices #3495
- Modification to cache_readonly to allow you to pass an argument (allow_setting), to 'set'
  this value (useful in order to avoid a computation you know to be true, e.g. is_unique = True
  for a default index

partially fixes #3468

This would previously raise (same dtype assignment to a non-multi dtype frame with dup indicies)

In [6]: df = DataFrame([[1,2]], columns=['a','a'])

In [7]: df.columns = ['a','a.1']

In [8]: df
Out[8]: 
   a  a.1
0  1    2

construction of a multi-dtype frame with a dup index (#2194) is fixed

In [18]: DataFrame([[1,2,1.,2.,3.,'foo','bar']], columns=list('aaaaaaa'))
Out[18]: 
   a  a  a  a  a    a    a
0  1  2  1  2  3  foo  bar

This was also previously would raise

In [3]: df_float  = DataFrame(np.random.randn(10, 3),dtype='float64')

In [4]: df_int    = DataFrame(np.random.randn(10, 3),dtype='int64')

In [5]: df_bool   = DataFrame(True,index=df_float.index,columns=df_float.columns)

In [6]: df_object = DataFrame('foo',index=df_float.index,columns=df_float.columns)

In [7]: df_dt     = DataFrame(Timestamp('20010101'),index=df_float.index,columns=df_float.columns)

In [9]: df        = pan.concat([ df_float, df_int, df_bool, df_object, df_dt ], axis=1)

In [14]: cols = []

In [15]: for i in range(5):
   ....:     cols.extend([0,1,2])
   ....:     

In [16]: df.columns = cols

In [17]: df
Out[17]: 
          0         1         2  0  1  2     0     1     2    0    1    2                   0                   1                   2
0  0.586610  0.369944  1.341337  1  1  1  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
1 -1.944284 -0.813987  0.061306  0  0  1  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
2 -1.688694  1.644802  0.659083  0  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
3  1.422893  0.712382  0.749263 -1  0 -1  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
4 -0.453802  0.228886 -0.339753  2  0 -2  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
5 -0.189643  1.309407 -0.386121  0  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
6  0.455658  0.822050 -0.741014  0  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
7 -0.484678 -1.089146  0.774849  0  1  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
8  0.720365  1.696400 -0.604040 -1  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
9 -0.344480  0.886489  0.274428  1  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00

For those of you interested.....here is the new ref_loc indexer for duplicate columns
its by necessity a block oriented indexer, returns the column map (by column number) to a tuple of the block and the index in the block, only created when needed (e.g. when trying to get a column via iget and the index is non-unique, and the results are cached), this is #3092

In [1]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [2]: df._data.blocks
Out[2]: [FloatBlock: [a, a, a, a], 4 x 8, dtype float64]

In [3]: df._data.blocks[0]._ref_locs

In [4]: df._data._set_ref_locs()
Out[4]: 
array([(FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 0),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 1),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 2),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 3)], dtype=object)

Fixed the #2786, #3230 bug that caused applymap to not work (we temp worked around by raising a ValueError; removed that check)

n [3]: In [3]: df = pd.DataFrame(np.random.random((3,4)))

In [4]: In [4]: cols = pd.Index(['a','a','a','a'])

In [5]: In [5]: df.columns = cols

In [6]: In [6]: df.applymap(str)
Out[6]: 
                a                a               a               a
0  0.494204195164   0.534601503195  0.471870025143  0.880092879641
1  0.860369768954  0.0472931994392  0.775532754792  0.822046777859
2  0.478775855962   0.623584943227  0.932012693593  0.739502590395

Finally, to_csv writing has been fixed to use a single column mapper (which is derived from the ref_locs if the index is non-unique or the column numbering if it is unique)

…rame would fail

BUG: fix construction of a DataFrame with duplicative indices

…get) when using a non-unique index (GH2786 for the warning and GH3230 for applymap) TST: test for GH2194 (which is fixed)

… of dealing with columns duplicate or not

jreback · 2013-05-02T01:20:38Z

@wesm, @y-p this was a rabbit hole! I think this finally solves the non-unique indexing issues in construction, assignment, and selection. may have to review some of the other temp fixes that are in, e.g. #3458?

wesm · 2013-05-02T01:40:36Z

Ha, the rabbit hole, you went down it. Thanks for sparing me this one!

ghost · 2013-05-02T02:22:06Z

good one, jeff. I think that's the most issues addressed by a single PR ever. :)

jreback · 2013-05-02T02:27:51Z

I would say that's not a good thing, but they r all related :)

changhiskhan · 2013-05-02T02:28:28Z

@jreback you have an iron stomach :)

…on the decoration useful when specifiying an index that is **known** to be unique (e.g. in the case of a default range index)

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0)

jreback added 4 commits May 1, 2013 12:37

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataF…

432c672

…rame would fail

ENH: support for having duplicative indices across blocks (dtypes)

4c756e2

BUG: fix construction of a DataFrame with duplicative indices

BUG: enabled applymap to work (and updated internals/convert to use i…

b4677c1

…get) when using a non-unique index (GH2786 for the warning and GH3230 for applymap) TST: test for GH2194 (which is fixed)

BUG: GH3495 change core/format/CSVFormatter.save to allow generic way…

b8382a3

… of dealing with columns duplicate or not

jreback mentioned this pull request May 2, 2013

BUG: GH3468 Fix assigning a new index to a duplicate index in a DataFrame would fail #3483

Closed

PERF: allow a cache_readonly to be 'set' if allow_settings is passed …

8c08aca

…on the decoration useful when specifiying an index that is **known** to be unique (e.g. in the case of a default range index)

jreback added a commit that referenced this pull request May 2, 2013

Merge pull request #3509 from jreback/dup_columns2

c03f0ca

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0)

jreback merged commit c03f0ca into pandas-dev:master May 2, 2013

This was referenced May 3, 2013

df.applymap duplicates data with frame has dupe columns #2786

Closed

fancy indexing with dupe columns yields unexpected ordering #3455

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

jreback commented May 2, 2013

jreback commented May 2, 2013

wesm commented May 2, 2013

ghost commented May 2, 2013

jreback commented May 2, 2013

changhiskhan commented May 2, 2013

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

Conversation

jreback commented May 2, 2013

jreback commented May 2, 2013

wesm commented May 2, 2013

ghost commented May 2, 2013

jreback commented May 2, 2013

changhiskhan commented May 2, 2013