Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/CLN: Allow the BlockManager to have a non-unique items (axis 0) #3509

Merged
merged 5 commits into from
May 2, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 2, 2013

partially fixes #3468

This would previously raise (same dtype assignment to a non-multi dtype frame with dup indicies)

In [6]: df = DataFrame([[1,2]], columns=['a','a'])

In [7]: df.columns = ['a','a.1']

In [8]: df
Out[8]: 
   a  a.1
0  1    2

construction of a multi-dtype frame with a dup index (#2194) is fixed

In [18]: DataFrame([[1,2,1.,2.,3.,'foo','bar']], columns=list('aaaaaaa'))
Out[18]: 
   a  a  a  a  a    a    a
0  1  2  1  2  3  foo  bar

This was also previously would raise

In [3]: df_float  = DataFrame(np.random.randn(10, 3),dtype='float64')

In [4]: df_int    = DataFrame(np.random.randn(10, 3),dtype='int64')

In [5]: df_bool   = DataFrame(True,index=df_float.index,columns=df_float.columns)

In [6]: df_object = DataFrame('foo',index=df_float.index,columns=df_float.columns)

In [7]: df_dt     = DataFrame(Timestamp('20010101'),index=df_float.index,columns=df_float.columns)

In [9]: df        = pan.concat([ df_float, df_int, df_bool, df_object, df_dt ], axis=1)

In [14]: cols = []

In [15]: for i in range(5):
   ....:     cols.extend([0,1,2])
   ....:     

In [16]: df.columns = cols

In [17]: df
Out[17]: 
          0         1         2  0  1  2     0     1     2    0    1    2                   0                   1                   2
0  0.586610  0.369944  1.341337  1  1  1  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
1 -1.944284 -0.813987  0.061306  0  0  1  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
2 -1.688694  1.644802  0.659083  0  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
3  1.422893  0.712382  0.749263 -1  0 -1  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
4 -0.453802  0.228886 -0.339753  2  0 -2  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
5 -0.189643  1.309407 -0.386121  0  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
6  0.455658  0.822050 -0.741014  0  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
7 -0.484678 -1.089146  0.774849  0  1  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
8  0.720365  1.696400 -0.604040 -1  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00
9 -0.344480  0.886489  0.274428  1  0  0  True  True  True  foo  foo  foo 2001-01-01 00:00:00 2001-01-01 00:00:00 2001-01-01 00:00:00

For those of you interested.....here is the new ref_loc indexer for duplicate columns
its by necessity a block oriented indexer, returns the column map (by column number) to a tuple of the block and the index in the block, only created when needed (e.g. when trying to get a column via iget and the index is non-unique, and the results are cached), this is #3092

In [1]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [2]: df._data.blocks
Out[2]: [FloatBlock: [a, a, a, a], 4 x 8, dtype float64]

In [3]: df._data.blocks[0]._ref_locs

In [4]: df._data._set_ref_locs()
Out[4]: 
array([(FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 0),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 1),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 2),
       (FloatBlock: [a, a, a, a], 4 x 8, dtype float64, 3)], dtype=object)

Fixed the #2786, #3230 bug that caused applymap to not work (we temp worked around by raising a ValueError; removed that check)

n [3]: In [3]: df = pd.DataFrame(np.random.random((3,4)))

In [4]: In [4]: cols = pd.Index(['a','a','a','a'])

In [5]: In [5]: df.columns = cols

In [6]: In [6]: df.applymap(str)
Out[6]: 
                a                a               a               a
0  0.494204195164   0.534601503195  0.471870025143  0.880092879641
1  0.860369768954  0.0472931994392  0.775532754792  0.822046777859
2  0.478775855962   0.623584943227  0.932012693593  0.739502590395

Finally, to_csv writing has been fixed to use a single column mapper (which is derived from the ref_locs if the index is non-unique or the column numbering if it is unique)

jreback added 4 commits May 1, 2013 12:37
BUG: fix construction of a DataFrame with duplicative indices
…get) when

     using a non-unique index (GH2786 for the warning and GH3230 for applymap)

TST: test for GH2194 (which is fixed)
@jreback
Copy link
Contributor Author

jreback commented May 2, 2013

@wesm, @y-p this was a rabbit hole! I think this finally solves the non-unique indexing issues in construction, assignment, and selection. may have to review some of the other temp fixes that are in, e.g. #3458?

@wesm
Copy link
Member

wesm commented May 2, 2013

Ha, the rabbit hole, you went down it. Thanks for sparing me this one!

@ghost
Copy link

ghost commented May 2, 2013

good one, jeff. I think that's the most issues addressed by a single PR ever. :)

@jreback
Copy link
Contributor Author

jreback commented May 2, 2013

I would say that's not a good thing, but they r all related :)

@changhiskhan
Copy link
Contributor

@jreback you have an iron stomach :)

…on the decoration

      useful when specifiying an index that is **known** to be unique (e.g. in the case
      of a default range index)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants