Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575

Merged
merged 7 commits into from
May 19, 2013

Conversation

jreback
Copy link
Contributor

@jreback jreback commented May 11, 2013

In theory should close:
#3571, #1651, #3141

Works, but a couple of issues/caveats:

  • index_col needs to be specified as an integer list (can be fixed)
  • header is a list of rows to read that contain the multi-index,
    a row that is skipped (e.g. [0,1,3,5], will just be skipped, like a comment)
  • the writing format might be a bit odd: the col names go in the first column,
    other index_cols are blanks (they are separated, just == '')
    The names of an multi-index on the index are after the columns and before the data,
    and are a full row (but blank after the row names).
  • I am not sure if we should allow df.to_csv('path',index=False) when have a multi-index columns, could just ban it I guess (mainly as it screws up the write format, and then where do you put the names?)
  • The cols argument needs testing and prob is broken when using multi-index on the columns (it really should be specified as a tuple I think, but that is work, so maybe just ban it when using multi-index columns)
  • needs more testing
In [14]: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)

In [15]: df.to_csv('test.csv')

In [16]: !cat 'test.csv'
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2

In [17]: res = read_csv('test.csv',header=[0,1,2,3],index_col=[0,1])

In [18]: res.index
Out[18]: 
MultiIndex
[(u'R_l0_g0', u'R_l1_g0'), (u'R_l0_g1', u'R_l1_g1'), (u'R_l0_g2', u'R_l1_g2'), (u'R_l0_g3', u'R_l1_g3'), (u'R_l0_g4', u'R_l1_g4')]

In [19]: res.columns
Out[19]: 
MultiIndex
[(u'C_l0_g0', u'C_l1_g0', u'C_l2_g0', u'C_l3_g0'), (u'C_l0_g1', u'C_l1_g1', u'C_l2_g1', u'C_l3_g1'), (u'C_l0_g2', u'C_l1_g2', u'C_l2_g2', u'C_l3_g2')]

In [20]: res
Out[20]: 
C0              C_l0_g0 C_l0_g1 C_l0_g2
C1              C_l1_g0 C_l1_g1 C_l1_g2
C2              C_l2_g0 C_l2_g1 C_l2_g2
C3              C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                             
R_l0_g0 R_l1_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4    R4C0    R4C1    R4C2

In [21]: res.index.names
Out[21]: ['R0', 'R1']

In [22]: res.columns.names
Out[22]: ['C0', 'C1', 'C2', 'C3']

@cpcloud
Copy link
Member

cpcloud commented May 11, 2013

Might it be useful to be able to pass the index/header names if you know them a priori?

@jreback
Copy link
Contributor Author

jreback commented May 11, 2013

can u give an example?

@cpcloud
Copy link
Member

cpcloud commented May 11, 2013

Hm. Maybe it's not that useful since u can just read them in and then filter out what u don't want. I'll give an example anyway.

Say you had column index names stored somewhere e.g., a text file. You only want to read in the levels of the column index with those names, so you specify e.g., header=['c0', 'c1'] and only the levels from c0 and c1 will be in the resulting index.

@jreback
Copy link
Contributor Author

jreback commented May 11, 2013

you kind of want a uselevels argument analagous to usecols, need header to indicate which rows to use for the index in the first palce, e.g. header=[0,1,2], which says you have 3 levels, then uselevels=['c0','c1'] means only take the first 2...(and discard the 3rd) ?

@cpcloud
Copy link
Member

cpcloud commented May 11, 2013

@jreback right, right. sorry, yes header is for rows to use as headers. yeah or only the columns named c0, c1, ..., cN.

@jreback
Copy link
Contributor Author

jreback commented May 13, 2013

@y-p would be interested to here your take on this :)

@ghost
Copy link

ghost commented May 13, 2013

It's consistent with the way the row index is handled, and as long as it doesn't try to do
auto-detection and keeps the user in charge it should work fine. The issues are just
those of back-compat which you've already raised, so to reiterate:

  1. is it necessary to change default behavior rather then add a flag? this will break someone's code,
    as you mentioned. With the recent due colname mangling, we opted for an arg with a default
    value the keeps back-compat, and added a note that it will flip in 0.12, giving people a heads-up
    about the breaking change.
  2. if I like the current behavior, or have code that depends on it (skips rows for example),
    how can I ensure that's what my output looks like?
  3. Same for input, if I have existing data with intentional tuples, how do I ensure I get
    the old behaviour?

Speaking of dupe cols, obviously a misuse since the save/load arguments don't match,
but note the dupe unnamed columns, I don't think we meant that to happen:

In [1]: df = mkdf(5,3,r_idx_nlevels=3,c_idx_nlevels=4)
In [2]: df.to_csv('test.csv')
In [7]: res = pd.read_csv('test.csv',header=[0,1,2,3],index_col=[0,1])
In [10]: res.columns
Out[10]: 
MultiIndex
[(u'Unnamed: 2', u'Unnamed: 2', u'Unnamed: 2', u'Unnamed: 2'), (u'C_l0_g0', u'C_l1_g0', u'C_l2_g0', u'C_l3_g0'), (u'C_l0_g1', u'C_l1_g1', u'C_l2_g1', u'C_l3_g1'), (u'C_l0_g2', u'C_l1_g2', u'C_l2_g2', u'C_l3_g2')]

C0              Unnamed: 2 C_l0_g0 C_l0_g1 C_l0_g2
C1              Unnamed: 2 C_l1_g0 C_l1_g1 C_l1_g2
C2              Unnamed: 2 C_l2_g0 C_l2_g1 C_l2_g2
C3              Unnamed: 2 C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                                        
R_l0_g0 R_l1_g0    R_l2_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1    R_l2_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2    R_l2_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3    R_l2_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4    R_l2_g4    R4C0    R4C1    R4C2

@jreback
Copy link
Contributor Author

jreback commented May 13, 2013

  1. The question is are we 'forcing' a multi-index on columns to now be written in the new format, rather than the list of tuples. The answer is yes. I suppose could an an option not to do that, but I would argue that the default should be to do the new format.

the read code will work in either case (If you have list of tuples, then header=0 will interpret it, while header=[0,1,2] will work in the reg case).

So its only a back compat issue if you are actually using the file and depending on the list of tuples (and not reading with pandas, but using externally)

  1. I don't understand, you want to skip rows?

  2. looks like a bug....

so I guess having 1 (or 2) more options

Can you have a sub-category? (e.g. displays.csv.multi_index?)

displays.write_csv_multi_index_columns=True
displays.read_csv_multi_index_as_tuples=True

would be the new behavior

@jreback
Copy link
Contributor Author

jreback commented May 13, 2013

Ok for you example of an underspecified index_col I now return this

(Pdb) result
C0              Unnamed: 2_level_0 C_l0_g0 C_l0_g1 C_l0_g2
C1              Unnamed: 2_level_1 C_l1_g0 C_l1_g1 C_l1_g2
C2              Unnamed: 2_level_2 C_l2_g0 C_l2_g1 C_l2_g2
C3              Unnamed: 2_level_3 C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                                                
R_l0_g0 R_l1_g0            R_l2_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1            R_l2_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2            R_l2_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3            R_l2_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4            R_l2_g4    R4C0    R4C1    R4C2

@ghost
Copy link

ghost commented May 13, 2013

Aside from what the default should be, is it still possible to specify that I want to have
the old behavior? that's writing out tuples, and reading in tuples as-is.

For people who have existing code, the bare minimum is to allow to them
to add a flag to keep the old behavior. Forcing everyone to rewrite their code
to a larger extent then that is a bad move IMO.

@jreback
Copy link
Contributor Author

jreback commented May 13, 2013

Here's to specify old behavior (option is a bit long, but wanted to be explicit),
default is for the new behavior (which will still read the old style)

In [4]: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)

In [5]: df.to_csv('test.csv',multi_index_columns_compat=True)

In [6]: !cat 'test.csv'
R0,R1,"('C_l0_g0', 'C_l1_g0', 'C_l2_g0', 'C_l3_g0')","('C_l0_g1', 'C_l1_g1', 'C_l2_g1', 'C_l3_g1')","('C_l0_g2', 'C_l1_g2', 'C_l2_g2', 'C_l3_g2')"
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2

In [7]: result = read_csv('test.csv',header=0,index_col=[0,1],multi_index_columns_compat=True)

In [8]: result
Out[8]: 
                ('C_l0_g0', 'C_l1_g0', 'C_l2_g0', 'C_l3_g0') ('C_l0_g1', 'C_l1_g1', 'C_l2_g1', 'C_l3_g1') ('C_l0_g2', 'C_l1_g2', 'C_l2_g2', 'C_l3_g2')
R0      R1                                                                                                                                            
R_l0_g0 R_l1_g0                                         R0C0                                         R0C1                                         R0C2
R_l0_g1 R_l1_g1                                         R1C0                                         R1C1                                         R1C2
R_l0_g2 R_l1_g2                                         R2C0                                         R2C1                                         R2C2
R_l0_g3 R_l1_g3                                         R3C0                                         R3C1                                         R3C2
R_l0_g4 R_l1_g4                                         R4C0                                         R4C1                                         R4C2

@ghost
Copy link

ghost commented May 14, 2013

How about tupleized_cols=bool or similar?

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

tupleized_columns=bool

to be consistent with mangle_dup_columns

or maybe change that to mangle_dup_cols for consistency with index_col?

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

I guess default should be True for 0.11.1, then False (new behavior) in 0.12
so API available but not default yet ?

@ghost
Copy link

ghost commented May 14, 2013

(+1) + (+1)

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

@y-p my mistake on mange_dup_columns, I had added to the parsers docstring and put in like that rather than the actual option name (which is correct), mange_dup_cols.

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

of course you can suggest, and I'll take it!

@ghost
Copy link

ghost commented May 14, 2013

suggest you open a note-to-selves issue for the 0.12 miletsone, to remind us to flip the default
and update the docstrings.

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

typos fixed

@jreback
Copy link
Contributor Author

jreback commented May 14, 2013

@y-p I can allow index=False in to_csv and then allow index_col=None in read_csv, BUT, then you lose any columns.names, or could disallow writing w/o an index

what do you think?

@jreback
Copy link
Contributor Author

jreback commented May 15, 2013

@y-p @cpcloud take a look at the csv as written on disk. There is an empty line where the row labels should go (and as they are none). Its very hard to parse this if this line is not there (because its ambiguous whether their are row labels or they should be part of the data). I could solve this with another option, but I don't like that solution.

Any thoughts?

In [11]: 
df = DataFrame(np.random.randint(0,10,size=(3,3)), 
                                 columns=MultiIndex.from_tuples([('bah', 'foo'), 
                                                                 ('bah', 'bar'), 
                                                                 ('ban', 'baz')],
                                                                names=None))

In [13]: df.to_csv('test.csv',tupleize_cols=False)

In [14]: !cat 'test.csv'
,bah,bah,ban
,foo,bar,baz
,,,
0,2,0,6
1,4,0,5
2,1,4,2

In [16]: read_csv('test.csv',tupleize_cols=False,header=[0,1],index_col=0)
Out[16]: 
   bah       ban
   foo  bar  baz
0    2    0    6
1    4    0    5
2    1    4    2

@cpcloud
Copy link
Member

cpcloud commented May 15, 2013

Maybe adding a comment line to the right of the empty row so that new users aren't wondering what this is. That forces a specific comment syntax though...

@jreback
Copy link
Contributor Author

jreback commented May 15, 2013

what about:

Unnamed Index Label 0,,,

(which the parser can then ignore, similar to what we do with unnamed columns)
?

@cpcloud
Copy link
Member

cpcloud commented May 15, 2013

should this throw on header=list_that_is_too_long or header=list_that_contains_rows_not_in_header, e.g. (using your 'test.csv'),

if header not in parsed_rows_before_comma_line:
    raise ParserError("passed header rows not all in header when parsing MultiIndex")

when for example a user does something like:

read_csv('test.csv', tupleize_cols=False, header=range(10), index_col=0)

Slightly OT: allow xrange objects in header argument?

@jreback
Copy link
Contributor Author

jreback commented May 15, 2013

So I thing these are reasonable

In [3]: read_csv('test.csv', tupleize_cols=False, header=range(10), index_col=0)
CParserError: Passed header=0 but only 6 lines in file

In [4]: read_csv('test.csv', tupleize_cols=False, header=range(3), index_col=0)
Exception: Passed header=[0,1,2,3] are too many rows for this multi_index of columns

no real reason to support xrange this is a really short sequence (and we don't support it in
any other fields)

@jreback
Copy link
Contributor Author

jreback commented May 16, 2013

@wesm @y-p what do you think?

@ghost
Copy link

ghost commented May 18, 2013

my 2c:

  • No need for xrange support. more generally, no need to add features just "because we can",
    there's always an associated cost.
  • Fine to have a row of missing values to make parsing easier on roundtrip,
    when the user specifies tupleize_cols=True.
  • disallowing index=False with tupleize_cols=True seems
    reasonable to me.
  • Note that csv is just an export format, It's pushing things when you try to turn it
    into a dataframe serialization format. (the empty row labels line, the index=False problem,
    the inevitable request for control over including index names).
  • Slight dejavu of New Excel functionality #2478, Excelfancy #2370, really.

@jreback
Copy link
Contributor Author

jreback commented May 18, 2013

@y-p

I didn't disable index=False with tupleize_cols=True; just aded a test for it. Its not useful, but allowed. (This is allowed in 0.11.0 in fact)

So this I think is ready to go

@jreback
Copy link
Contributor Author

jreback commented May 18, 2013

index=False and tupleize=True (in 0.11.1), same as in prior versions

In [2]: df =  DataFrame(np.random.randint(0,10,size=(3,3)),columns=MultiIndex.from_tuples([('bah', 'foo'),('bah', 'bar'),('ban', 'baz')],names=['first','second']))

In [3]: df
Out[3]: 
first   bah       ban
second  foo  bar  baz
0         1    6    2
1         0    3    8
2         0    4    3

In [4]: df.to_csv('test.csv',tupleize_cols=True,index=False)

In [5]: !cat 'test.csv'
"('bah', 'foo')","('bah', 'bar')","('ban', 'baz')"
1,6,2
0,3,8
0,4,3

In [6]: read_csv('test.csv',header=0,tupleize_cols=True,index_col=None)
Out[6]: 
   ('bah', 'foo')  ('bah', 'bar')  ('ban', 'baz')
0               1               6               2
1               0               3               8
2               0               4               3

@cpcloud
Copy link
Member

cpcloud commented May 18, 2013

@jreback @y-p sorry about the xrange noise. abs no need for that there. want to avoid feature creep amap.

@y-p i agree that this is bordering on df srlztn but i think that if pandas supports mi's then it should support them as completely as possible.

@jreback i think this err msg is a bit cryptic CParserError: Passed header=0 but only 6 lines in file. why not something like "Elements of header list contain N but there are only M lines in file" subs N and M for associated values, because if there 6 lines in a file why is header=0 invalid?

@jreback
Copy link
Contributor Author

jreback commented May 18, 2013

@cpcloud that was an existing error message, I will fix that (the text)
the passed was header=range(10)....must not be formatting it correctly

@jreback
Copy link
Contributor Author

jreback commented May 19, 2013

A bit more intuitive now

(Pdb) read_csv(path,tupleize_cols=False,header=range(3),index_col=0)  
*** CParserError: Passed header=[0,1,2] are too many rows for this multi_index of columns
(Pdb) read_csv(path,tupleize_cols=False,header=range(7),index_col=0)  
*** CParserError: Passed header=[0,1,2,3,4,5,6], len of 7, but only 6 lines in file

jreback added 7 commits May 19, 2013 10:20
ENH: catching some invalid option combinations

BUG: fix as_recarray

DOC: io.rst updated
…vel_0, so they are not duplicated

ENH: add options ``multi_index_columns_compat`` both to to_csv and read_csv (default is False),

    to force (when True) the previous behavior of creating a list of tuples (when writing), and
    reading as a list of tuples (and NOT as a MultiIndex)

DOC: add compat flags to io.rst
CLN: changed formatting option: multi_index_columns_compat -> tupleize_cols

BUG: incorrectly writing sparse levels for the multi_index

DOC: slight docs changes

TST: added tests/fixes for dissallowed options in to_csv (cols=not None,index=False)

TST: from_csv not accepting tupleize_cols

ENH: allow index=False in to_csv with a multi_index column

     allow reading of a multi_index column with with index_col=None

DOC: updates to examples in io.rst and v0.11.1.rst

TST: disallow names, usecols, non-numeric in index_cols

BUG: raise on too many rows in the header if multi_index of columns
TST: better error messages on multi_index column read failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants