ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575

jreback · 2013-05-11T04:08:31Z

In theory should close:
#3571, #1651, #3141

Works, but a couple of issues/caveats:

~~index_col needs to be specified as an integer list (can be fixed)~~
header is a list of rows to read that contain the multi-index,
a row that is skipped (e.g. [0,1,3,5], will just be skipped, like a comment)
the writing format might be a bit odd: the col names go in the first column,
other index_cols are blanks (they are separated, just == '')
The names of an multi-index on the index are after the columns and before the data,
and are a full row (but blank after the row names).
I am not sure if we should allow df.to_csv('path',index=False) when have a multi-index columns, could just ban it I guess (mainly as it screws up the write format, and then where do you put the names?)
The cols argument needs testing and prob is broken when using multi-index on the columns (it really should be specified as a tuple I think, but that is work, so maybe just ban it when using multi-index columns)
needs more testing

In [14]: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)

In [15]: df.to_csv('test.csv')

In [16]: !cat 'test.csv'
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2

In [17]: res = read_csv('test.csv',header=[0,1,2,3],index_col=[0,1])

In [18]: res.index
Out[18]: 
MultiIndex
[(u'R_l0_g0', u'R_l1_g0'), (u'R_l0_g1', u'R_l1_g1'), (u'R_l0_g2', u'R_l1_g2'), (u'R_l0_g3', u'R_l1_g3'), (u'R_l0_g4', u'R_l1_g4')]

In [19]: res.columns
Out[19]: 
MultiIndex
[(u'C_l0_g0', u'C_l1_g0', u'C_l2_g0', u'C_l3_g0'), (u'C_l0_g1', u'C_l1_g1', u'C_l2_g1', u'C_l3_g1'), (u'C_l0_g2', u'C_l1_g2', u'C_l2_g2', u'C_l3_g2')]

In [20]: res
Out[20]: 
C0              C_l0_g0 C_l0_g1 C_l0_g2
C1              C_l1_g0 C_l1_g1 C_l1_g2
C2              C_l2_g0 C_l2_g1 C_l2_g2
C3              C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                             
R_l0_g0 R_l1_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4    R4C0    R4C1    R4C2

In [21]: res.index.names
Out[21]: ['R0', 'R1']

In [22]: res.columns.names
Out[22]: ['C0', 'C1', 'C2', 'C3']

cpcloud · 2013-05-11T20:45:32Z

Might it be useful to be able to pass the index/header names if you know them a priori?

jreback · 2013-05-11T20:48:59Z

can u give an example?

cpcloud · 2013-05-11T21:35:41Z

Hm. Maybe it's not that useful since u can just read them in and then filter out what u don't want. I'll give an example anyway.

Say you had column index names stored somewhere e.g., a text file. You only want to read in the levels of the column index with those names, so you specify e.g., header=['c0', 'c1'] and only the levels from c0 and c1 will be in the resulting index.

jreback · 2013-05-11T21:41:50Z

you kind of want a uselevels argument analagous to usecols, need header to indicate which rows to use for the index in the first palce, e.g. header=[0,1,2], which says you have 3 levels, then uselevels=['c0','c1'] means only take the first 2...(and discard the 3rd) ?

cpcloud · 2013-05-11T21:44:32Z

@jreback right, right. sorry, yes header is for rows to use as headers. yeah or only the columns named c0, c1, ..., cN.

jreback · 2013-05-13T14:13:28Z

@y-p would be interested to here your take on this :)

ghost · 2013-05-13T15:12:02Z

It's consistent with the way the row index is handled, and as long as it doesn't try to do
auto-detection and keeps the user in charge it should work fine. The issues are just
those of back-compat which you've already raised, so to reiterate:

is it necessary to change default behavior rather then add a flag? this will break someone's code,
as you mentioned. With the recent due colname mangling, we opted for an arg with a default
value the keeps back-compat, and added a note that it will flip in 0.12, giving people a heads-up
about the breaking change.
if I like the current behavior, or have code that depends on it (skips rows for example),
how can I ensure that's what my output looks like?
Same for input, if I have existing data with intentional tuples, how do I ensure I get
the old behaviour?

Speaking of dupe cols, obviously a misuse since the save/load arguments don't match,
but note the dupe unnamed columns, I don't think we meant that to happen:

In [1]: df = mkdf(5,3,r_idx_nlevels=3,c_idx_nlevels=4)
In [2]: df.to_csv('test.csv')
In [7]: res = pd.read_csv('test.csv',header=[0,1,2,3],index_col=[0,1])
In [10]: res.columns
Out[10]: 
MultiIndex
[(u'Unnamed: 2', u'Unnamed: 2', u'Unnamed: 2', u'Unnamed: 2'), (u'C_l0_g0', u'C_l1_g0', u'C_l2_g0', u'C_l3_g0'), (u'C_l0_g1', u'C_l1_g1', u'C_l2_g1', u'C_l3_g1'), (u'C_l0_g2', u'C_l1_g2', u'C_l2_g2', u'C_l3_g2')]

C0              Unnamed: 2 C_l0_g0 C_l0_g1 C_l0_g2
C1              Unnamed: 2 C_l1_g0 C_l1_g1 C_l1_g2
C2              Unnamed: 2 C_l2_g0 C_l2_g1 C_l2_g2
C3              Unnamed: 2 C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                                        
R_l0_g0 R_l1_g0    R_l2_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1    R_l2_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2    R_l2_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3    R_l2_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4    R_l2_g4    R4C0    R4C1    R4C2

jreback · 2013-05-13T15:22:15Z

The question is are we 'forcing' a multi-index on columns to now be written in the new format, rather than the list of tuples. The answer is yes. I suppose could an an option not to do that, but I would argue that the default should be to do the new format.

the read code will work in either case (If you have list of tuples, then header=0 will interpret it, while header=[0,1,2] will work in the reg case).

So its only a back compat issue if you are actually using the file and depending on the list of tuples (and not reading with pandas, but using externally)

I don't understand, you want to skip rows?
looks like a bug....

so I guess having 1 (or 2) more options

Can you have a sub-category? (e.g. displays.csv.multi_index?)

displays.write_csv_multi_index_columns=True
displays.read_csv_multi_index_as_tuples=True

would be the new behavior

jreback · 2013-05-13T15:58:34Z

Ok for you example of an underspecified index_col I now return this

(Pdb) result
C0              Unnamed: 2_level_0 C_l0_g0 C_l0_g1 C_l0_g2
C1              Unnamed: 2_level_1 C_l1_g0 C_l1_g1 C_l1_g2
C2              Unnamed: 2_level_2 C_l2_g0 C_l2_g1 C_l2_g2
C3              Unnamed: 2_level_3 C_l3_g0 C_l3_g1 C_l3_g2
R0      R1                                                
R_l0_g0 R_l1_g0            R_l2_g0    R0C0    R0C1    R0C2
R_l0_g1 R_l1_g1            R_l2_g1    R1C0    R1C1    R1C2
R_l0_g2 R_l1_g2            R_l2_g2    R2C0    R2C1    R2C2
R_l0_g3 R_l1_g3            R_l2_g3    R3C0    R3C1    R3C2
R_l0_g4 R_l1_g4            R_l2_g4    R4C0    R4C1    R4C2

ghost · 2013-05-13T16:43:15Z

Aside from what the default should be, is it still possible to specify that I want to have
the old behavior? that's writing out tuples, and reading in tuples as-is.

For people who have existing code, the bare minimum is to allow to them
to add a flag to keep the old behavior. Forcing everyone to rewrite their code
to a larger extent then that is a bad move IMO.

jreback · 2013-05-13T16:59:54Z

Here's to specify old behavior (option is a bit long, but wanted to be explicit),
default is for the new behavior (which will still read the old style)

In [4]: df = mkdf(5,3,r_idx_nlevels=2,c_idx_nlevels=4)

In [5]: df.to_csv('test.csv',multi_index_columns_compat=True)

In [6]: !cat 'test.csv'
R0,R1,"('C_l0_g0', 'C_l1_g0', 'C_l2_g0', 'C_l3_g0')","('C_l0_g1', 'C_l1_g1', 'C_l2_g1', 'C_l3_g1')","('C_l0_g2', 'C_l1_g2', 'C_l2_g2', 'C_l3_g2')"
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2

In [7]: result = read_csv('test.csv',header=0,index_col=[0,1],multi_index_columns_compat=True)

In [8]: result
Out[8]: 
                ('C_l0_g0', 'C_l1_g0', 'C_l2_g0', 'C_l3_g0') ('C_l0_g1', 'C_l1_g1', 'C_l2_g1', 'C_l3_g1') ('C_l0_g2', 'C_l1_g2', 'C_l2_g2', 'C_l3_g2')
R0      R1                                                                                                                                            
R_l0_g0 R_l1_g0                                         R0C0                                         R0C1                                         R0C2
R_l0_g1 R_l1_g1                                         R1C0                                         R1C1                                         R1C2
R_l0_g2 R_l1_g2                                         R2C0                                         R2C1                                         R2C2
R_l0_g3 R_l1_g3                                         R3C0                                         R3C1                                         R3C2
R_l0_g4 R_l1_g4                                         R4C0                                         R4C1                                         R4C2

ghost · 2013-05-14T16:29:06Z

How about tupleized_cols=bool or similar?

jreback · 2013-05-14T16:39:37Z

tupleized_columns=bool

to be consistent with mangle_dup_columns

or maybe change that to mangle_dup_cols for consistency with index_col?

jreback · 2013-05-14T17:00:14Z

I guess default should be True for 0.11.1, then False (new behavior) in 0.12
so API available but not default yet ?

ghost · 2013-05-14T17:14:05Z

(+1) + (+1)

jreback · 2013-05-14T17:46:23Z

@y-p my mistake on mange_dup_columns, I had added to the parsers docstring and put in like that rather than the actual option name (which is correct), mange_dup_cols.

jreback · 2013-05-14T21:06:37Z

of course you can suggest, and I'll take it!

ghost · 2013-05-14T21:09:26Z

suggest you open a note-to-selves issue for the 0.12 miletsone, to remind us to flip the default
and update the docstrings.

jreback · 2013-05-14T21:22:13Z

typos fixed

jreback · 2013-05-14T23:53:13Z

@y-p I can allow index=False in to_csv and then allow index_col=None in read_csv, BUT, then you lose any columns.names, or could disallow writing w/o an index

what do you think?

jreback · 2013-05-15T12:19:47Z

@y-p @cpcloud take a look at the csv as written on disk. There is an empty line where the row labels should go (and as they are none). Its very hard to parse this if this line is not there (because its ambiguous whether their are row labels or they should be part of the data). I could solve this with another option, but I don't like that solution.

Any thoughts?

In [11]: 
df = DataFrame(np.random.randint(0,10,size=(3,3)), 
                                 columns=MultiIndex.from_tuples([('bah', 'foo'), 
                                                                 ('bah', 'bar'), 
                                                                 ('ban', 'baz')],
                                                                names=None))

In [13]: df.to_csv('test.csv',tupleize_cols=False)

In [14]: !cat 'test.csv'
,bah,bah,ban
,foo,bar,baz
,,,
0,2,0,6
1,4,0,5
2,1,4,2

In [16]: read_csv('test.csv',tupleize_cols=False,header=[0,1],index_col=0)
Out[16]: 
   bah       ban
   foo  bar  baz
0    2    0    6
1    4    0    5
2    1    4    2

cpcloud · 2013-05-15T14:04:07Z

Maybe adding a comment line to the right of the empty row so that new users aren't wondering what this is. That forces a specific comment syntax though...

jreback · 2013-05-15T14:14:46Z

what about:

Unnamed Index Label 0,,,

(which the parser can then ignore, similar to what we do with unnamed columns)
?

cpcloud · 2013-05-15T15:40:19Z

should this throw on header=list_that_is_too_long or header=list_that_contains_rows_not_in_header, e.g. (using your 'test.csv'),

if header not in parsed_rows_before_comma_line:
    raise ParserError("passed header rows not all in header when parsing MultiIndex")

when for example a user does something like:

read_csv('test.csv', tupleize_cols=False, header=range(10), index_col=0)

Slightly OT: allow xrange objects in header argument?

jreback · 2013-05-15T16:23:06Z

So I thing these are reasonable

In [3]: read_csv('test.csv', tupleize_cols=False, header=range(10), index_col=0)
CParserError: Passed header=0 but only 6 lines in file

In [4]: read_csv('test.csv', tupleize_cols=False, header=range(3), index_col=0)
Exception: Passed header=[0,1,2,3] are too many rows for this multi_index of columns

no real reason to support xrange this is a really short sequence (and we don't support it in
any other fields)

jreback · 2013-05-16T22:47:40Z

@wesm @y-p what do you think?

ghost · 2013-05-18T15:52:33Z

my 2c:

No need for xrange support. more generally, no need to add features just "because we can",
there's always an associated cost.
Fine to have a row of missing values to make parsing easier on roundtrip,
when the user specifies tupleize_cols=True.
disallowing index=False with tupleize_cols=True seems
reasonable to me.
Note that csv is just an export format, It's pushing things when you try to turn it
into a dataframe serialization format. (the empty row labels line, the index=False problem,
the inevitable request for control over including index names).
Slight dejavu of New Excel functionality #2478, Excelfancy #2370, really.

jreback · 2013-05-18T23:15:44Z

@y-p

I didn't disable index=False with tupleize_cols=True; just aded a test for it. Its not useful, but allowed. (This is allowed in 0.11.0 in fact)

So this I think is ready to go

jreback · 2013-05-18T23:36:06Z

index=False and tupleize=True (in 0.11.1), same as in prior versions

In [2]: df =  DataFrame(np.random.randint(0,10,size=(3,3)),columns=MultiIndex.from_tuples([('bah', 'foo'),('bah', 'bar'),('ban', 'baz')],names=['first','second']))

In [3]: df
Out[3]: 
first   bah       ban
second  foo  bar  baz
0         1    6    2
1         0    3    8
2         0    4    3

In [4]: df.to_csv('test.csv',tupleize_cols=True,index=False)

In [5]: !cat 'test.csv'
"('bah', 'foo')","('bah', 'bar')","('ban', 'baz')"
1,6,2
0,3,8
0,4,3

In [6]: read_csv('test.csv',header=0,tupleize_cols=True,index_col=None)
Out[6]: 
   ('bah', 'foo')  ('bah', 'bar')  ('ban', 'baz')
0               1               6               2
1               0               3               8
2               0               4               3

cpcloud · 2013-05-18T23:37:46Z

@jreback @y-p sorry about the xrange noise. abs no need for that there. want to avoid feature creep amap.

@y-p i agree that this is bordering on df srlztn but i think that if pandas supports mi's then it should support them as completely as possible.

@jreback i think this err msg is a bit cryptic CParserError: Passed header=0 but only 6 lines in file. why not something like "Elements of header list contain N but there are only M lines in file" subs N and M for associated values, because if there 6 lines in a file why is header=0 invalid?

jreback · 2013-05-18T23:48:59Z

@cpcloud that was an existing error message, I will fix that (the text)
the passed was header=range(10)....must not be formatting it correctly

jreback · 2013-05-19T00:03:17Z

A bit more intuitive now

(Pdb) read_csv(path,tupleize_cols=False,header=range(3),index_col=0)  
*** CParserError: Passed header=[0,1,2] are too many rows for this multi_index of columns
(Pdb) read_csv(path,tupleize_cols=False,header=range(7),index_col=0)  
*** CParserError: Passed header=[0,1,2,3,4,5,6], len of 7, but only 6 lines in file

…ed in to_string

GH3571, GH1651, GH3141

ENH: catching some invalid option combinations BUG: fix as_recarray DOC: io.rst updated

…vel_0, so they are not duplicated ENH: add options ``multi_index_columns_compat`` both to to_csv and read_csv (default is False), to force (when True) the previous behavior of creating a list of tuples (when writing), and reading as a list of tuples (and NOT as a MultiIndex) DOC: add compat flags to io.rst

CLN: changed formatting option: multi_index_columns_compat -> tupleize_cols BUG: incorrectly writing sparse levels for the multi_index DOC: slight docs changes TST: added tests/fixes for dissallowed options in to_csv (cols=not None,index=False) TST: from_csv not accepting tupleize_cols ENH: allow index=False in to_csv with a multi_index column allow reading of a multi_index column with with index_col=None DOC: updates to examples in io.rst and v0.11.1.rst TST: disallow names, usecols, non-numeric in index_cols BUG: raise on too many rows in the header if multi_index of columns

TST: better error messages on multi_index column read failure

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg

jreback mentioned this pull request May 13, 2013

Improve Reading and Writing of Multi-Index Columns #3571

Closed

jreback mentioned this pull request May 15, 2013

Change default value of tupleize_cols in to_csv/read_csv to False and update docs #3604

Closed

jreback added 7 commits May 19, 2013 10:20

ENH: to_csv write multi-index columns similar to how they are display…

de27eef

…ed in to_string

ENH: Allow read_csv to handle multi-index in columns

cc93d61

GH3571, GH1651, GH3141

TST: more test cases

c64555b

ENH: catching some invalid option combinations BUG: fix as_recarray DOC: io.rst updated

ENH/CLN: refactor to support PythonParser as well as CParser

d6573f5

TST: test for tupleize_cols=True,index=False

faf4d53

TST: better error messages on multi_index column read failure

jreback added a commit that referenced this pull request May 19, 2013

Merge pull request #3575 from jreback/mi_csv

860b05d

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg

jreback merged commit 860b05d into pandas-dev:master May 19, 2013

This was referenced May 19, 2013

Request: Multi-line header for to_csv when column is MultiIndex #1651

Closed

ENH: read_csv header argument to take a list to reconstruct multi-index #3141

Closed

ghost mentioned this pull request Sep 10, 2013

CLN: default for tupleize_cols is now False for both to_csv and read_csv. Fair warning in 0.12 (GH3604) #4797

Merged

jankatins mentioned this pull request Nov 17, 2013

Regression: to_csv and multiindex columns with header kw #5539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575

jreback commented May 11, 2013

cpcloud commented May 11, 2013

jreback commented May 11, 2013

cpcloud commented May 11, 2013

jreback commented May 11, 2013

cpcloud commented May 11, 2013

jreback commented May 13, 2013

ghost commented May 13, 2013

jreback commented May 13, 2013

jreback commented May 13, 2013

ghost commented May 13, 2013

jreback commented May 13, 2013

ghost commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

ghost commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

ghost commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

jreback commented May 16, 2013

ghost commented May 18, 2013

jreback commented May 18, 2013

jreback commented May 18, 2013

cpcloud commented May 18, 2013

jreback commented May 18, 2013

jreback commented May 19, 2013

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575

ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575

Conversation

jreback commented May 11, 2013

cpcloud commented May 11, 2013

jreback commented May 11, 2013

cpcloud commented May 11, 2013

jreback commented May 11, 2013

cpcloud commented May 11, 2013

jreback commented May 13, 2013

ghost commented May 13, 2013

jreback commented May 13, 2013

jreback commented May 13, 2013

ghost commented May 13, 2013

jreback commented May 13, 2013

ghost commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

ghost commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

ghost commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 14, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

cpcloud commented May 15, 2013

jreback commented May 15, 2013

jreback commented May 16, 2013

ghost commented May 18, 2013

jreback commented May 18, 2013

jreback commented May 18, 2013

cpcloud commented May 18, 2013

jreback commented May 18, 2013

jreback commented May 19, 2013