-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg #3575
Conversation
Might it be useful to be able to pass the index/header names if you know them a priori? |
can u give an example? |
Hm. Maybe it's not that useful since u can just read them in and then filter out what u don't want. I'll give an example anyway. Say you had column index names stored somewhere e.g., a text file. You only want to read in the levels of the column index with those names, so you specify e.g., |
you kind of want a |
@jreback right, right. sorry, yes header is for rows to use as headers. yeah or only the columns named c0, c1, ..., cN. |
@y-p would be interested to here your take on this :) |
It's consistent with the way the row index is handled, and as long as it doesn't try to do
Speaking of dupe cols, obviously a misuse since the save/load arguments don't match,
|
the read code will work in either case (If you have list of tuples, then header=0 will interpret it, while header=[0,1,2] will work in the reg case). So its only a back compat issue if you are actually using the file and depending on the list of tuples (and not reading with pandas, but using externally)
so I guess having 1 (or 2) more options Can you have a sub-category? (e.g.
would be the new behavior |
Ok for you example of an underspecified
|
Aside from what the default should be, is it still possible to specify that I want to have For people who have existing code, the bare minimum is to allow to them |
Here's to specify old behavior (option is a bit long, but wanted to be explicit),
|
How about |
to be consistent with or maybe change that to |
I guess default should be True for 0.11.1, then False (new behavior) in 0.12 |
(+1) + (+1) |
@y-p my mistake on |
of course you can suggest, and I'll take it! |
suggest you open a |
typos fixed |
@y-p I can allow what do you think? |
@y-p @cpcloud take a look at the csv as written on disk. There is an empty line where the row labels should go (and as they are none). Its very hard to parse this if this line is not there (because its ambiguous whether their are row labels or they should be part of the data). I could solve this with another option, but I don't like that solution. Any thoughts?
|
Maybe adding a comment line to the right of the empty row so that new users aren't wondering what this is. That forces a specific comment syntax though... |
what about:
(which the parser can then ignore, similar to what we do with unnamed columns) |
should this throw on if header not in parsed_rows_before_comma_line:
raise ParserError("passed header rows not all in header when parsing MultiIndex") when for example a user does something like: read_csv('test.csv', tupleize_cols=False, header=range(10), index_col=0) Slightly OT: allow |
So I thing these are reasonable
no real reason to support |
my 2c:
|
I didn't disable So this I think is ready to go |
|
@jreback @y-p sorry about the @y-p i agree that this is bordering on df srlztn but i think that if pandas supports mi's then it should support them as completely as possible. @jreback i think this err msg is a bit cryptic |
@cpcloud that was an existing error message, I will fix that (the text) |
A bit more intuitive now
|
GH3571, GH1651, GH3141
ENH: catching some invalid option combinations BUG: fix as_recarray DOC: io.rst updated
…vel_0, so they are not duplicated ENH: add options ``multi_index_columns_compat`` both to to_csv and read_csv (default is False), to force (when True) the previous behavior of creating a list of tuples (when writing), and reading as a list of tuples (and NOT as a MultiIndex) DOC: add compat flags to io.rst
CLN: changed formatting option: multi_index_columns_compat -> tupleize_cols BUG: incorrectly writing sparse levels for the multi_index DOC: slight docs changes TST: added tests/fixes for dissallowed options in to_csv (cols=not None,index=False) TST: from_csv not accepting tupleize_cols ENH: allow index=False in to_csv with a multi_index column allow reading of a multi_index column with with index_col=None DOC: updates to examples in io.rst and v0.11.1.rst TST: disallow names, usecols, non-numeric in index_cols BUG: raise on too many rows in the header if multi_index of columns
TST: better error messages on multi_index column read failure
ENH: allow to_csv to write multi-index columns, read_csv to read with header=list arg
In theory should close:
#3571, #1651, #3141
Works, but a couple of issues/caveats:
index_col needs to be specified as an integer list (can be fixed)a row that is skipped (e.g. [0,1,3,5], will just be skipped, like a comment)
other index_cols are blanks (they are separated, just == '')
The names of an multi-index on the index are after the columns and before the data,
and are a full row (but blank after the row names).
I am not sure if we should allowdf.to_csv('path',index=False)
when have a multi-index columns, could just ban it I guess (mainly as it screws up the write format, and then where do you put the names?)Thecols
argument needs testing and prob is broken when using multi-index on the columns (it really should be specified as a tuple I think, but that is work, so maybe just ban it when using multi-index columns)