ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

alefnula · 2013-09-25T22:12:05Z

read_csv uses the usecols parameter, and the read_excel function uses parse_cols parameter. Both parameter do basically the same thing:

usecols : array-like
    Return a subset of the columns.
    Results in much faster parsing time and lower memory usage.

parse_cols : int or list, default None
    * If None then parse all columns,
    * If int then indicates last column to be parsed
    * If list of ints then indicates list of column numbers to be parsed
    * If string then indicates comma separated list of column names and
      column ranges (e.g. "A:E" or "A,C,E:F")

Also maybe supporting the same thing that parse_cols does in usecol wouldn't be so bad?

The text was updated successfully, but these errors were encountered:

jtratner · 2013-09-25T22:24:27Z

That seems reasonable, however we'd probably still have to support the old parameter. My impression is that the only difference is that parse_cols accept a single int: clearly csv can't support the string option, given that it's excel specific and it already accepts a list of int (and I'm sure it accepts None as well).

alefnula · 2013-09-25T22:42:20Z

@jtratner Maybe adding use_cols (since all other parameters use the underscore convention) that does the combination you specified (+ for excel files the excel specific thing) and deprecate the usecols and parse_cols parameters in 0.13, 0.14... Exclude them from docs and drop them in some future version?

alefnula · 2013-09-25T22:49:09Z

I can take this if you decide this is reasonable to implement... The question is just what to do with the test cases? Test both or test just the use_cols since usecols andparse_cols` will be just aliases.

jreback · 2013-09-25T22:51:40Z

you are talking aboutusecols, not use_cols (sp?)

alefnula · 2013-09-25T23:02:50Z

@jreback I proposed a change from usecols to use_cols, since most parameters use the underscore naming convention: filepath_or_buffer, index_col, na_values, keep_default_na, true_values, false_values, parse_dates, keep_date_col, date_parser, na_filter, infer_types, mangle_dupe_cols, tupleize_cols... There are still ones that do not, like: dayfirst, skiprows, escapechar, skipinitialspace, lineterminator, quotechar, skipfooter, chunksize.

But if I noticed correctly the naming conventions are going toward the _ separated ones... Correct me if I'm wrong.

jtratner · 2013-09-26T02:22:00Z

@alefnula while I get what you're saying, I don't know if it really makes sense to change those keyword arguments now (and escapechar and quotechar are the csv module's kwargs, so thsoe probably shouldn't change). If you're going to do any change, you should definitely change it to parse_cols, to be consistent with excel

alefnula · 2013-09-26T20:24:06Z

@jtratner I think you misunderstood me, but it's my fault, I wasn't clear enough :) So I'll start fresh.

I went through the code of the read_ functions and here are the results:

Function	Parameter
read_clipboard	usecols
read_csv	usecols
read_excel	parse_cols
read_fwf	usecols
read_hdf	-
read_html	-
read_json	-
read_pickle	-
read_sql	-
read_stata	usecols
read_table	usecols
read_sql	-

The conclusion is: Most of the functions that have a column selection parameter use usecols, except the read_excel which uses parse_cols.

My suggestion is to:

Add column selection to all read_ functions. Implementing this for the functions that does not support it, at the moment, can just be something like this:
```
if usecols:
    return result[cols]
```
at the end of the function. I KNOW I oversimplified it :) This is just pseudecode. But doing a refined version of this wouldn't be so bad for the first iteration, and in the second iteration it can be supported directly by the parsers.
Since I first looked only at the read_csv and read_excel I suggested a unification of usecols and parse_cols under the name use_cols, since the underscore convention is used more often in pandas. But I take that back. Five from thirteen functions use the usecols parameter, and this would just cause trouble and wont gain anything. So just forget that renaming part :) The only renaming in this case would be from parse_cols to usecols in read_excel.
usecols should, for all functions, support the same thing it currently does + the integer slicing one [:n].
It will, for the read_ functions that support something fancy, have additional superpowers (like in the case of the read_excel, and maybe some other formats) but the basic functionality from the point 3. should be there for all functions. Additional "superpowers" would be noted in the docs for the specific formats.

This is basically the proposal I had in mind :) I just wasn't clear enough.

jreback · 2016-04-14T01:27:13Z

If someone is interested in this, we should deprecate parse_cols in favor of usecols

abarber4gh · 2017-05-23T18:30:47Z

taking a look at this, I plan to deprecate parse_cols for read_excel() in favor of usecols for consistency with other read functions. plan for usecols to support current functionality of read_excel()'s parse_cols and other usecols.
Basically, @alefnula's points 2, 3, and 4 above. point 1 (implement column selection functionality for all read_ functions) probably should be another ENH.

abarber4gh · 2017-05-24T05:52:26Z

in the current release version (0.20.1), a list of columns returns an empty DF. I.E. pandas.read_excel(test_inputexcel_file, parse_cols=['A','B','D']) results in

Empty DataFrame
Columns: []
Index: []

rather than the expected DataFrame with columns A, B, and D. I plan to address this as well to implement @alefnula's point 3.

jreback · 2017-09-23T21:04:20Z

@gfyoung anything worth changing for 0.21.0 for this? (prob in read_excel) that isn't already done / in an issue?

gfyoung · 2017-09-24T08:36:23Z

@jreback : I agree with you from a year ago that parse_cols should be deprecated in favor of usecols. I don't see why we shouldn't aim to do this in 0.21.0.

Will now use "usecols" just like in read_csv. xref pandas-devgh-4988.

closes #4988

closes pandas-dev#4988

alefnula closed this as completed Sep 25, 2013

alefnula reopened this Sep 25, 2013

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

jreback modified the milestones: 0.18.1, Next Major Release Apr 14, 2016

jreback added Difficulty Intermediate labels Apr 14, 2016

jreback modified the milestones: 0.19.0, 0.18.1 Apr 14, 2016

jreback added the IO CSV read_csv, to_csv label Jun 2, 2016

jreback mentioned this issue Jun 2, 2016

API: skipfooter or skip_footer? read_csv can't seem to decide #13349

Closed

jreback mentioned this issue Jul 10, 2016

API: Deprecate skip_footer in read_csv #13386

Closed

jreback modified the milestones: 0.19.0, 0.20.0 Jul 15, 2016

chris-b1 mentioned this issue Feb 6, 2017

read_excel gives different data for one and more than one elements in parse_cols setting #15316

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

abarber4gh mentioned this issue May 24, 2017

ENH: inconsistent naming convention for read_excel column selection (#4988) #16488

Closed

5 tasks

jreback modified the milestones: 0.21.0, Next Major Release May 24, 2017

abarber4gh mentioned this issue May 25, 2017

Implement usecols functionality for read_excel #16510

Closed

jreback modified the milestones: 0.21.0, 1.0 Oct 2, 2017

gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 4, 2017

DEPR: Deprecate parse_cols in read_excel

517b726

Will now use "usecols" just like in read_csv. xref pandas-devgh-4988.

gfyoung mentioned this issue Oct 4, 2017

DEPR: Deprecate parse_cols in read_excel #17774

Merged

jreback closed this as completed in #17774 Oct 4, 2017

jreback pushed a commit that referenced this issue Oct 4, 2017

DEPR: Deprecate parse_cols in read_excel (#17774)

6b29fc0

closes #4988

gfyoung modified the milestones: 1.0, 0.21.1, 0.21.0 Oct 4, 2017

ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017

DEPR: Deprecate parse_cols in read_excel (pandas-dev#17774)

c1e68cf

closes pandas-dev#4988

alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017

DEPR: Deprecate parse_cols in read_excel (pandas-dev#17774)

1da2b76

closes pandas-dev#4988

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

DEPR: Deprecate parse_cols in read_excel (pandas-dev#17774)

f8c385b

closes pandas-dev#4988

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

alefnula commented Sep 25, 2013

jtratner commented Sep 25, 2013

alefnula commented Sep 25, 2013

alefnula commented Sep 25, 2013

jreback commented Sep 25, 2013

alefnula commented Sep 25, 2013

jtratner commented Sep 26, 2013

alefnula commented Sep 26, 2013

jreback commented Apr 14, 2016

abarber4gh commented May 23, 2017 •

edited

Loading

abarber4gh commented May 24, 2017

jreback commented Sep 23, 2017

gfyoung commented Sep 24, 2017

ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

Comments

alefnula commented Sep 25, 2013

jtratner commented Sep 25, 2013

alefnula commented Sep 25, 2013

alefnula commented Sep 25, 2013

jreback commented Sep 25, 2013

alefnula commented Sep 25, 2013

jtratner commented Sep 26, 2013

alefnula commented Sep 26, 2013

jreback commented Apr 14, 2016

abarber4gh commented May 23, 2017 • edited Loading

abarber4gh commented May 24, 2017

jreback commented Sep 23, 2017

gfyoung commented Sep 24, 2017

abarber4gh commented May 23, 2017 •

edited

Loading