Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

Closed
alefnula opened this issue Sep 25, 2013 · 12 comments · Fixed by #17774
Closed

ENH: inconsistent naming convention for read_csv and read_excel column selection #4988

alefnula opened this issue Sep 25, 2013 · 12 comments · Fixed by #17774
Labels
API Design IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Milestone

Comments

@alefnula
Copy link
Contributor

read_csv uses the usecols parameter, and the read_excel function uses parse_cols parameter. Both parameter do basically the same thing:

usecols : array-like
    Return a subset of the columns.
    Results in much faster parsing time and lower memory usage.
parse_cols : int or list, default None
    * If None then parse all columns,
    * If int then indicates last column to be parsed
    * If list of ints then indicates list of column numbers to be parsed
    * If string then indicates comma separated list of column names and
      column ranges (e.g. "A:E" or "A,C,E:F")

Also maybe supporting the same thing that parse_cols does in usecol wouldn't be so bad?

@jtratner
Copy link
Contributor

That seems reasonable, however we'd probably still have to support the old parameter. My impression is that the only difference is that parse_cols accept a single int: clearly csv can't support the string option, given that it's excel specific and it already accepts a list of int (and I'm sure it accepts None as well).

@alefnula
Copy link
Contributor Author

@jtratner Maybe adding use_cols (since all other parameters use the underscore convention) that does the combination you specified (+ for excel files the excel specific thing) and deprecate the usecols and parse_cols parameters in 0.13, 0.14... Exclude them from docs and drop them in some future version?

@alefnula alefnula reopened this Sep 25, 2013
@alefnula
Copy link
Contributor Author

I can take this if you decide this is reasonable to implement... The question is just what to do with the test cases? Test both or test just the use_cols since usecols andparse_cols` will be just aliases.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2013

you are talking aboutusecols, not use_cols (sp?)

@alefnula
Copy link
Contributor Author

@jreback I proposed a change from usecols to use_cols, since most parameters use the underscore naming convention: filepath_or_buffer, index_col, na_values, keep_default_na, true_values, false_values, parse_dates, keep_date_col, date_parser, na_filter, infer_types, mangle_dupe_cols, tupleize_cols... There are still ones that do not, like: dayfirst, skiprows, escapechar, skipinitialspace, lineterminator, quotechar, skipfooter, chunksize.

But if I noticed correctly the naming conventions are going toward the _ separated ones... Correct me if I'm wrong.

@jtratner
Copy link
Contributor

@alefnula while I get what you're saying, I don't know if it really makes sense to change those keyword arguments now (and escapechar and quotechar are the csv module's kwargs, so thsoe probably shouldn't change). If you're going to do any change, you should definitely change it to parse_cols, to be consistent with excel

@alefnula
Copy link
Contributor Author

@jtratner I think you misunderstood me, but it's my fault, I wasn't clear enough :) So I'll start fresh.

I went through the code of the read_ functions and here are the results:

Function Parameter
read_clipboard usecols
read_csv usecols
read_excel parse_cols
read_fwf usecols
read_hdf -
read_html -
read_json -
read_pickle -
read_sql -
read_stata usecols
read_table usecols
read_sql -

The conclusion is: Most of the functions that have a column selection parameter use usecols, except the read_excel which uses parse_cols.

My suggestion is to:

  1. Add column selection to all read_ functions. Implementing this for the functions that does not support it, at the moment, can just be something like this:

    if usecols:
        return result[cols]

    at the end of the function. I KNOW I oversimplified it :) This is just pseudecode. But doing a refined version of this wouldn't be so bad for the first iteration, and in the second iteration it can be supported directly by the parsers.

  2. Since I first looked only at the read_csv and read_excel I suggested a unification of usecols and parse_cols under the name use_cols, since the underscore convention is used more often in pandas. But I take that back. Five from thirteen functions use the usecols parameter, and this would just cause trouble and wont gain anything. So just forget that renaming part :) The only renaming in this case would be from parse_cols to usecols in read_excel.

  3. usecols should, for all functions, support the same thing it currently does + the integer slicing one [:n].

  4. It will, for the read_ functions that support something fancy, have additional superpowers (like in the case of the read_excel, and maybe some other formats) but the basic functionality from the point 3. should be there for all functions. Additional "superpowers" would be noted in the docs for the specific formats.

This is basically the proposal I had in mind :) I just wasn't clear enough.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 14, 2016
@jreback jreback modified the milestones: 0.19.0, 0.18.1 Apr 14, 2016
@jreback
Copy link
Contributor

jreback commented Apr 14, 2016

If someone is interested in this, we should deprecate parse_cols in favor of usecols

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@abarber4gh
Copy link
Contributor

abarber4gh commented May 23, 2017

taking a look at this, I plan to deprecate parse_cols for read_excel() in favor of usecols for consistency with other read functions. plan for usecols to support current functionality of read_excel()'s parse_cols and other usecols.
Basically, @alefnula's points 2, 3, and 4 above. point 1 (implement column selection functionality for all read_ functions) probably should be another ENH.

@abarber4gh
Copy link
Contributor

in the current release version (0.20.1), a list of columns returns an empty DF. I.E. pandas.read_excel(test_inputexcel_file, parse_cols=['A','B','D']) results in

Empty DataFrame
Columns: []
Index: []

rather than the expected DataFrame with columns A, B, and D. I plan to address this as well to implement @alefnula's point 3.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2017

@gfyoung anything worth changing for 0.21.0 for this? (prob in read_excel) that isn't already done / in an issue?

@gfyoung
Copy link
Member

gfyoung commented Sep 24, 2017

@jreback : I agree with you from a year ago that parse_cols should be deprecated in favor of usecols. I don't see why we shouldn't aim to do this in 0.21.0.

@jreback jreback modified the milestones: 0.21.0, 1.0 Oct 2, 2017
gfyoung added a commit to forking-repos/pandas that referenced this issue Oct 4, 2017
Will now use "usecols" just like in read_csv.

xref pandas-devgh-4988.
jreback pushed a commit that referenced this issue Oct 4, 2017
@gfyoung gfyoung modified the milestones: 1.0, 0.21.1, 0.21.0 Oct 4, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this issue Oct 16, 2017
alanbato pushed a commit to alanbato/pandas that referenced this issue Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO CSV read_csv, to_csv IO Excel read_excel, to_excel
Projects
None yet
5 participants