Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defaulting to_csv to infer compression #22004

Closed
dhimmel opened this issue Jul 20, 2018 · 2 comments
Closed

Defaulting to_csv to infer compression #22004

dhimmel opened this issue Jul 20, 2018 · 2 comments
Labels
Enhancement IO CSV read_csv, to_csv
Milestone

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Jul 20, 2018

This issue follows up on #17900 by thanks @Dobatymo and @gfyoung with review from @jreback. #17900 added an 'infer' option to compression in _get_handle. The main user-facing benefit here is that df.to_csv will be able to infer compression just like pandas.read_csv. However, unlike read_csv the default value for compression is None rather than 'infer'

Unfortunately, much of the convenience of compression='infer' is lost if you have to explicitly specify it. In summary, I think there is a major convenience to the following command to work and automatically perform gzip compression:

df.to_csv('path.csv.gz')

Compatibility assessment

Defaulting to infer would only affect users who are currently using paths with compression extensions but not actually compressing. That's pretty bad practice IMO. Hence, I'm in favor of breaking backwards compatibility and changing the default for compression to infer. It looks like this would go into the major release 0.24?

@WillAyd
Copy link
Member

WillAyd commented Jul 20, 2018

I agree conceptually. Probably need to handle cases where this would potentially conflict with the compression argument. PRs welcome

@WillAyd WillAyd added IO CSV read_csv, to_csv Enhancement labels Jul 20, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Jul 21, 2018
@dhimmel
Copy link
Contributor Author

dhimmel commented Jul 21, 2018

I am happy to open a PR. I think the solution will be as simple as changing the compression default to infer in:

pandas/pandas/core/frame.py

Lines 1714 to 1716 in 322dbf4

def to_csv(self, path_or_buf=None, sep=",", na_rep='', float_format=None,
columns=None, header=True, index=True, index_label=None,
mode='w', encoding=None, compression=None, quoting=None,

Looks like to_pickle already defaults to infer:

def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):

to_json should also probably be switched to default to infer:

def to_json(path_or_buf, obj, orient=None, date_format='epoch',
double_precision=10, force_ascii=True, date_unit='ms',
default_handler=None, lines=False, compression=None,
index=True):

I don't think the other to_* methods have a compression argument but I should double check.

@jreback jreback modified the milestones: Contributions Welcome, 0.24.0 Jul 31, 2018
@gfyoung gfyoung closed this as completed in 93f154c Aug 1, 2018
dberenbaum pushed a commit to dberenbaum/pandas that referenced this issue Aug 3, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants