Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: Warn about Series.to_csv signature alignment #21868

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jul 12, 2018

Warns about aligning Series.to_csv's signature with that of DataFrame.to_csv's.

In anticipation, we have moved DataFrame.to_csv to generic.py so that we can later delete the
Series.to_csv implementation, and allow it to adopt DataFrame's to_csv due to inheritance.

Closes #19745.

cc @dahlbaek

@gfyoung gfyoung added IO CSV read_csv, to_csv Deprecate Functionality to remove in pandas labels Jul 12, 2018
@gfyoung gfyoung added this to the 0.24.0 milestone Jul 12, 2018
@gfyoung gfyoung added Output-Formatting __repr__ of pandas objects, to_string API Design labels Jul 12, 2018
@jreback
Copy link
Contributor

jreback commented Jul 12, 2018

cc @toobaz @jorisvandenbossche

@gfyoung gfyoung force-pushed the to-csv-unify branch 2 times, most recently from 4dc9a6f to 8dda35d Compare July 12, 2018 05:19
@codecov
Copy link

codecov bot commented Jul 12, 2018

Codecov Report

Merging #21868 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21868      +/-   ##
==========================================
+ Coverage   91.91%   91.91%   +<.01%     
==========================================
  Files         164      164              
  Lines       49992    50008      +16     
==========================================
+ Hits        45951    45967      +16     
  Misses       4041     4041
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 42.15% <39.28%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.18% <ø> (-0.02%) ⬇️
pandas/core/generic.py 96.47% <100%> (+0.01%) ⬆️
pandas/core/series.py 94.18% <100%> (+0.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 365eac4...38aebd0. Read the comment docs.

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

My understanding from #19745 was that the differences were limited (path, ordering, header default value) and that hence we could have raised warnings only for those, and specific errors for them.

Couldn't we change the first argument to path_or_buf, append the path argument at the end of the signature, and then emit the warning (and assign its value to path_or_buf) if path is not None?

Similarly, couldn't we set the default value of header to None, and translate it to False (if it is None) but emitting the FutureWarning?

I know we would still not catch the ordering change, but

  • probably the large majority of users will get a warning from header - and hence they will be informed about the signature change
  • I expect very very few users to pass index= (which is the first argument after buf) as a positional argument

Apart from the comment above, my main problem with this PR is that it seems to enforce (at least if one wants to suppress the warning) passing path as a named arg, which I think is unnatural and unneeded.

encoding=encoding, compression=compression,
date_format=date_format, decimal=decimal)

if path is not None:
Copy link
Member

@toobaz toobaz Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no valid call that fails this test, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by this question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toobaz As far as I can tell, the default value of path is None, so if path is not explicitly assigned another value, the branch is not executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this over, wouldn't it be better to be even more conservative here? One of my gripes with the inconsistency between Series.to_csv and DataFrame.to_csv is that I am not always completely aware whether a given call will return a Series or a DataFrame. Thus, I may inadvertently think that I am working with a DataFrame and pass the path_or_buf keyword argument, when I am in fact working with a Series. In that case, the warning would not trigger, but I would get unexpected behaviour due to the difference in default value of header.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 12, 2018

@toobaz : The original PR IMO is only a band-aid to the larger issue that Series.to_csv is not in sync with DataFrame.to_csv. That's why I'm going all in and issuing a warning as is. A major release (0.24.0) is the best time to do this.

If your main issue with this PR is that people need to pass in a keyword arg for path, I'm okay with that "inconvenience", especially since DataFrame.to_csv is a much more commonly used IMO than Series.to_csv. Also, it will certainly make people aware of the fact that we are going to switch signatures in the future.

A single keyword inconvenience is worth it for all of the keyword arguments and awkwardness that we will resolve with the fact that the functions are not synchronized, both for end users and us as maintainers.


# Result is only a string if no path provided, otherwise None.
result = df.to_csv(**to_csv_kwargs)

if path is None:
Copy link
Contributor

@dahlbaek dahlbaek Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If path is not None, None is implicitly returned. But in that case, the value of result is also None. So there is no reason to test the value of path, it suffices to return result in any case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I must be missing something obvious, but can you provide a valid example of calling this method with path=None?

Copy link
Contributor

@dahlbaek dahlbaek Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should think

import pandas as pd
pd.DataFrame({"a": [1, 2]}).to_csv()

would be a valid call, and would print the contents to screen. Maybe I am wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is indeed a perfectly valid way to use path=None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was indeed missing something obvious ;-)

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 12, 2018

Regarding silencing the warning, perhaps one could add to the warning message that it can be silenced by explicitly casting the Series instance ser to DataFrame using DataFrame(ser)? As far as i can tell, this would be a future-proof method (i.e. to_csv is implemented on Series by first casting to DataFrame and then calling to_csv on the resulting DataFrame). In that case, one could even go to the extreme and simply always show the warning message when Series.to_csv is called.

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

Regarding silencing the warning, perhaps one could add to the warning message that it can be silenced by explicitly casting the Series instance ser to DataFrame using DataFrame(ser)?

We might as well state that we will completely remove Series.to_csv! I don't see any reason for trying to change the way users use a function which works just fine.

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

especially since DataFrame.to_csv is a much more commonly used IMO than Series.to_csv

Again, having a seldomly used function is not a good argument, to me at least, to make it more annoying to use.

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 12, 2018

We might as well state that we will completely remove Series.to_csv! I don't see any reason for trying to change the way users use a function which works just fine.

Removing Series.to_csv is not the same as changing the signature, so stating that Series.to_csv will be removed would be false information. No matter what you do, the user will be bothered by a signature change. This way, the user has three options:

  1. Live with the warnings and do not change the arguments. Result: Warnings, not future-proof.
  2. Change the arguments, and live with the warnings. Results: Warnings, future-proof.
  3. Avoid warnings by not using the "unstable" method (i.e. explicitly recast to DataFrame). Result: No warnings, future-proof.

Note that the third option is simply the user implementing the fix by hand, which is why it is future-proof and does not emit warnings. I think those all seem like reasonable options for a short deprecation period.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 12, 2018

So let me meet you guys half-way:

  • Don't warn on the path argument
  • Change header to default to None
  • Warn if header isn't explicitly set

How does that sound?

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

So let me meet you guys half-way:

Don't warn on the path argument
Change header to default to None
Warn if header isn't explicitly set

How does that sound?

I think that's reasonable... I just think that moving path as last argument, and warning if path is not None, is superior (because those people who passed path= will get the warning): can you point at any downside?

@gfyoung
Copy link
Member Author

gfyoung commented Jul 12, 2018

I just think that moving path as last argument, and warning if path is not None, is superior (because those people who passed path= will get the warning): can you point at any downside?

@toobaz : I'm not sure I fully understand you here. What do you mean by "moving path as last argument" ? It sounds like you're just repeating what I did currently in this PR?

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

I think those all seem like reasonable options for a short deprecation period.

My point is: if a given thing works now and will work in the future in the same way, no deprecation period should force users to do things otherwise.

In this case, the "thing" is "passing a filename as first positional argument", which (correct me if I'm wrong) works exactly as it should, and will.

Contrast this with the warning for header=, which will rightly force people to change their calls because the behavior of that argument will change.

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

@toobaz : I'm not sure I fully understand you here. What do you mean by "moving path as last argument" ?

My idea is that Series.to_csv could look like:

def to_csv(self, path_or_buf=None,
           [rest of signature],
           path=None)
if path is not None:
    # emit warning about change of argument name
    assert(path_or_buf is None), "You cannot pass both 'path' and 'path_or_buf'."
    path_or_buf = path
[...]

where [rest of signature] includes all other arguments, in the order they appear in DataFrame.to_csv().

This will be complemented with the warning for header= if unset and (this just came to my mind now) with

if isinstance(sep, bool):
    # emit warning about the change of arguments ordering
    raise ValueError

This way, we inform user of the old Series.to_csv positional arguments that the signature ordering has changed (previously, index was the first positional argument after path, now it is sep).
The only downside (I can see) is that for those users, there will be immediate breakage, rather than a warning... but again, I expect it to be a really rare use case. This downside could be avoided by reordering all arguments (only inside this if branch)... it's maybe just not worth the effort.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 12, 2018

My idea is that Series.to_csv could look like:

Hmm...I see...IMO that's bending over backwards a little too much duplicating (and bloating the signature) like that. I would leave it at warning regarding the header argument.

@toobaz
Copy link
Member

toobaz commented Jul 12, 2018

that's bending over backwards a little too much duplicating (and bloating the signature) like that

Just to be sure we are on the same page: it would obviously be a temporary measure, drop= would be removed at the end of the deprecation period.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 12, 2018

Just to be sure we are on the same page: it would obviously be a temporary measure

@toobaz : Oh, I definitely understand. 🙂 My comment above was made with that consideration in mind.

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

@toobaz : Oh, I definitely understand. slightly_smiling_face My comment above was made with that consideration in mind.

OK. Would your comment change if I proposed the following rather than the above?

def to_csv(self, path_or_buf=None,
           [rest of signature],
           **kwargs)
if kwargs.get('path', None) is not None:
    # emit warning about change of argument name
# continue as in your PR

Concerning the possibility that [rest of signature] follows the DataFrame.to_csv order: this is orthogonal with respect to the discussion on whether to warn for path=. And my reasoning is that if we do not change the signature order now, people might have to change their calls twice: now to avoid the warnings, and later to conform the signature. This unless they refrain from using positional arguments at all - which is probably good style, but which we probably don't want to impose.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 13, 2018

OK. Would your comment change if I proposed the following rather than the above?

@toobaz : Ah...that I think we can do. Nice!

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

I'm not a big of that one just because a simple command like Series.to_csv(path) will issue a warning (unnecessarily).

I meant the second except self.

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 13, 2018

If you aren't triggering the warning, then your code will be forwards-compatible.

Again: we don't want to forbid behaviors which are legitimate now and in the future.

I'm not sure I understand. No behaviour which passes more than one positional argument ought to be legitimate both now and in the future (since index should have type bool while sep should have type str). However, consider that any one-character string is recast into the boolean True. Thus, the behaviour of

df.to_csv("test.csv", "y", header=False)

would not be the same before as after, and therefore should at least emit a warning.

if we want to warn for the changing order for positional arguments, we can just check the type of the second positional argument

I'm not sure I understand how you would easily go about emitting a warning by checking the type of the second positional argument. How do you know which arguments were passed as positional? But if there is an easy way, I am very interested in seeing it!

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

No behaviour which passes more than one positional argument should be legitimate both now and in the future

Passing positional arguments itself is. So we will not forbid it (if we can). Or in other terms: if people want to use positional arguments now and in the future, making them change their calls once is acceptable; twice is not.

I'm not sure I understand how you would easily go about emitting a warning by checking the type of the second positional argument.

As I explained above, apart from the first two arguments, which work the same in the old and new signature (self and buffer_or_path), we then have sep for the new signature and index for the old. The latter is always boolean, the former never is.

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 13, 2018

As I explained above, apart from the first two arguments, which work the same in the old and new signature (self and buffer_or_path), we then have sep for the new signature and index for the old. The latter is always boolean, the former never is.

I see, so you would simply check the type of sep, and if it is bool, throw a ValueError.

Since the decorator seems out of the question, I would support your solution. It has the drawback that it, instead of providing the user with a warning ahead of time, will directly break most old code relying on positional arguments. It also will not catch edge cases with poor code, like

df.to_csv("test.csv", "y", header=False)

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

It has the drawback that it, instead of providing the user with a warning ahead of time, will directly break most old code relying on positional arguments

How to detect the old signature and what to do when we detect it are two different problems.

Once more: we can temporarily support the old signature and the new, it's just a matter of reordering (conditional on dtype of sep) rather than raising/just warning. It is arguably the optimal solution for the user; it just requires slightly more code (so @gfyoung will probably not like it).

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 13, 2018

Once more: we can temporarily support the old signature and the new, it's just a matter of reordering (conditional on dtype of sep) rather than raising/just warning.

Again, edge cases:

df.to_csv("test.csv", "y", header=False)

Since python is weakly and dynamically typed, you cannot reliably infer which signature the user had in mind by looking at the type of the second positional argument — one-character strings are legitimate values for both sep and index.

As far as I can tell, you have to choose between "more than one positional argument will cause a non-disableable warning during deprecation period" and "some edge cases may not trigger warnings".

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

Since python is weakly and dynamically typed, you cannot reliably infer which signature the user had in mind by looking at the type of the second positional argument — one-character strings are legitimate values for both sep and index.

In your reply, you are again mixing detection and consequences, let's try to keep the two separated ;-)

Anyway, about detection: I know we would not catch "y" used as a boolean, but assuming we really care about this (which is false - the docstring mentions that index= is a boolean, not that it is interpreted as a boolean), the solution is simple: we can check that the dtype is not a character (the only valid type for sep). str(True) and str(False) are not a character.

But again, I really don't think we care.

And again, this is a matter of how we catch the wrong ordering...

... however we do it, we can reorder if we want.

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 13, 2018

In your reply, you are again mixing detection and consequences, let's try to keep the two separated ;-)

Not really, my point was precisely that one cannot reliably detect which signature the user has in mind, and therefore one should choose a consequence which takes this lack of reliability into account. That is, detection and consequences are related. For clarity, I even broke the reply up into two paragraphs, one for detection and one for consequences.

Anyway, about detection: I know we would not catch "y" used as a boolean, but assuming we really care about this (which is false - the docstring mentions that index= is a boolean, not that it is interpreted as a boolean), the solution is simple: we can check that the dtype is not a character (the only valid type for sep). str(True) and str(False) are not a character.

I don't follow — wouldn't those two tests produce the same exact outcome in this case? I.e. classify the call as following the new signature?

But again, I really don't think we care.

This is a fair point, and why I said I would be in favor of you solution.

And again, this is a matter of how we catch the wrong ordering...

... however we do it, we can reorder if we want.

No, you cannot reorder correctly if you cannot catch correctly. That's why I would prefer a catch-all warning or a decorator-based warning, letting the user fix the problem on their end.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 13, 2018

@toobaz @dahlbaek : It sounds like your conversation is moving back towards reorganizing arguments. That isn't backwards compatible though. This change can't be breaking.

@dahlbaek
Copy link
Contributor

dahlbaek commented Jul 13, 2018

You're right. It seems to me like the only thing missing is a warning to those users who pass more than a single positional argument and pass a value to header. I.e., calls such as

df.to_csv("test.csv", False, header=False)

should emit a warning in my opinion, whereas calls such as

df.to_csv("test.csv", index=False, header=False)

should not. This is easy to do with a decorator, but I cannot come up with any other simple solution.

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

I don't follow — wouldn't those two tests produce the same exact outcome in this case? I.e. classify the call as following the new signature?

What I was mentioning would be to change the signature to Series.to_csv(*args, **kwargs). If len(args) >= 3 and args[2] is not a string(-like) of length 1, we are sure it's an invalid value for sep, and we (warn and) reorder. The downside of this solution is just that it looks bad in the docs (although we could refer to the DataFrame.to_csv docs/signature).

Assuming we don't want to do this, then I think we agree on not considering "y" as a valid boolean. I think we also agree on changing the signature to the new ordering.
We do not agree on what doing then, because @dahlbaek you mention that we are not sure we catch old positional calls.
The thing is: when we catch them (that is, when sep is a bool), then we know the old signature is being used, and we know DataFrame.to_csv will raise a TypeError.
Do you think this is the best option, just because it guarantees consistency with users who pass values to index= that we document as unsupported?!

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

This change can't be breaking.

... which is an argument for reordering (or for the implicit signature).

The change that really can't be breaking is the one after the deprecation period.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 13, 2018

The change that really can't be breaking is the one after the deprecation period.

@toobaz : I'm not sure what you mean. The one that can't be breaking is the deprecation. The one after the deprecation period is breaking. That's why we warn users about it first. That's why I'm not in favor of breaking changes the signature at this current time.

As for Series.to_csv(*args, **kwargs), I'm not a fan of argument introspection, especially with all of these arguments. In fact, that whole conversation deadlocked before in #19745 (comment).

@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

@toobaz : I'm not sure what you mean.

I mean that the ideal deprecation cycle will let you change your code at any time during the deprecation period, with warnings urging you to do so.
The less ideal deprecation cycle will pretend that you change your code immediately, by raising.
Asking to change your code once to avoid the warning, and once to be future-proof, is cruel.

As I said, we shouldn't feel obliged to support positional indexing at all, it's a choice. But supporting both styles is technically feasible, with absolute certainty, and is the only way to guarantee a smooth deprecation cycle to whoever uses positional indexing.

I'm not a fan of argument introspection

If your main argument is "I'm writing this PR, my tastes matter", then... OK, I'm out of arguments.

@gfyoung
Copy link
Member Author

gfyoung commented Jul 13, 2018

If your main argument is "I'm writing this PR, my tastes matter", then... OK, I'm out of arguments.

@toobaz : Actually, I'm just expressing my personal opinion, but as I said earlier, this conversation deadlocked previously #19745 (comment). If you think that doing argument introspection is feasible (and not overly messy), then go for it.

In fact, you can modify #19745 if you want to illustrate your point. Let's make something concrete out of this conversation before it drags out forever in theory world 🙂

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 13, 2018
toobaz added a commit to toobaz/pandas that referenced this pull request Jul 13, 2018
@toobaz
Copy link
Member

toobaz commented Jul 13, 2018

In fact, you can modify #19745 if you want to illustrate your point. Let's make something concrete out of this conversation before it drags out forever in theory world slightly_smiling_face

See #21896

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 13, 2018
toobaz added a commit to toobaz/pandas that referenced this pull request Jul 14, 2018
toobaz added a commit to toobaz/pandas that referenced this pull request Jul 14, 2018
@gfyoung
Copy link
Member Author

gfyoung commented Jul 25, 2018

We're pushing to merge #21896 over this one.

@gfyoung gfyoung closed this Jul 25, 2018
@gfyoung gfyoung modified the milestones: 0.24.0, No action Jul 25, 2018
@gfyoung gfyoung deleted the to-csv-unify branch July 25, 2018 17:20
gfyoung added a commit to toobaz/pandas that referenced this pull request Jul 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Deprecate Functionality to remove in pandas IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants