DEPR: Warn about Series.to_csv signature alignment #21868

gfyoung · 2018-07-12T00:27:23Z

Warns about aligning Series.to_csv's signature with that of DataFrame.to_csv's.

In anticipation, we have moved DataFrame.to_csv to generic.py so that we can later delete the
Series.to_csv implementation, and allow it to adopt DataFrame's to_csv due to inheritance.

Closes #19745.

cc @dahlbaek

jreback · 2018-07-12T00:32:50Z

cc @toobaz @jorisvandenbossche

codecov · 2018-07-12T05:19:57Z

Codecov Report

Merging #21868 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21868      +/-   ##
==========================================
+ Coverage   91.91%   91.91%   +<.01%     
==========================================
  Files         164      164              
  Lines       49992    50008      +16     
==========================================
+ Hits        45951    45967      +16     
  Misses       4041     4041

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`42.15% <39.28%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.18% <ø> (-0.02%)`	⬇️
pandas/core/generic.py	`96.47% <100%> (+0.01%)`	⬆️
pandas/core/series.py	`94.18% <100%> (+0.07%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 365eac4...38aebd0. Read the comment docs.

toobaz · 2018-07-12T07:46:09Z

My understanding from #19745 was that the differences were limited (path, ordering, header default value) and that hence we could have raised warnings only for those, and specific errors for them.

Couldn't we change the first argument to path_or_buf, append the path argument at the end of the signature, and then emit the warning (and assign its value to path_or_buf) if path is not None?

Similarly, couldn't we set the default value of header to None, and translate it to False (if it is None) but emitting the FutureWarning?

I know we would still not catch the ordering change, but

probably the large majority of users will get a warning from header - and hence they will be informed about the signature change
I expect very very few users to pass index= (which is the first argument after buf) as a positional argument

Apart from the comment above, my main problem with this PR is that it seems to enforce (at least if one wants to suppress the warning) passing path as a named arg, which I think is unnatural and unneeded.

toobaz · 2018-07-12T07:46:54Z

pandas/core/series.py

-                           encoding=encoding, compression=compression,
-                           date_format=date_format, decimal=decimal)
+
+        if path is not None:


There is no valid call that fails this test, right?

I'm not sure what you mean by this question.

@toobaz As far as I can tell, the default value of path is None, so if path is not explicitly assigned another value, the branch is not executed.

Thinking this over, wouldn't it be better to be even more conservative here? One of my gripes with the inconsistency between Series.to_csv and DataFrame.to_csv is that I am not always completely aware whether a given call will return a Series or a DataFrame. Thus, I may inadvertently think that I am working with a DataFrame and pass the path_or_buf keyword argument, when I am in fact working with a Series. In that case, the warning would not trigger, but I would get unexpected behaviour due to the difference in default value of header.

gfyoung · 2018-07-12T08:01:44Z

@toobaz : The original PR IMO is only a band-aid to the larger issue that Series.to_csv is not in sync with DataFrame.to_csv. That's why I'm going all in and issuing a warning as is. A major release (0.24.0) is the best time to do this.

If your main issue with this PR is that people need to pass in a keyword arg for path, I'm okay with that "inconvenience", especially since DataFrame.to_csv is a much more commonly used IMO than Series.to_csv. Also, it will certainly make people aware of the fact that we are going to switch signatures in the future.

A single keyword inconvenience is worth it for all of the keyword arguments and awkwardness that we will resolve with the fact that the functions are not synchronized, both for end users and us as maintainers.

dahlbaek · 2018-07-12T12:56:33Z

pandas/core/series.py

+
+        # Result is only a string if no path provided, otherwise None.
+        result = df.to_csv(**to_csv_kwargs)
+
        if path is None:


If path is not None, None is implicitly returned. But in that case, the value of result is also None. So there is no reason to test the value of path, it suffices to return result in any case.

Sorry, I must be missing something obvious, but can you provide a valid example of calling this method with path=None?

I should think

import pandas as pd pd.DataFrame({"a": [1, 2]}).to_csv()

would be a valid call, and would print the contents to screen. Maybe I am wrong.

Yes, that is indeed a perfectly valid way to use path=None.

I was indeed missing something obvious ;-)

dahlbaek · 2018-07-12T13:12:35Z

Regarding silencing the warning, perhaps one could add to the warning message that it can be silenced by explicitly casting the Series instance ser to DataFrame using DataFrame(ser)? As far as i can tell, this would be a future-proof method (i.e. to_csv is implemented on Series by first casting to DataFrame and then calling to_csv on the resulting DataFrame). In that case, one could even go to the extreme and simply always show the warning message when Series.to_csv is called.

toobaz · 2018-07-12T13:29:54Z

Regarding silencing the warning, perhaps one could add to the warning message that it can be silenced by explicitly casting the Series instance ser to DataFrame using DataFrame(ser)?

We might as well state that we will completely remove Series.to_csv! I don't see any reason for trying to change the way users use a function which works just fine.

toobaz · 2018-07-12T13:31:52Z

especially since DataFrame.to_csv is a much more commonly used IMO than Series.to_csv

Again, having a seldomly used function is not a good argument, to me at least, to make it more annoying to use.

dahlbaek · 2018-07-12T14:02:16Z

We might as well state that we will completely remove Series.to_csv! I don't see any reason for trying to change the way users use a function which works just fine.

Removing Series.to_csv is not the same as changing the signature, so stating that Series.to_csv will be removed would be false information. No matter what you do, the user will be bothered by a signature change. This way, the user has three options:

Live with the warnings and do not change the arguments. Result: Warnings, not future-proof.
Change the arguments, and live with the warnings. Results: Warnings, future-proof.
Avoid warnings by not using the "unstable" method (i.e. explicitly recast to DataFrame). Result: No warnings, future-proof.

Note that the third option is simply the user implementing the fix by hand, which is why it is future-proof and does not emit warnings. I think those all seem like reasonable options for a short deprecation period.

gfyoung · 2018-07-12T16:06:04Z

So let me meet you guys half-way:

Don't warn on the path argument
Change header to default to None
Warn if header isn't explicitly set

How does that sound?

toobaz · 2018-07-12T21:31:43Z

So let me meet you guys half-way:
Don't warn on the path argument
Change header to default to None
Warn if header isn't explicitly set
How does that sound?

I think that's reasonable... I just think that moving path as last argument, and warning if path is not None, is superior (because those people who passed path= will get the warning): can you point at any downside?

gfyoung · 2018-07-12T21:34:30Z

I just think that moving path as last argument, and warning if path is not None, is superior (because those people who passed path= will get the warning): can you point at any downside?

@toobaz : I'm not sure I fully understand you here. What do you mean by "moving path as last argument" ? It sounds like you're just repeating what I did currently in this PR?

toobaz · 2018-07-12T21:35:51Z

I think those all seem like reasonable options for a short deprecation period.

My point is: if a given thing works now and will work in the future in the same way, no deprecation period should force users to do things otherwise.

In this case, the "thing" is "passing a filename as first positional argument", which (correct me if I'm wrong) works exactly as it should, and will.

Contrast this with the warning for header=, which will rightly force people to change their calls because the behavior of that argument will change.

toobaz · 2018-07-12T22:10:19Z

@toobaz : I'm not sure I fully understand you here. What do you mean by "moving path as last argument" ?

My idea is that Series.to_csv could look like:

def to_csv(self, path_or_buf=None,
           [rest of signature],
           path=None)
if path is not None:
    # emit warning about change of argument name
    assert(path_or_buf is None), "You cannot pass both 'path' and 'path_or_buf'."
    path_or_buf = path
[...]

where [rest of signature] includes all other arguments, in the order they appear in DataFrame.to_csv().

This will be complemented with the warning for header= if unset and (this just came to my mind now) with

if isinstance(sep, bool):
    # emit warning about the change of arguments ordering
    raise ValueError

This way, we inform user of the old Series.to_csv positional arguments that the signature ordering has changed (previously, index was the first positional argument after path, now it is sep).
The only downside (I can see) is that for those users, there will be immediate breakage, rather than a warning... but again, I expect it to be a really rare use case. This downside could be avoided by reordering all arguments (only inside this if branch)... it's maybe just not worth the effort.

gfyoung · 2018-07-12T22:17:28Z

My idea is that Series.to_csv could look like:

Hmm...I see...IMO that's bending over backwards a little too much duplicating (and bloating the signature) like that. I would leave it at warning regarding the header argument.

toobaz · 2018-07-12T23:10:41Z

that's bending over backwards a little too much duplicating (and bloating the signature) like that

Just to be sure we are on the same page: it would obviously be a temporary measure, drop= would be removed at the end of the deprecation period.

gfyoung · 2018-07-12T23:13:23Z

Just to be sure we are on the same page: it would obviously be a temporary measure

@toobaz : Oh, I definitely understand. 🙂 My comment above was made with that consideration in mind.

toobaz · 2018-07-13T00:07:54Z

@toobaz : Oh, I definitely understand. slightly_smiling_face My comment above was made with that consideration in mind.

OK. Would your comment change if I proposed the following rather than the above?

def to_csv(self, path_or_buf=None,
           [rest of signature],
           **kwargs)
if kwargs.get('path', None) is not None:
    # emit warning about change of argument name
# continue as in your PR

Concerning the possibility that [rest of signature] follows the DataFrame.to_csv order: this is orthogonal with respect to the discussion on whether to warn for path=. And my reasoning is that if we do not change the signature order now, people might have to change their calls twice: now to avoid the warnings, and later to conform the signature. This unless they refrain from using positional arguments at all - which is probably good style, but which we probably don't want to impose.

gfyoung · 2018-07-13T00:10:25Z

OK. Would your comment change if I proposed the following rather than the above?

@toobaz : Ah...that I think we can do. Nice!

toobaz · 2018-07-13T10:11:46Z

I'm not a big of that one just because a simple command like Series.to_csv(path) will issue a warning (unnecessarily).

I meant the second except self.

dahlbaek · 2018-07-13T10:35:36Z

If you aren't triggering the warning, then your code will be forwards-compatible.

Again: we don't want to forbid behaviors which are legitimate now and in the future.

I'm not sure I understand. No behaviour which passes more than one positional argument ought to be legitimate both now and in the future (since index should have type bool while sep should have type str). However, consider that any one-character string is recast into the boolean True. Thus, the behaviour of

df.to_csv("test.csv", "y", header=False)

would not be the same before as after, and therefore should at least emit a warning.

if we want to warn for the changing order for positional arguments, we can just check the type of the second positional argument

I'm not sure I understand how you would easily go about emitting a warning by checking the type of the second positional argument. How do you know which arguments were passed as positional? But if there is an easy way, I am very interested in seeing it!

toobaz · 2018-07-13T11:07:55Z

No behaviour which passes more than one positional argument should be legitimate both now and in the future

Passing positional arguments itself is. So we will not forbid it (if we can). Or in other terms: if people want to use positional arguments now and in the future, making them change their calls once is acceptable; twice is not.

I'm not sure I understand how you would easily go about emitting a warning by checking the type of the second positional argument.

As I explained above, apart from the first two arguments, which work the same in the old and new signature (self and buffer_or_path), we then have sep for the new signature and index for the old. The latter is always boolean, the former never is.

dahlbaek · 2018-07-13T11:18:47Z

As I explained above, apart from the first two arguments, which work the same in the old and new signature (self and buffer_or_path), we then have sep for the new signature and index for the old. The latter is always boolean, the former never is.

I see, so you would simply check the type of sep, and if it is bool, throw a ValueError.

Since the decorator seems out of the question, I would support your solution. It has the drawback that it, instead of providing the user with a warning ahead of time, will directly break most old code relying on positional arguments. It also will not catch edge cases with poor code, like

df.to_csv("test.csv", "y", header=False)

toobaz · 2018-07-13T11:55:23Z

It has the drawback that it, instead of providing the user with a warning ahead of time, will directly break most old code relying on positional arguments

How to detect the old signature and what to do when we detect it are two different problems.

Once more: we can temporarily support the old signature and the new, it's just a matter of reordering (conditional on dtype of sep) rather than raising/just warning. It is arguably the optimal solution for the user; it just requires slightly more code (so @gfyoung will probably not like it).

dahlbaek · 2018-07-13T13:44:27Z

Once more: we can temporarily support the old signature and the new, it's just a matter of reordering (conditional on dtype of sep) rather than raising/just warning.

Again, edge cases:

df.to_csv("test.csv", "y", header=False)

Since python is weakly and dynamically typed, you cannot reliably infer which signature the user had in mind by looking at the type of the second positional argument — one-character strings are legitimate values for both sep and index.

As far as I can tell, you have to choose between "more than one positional argument will cause a non-disableable warning during deprecation period" and "some edge cases may not trigger warnings".

toobaz · 2018-07-13T14:00:13Z

Since python is weakly and dynamically typed, you cannot reliably infer which signature the user had in mind by looking at the type of the second positional argument — one-character strings are legitimate values for both sep and index.

In your reply, you are again mixing detection and consequences, let's try to keep the two separated ;-)

Anyway, about detection: I know we would not catch "y" used as a boolean, but assuming we really care about this (which is false - the docstring mentions that index= is a boolean, not that it is interpreted as a boolean), the solution is simple: we can check that the dtype is not a character (the only valid type for sep). str(True) and str(False) are not a character.

But again, I really don't think we care.

And again, this is a matter of how we catch the wrong ordering...

... however we do it, we can reorder if we want.

dahlbaek · 2018-07-13T14:19:24Z

In your reply, you are again mixing detection and consequences, let's try to keep the two separated ;-)

Not really, my point was precisely that one cannot reliably detect which signature the user has in mind, and therefore one should choose a consequence which takes this lack of reliability into account. That is, detection and consequences are related. For clarity, I even broke the reply up into two paragraphs, one for detection and one for consequences.

Anyway, about detection: I know we would not catch "y" used as a boolean, but assuming we really care about this (which is false - the docstring mentions that index= is a boolean, not that it is interpreted as a boolean), the solution is simple: we can check that the dtype is not a character (the only valid type for sep). str(True) and str(False) are not a character.

I don't follow — wouldn't those two tests produce the same exact outcome in this case? I.e. classify the call as following the new signature?

But again, I really don't think we care.

This is a fair point, and why I said I would be in favor of you solution.

And again, this is a matter of how we catch the wrong ordering...

... however we do it, we can reorder if we want.

No, you cannot reorder correctly if you cannot catch correctly. That's why I would prefer a catch-all warning or a decorator-based warning, letting the user fix the problem on their end.

gfyoung · 2018-07-13T16:21:00Z

@toobaz @dahlbaek : It sounds like your conversation is moving back towards reorganizing arguments. That isn't backwards compatible though. This change can't be breaking.

dahlbaek · 2018-07-13T16:43:11Z

You're right. It seems to me like the only thing missing is a warning to those users who pass more than a single positional argument and pass a value to header. I.e., calls such as

df.to_csv("test.csv", False, header=False)

should emit a warning in my opinion, whereas calls such as

df.to_csv("test.csv", index=False, header=False)

should not. This is easy to do with a decorator, but I cannot come up with any other simple solution.

toobaz · 2018-07-13T16:53:03Z

I don't follow — wouldn't those two tests produce the same exact outcome in this case? I.e. classify the call as following the new signature?

What I was mentioning would be to change the signature to Series.to_csv(*args, **kwargs). If len(args) >= 3 and args[2] is not a string(-like) of length 1, we are sure it's an invalid value for sep, and we (warn and) reorder. The downside of this solution is just that it looks bad in the docs (although we could refer to the DataFrame.to_csv docs/signature).

Assuming we don't want to do this, then I think we agree on not considering "y" as a valid boolean. I think we also agree on changing the signature to the new ordering.
We do not agree on what doing then, because @dahlbaek you mention that we are not sure we catch old positional calls.
The thing is: when we catch them (that is, when sep is a bool), then we know the old signature is being used, and we know DataFrame.to_csv will raise a TypeError.
Do you think this is the best option, just because it guarantees consistency with users who pass values to index= that we document as unsupported?!

toobaz · 2018-07-13T16:54:25Z

This change can't be breaking.

... which is an argument for reordering (or for the implicit signature).

The change that really can't be breaking is the one after the deprecation period.

gfyoung · 2018-07-13T17:17:06Z

The change that really can't be breaking is the one after the deprecation period.

@toobaz : I'm not sure what you mean. The one that can't be breaking is the deprecation. The one after the deprecation period is breaking. That's why we warn users about it first. That's why I'm not in favor of breaking changes the signature at this current time.

As for Series.to_csv(*args, **kwargs), I'm not a fan of argument introspection, especially with all of these arguments. In fact, that whole conversation deadlocked before in #19745 (comment).

toobaz · 2018-07-13T17:31:27Z

@toobaz : I'm not sure what you mean.

I mean that the ideal deprecation cycle will let you change your code at any time during the deprecation period, with warnings urging you to do so.
The less ideal deprecation cycle will pretend that you change your code immediately, by raising.
Asking to change your code once to avoid the warning, and once to be future-proof, is cruel.

As I said, we shouldn't feel obliged to support positional indexing at all, it's a choice. But supporting both styles is technically feasible, with absolute certainty, and is the only way to guarantee a smooth deprecation cycle to whoever uses positional indexing.

I'm not a fan of argument introspection

If your main argument is "I'm writing this PR, my tastes matter", then... OK, I'm out of arguments.

gfyoung · 2018-07-13T17:37:38Z

If your main argument is "I'm writing this PR, my tastes matter", then... OK, I'm out of arguments.

@toobaz : Actually, I'm just expressing my personal opinion, but as I said earlier, this conversation deadlocked previously #19745 (comment). If you think that doing argument introspection is feasible (and not overly messy), then go for it.

In fact, you can modify #19745 if you want to illustrate your point. Let's make something concrete out of this conversation before it drags out forever in theory world 🙂

toobaz · 2018-07-13T20:23:13Z

In fact, you can modify #19745 if you want to illustrate your point. Let's make something concrete out of this conversation before it drags out forever in theory world slightly_smiling_face

See #21896

closes pandas-dev#19715

gfyoung · 2018-07-25T16:48:39Z

We're pushing to merge #21896 over this one.

gfyoung added IO CSV read_csv, to_csv Deprecate Functionality to remove in pandas labels Jul 12, 2018

gfyoung added this to the 0.24.0 milestone Jul 12, 2018

gfyoung requested review from jreback and jorisvandenbossche July 12, 2018 00:27

gfyoung added Output-Formatting __repr__ of pandas objects, to_string API Design labels Jul 12, 2018

gfyoung force-pushed the to-csv-unify branch 2 times, most recently from 4dc9a6f to 8dda35d Compare July 12, 2018 05:19

toobaz reviewed Jul 12, 2018

View reviewed changes

dahlbaek reviewed Jul 12, 2018

View reviewed changes

gfyoung force-pushed the to-csv-unify branch from 8dda35d to f2f2aff Compare July 12, 2018 21:48

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 13, 2018

Proof of concept for pandas-dev#19715 based on pandas-dev#21868

3ea765a

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 13, 2018

Proof of concept for pandas-dev#19715 based on pandas-dev#21868

1fa5123

toobaz mentioned this pull request Jul 13, 2018

DEPR: Deprecate Series.to_csv signature #21896

Merged

4 tasks

dahlbaek mentioned this pull request Jul 13, 2018

Conform Series.to_csv to DataFrame.to_csv #19745

Closed

4 tasks

gfyoung mentioned this pull request Jul 13, 2018

Conform Series.to_csv to DataFrame.to_csv #19715

Closed

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 13, 2018

API: Proof of concept for pandas-dev#19715 based on pandas-dev#21868

5175907

closes pandas-dev#19715

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 14, 2018

API: Proof of concept for pandas-dev#19715 based on pandas-dev#21868

7d655b2

closes pandas-dev#19715

toobaz added a commit to toobaz/pandas that referenced this pull request Jul 14, 2018

API: Proof of concept for pandas-dev#19715 based on pandas-dev#21868

446cd2c

closes pandas-dev#19715

gfyoung closed this Jul 25, 2018

gfyoung modified the milestones: 0.24.0, No action Jul 25, 2018

gfyoung deleted the to-csv-unify branch July 25, 2018 17:20

gfyoung added a commit to toobaz/pandas that referenced this pull request Jul 25, 2018

Add missing changes from pandas-devgh-21868

a570f3c

DEPR: Warn about Series.to_csv signature alignment #21868

DEPR: Warn about Series.to_csv signature alignment #21868

Conversation

gfyoung commented Jul 12, 2018 • edited Loading

jreback commented Jul 12, 2018

codecov bot commented Jul 12, 2018 • edited Loading

Codecov Report

toobaz commented Jul 12, 2018 • edited Loading

toobaz Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

gfyoung Jul 12, 2018

Choose a reason for hiding this comment

dahlbaek Jul 12, 2018

Choose a reason for hiding this comment

dahlbaek Jul 12, 2018

Choose a reason for hiding this comment

gfyoung commented Jul 12, 2018 • edited Loading

dahlbaek Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

toobaz Jul 12, 2018

Choose a reason for hiding this comment

dahlbaek Jul 12, 2018 • edited Loading

Choose a reason for hiding this comment

gfyoung Jul 12, 2018

Choose a reason for hiding this comment

toobaz Jul 12, 2018

Choose a reason for hiding this comment

dahlbaek commented Jul 12, 2018 • edited Loading

toobaz commented Jul 12, 2018

toobaz commented Jul 12, 2018

dahlbaek commented Jul 12, 2018 • edited Loading

gfyoung commented Jul 12, 2018 • edited Loading

toobaz commented Jul 12, 2018

gfyoung commented Jul 12, 2018

toobaz commented Jul 12, 2018

toobaz commented Jul 12, 2018 • edited Loading

gfyoung commented Jul 12, 2018

toobaz commented Jul 12, 2018

gfyoung commented Jul 12, 2018

toobaz commented Jul 13, 2018

gfyoung commented Jul 13, 2018

toobaz commented Jul 13, 2018

dahlbaek commented Jul 13, 2018 • edited Loading

toobaz commented Jul 13, 2018

dahlbaek commented Jul 13, 2018 • edited Loading

toobaz commented Jul 13, 2018

dahlbaek commented Jul 13, 2018 • edited Loading

toobaz commented Jul 13, 2018

dahlbaek commented Jul 13, 2018 • edited Loading

gfyoung commented Jul 13, 2018

dahlbaek commented Jul 13, 2018 • edited Loading

toobaz commented Jul 13, 2018

toobaz commented Jul 13, 2018

gfyoung commented Jul 13, 2018 • edited Loading

toobaz commented Jul 13, 2018

gfyoung commented Jul 13, 2018

toobaz commented Jul 13, 2018

gfyoung commented Jul 25, 2018

gfyoung commented Jul 12, 2018 •

edited

Loading

codecov bot commented Jul 12, 2018 •

edited

Loading

toobaz commented Jul 12, 2018 •

edited

Loading

toobaz Jul 12, 2018 •

edited

Loading

gfyoung commented Jul 12, 2018 •

edited

Loading

dahlbaek Jul 12, 2018 •

edited

Loading

dahlbaek Jul 12, 2018 •

edited

Loading

dahlbaek commented Jul 12, 2018 •

edited

Loading

dahlbaek commented Jul 12, 2018 •

edited

Loading

gfyoung commented Jul 12, 2018 •

edited

Loading

toobaz commented Jul 12, 2018 •

edited

Loading

dahlbaek commented Jul 13, 2018 •

edited

Loading

dahlbaek commented Jul 13, 2018 •

edited

Loading

dahlbaek commented Jul 13, 2018 •

edited

Loading

dahlbaek commented Jul 13, 2018 •

edited

Loading

dahlbaek commented Jul 13, 2018 •

edited

Loading

gfyoung commented Jul 13, 2018 •

edited

Loading