Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-36582: Make collections.UserString.encode() return bytes, not str #13138

Merged
merged 6 commits into from
Aug 28, 2019

Conversation

asqui
Copy link
Contributor

@asqui asqui commented May 6, 2019

In Python 2 a subclass of UserString would return an instance of that
subclass from encode(). This is invalid in Python 3, where the result of
encode() should always be bytes.

Also:

  • Collapse the 3 code paths in UserString.encode() into a single path by
    lifting the underlying str.encode() defaults into the method signature

https://bugs.python.org/issue36582

In Python 2 a subclass of UserString would return an instance of that
subclass from encode(). This is invalid in Python 3, where the result of
encode() should always be `bytes`.

Also:
* Collapse the 3 code paths in UserString.encode() into a single path by
  lifting the underlying str.encode() defaults into the method signature
return self.__class__(self.data.encode(encoding, errors))
return self.__class__(self.data.encode(encoding))
return self.__class__(self.data.encode())
def encode(self, encoding='utf-8', errors='strict'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the signature of the function's defaults. I would prefer keeping the old defaults and code backwards compatiblity.

Copy link
Contributor Author

@asqui asqui May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback!

So there are two points you are making, which I think can be addressed independently:

  1. Backward compatibility: Ensuring that code which passes in None for either encoding or errors continues to work the same.
  2. Maintaining the signature: Ensuring that the signature still shows None as the default value for encoding and errors.

Point 1 is definitely desired, and this change breaks that in its current form so that needs to be addressed. The way that I would address it depends on whether we also want to address Point 2. Either way, this needs an additional test case.

Point 2 I'm less convinced about. Is there any value to having the signature show defaults of None/None when they will effectively default to utf-8/strict internally? I think that surfacing these defaults in the method signature makes the signature more informative about the default behaviour.

The only downside I can think of is that there is a maintenance cost of duplicating these defaults here; although I think the chances of the defaults for either of these changing are low enough for that to not be a concern?

The way I would address Point 1 only is:

Suggested change
def encode(self, encoding='utf-8', errors='strict'):
def encode(self, encoding='utf-8', errors='strict'):
encoding = encoding or 'utf-8'
errors = errors or 'strict'

This is my preferred approach, but I'm keen to hear your feedback.

The way I would address Point 1 and Point 2 together is:

Suggested change
def encode(self, encoding='utf-8', errors='strict'):
def encode(self, encoding=None, errors=None):
encoding = encoding or 'utf-8'
errors = errors or 'strict'

This keeps a single codepath, but does still embed the defaults in the implementation of UserString -- I'm not sure if that was part of your concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last change of using None in signature and having a default value sounds good to me. Normally, str.encode has defaults errors='strict', encoding='utf-8' but I am just wondering if there is any other case I am missing where data could be anything other than string where default of 'utf-8' is assumed. I would suggest waiting for @rhettinger on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's wait for confirmation.

In the meantime, I don't quite understand the aversion to including these default values in the signature -- is there a downside to changing the method signature that I'm missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the slightly contrived example that I can think of. I inherit from str that has custom encode method of encodng='ascii' and errors='ignore'. I then pass it to UserString. With the PR encode defaults to encodng='utf-8' and errors='strict' though actually encode should be called with 'ascii' and 'ignore' when no arguments are given.

# -*- coding: utf-8 -*-
# ../backups/bpo36582.py

from collections import UserString

class AsciiString(str):

    def encode(self, encoding='ascii', errors='ignore'):
        print(encoding, errors)
        return super().encode(encoding, errors)

data = AsciiString('早上好')
print(repr(UserString(data).encode()))

Python 3.7 has the bug where self.class is called but ignoring that UserString(data).encode() is called with encoding='ascii' and errors='strict'

$ python3.7 ../backups/bpo36582.py
ascii ignore
"b''"

With the PR the encode method defaults encoding='utf-8' in signature when None and errors='strict'. So UserString(data).encode() calls AsciiString.encode but with encoding='utf-8' and errors='strict' always unless the user specifies the values.

$ ./python.exe ../backups/bpo36582.py
utf-8 strict
b'\xe6\x97\xa9\xe4\xb8\x8a\xe5\xa5\xbd'

If we keep the old code and just remove the usage of self.class then the behavior is as per Python 3.7 and just returns bytes as return value but also keep compatibility where self.data is called with expected defaults of (encoding='ascii', errors='ignore') instead of (encoding='utf-8', errors='strict').

./python.exe ../backups/bpo36582.py
ascii ignore
b''

Feel free to correct me if I am missing something on the above.

Copy link
Contributor Author

@asqui asqui May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I believe what you're saying is that:
The utf-8/strict defaults are valid for the str class, but may not be valid for subclasses of str (such as the AsciiString class in your example).

If we want to support this use case then I believe the only option is to return to the original form of calling encode() with zero, one, or two arguments as appropriate, as both my proposed changes (above) would suffer this problem of assuming the incorrect defaults.

However, note that this would still not correctly handle the case where only encode(errors=...) is specified -- the supplied value of errors is currently discarded (a bug introduced in 3.1 when support for keyword arguments to str.encode() was added, making it possible to specify only errors alone.) This would require the addition of a fourth code path!

This aside, does the scenario of containing one custom str-derived subclass in another custom UserString-derived subclass really need to be supported? Reading the documentation for UserString led me to believe that AsciiString would not be preserved in data, which is documented as "The real str object used to store the contents of the UserString class." But I guess by the Liskov Substitution Principle, AsciiString is a real str class and should be treated as such...

The problem is that the scenario you outline precludes the use of any default values in UserString signatures if we want to support containment of a custom str-subclass that defines its own, different, defaults. i.e. I can adapt your AsciiString class into one that has different defaults for the split() method, or find()/startswith()/endwith(), and these will already not be respected. Perhaps these are even more esoteric than your example, but I think they are in the same class.

Suppose, for instance, I replace AsciiString in your example with an IgnoreFirstCharacterString class that defines a default value of start=1 for count(), find(), startswith(), endswith(). Wrapping this IgnoreFirstCharacterString in a UserString would not currently preserve the 'ignore first character' behaviour.

Would you consider this an existing bug?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was mostly around subclasses of str with custom encode method to behave the same way as before and to make minimal changes of just removing self.class wrappers to return bytes. I will be happy with the decisions from the core dev to see if my concerns are not not worth it given adding default values in signature as per your comment improves the interface and also perspective on cases like IgnoreFirstCharacterString.

Thanks for the details 👍

@tirkarthi
Copy link
Member

Please add a NEWS entry since this is a change in behavior. You can use blurb or blurb-it : https://devguide.python.org/committing/?highlight=news#what-s-new-and-news-entries

blurb-it bot and others added 2 commits May 7, 2019 17:42
* Test to ensure that utf-8/strict are used as defaults
* Use the self.check*() methods instead of assertEqual() in tests
  (This makes test_encode() more portable; it is eligible for promotion
   to MixinStrUnicodeUserStringTest, as it is a valid test case for
   `str` too.)
return self.__class__(self.data.encode(encoding))
return self.__class__(self.data.encode())
def encode(self, encoding='utf-8', errors='strict'):
encoding, errors = (encoding or 'utf-8'), (errors or 'strict')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case encoding='' is different from encoding=None. Expand this code to:

encode = 'utf-8' if encoding is None else encoding
errors = 'strict' if errors is None else errors

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@rhettinger rhettinger added needs backport to 3.7 type-bug An unexpected behavior, bug, or error labels May 26, 2019
@asqui
Copy link
Contributor Author

asqui commented May 26, 2019

Thanks for the astute observation @rhettinger -- I often point out the subtle bugs (or unintuitive error messages) that can result from relaxed use of "truthiness" so I completely agree with your suggested change in principle.

Out of curiosity, is your request based purely on this principle, or did you have some specific scenario(s) in mind?

From what I can see, the change you have suggested will, in practical terms:

  • Correctly support the future addition of an encoding called '' -- however this seems unlikely.
  • Correctly fail to look up an encoding called '' at present (note that this is a breaking change with respect to the current implementation on master, which does rely on the truthiness of encoding, and will therefore discard an encoding of '' without passing it through to the underlying str.encode() method.)
  • Correctly pass through an errors value of '' to the underlying str.encode() (where it seems to be defaulted to 'strict' anyway) thereby adding support for a future change in the behaviour of str.encode() to treat '' differently -- this also seems unlikely, though not as unlikely as an encoding called ''.

Out of these observations the most significant one is probably the backward compatibility breakage, which would actually be motivation to not make the change you're suggesting (at least for encoding).

Am I missing something more significant here?

I hasten to add that this is my first contribution to CPython so I'm more than happy to defer to you on what the right thing to do is, I am just curious to understand what you had in mind when suggesting this change.

@rhettinger
Copy link
Contributor

Please make the requested change. It matches what other APIs do and follows PEP 8 guidance on how to test for None.

NB: Minor backward compatibility break for any existing code that
    specifies encoding='' (or any other 'Falsy' value)
@asqui
Copy link
Contributor Author

asqui commented Aug 27, 2019

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@rhettinger: please review the changes made to this pull request.

@rhettinger rhettinger self-assigned this Aug 28, 2019
@rhettinger rhettinger merged commit 2a16eea into python:master Aug 28, 2019
@miss-islington
Copy link
Contributor

Thanks @asqui for the PR, and @rhettinger for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-15557 is a backport of this pull request to the 3.8 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Aug 28, 2019
…pythonGH-13138)

(cherry picked from commit 2a16eea)

Co-authored-by: Daniel Fortunov <asqui@users.noreply.github.com>
rhettinger pushed a commit that referenced this pull request Aug 28, 2019
…GH-13138) (GH-15557)

(cherry picked from commit 2a16eea)

Co-authored-by: Daniel Fortunov <asqui@users.noreply.github.com>
@asqui asqui deleted the UserString-encode-fix-bpo-36582 branch June 6, 2020 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants