bpo-36582: Make collections.UserString.encode() return bytes, not str #13138

asqui · 2019-05-06T21:19:48Z

In Python 2 a subclass of UserString would return an instance of that
subclass from encode(). This is invalid in Python 3, where the result of
encode() should always be bytes.

Also:

Collapse the 3 code paths in UserString.encode() into a single path by
lifting the underlying str.encode() defaults into the method signature

https://bugs.python.org/issue36582

In Python 2 a subclass of UserString would return an instance of that subclass from encode(). This is invalid in Python 3, where the result of encode() should always be `bytes`. Also: * Collapse the 3 code paths in UserString.encode() into a single path by lifting the underlying str.encode() defaults into the method signature

tirkarthi · 2019-05-07T00:58:18Z

Lib/collections/__init__.py

-                return self.__class__(self.data.encode(encoding, errors))
-            return self.__class__(self.data.encode(encoding))
-        return self.__class__(self.data.encode())
+    def encode(self, encoding='utf-8', errors='strict'):


This changes the signature of the function's defaults. I would prefer keeping the old defaults and code backwards compatiblity.

Thanks for the quick feedback!

So there are two points you are making, which I think can be addressed independently:

Backward compatibility: Ensuring that code which passes in None for either encoding or errors continues to work the same.

Maintaining the signature: Ensuring that the signature still shows None as the default value for encoding and errors.

Point 1 is definitely desired, and this change breaks that in its current form so that needs to be addressed. The way that I would address it depends on whether we also want to address Point 2. Either way, this needs an additional test case.

Point 2 I'm less convinced about. Is there any value to having the signature show defaults of None/None when they will effectively default to utf-8/strict internally? I think that surfacing these defaults in the method signature makes the signature more informative about the default behaviour.

The only downside I can think of is that there is a maintenance cost of duplicating these defaults here; although I think the chances of the defaults for either of these changing are low enough for that to not be a concern?

The way I would address Point 1 only is:

Suggested change

def encode(self, encoding='utf-8', errors='strict'):

def encode(self, encoding='utf-8', errors='strict'):

encoding = encoding or 'utf-8'

errors = errors or 'strict'

This is my preferred approach, but I'm keen to hear your feedback.

The way I would address Point 1 and Point 2 together is:

Suggested change

def encode(self, encoding='utf-8', errors='strict'):

def encode(self, encoding=None, errors=None):

encoding = encoding or 'utf-8'

errors = errors or 'strict'

This keeps a single codepath, but does still embed the defaults in the implementation of UserString -- I'm not sure if that was part of your concern?

The last change of using None in signature and having a default value sounds good to me. Normally, str.encode has defaults errors='strict', encoding='utf-8' but I am just wondering if there is any other case I am missing where data could be anything other than string where default of 'utf-8' is assumed. I would suggest waiting for @rhettinger on this.

Ok, let's wait for confirmation.

In the meantime, I don't quite understand the aversion to including these default values in the signature -- is there a downside to changing the method signature that I'm missing?

Apologies for the slightly contrived example that I can think of. I inherit from str that has custom encode method of encodng='ascii' and errors='ignore'. I then pass it to UserString. With the PR encode defaults to encodng='utf-8' and errors='strict' though actually encode should be called with 'ascii' and 'ignore' when no arguments are given.

# -*- coding: utf-8 -*- # ../backups/bpo36582.py from collections import UserString class AsciiString(str): def encode(self, encoding='ascii', errors='ignore'): print(encoding, errors) return super().encode(encoding, errors) data = AsciiString('早上好') print(repr(UserString(data).encode()))

Python 3.7 has the bug where self.class is called but ignoring that UserString(data).encode() is called with encoding='ascii' and errors='strict'

$ python3.7 ../backups/bpo36582.py ascii ignore "b''"

With the PR the encode method defaults encoding='utf-8' in signature when None and errors='strict'. So UserString(data).encode() calls AsciiString.encode but with encoding='utf-8' and errors='strict' always unless the user specifies the values.

$ ./python.exe ../backups/bpo36582.py utf-8 strict b'\xe6\x97\xa9\xe4\xb8\x8a\xe5\xa5\xbd'

If we keep the old code and just remove the usage of self.class then the behavior is as per Python 3.7 and just returns bytes as return value but also keep compatibility where self.data is called with expected defaults of (encoding='ascii', errors='ignore') instead of (encoding='utf-8', errors='strict').

./python.exe ../backups/bpo36582.py ascii ignore b''

Feel free to correct me if I am missing something on the above.

Ok, so I believe what you're saying is that:
The utf-8/strict defaults are valid for the str class, but may not be valid for subclasses of str (such as the AsciiString class in your example).

If we want to support this use case then I believe the only option is to return to the original form of calling encode() with zero, one, or two arguments as appropriate, as both my proposed changes (above) would suffer this problem of assuming the incorrect defaults.

However, note that this would still not correctly handle the case where only encode(errors=...) is specified -- the supplied value of errors is currently discarded (a bug introduced in 3.1 when support for keyword arguments to str.encode() was added, making it possible to specify only errors alone.) This would require the addition of a fourth code path!

This aside, does the scenario of containing one custom str-derived subclass in another custom UserString-derived subclass really need to be supported? Reading the documentation for UserString led me to believe that AsciiString would not be preserved in data, which is documented as "The real str object used to store the contents of the UserString class." But I guess by the Liskov Substitution Principle, AsciiString is a real str class and should be treated as such...

The problem is that the scenario you outline precludes the use of any default values in UserString signatures if we want to support containment of a custom str-subclass that defines its own, different, defaults. i.e. I can adapt your AsciiString class into one that has different defaults for the split() method, or find()/startswith()/endwith(), and these will already not be respected. Perhaps these are even more esoteric than your example, but I think they are in the same class.

Suppose, for instance, I replace AsciiString in your example with an IgnoreFirstCharacterString class that defines a default value of start=1 for count(), find(), startswith(), endswith(). Wrapping this IgnoreFirstCharacterString in a UserString would not currently preserve the 'ignore first character' behaviour.

Would you consider this an existing bug?

My point was mostly around subclasses of str with custom encode method to behave the same way as before and to make minimal changes of just removing self.class wrappers to return bytes. I will be happy with the decisions from the core dev to see if my concerns are not not worth it given adding default values in signature as per your comment improves the interface and also perspective on cases like IgnoreFirstCharacterString.

Thanks for the details 👍

tirkarthi · 2019-05-07T03:28:04Z

Please add a NEWS entry since this is a change in behavior. You can use blurb or blurb-it : https://devguide.python.org/committing/?highlight=news#what-s-new-and-news-entries

* Test to ensure that utf-8/strict are used as defaults * Use the self.check*() methods instead of assertEqual() in tests (This makes test_encode() more portable; it is eligible for promotion to MixinStrUnicodeUserStringTest, as it is a valid test case for `str` too.)

rhettinger · 2019-05-26T18:34:20Z

Lib/collections/__init__.py

-            return self.__class__(self.data.encode(encoding))
-        return self.__class__(self.data.encode())
+    def encode(self, encoding='utf-8', errors='strict'):
+        encoding, errors = (encoding or 'utf-8'), (errors or 'strict')


The case encoding='' is different from encoding=None. Expand this code to:

encode = 'utf-8' if encoding is None else encoding errors = 'strict' if errors is None else errors

bedevere-bot · 2019-05-26T18:34:47Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

asqui · 2019-05-26T19:43:11Z

Thanks for the astute observation @rhettinger -- I often point out the subtle bugs (or unintuitive error messages) that can result from relaxed use of "truthiness" so I completely agree with your suggested change in principle.

Out of curiosity, is your request based purely on this principle, or did you have some specific scenario(s) in mind?

From what I can see, the change you have suggested will, in practical terms:

Correctly support the future addition of an encoding called '' -- however this seems unlikely.
Correctly fail to look up an encoding called '' at present (note that this is a breaking change with respect to the current implementation on master, which does rely on the truthiness of encoding, and will therefore discard an encoding of '' without passing it through to the underlying str.encode() method.)
Correctly pass through an errors value of '' to the underlying str.encode() (where it seems to be defaulted to 'strict' anyway) thereby adding support for a future change in the behaviour of str.encode() to treat '' differently -- this also seems unlikely, though not as unlikely as an encoding called ''.

Out of these observations the most significant one is probably the backward compatibility breakage, which would actually be motivation to not make the change you're suggesting (at least for encoding).

Am I missing something more significant here?

I hasten to add that this is my first contribution to CPython so I'm more than happy to defer to you on what the right thing to do is, I am just curious to understand what you had in mind when suggesting this change.

rhettinger · 2019-08-24T00:35:24Z

Please make the requested change. It matches what other APIs do and follows PEP 8 guidance on how to test for None.

NB: Minor backward compatibility break for any existing code that specifies encoding='' (or any other 'Falsy' value)

asqui · 2019-08-27T16:41:23Z

I have made the requested changes; please review again

bedevere-bot · 2019-08-27T16:41:25Z

Thanks for making the requested changes!

@rhettinger: please review the changes made to this pull request.

miss-islington · 2019-08-28T04:38:13Z

Thanks @asqui for the PR, and @rhettinger for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

bedevere-bot · 2019-08-28T04:38:25Z

GH-15557 is a backport of this pull request to the 3.8 branch.

…pythonGH-13138) (cherry picked from commit 2a16eea) Co-authored-by: Daniel Fortunov <asqui@users.noreply.github.com>

…GH-13138) (GH-15557) (cherry picked from commit 2a16eea) Co-authored-by: Daniel Fortunov <asqui@users.noreply.github.com>

…pythonGH-13138)

asqui requested a review from rhettinger as a code owner May 6, 2019 21:19

the-knights-who-say-ni added the CLA signed label May 6, 2019

bedevere-bot added the awaiting review label May 6, 2019

tirkarthi reviewed May 7, 2019

View reviewed changes

blurb-it bot and others added 2 commits May 7, 2019 17:42

📜🤖 Added by blurb_it.

7f261e1

rhettinger requested changes May 26, 2019

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels May 26, 2019

rhettinger added needs backport to 3.7 type-bug An unexpected behavior, bug, or error labels May 26, 2019

Stricter None-checking, as requested by Raymond Hettinger

348812f

NB: Minor backward compatibility break for any existing code that specifies encoding='' (or any other 'Falsy' value)

bedevere-bot removed the awaiting changes label Aug 27, 2019

bedevere-bot added the awaiting change review label Aug 27, 2019

Merge branch 'master' into UserString-encode-fix-bpo-36582

b4a2695

rhettinger self-assigned this Aug 28, 2019

Add Daniel to acks

2c66fbe

rhettinger added needs backport to 3.8 and removed needs backport to 3.7 labels Aug 28, 2019

rhettinger approved these changes Aug 28, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting change review labels Aug 28, 2019

rhettinger merged commit 2a16eea into python:master Aug 28, 2019

bedevere-bot removed the awaiting merge label Aug 28, 2019

bedevere-bot removed the needs backport to 3.8 label Aug 28, 2019

rhettinger pushed a commit that referenced this pull request Aug 28, 2019

bpo-36582: Make collections.UserString.encode() return bytes, not str (…

2cb82d2

…GH-13138) (GH-15557) (cherry picked from commit 2a16eea) Co-authored-by: Daniel Fortunov <asqui@users.noreply.github.com>

lisroach pushed a commit to lisroach/cpython that referenced this pull request Sep 10, 2019

bpo-36582: Make collections.UserString.encode() return bytes, not str (…

79479f9

…pythonGH-13138)

DinoV pushed a commit to DinoV/cpython that referenced this pull request Jan 14, 2020

bpo-36582: Make collections.UserString.encode() return bytes, not str (…

3ee43fa

…pythonGH-13138)

asqui deleted the UserString-encode-fix-bpo-36582 branch June 6, 2020 22:56

websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request Jul 20, 2020

bpo-36582: Make collections.UserString.encode() return bytes, not str (…

ef7f403

…pythonGH-13138)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-36582: Make collections.UserString.encode() return bytes, not str #13138

bpo-36582: Make collections.UserString.encode() return bytes, not str #13138

asqui commented May 6, 2019 •

edited by bedevere-bot

Loading

tirkarthi May 7, 2019

asqui May 7, 2019 •

edited

Loading

tirkarthi May 7, 2019

asqui May 7, 2019

tirkarthi May 7, 2019

asqui May 7, 2019 •

edited

Loading

tirkarthi May 7, 2019

tirkarthi commented May 7, 2019

rhettinger May 26, 2019

bedevere-bot commented May 26, 2019

asqui commented May 26, 2019

rhettinger commented Aug 24, 2019

asqui commented Aug 27, 2019

bedevere-bot commented Aug 27, 2019

miss-islington commented Aug 28, 2019

bedevere-bot commented Aug 28, 2019

-    def encode(self, encoding='utf-8', errors='strict'):
+    def encode(self, encoding=None, errors=None):
+        encoding = encoding or 'utf-8'
+        errors = errors or 'strict'

bpo-36582: Make collections.UserString.encode() return bytes, not str #13138

bpo-36582: Make collections.UserString.encode() return bytes, not str #13138

Conversation

asqui commented May 6, 2019 • edited by bedevere-bot Loading

tirkarthi May 7, 2019

Choose a reason for hiding this comment

asqui May 7, 2019 • edited Loading

Choose a reason for hiding this comment

tirkarthi May 7, 2019

Choose a reason for hiding this comment

asqui May 7, 2019

Choose a reason for hiding this comment

tirkarthi May 7, 2019

Choose a reason for hiding this comment

asqui May 7, 2019 • edited Loading

Choose a reason for hiding this comment

tirkarthi May 7, 2019

Choose a reason for hiding this comment

tirkarthi commented May 7, 2019

rhettinger May 26, 2019

Choose a reason for hiding this comment

bedevere-bot commented May 26, 2019

asqui commented May 26, 2019

rhettinger commented Aug 24, 2019

asqui commented Aug 27, 2019

bedevere-bot commented Aug 27, 2019

miss-islington commented Aug 28, 2019

bedevere-bot commented Aug 28, 2019

asqui commented May 6, 2019 •

edited by bedevere-bot

Loading

asqui May 7, 2019 •

edited

Loading

asqui May 7, 2019 •

edited

Loading