-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-36582: Make collections.UserString.encode() return bytes, not str #13138
bpo-36582: Make collections.UserString.encode() return bytes, not str #13138
Conversation
In Python 2 a subclass of UserString would return an instance of that subclass from encode(). This is invalid in Python 3, where the result of encode() should always be `bytes`. Also: * Collapse the 3 code paths in UserString.encode() into a single path by lifting the underlying str.encode() defaults into the method signature
return self.__class__(self.data.encode(encoding, errors)) | ||
return self.__class__(self.data.encode(encoding)) | ||
return self.__class__(self.data.encode()) | ||
def encode(self, encoding='utf-8', errors='strict'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This changes the signature of the function's defaults. I would prefer keeping the old defaults and code backwards compatiblity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick feedback!
So there are two points you are making, which I think can be addressed independently:
- Backward compatibility: Ensuring that code which passes in
None
for eitherencoding
orerrors
continues to work the same. - Maintaining the signature: Ensuring that the signature still shows
None
as the default value forencoding
anderrors
.
Point 1 is definitely desired, and this change breaks that in its current form so that needs to be addressed. The way that I would address it depends on whether we also want to address Point 2. Either way, this needs an additional test case.
Point 2 I'm less convinced about. Is there any value to having the signature show defaults of None
/None
when they will effectively default to utf-8
/strict
internally? I think that surfacing these defaults in the method signature makes the signature more informative about the default behaviour.
The only downside I can think of is that there is a maintenance cost of duplicating these defaults here; although I think the chances of the defaults for either of these changing are low enough for that to not be a concern?
The way I would address Point 1 only is:
def encode(self, encoding='utf-8', errors='strict'): | |
def encode(self, encoding='utf-8', errors='strict'): | |
encoding = encoding or 'utf-8' | |
errors = errors or 'strict' |
This is my preferred approach, but I'm keen to hear your feedback.
The way I would address Point 1 and Point 2 together is:
def encode(self, encoding='utf-8', errors='strict'): | |
def encode(self, encoding=None, errors=None): | |
encoding = encoding or 'utf-8' | |
errors = errors or 'strict' |
This keeps a single codepath, but does still embed the defaults in the implementation of UserString
-- I'm not sure if that was part of your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last change of using None in signature and having a default value sounds good to me. Normally, str.encode has defaults errors='strict', encoding='utf-8' but I am just wondering if there is any other case I am missing where data could be anything other than string where default of 'utf-8' is assumed. I would suggest waiting for @rhettinger on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's wait for confirmation.
In the meantime, I don't quite understand the aversion to including these default values in the signature -- is there a downside to changing the method signature that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the slightly contrived example that I can think of. I inherit from str that has custom encode method of encodng='ascii' and errors='ignore'. I then pass it to UserString. With the PR encode defaults to encodng='utf-8' and errors='strict' though actually encode should be called with 'ascii' and 'ignore' when no arguments are given.
# -*- coding: utf-8 -*-
# ../backups/bpo36582.py
from collections import UserString
class AsciiString(str):
def encode(self, encoding='ascii', errors='ignore'):
print(encoding, errors)
return super().encode(encoding, errors)
data = AsciiString('早上好')
print(repr(UserString(data).encode()))
Python 3.7 has the bug where self.class is called but ignoring that UserString(data).encode() is called with encoding='ascii' and errors='strict'
$ python3.7 ../backups/bpo36582.py
ascii ignore
"b''"
With the PR the encode method defaults encoding='utf-8' in signature when None and errors='strict'. So UserString(data).encode() calls AsciiString.encode but with encoding='utf-8' and errors='strict' always unless the user specifies the values.
$ ./python.exe ../backups/bpo36582.py
utf-8 strict
b'\xe6\x97\xa9\xe4\xb8\x8a\xe5\xa5\xbd'
If we keep the old code and just remove the usage of self.class then the behavior is as per Python 3.7 and just returns bytes as return value but also keep compatibility where self.data is called with expected defaults of (encoding='ascii', errors='ignore') instead of (encoding='utf-8', errors='strict').
./python.exe ../backups/bpo36582.py
ascii ignore
b''
Feel free to correct me if I am missing something on the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I believe what you're saying is that:
The utf-8
/strict
defaults are valid for the str
class, but may not be valid for subclasses of str
(such as the AsciiString
class in your example).
If we want to support this use case then I believe the only option is to return to the original form of calling encode()
with zero, one, or two arguments as appropriate, as both my proposed changes (above) would suffer this problem of assuming the incorrect defaults.
However, note that this would still not correctly handle the case where only encode(errors=...)
is specified -- the supplied value of errors
is currently discarded (a bug introduced in 3.1 when support for keyword arguments to str.encode()
was added, making it possible to specify only errors
alone.) This would require the addition of a fourth code path!
This aside, does the scenario of containing one custom str
-derived subclass in another custom UserString
-derived subclass really need to be supported? Reading the documentation for UserString led me to believe that AsciiString
would not be preserved in data
, which is documented as "The real str
object used to store the contents of the UserString
class." But I guess by the Liskov Substitution Principle, AsciiString
is a real str
class and should be treated as such...
The problem is that the scenario you outline precludes the use of any default values in UserString
signatures if we want to support containment of a custom str
-subclass that defines its own, different, defaults. i.e. I can adapt your AsciiString
class into one that has different defaults for the split()
method, or find()
/startswith()
/endwith()
, and these will already not be respected. Perhaps these are even more esoteric than your example, but I think they are in the same class.
Suppose, for instance, I replace AsciiString
in your example with an IgnoreFirstCharacterString
class that defines a default value of start=1
for count()
, find()
, startswith()
, endswith()
. Wrapping this IgnoreFirstCharacterString
in a UserString
would not currently preserve the 'ignore first character' behaviour.
Would you consider this an existing bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was mostly around subclasses of str with custom encode method to behave the same way as before and to make minimal changes of just removing self.class wrappers to return bytes. I will be happy with the decisions from the core dev to see if my concerns are not not worth it given adding default values in signature as per your comment improves the interface and also perspective on cases like IgnoreFirstCharacterString
.
Thanks for the details 👍
Please add a NEWS entry since this is a change in behavior. You can use blurb or blurb-it : https://devguide.python.org/committing/?highlight=news#what-s-new-and-news-entries |
* Test to ensure that utf-8/strict are used as defaults * Use the self.check*() methods instead of assertEqual() in tests (This makes test_encode() more portable; it is eligible for promotion to MixinStrUnicodeUserStringTest, as it is a valid test case for `str` too.)
Lib/collections/__init__.py
Outdated
return self.__class__(self.data.encode(encoding)) | ||
return self.__class__(self.data.encode()) | ||
def encode(self, encoding='utf-8', errors='strict'): | ||
encoding, errors = (encoding or 'utf-8'), (errors or 'strict') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case encoding=''
is different from encoding=None
. Expand this code to:
encode = 'utf-8' if encoding is None else encoding
errors = 'strict' if errors is None else errors
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Thanks for the astute observation @rhettinger -- I often point out the subtle bugs (or unintuitive error messages) that can result from relaxed use of "truthiness" so I completely agree with your suggested change in principle. Out of curiosity, is your request based purely on this principle, or did you have some specific scenario(s) in mind? From what I can see, the change you have suggested will, in practical terms:
Out of these observations the most significant one is probably the backward compatibility breakage, which would actually be motivation to not make the change you're suggesting (at least for Am I missing something more significant here? I hasten to add that this is my first contribution to CPython so I'm more than happy to defer to you on what the right thing to do is, I am just curious to understand what you had in mind when suggesting this change. |
Please make the requested change. It matches what other APIs do and follows PEP 8 guidance on how to test for None. |
NB: Minor backward compatibility break for any existing code that specifies encoding='' (or any other 'Falsy' value)
I have made the requested changes; please review again |
Thanks for making the requested changes! @rhettinger: please review the changes made to this pull request. |
Thanks @asqui for the PR, and @rhettinger for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8. |
GH-15557 is a backport of this pull request to the 3.8 branch. |
…pythonGH-13138) (cherry picked from commit 2a16eea) Co-authored-by: Daniel Fortunov <asqui@users.noreply.github.com>
In Python 2 a subclass of UserString would return an instance of that
subclass from encode(). This is invalid in Python 3, where the result of
encode() should always be
bytes
.Also:
lifting the underlying str.encode() defaults into the method signature
https://bugs.python.org/issue36582