You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Python 2.x, calling pd.get_dummies on a data-frame containing Unicode column names with characters out of ASCII range leads to an UnicodeEncodeError. Problem first appeared in version 0.21.0 and is still present in 0.23.3, as well as master branch. It was first introduced in this commit: 133a208#diff-fef81b7e498e469973b2da18d19ff6f3L1256.
Reason behind the problem is that older pandas versions used % formatting operator, which automatically converts string to Unicode string if one or more arguments are themselves Unicode strings, while new code uses .format function and chooses unicode/str exclusively based on the type of level variable.
Series.str.get_dummies is not affected, but it might be worth it to check for similar issues with other .format calls.
(which, of course, is a bad idea anyway, but people still do) makes the encoding problems even worse. get_dummies will encode the Unicode string into normal string, and it will be impossible to lookup the column name with expectedly correct Unicode string later. This hides an error and makes it very hard to debug, since exception is far away from the root cause:
Problem description
In Python 2.x, calling
pd.get_dummies
on a data-frame containing Unicode column names with characters out of ASCII range leads to an UnicodeEncodeError. Problem first appeared in version 0.21.0 and is still present in 0.23.3, as well as master branch. It was first introduced in this commit: 133a208#diff-fef81b7e498e469973b2da18d19ff6f3L1256.Reason behind the problem is that older pandas versions used
%
formatting operator, which automatically converts string to Unicode string if one or more arguments are themselves Unicode strings, while new code uses.format
function and chooses unicode/str exclusively based on the type oflevel
variable.Series.str.get_dummies is not affected, but it might be worth it to check for similar issues with other
.format
calls.Code Sample, a copy-pastable example if possible
In pandas 0.23.3
Expected Output
As in pandas 0.19.2:
On a sidenote: setting default system encoding to 'utf-8' in your system with:
(which, of course, is a bad idea anyway, but people still do) makes the encoding problems even worse.
get_dummies
will encode the Unicode string into normal string, and it will be impossible to lookup the column name with expectedly correct Unicode string later. This hides an error and makes it very hard to debug, since exception is far away from the root cause:Similar problem will appear with:
Output of
pd.show_versions()
Since the problem is clear, creating PR will be rather simple. I will try to write it this weekend, if this issue is approved as a bug.
The text was updated successfully, but these errors were encountered: