pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

Scorpil · 2018-07-27T15:25:57Z

Problem description

In Python 2.x, calling pd.get_dummies on a data-frame containing Unicode column names with characters out of ASCII range leads to an UnicodeEncodeError. Problem first appeared in version 0.21.0 and is still present in 0.23.3, as well as master branch. It was first introduced in this commit: 133a208#diff-fef81b7e498e469973b2da18d19ff6f3L1256.

Reason behind the problem is that older pandas versions used % formatting operator, which automatically converts string to Unicode string if one or more arguments are themselves Unicode strings, while new code uses .format function and chooses unicode/str exclusively based on the type of level variable.

Series.str.get_dummies is not affected, but it might be worth it to check for similar issues with other .format calls.

Code Sample, a copy-pastable example if possible

In pandas 0.23.3

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-17-61dd26c6814f> in <module>()
----> 1 pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    890             dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
    891                                     dummy_na=dummy_na, sparse=sparse,
--> 892                                     drop_first=drop_first, dtype=dtype)
    893             with_dummies.append(dummy)
    894         result = concat(with_dummies, axis=1)

/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first, dtype)
    942                       else '{prefix}{sep}{level}' for v in levels]
    943         dummy_cols = [dummy_str.format(prefix=prefix, sep=prefix_sep, level=v)
--> 944                       for dummy_str, v in zip(dummy_strs, levels)]
    945     else:
    946         dummy_cols = levels

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

Expected Output

As in pandas 0.19.2:

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

[u'\xe4_a']

On a sidenote: setting default system encoding to 'utf-8' in your system with:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

(which, of course, is a bad idea anyway, but people still do) makes the encoding problems even worse. get_dummies will encode the Unicode string into normal string, and it will be impossible to lookup the column name with expectedly correct Unicode string later. This hides an error and makes it very hard to debug, since exception is far away from the root cause:

# pandas v0.23.3
pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
['\xc3\xa4_a']

pd.get_dummies(pd.DataFrame({u'ä': ['a']}))[u'ä_a']
... traceback ...
KeyError: u'\xe4_a'

Output of `pd.show_versions()`

commit: dfd58e8d1b32daddde18f40c289af1f77ad219b7
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.0.dev0+364.gdfd58e8d1.dirty
pytest: 3.6.3
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: 1.7.6
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Since the problem is clear, creating PR will be rather simple. I will try to write it this weekend, if this issue is approved as a bug.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-07-30T13:49:37Z

Python 2.x compatibility...sigh...

Yes, a PR would be greatly appreciated!

gfyoung added Unicode Unicode strings 2/3 Compat labels Jul 30, 2018

Scorpil mentioned this issue Jul 30, 2018

BUG: Fix get dummies unicode error #22131

Merged

4 tasks

jreback added this to the 0.24.0 milestone Aug 2, 2018

jreback closed this as completed in #22131 Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

Scorpil commented Jul 27, 2018

gfyoung commented Jul 30, 2018

pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

Comments

Scorpil commented Jul 27, 2018

Problem description

Code Sample, a copy-pastable example if possible

Expected Output

Output of pd.show_versions()

gfyoung commented Jul 30, 2018

Output of `pd.show_versions()`