Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

Closed
Scorpil opened this issue Jul 27, 2018 · 1 comment · Fixed by #22131
Closed

pd.get_dummies incorrectly encodes unicode characters in dataframe column names #22084

Scorpil opened this issue Jul 27, 2018 · 1 comment · Fixed by #22131
Labels
Unicode Unicode strings
Milestone

Comments

@Scorpil
Copy link
Contributor

Scorpil commented Jul 27, 2018

Problem description

In Python 2.x, calling pd.get_dummies on a data-frame containing Unicode column names with characters out of ASCII range leads to an UnicodeEncodeError. Problem first appeared in version 0.21.0 and is still present in 0.23.3, as well as master branch. It was first introduced in this commit: 133a208#diff-fef81b7e498e469973b2da18d19ff6f3L1256.

Reason behind the problem is that older pandas versions used % formatting operator, which automatically converts string to Unicode string if one or more arguments are themselves Unicode strings, while new code uses .format function and chooses unicode/str exclusively based on the type of level variable.

Series.str.get_dummies is not affected, but it might be worth it to check for similar issues with other .format calls.

Code Sample, a copy-pastable example if possible

In pandas 0.23.3

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-17-61dd26c6814f> in <module>()
----> 1 pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()

/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na, columns, sparse, drop_first, dtype)
    890             dummy = _get_dummies_1d(col[1], prefix=pre, prefix_sep=sep,
    891                                     dummy_na=dummy_na, sparse=sparse,
--> 892                                     drop_first=drop_first, dtype=dtype)
    893             with_dummies.append(dummy)
    894         result = concat(with_dummies, axis=1)

/usr/local/lib/python2.7/site-packages/pandas/core/reshape/reshape.pyc in _get_dummies_1d(data, prefix, prefix_sep, dummy_na, sparse, drop_first, dtype)
    942                       else '{prefix}{sep}{level}' for v in levels]
    943         dummy_cols = [dummy_str.format(prefix=prefix, sep=prefix_sep, level=v)
--> 944                       for dummy_str, v in zip(dummy_strs, levels)]
    945     else:
    946         dummy_cols = levels

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

Expected Output

As in pandas 0.19.2:

pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
[u'\xe4_a']

On a sidenote: setting default system encoding to 'utf-8' in your system with:

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

(which, of course, is a bad idea anyway, but people still do) makes the encoding problems even worse. get_dummies will encode the Unicode string into normal string, and it will be impossible to lookup the column name with expectedly correct Unicode string later. This hides an error and makes it very hard to debug, since exception is far away from the root cause:

# pandas v0.23.3
pd.get_dummies(pd.DataFrame({u'ä': ['a']})).columns.tolist()
['\xc3\xa4_a']

pd.get_dummies(pd.DataFrame({u'ä': ['a']}))[u'ä_a']
... traceback ...
KeyError: u'\xe4_a'

Similar problem will appear with:

pd.get_dummies(pd.DataFrame({'a': ['a']}), prefix=u'ä').columns.tolist()
pd.get_dummies(pd.DataFrame({'a': ['a']}), prefix_sep=u'ä').columns.tolist()

Output of pd.show_versions()

commit: dfd58e8d1b32daddde18f40c289af1f77ad219b7
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.0.dev0+364.gdfd58e8d1.dirty
pytest: 3.6.3
pip: 10.0.1
setuptools: 39.2.0
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: 1.7.6
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Since the problem is clear, creating PR will be rather simple. I will try to write it this weekend, if this issue is approved as a bug.

@gfyoung gfyoung added Unicode Unicode strings 2/3 Compat labels Jul 30, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 30, 2018

Python 2.x compatibility...sigh...

Yes, a PR would be greatly appreciated!

@jreback jreback added this to the 0.24.0 milestone Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants