Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umlauts in dataframe cannot be displayed in ipython notebook #2458

Closed
jankatins opened this issue Dec 9, 2012 · 11 comments
Closed

umlauts in dataframe cannot be displayed in ipython notebook #2458

jankatins opened this issue Dec 9, 2012 · 11 comments
Labels
Milestone

Comments

@jankatins
Copy link
Contributor

When I put this into a ipython notebook cell, an exception is thrown.

import pandas
data2 = [u"test", u"ß", u"ä", u"á"]
df2 = pandas.DataFrame({"a":data2})
print(df2["a"][1])
df2
ß
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-5-0bb7d6f7d515> in <module>()
      2 df2 = pandas.DataFrame({"a":data2})
      3 print(df2["a"][1])
----> 4 df2

C:\portabel\Python27\lib\site-packages\IPython\core\displayhook.pyc in __call__(self, result)
    244             self.update_user_ns(result)
    245             self.log_output(format_dict)
--> 246             self.finish_displayhook()
    247 
    248     def flush(self):

C:\portabel\Python27\lib\site-packages\IPython\zmq\displayhook.pyc in finish_displayhook(self)
     59         sys.stdout.flush()
     60         sys.stderr.flush()
---> 61         self.session.send(self.pub_socket, self.msg, ident=self.topic)
     62         self.msg = None
     63 

C:\portabel\Python27\lib\site-packages\IPython\zmq\session.pyc in send(self, stream, msg_or_type, content, parent, ident, buffers, track, header, metadata)
    576 
    577         buffers = [] if buffers is None else buffers
--> 578         to_send = self.serialize(msg, ident)
    579         to_send.extend(buffers)
    580         longest = max([ len(s) for s in to_send ])

C:\portabel\Python27\lib\site-packages\IPython\zmq\session.pyc in serialize(self, msg, ident)
    484             content = self.none
    485         elif isinstance(content, dict):
--> 486             content = self.pack(content)
    487         elif isinstance(content, bytes):
    488             # content is already packed, as in a relayed message

C:\portabel\Python27\lib\site-packages\IPython\zmq\session.pyc in <lambda>(obj)
     78 _version_info_list = list(IPython.version_info)
     79 # ISO8601-ify datetime objects
---> 80 json_packer = lambda obj: jsonapi.dumps(obj, default=date_default)
     81 json_unpacker = lambda s: extract_dates(jsonapi.loads(s))
     82 

C:\portabel\Python27\lib\site-packages\zmq\utils\jsonapi.pyc in dumps(o, **kwargs)
     70         kwargs['separators'] = (',', ':')
     71 
---> 72     return _squash_unicode(jsonmod.dumps(o, **kwargs))
     73 
     74 def loads(s, **kwargs):

C:\portabel\Python27\lib\site-packages\simplejson\__init__.pyc in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, use_decimal, namedtuple_as_object, tuple_as_array, bigint_as_string, sort_keys, item_sort_key, **kw)
    332         sort_keys=sort_keys,
    333         item_sort_key=item_sort_key,
--> 334         **kw).encode(obj)
    335 
    336 

C:\portabel\Python27\lib\site-packages\simplejson\encoder.pyc in encode(self, o)
    235         # exceptions aren't as detailed.  The list call should be roughly
    236         # equivalent to the PySequence_Fast that ''.join() would do.
--> 237         chunks = self.iterencode(o, _one_shot=True)
    238         if not isinstance(chunks, (list, tuple)):
    239             chunks = list(chunks)

C:\portabel\Python27\lib\site-packages\simplejson\encoder.pyc in iterencode(self, o, _one_shot)
    309                 Decimal=Decimal)
    310         try:
--> 311             return _iterencode(o, 0)
    312         finally:
    313             key_memo.clear()

UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 22: invalid continuation byte

This also happens when I have umlauts in csv files and specify an encoding:

df3 = pandas.read_csv("file_with_unlauts.csv", encoding="iso-8859-15")
print(df3.head(10)["name"][8]) # prints the right chars
df3.head(10) # throws the above error
@jankatins
Copy link
Contributor Author

pandas version: '0.10.0.dev-07318fa'

@ghost
Copy link

ghost commented Dec 9, 2012

Cannot reproduce on notebook or qtconsole with ipython git master (0.14-dev).
which ipython version are you using?

@jankatins
Copy link
Contributor Author

I also updated to ipython git master from today, but can still reproduce it in both ipythonqt and the notebook and with different browsers.

Win7 64bit

The notebook: https://gist.github.com/4244669 http://nbviewer.ipython.org/4244669/

@jankatins
Copy link
Contributor Author

This also looks wrong to me:

df2.to_html()

results in

u'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>a</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td><strong>0</strong></td>\n      <td> test</td>\n    </tr>\n    <tr>\n      <td><strong>1</strong></td>\n      <td>    \xdf</td>\n    </tr>\n    <tr>\n      <td><strong>2</strong></td>\n      <td>    \xe4</td>\n    </tr>\n    <tr>\n      <td><strong>3</strong></td>\n      <td>    \xe1</td>\n    </tr>\n  </tbody>\n</table>'

A unicode string but the umlauts are nevertheless encoded in iso-8859-15: u"ß".encode("iso-8859-15")

@ghost
Copy link

ghost commented Dec 9, 2012

what are the values of sys.stdout.encoding sys.stdin.encoding and pandas.get_option("print.encoding")?
try with qtconsole for now.

@ghost
Copy link

ghost commented Dec 9, 2012

also, sys.getdefaultencoding()

@jankatins
Copy link
Contributor Author

import sys
sys.stdout.encoding, sys.stdin.encoding, pandas.get_option("print.encoding"), sys.getdefaultencoding()
('UTF-8', None, 'cp1252', 'ascii')

@ghost
Copy link

ghost commented Dec 9, 2012

I think I see what's happening.
Try the updated git master and post the results again.

@ghost ghost self-assigned this Dec 9, 2012
@jankatins
Copy link
Contributor Author

Yay, that fixed it! Thanks!

@ghost
Copy link

ghost commented Dec 9, 2012

you're welcome.

closed via 7cc9779

@ghost ghost closed this as completed Dec 9, 2012
@hmeine
Copy link

hmeine commented Jul 20, 2014

I have the same problem now with 0.14.0 and 0.14.1, running latest IPython from git.

sys.stdin.encoding: UTF-8 (in the terminal, in the notebook I am getting None)
sys.stdout.encoding: UTF-8
sys.stderr.encoding: UTF-8
sys.getdefaultencoding(): ascii
sys.getfilesystemencoding(): utf-8
locale.getpreferredencoding(): UTF-8

(Did not notice this problem before, but I am also unsure how often I had umlauts in my data.)

@jankatins jankatins unassigned ghost Jul 20, 2014
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants