UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

zlqs1985 · 2017-08-09T12:27:56Z

Hi, thank you for your wonderful book on data wrangling
I encountered some issue when I was running the parse_pdf_text.py of chapter 5 in anaconda (python3.5)
The IDE show me the followning error message

Traceback (most recent call last):

  File "<ipython-input-10-957ab6bc6f5e>", line 39, in <module>
    for line in openfile:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 46: illegal multibyte sequence

it looks like the code opened the file in text mode with a "gbk" encoding. It should probably be opened in binary mode? I'm not sure. How can I fix this problem? thank you.

The text was updated successfully, but these errors were encountered:

kjam · 2017-08-14T18:01:34Z

Hi there,

Can you change this line near the top of the file:

openfile = open(pdf_txt, 'r')

to this:

openfile = open(pdf_txt, 'rb')

And let me know if that works better? Thanks!

-kjam

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

zlqs1985 commented Aug 9, 2017

kjam commented Aug 14, 2017

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

Comments

zlqs1985 commented Aug 9, 2017

kjam commented Aug 14, 2017