Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'gbk' codec can't decode byte when running parse_pdf_text.py #9

Open
zlqs1985 opened this issue Aug 9, 2017 · 1 comment

Comments

@zlqs1985
Copy link

zlqs1985 commented Aug 9, 2017

Hi, thank you for your wonderful book on data wrangling
I encountered some issue when I was running the parse_pdf_text.py of chapter 5 in anaconda (python3.5)
The IDE show me the followning error message

Traceback (most recent call last):

  File "<ipython-input-10-957ab6bc6f5e>", line 39, in <module>
    for line in openfile:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x93 in position 46: illegal multibyte sequence

it looks like the code opened the file in text mode with a "gbk" encoding. It should probably be opened in binary mode? I'm not sure. How can I fix this problem? thank you.

@kjam
Copy link
Collaborator

kjam commented Aug 14, 2017

Hi there,

Can you change this line near the top of the file:

openfile = open(pdf_txt, 'r')

to this:

openfile = open(pdf_txt, 'rb')

And let me know if that works better? Thanks!

-kjam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants