Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non UTF-8 character fonts cause UnicodeDecodeError #50

Open
zionsofer opened this issue Feb 1, 2022 · 1 comment
Open

Non UTF-8 character fonts cause UnicodeDecodeError #50

zionsofer opened this issue Feb 1, 2022 · 1 comment
Labels
poppler-cpp Need to be fixed upstream

Comments

@zionsofer
Copy link

I'm trying to parse a PDF that contains Chinese characters.
The text is extracted okay, but when I try to access fonts, I get the following error:

>>> box.get_font_name()  # Assume the box is extracted from some page, this box contains Chinese characters
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<PATH>/lib/python3.7/site-packages/poppler/utilities.py", line 90, in wrapped
    return fct(*args, **kwargs)
  File "<PATH>/lib/python3.7/site-packages/poppler/page.py", line 64, in get_font_name
    return self._text_box.get_font_name(i)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 7: invalid start byte

Trying to iterate fonts through the document itself results in the same error.

Environment:
Python 3.7.4
Poppler 21.12.0 (Compiled from source).
Happens on both Mac and Ubuntu.

I have seen other poppler bindings, such as this one that handles those errors (by using the replace keyword for decoding the string), but unfortunately it uses deprecated internal APIs and cannot be used with a newer version of poppler (even when trying to build from source).

If there was somehow a way to supply the required encoding or even suppress/ignore those errors, it would be very benficial.
I have seen another comment on another ticket that says we can request to expose the encoding/decoding in the cpp backend.

@cbrunet
Copy link
Owner

cbrunet commented Apr 11, 2022

poppler-cpp gives the font name as std::string, not as ustring. Therefore, I think the bug must be resolved upstream, unless we used some heuristics to guess the encoding, which would probably be fragile.

@cbrunet cbrunet added the poppler-cpp Need to be fixed upstream label Apr 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
poppler-cpp Need to be fixed upstream
Projects
None yet
Development

No branches or pull requests

2 participants