-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_text() return garbled characters #2330
Comments
I meet the same problem. |
Thank you for the good error report. I can confirm:
I'll make some more checks after work. |
This might be related to #2295 as well. In pypdf/pypdf/_text_extraction/__init__.py Line 231 in 38795f5
[b'\x05\xbb'] using utf-16-be , as this is the cmap[0] value: ('utf-16-be', {}, '/F1', {'/Subtype': '/Type0', '/DescendantFonts': [IndirectObject(7, 0, 140296872535904)], '/Name': '/F1', '/BaseFont': '/KSZZAC+SimSun', '/Encoding': '/Identity-H', '/Type': '/Font'}) The Latin text seems to use actual charmaps instead:
For reference: The file from #2295 has
|
Can you please indicate which program you have used. I did the test unsucessfully with Acrobat |
|
I used the Google Chrome reader |
I get garbled characters when parsing pdf file. The file I use is this. There may be encoding issues?
Environment
Code + PDF
This is a minimal, complete example that shows the issue:
The pdf file can be obtained from this url.
The output is:
The text was updated successfully, but these errors were encountered: