extract_text() return garbled characters #2330

ChanghaoLau · 2023-12-07T08:07:54Z

I get garbled characters when parsing pdf file. The file I use is this. There may be encoding issues?

Environment

$ python -m platform
Linux-4.18.0-147.5.1.6.h841.eulerosv2r9.x86_64-x86_64-with-glibc2.17

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.0.1

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader

file_path = '20120812.pdf'
page_idx = 0

reader = PdfReader(file_path)
page = reader.pages[page_idx]
text = page.extract_text()
print(text)

The pdf file can be obtained from this url.

The output is:

2012୍8ᄅ ACTA AUTOMATICA SINICA August, 2012
م
ᇛ ਟ1ࡹ1ྷ ೦2ᅦ ม1
ᅋေم, ྛऊো ,ۋ, ০Ⴈ
......

The text was updated successfully, but these errors were encountered:

unique-Li-yuankun · 2023-12-12T01:57:46Z

I meet the same problem.

MartinThoma · 2023-12-12T10:46:44Z

Thank you for the good error report.

I can confirm:

The PDF contains text that can be copy-pasted (it's not an image)
The copy-pasted text looks fine (it's not intentionally garbled within the file / via the font to avoid copy-pasting)
pypdf was used

I'll make some more checks after work.

stefan6419846 · 2023-12-12T15:15:49Z

This might be related to #2295 as well. In

pypdf/pypdf/_text_extraction/__init__.py

Line 231 in 38795f5

t = tt.decode(cmap[0], "surrogatepass") # apply str encoding

we decode the operands [b'\x05\xbb'] using utf-16-be, as this is the cmap[0] value:

('utf-16-be', {}, '/F1', {'/Subtype': '/Type0', '/DescendantFonts': [IndirectObject(7, 0, 140296872535904)], '/Name': '/F1', '/BaseFont': '/KSZZAC+SimSun', '/Encoding': '/Identity-H', '/Type': '/Font'})

The Latin text seems to use actual charmaps instead:

('charmap', {'®': 'ﬀ', '¯': 'ﬁ', '±': 'ﬃ', 'Ä': '¨', '%': '%', '(': '(', ')': ')', ',': ',', '-': '-', '.': '.', '/': '/', '0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9', ':': ':', ';': ';', '=': '=', '@': '@', 'A': 'A', 'C': 'C', 'D': 'D', 'E': 'E', 'F': 'F', 'G': 'G', 'H': 'H', 'I': 'I', 'J': 'J', 'K': 'K', 'L': 'L', 'M': 'M', 'N': 'N', 'O': 'O', 'P': 'P', 'R': 'R', 'S': 'S', 'T': 'T', 'U': 'U', 'V': 'V', 'X': 'X', 'Y': 'Y', 'Z': 'Z', '[': '[', ']': ']', 'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', 'e': 'e', 'f': 'f', 'g': 'g', 'h': 'h', 'i': 'i', 'j': 'j', 'k': 'k', 'l': 'l', 'm': 'm', 'n': 'n', 'o': 'o', 'p': 'p', 'q': 'q', 'r': 'r', 's': 's', 't': 't', 'u': 'u', 'v': 'v', 'w': 'w', 'x': 'x', 'y': 'y', 'z': 'z'}, '/F2', {'/Subtype': '/Type1', '/FontDescriptor': IndirectObject(14, 0, 140296872535904), '/LastChar': 196, '/Widths': [285, 514, 856, 514, 856, 799, 285, 400, 400, 514, 799, 285, 343, 285, 514, 514, 514, 514, 514, 514, 514, 514, 514, 514, 514, 285, 285, 285, 799, 485, 485, 799, 771, 728, 742, 785, 699, 671, 806, 771, 371, 528, 799, 642, 942, 771, 799, 699, 799, 756, 571, 742, 771, 771, 1056, 771, 771, 628, 285, 514, 285, 514, 285, 285, 514, 571, 457, 571, 457, 314, 514, 571, 285, 314, 542, 285, 856, 571, 514, 571, 542, 402, 405, 400, 571, 542, 742, 542, 542, 457, 514, 1028, 514, 514, 514, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 642, 856, 799, 714, 685, 771, 742, 799, 742, 799, 0, 0, 742, 600, 571, 571, 856, 856, 285, 314, 514, 514, 514, 514, 514, 771, 457, 514, 742, 799, 514, 928, 1042, 799, 285, 514], '/Name': '/F2', '/BaseFont': '/KSZZAC+CMR9', '/FirstChar': 33, '/Type': '/Font'})

For reference: The file from #2295 has ('utf-16-be', {}, '/R11', {'/BaseFont': '/GSWDKI+MHeiHK-Bold', '/Type': '/Font', '/Encoding': '/Identity-H', '/DescendantFonts': [IndirectObject(12, 0, 139916737754976)], '/Subtype': '/Type0'}) for the wrong characters as well, while there cmap for Arabic numbers looks good again (dict encoding in this case):

({0: '\x00', 1: '\x01', 2: '\x02', 3: '\x03', 4: '\x04', 5: '\x05', 6: '\x06', 7: '\x07', 8: '\x08', 9: '\t', 10: '\n', 11: '\x0b', 12: '\x0c', 13: '\r', 14: '\x0e', 15: '\x0f', 16: '\x10', 17: '\x11', 18: '\x12', 19: '\x13', 20: '\x14', 21: '\x15', 22: '\x16', 23: '\x17', 24: '\x18', 25: '\x19', 26: '\x1a', 27: '\x1b', 28: '\x1c', 29: '\x1d', 30: '\x1e', 31: '\x1f', 32: ' ', 33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e', 102: 'f', 103: 'g', 104: 'h', 105: 'i', 106: 'j', 107: 'k', 108: 'l', 109: 'm', 110: 'n', 111: 'o', 112: 'p', 113: 'q', 114: 'r', 115: 's', 116: 't', 117: 'u', 118: 'v', 119: 'w', 120: 'x', 121: 'y', 122: 'z', 123: '{', 124: '|', 125: '}', 126: '~', 127: '\x7f', 128: '€', 129: '\x81', 130: '‚', 131: 'ƒ', 132: '„', 133: '…', 134: '†', 135: '‡', 136: 'ˆ', 137: '‰', 138: 'Š', 139: '‹', 140: 'Œ', 141: '\x8d', 142: 'Ž', 143: '\x8f', 144: '\x90', 145: '‘', 146: '’', 147: '“', 148: '”', 149: '•', 150: '–', 151: '—', 152: '˜', 153: '™', 154: 'š', 155: '›', 156: 'œ', 157: '\x9d', 158: 'ž', 159: 'Ÿ', 160: '\xa0', 161: '¡', 162: '¢', 163: '£', 164: '¤', 165: '¥', 166: '¦', 167: '§', 168: '¨', 169: '©', 170: 'ª', 171: '«', 172: '¬', 173: '\xad', 174: '®', 175: '¯', 176: '°', 177: '±', 178: '²', 179: '³', 180: '´', 181: 'µ', 182: '¶', 183: '·', 184: '¸', 185: '¹', 186: 'º', 187: '»', 188: '¼', 189: '½', 190: '¾', 191: '¿', 192: 'À', 193: 'Á', 194: 'Â', 195: 'Ã', 196: 'Ä', 197: 'Å', 198: 'Æ', 199: 'Ç', 200: 'È', 201: 'É', 202: 'Ê', 203: 'Ë', 204: 'Ì', 205: 'Í', 206: 'Î', 207: 'Ï', 208: 'Ð', 209: 'Ñ', 210: 'Ò', 211: 'Ó', 212: 'Ô', 213: 'Õ', 214: 'Ö', 215: '×', 216: 'Ø', 217: 'Ù', 218: 'Ú', 219: 'Û', 220: 'Ü', 221: 'Ý', 222: 'Þ', 223: 'ß', 224: 'à', 225: 'á', 226: 'â', 227: 'ã', 228: 'ä', 229: 'å', 230: 'æ', 231: 'ç', 232: 'è', 233: 'é', 234: 'ê', 235: 'ë', 236: 'ì', 237: 'í', 238: 'î', 239: 'ï', 240: 'ð', 241: 'ñ', 242: 'ò', 243: 'ó', 244: 'ô', 245: 'õ', 246: 'ö', 247: '÷', 248: 'ø', 249: 'ù', 250: 'ú', 251: 'û', 252: 'ü', 253: 'ý', 254: 'þ', 255: 'ÿ'}, {}, '/R18', {'/BaseFont': '/ZHXRWX+TimesLTStd-Bold', '/FontDescriptor': IndirectObject(19, 0, 140377741535072), '/Type': '/Font', '/FirstChar': 44, '/LastChar': 57, '/Widths': [250, 0, 250, 0, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500], '/Encoding': '/WinAnsiEncoding', '/Subtype': '/Type1'})

pubpub-zz · 2023-12-12T18:24:40Z

@MartinThoma has written:
2. The copy-pasted text looks fine (it's not intentionally garbled within the file / via the font to avoid copy-pasting)

Can you please indicate which program you have used. I did the test unsucessfully with Acrobat

stefan6419846 · 2023-12-13T07:50:09Z

pdftotext/poppler seems to work fine for example: pdftotext -f 1 -l 1 20120812.pdf -.

MartinThoma · 2023-12-23T13:20:59Z

I used the Google Chrome reader

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_text() return garbled characters #2330

extract_text() return garbled characters #2330

ChanghaoLau commented Dec 7, 2023 •

edited

Loading

unique-Li-yuankun commented Dec 12, 2023

MartinThoma commented Dec 12, 2023

stefan6419846 commented Dec 12, 2023

pubpub-zz commented Dec 12, 2023

stefan6419846 commented Dec 13, 2023

MartinThoma commented Dec 23, 2023

extract_text() return garbled characters #2330

extract_text() return garbled characters #2330

Comments

ChanghaoLau commented Dec 7, 2023 • edited Loading

Environment

Code + PDF

unique-Li-yuankun commented Dec 12, 2023

MartinThoma commented Dec 12, 2023

stefan6419846 commented Dec 12, 2023

pubpub-zz commented Dec 12, 2023

stefan6419846 commented Dec 13, 2023

MartinThoma commented Dec 23, 2023

ChanghaoLau commented Dec 7, 2023 •

edited

Loading