Fix #934: Create correct cidcoding name #935

aoking · 2024-01-09T11:28:51Z

Pull request

In some PDF files, cmap data could not be read correctly.
This was due to unintentional whitespace in the filename used to read the cmap file.
This fix will allow cmap to be read correctly in some PDF files.

The PDF file where this occurs is bs104761.pdf in #934.

The PDF contained whitespace in cid_registry and cid_ordering.

cid_registry: Adobe��
cid_ordering: Japan1\n\n\n\n\n\n\n\n\n\n

Therefore, strip() was used to remove the whitespace characters.

How Has This Been Tested?

With the corrected version, the text extraction can be performed correctly in bs104761.pdf.

$ python tools/pdf2txt.py bs104761.pdf | head
WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='bs104761.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
商号　アルファテックス株式会社

貸 借 対 照 表

令 和 3 年 3 月 31 日 現 在

代表者 石川　春

科　　　　　目
産

Behavior before modification:

$ python tools/pdf2txt.py bs104761.pdf | head
WARNING:pdfminer.pdfpage:The PDF <_io.BufferedReader name='bs104761.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
(cid:2446)(cid:2040)(cid:633)(cid:926)(cid:999)(cid:977)(cid:925)(cid:962)(cid:959)(cid:939)(cid:949)(cid:1490)(cid:2268)(cid:1393)(cid:2302)

(cid:2879) (cid:2310) (cid:2864) (cid:2480) (cid:3503)

(cid:4009) (cid:4072) (cid:250) (cid:3301) (cid:250) (cid:1860) (cid:250)(cid:248) (cid:3284) (cid:1905) (cid:2127)

(cid:2885)(cid:3503)(cid:2304)(cid:231)(cid:2676)(cid:2706)(cid:633)(cid:2399)

(cid:1354)(cid:633)(cid:633)(cid:633)(cid:633)(cid:633)(cid:3816)
(cid:2184)

Checklist

I have read CONTRIBUTING.md.
I have added a concise human-readable description of the change to CHANGELOG.md.
I have tested that this fix is effective or that this feature works.
I have added docstrings to newly created methods and classes.
I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

pietermarsman

Looking great.

Thanks for this @aoking

aoking added 2 commits January 9, 2024 18:16

fix cidcoding name

1932b28

write CHANGELOG.md

dc1907e

aoking changed the title ~~fix cidcoding name~~ Fix #934: Create correct cidcoding name Jan 9, 2024

Update CHANGELOG.md

28bf42e

pietermarsman approved these changes Jan 12, 2024

View reviewed changes

pietermarsman added this pull request to the merge queue Jan 12, 2024

Merged via the queue into pdfminer:master with commit 48774a1 Jan 12, 2024
9 checks passed

pietermarsman mentioned this pull request Jan 16, 2024

Incorrect character extraction, CID string returned #934

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #934: Create correct cidcoding name #935

Fix #934: Create correct cidcoding name #935

aoking commented Jan 9, 2024 •

edited

Loading

pietermarsman left a comment •

edited

Loading

Fix #934: Create correct cidcoding name #935

Fix #934: Create correct cidcoding name #935

Conversation

aoking commented Jan 9, 2024 • edited Loading

pietermarsman left a comment • edited Loading

Choose a reason for hiding this comment

aoking commented Jan 9, 2024 •

edited

Loading

pietermarsman left a comment •

edited

Loading