Fix #791 #806

pomy4 · 2022-09-04T10:30:07Z

Pull request

Fixes #791, where text extraction from a certain PDF is not quite correct.

So, the problematic font (ABCDEE+Arial) has its subtype specified as a Type 2 CIDFont (aka CIDFontType2) (which basically just means it is a TrueType font). According to the PDF spec, if a Type 2 CIDFont is embedded in a PDF, it must also have a CIDToGIDMap entry (I think), which this problematic font does not, which is why Adobe Reader DC fails when copying it:

However, pdfminer.six doesn't try to read the CIDToGIDMap entry, but instead just parses the cmaps in the font itself, which causes it to correctly identify most of the characters.

The few incorrectly identified characters are caused by the fact, that pdfminer assumes, that the cmap codes always represent Unicode code points, but this font also contains one cmap where the codes should be interpreted using the Mac OS Roman encoding:

(this is a screenshot from OTMaster Light (and the font was extracted from the PDF using fontforge))

The simplest solution seemed to me to be to just skip all such non-Unicode cmaps. I found out how to do that using the info in https://docs.microsoft.com/en-us/typography/opentype/spec/cmap

This means that the loop can now end without doing anything, so I added a small check that raises an exception, though I wasn't sure whether it isn't better to do assert False, and mainly I doubt that there are many TrueType fonts which don't have an Unicode cmap.

Another issue is that in this font, space and non-breaking space have the same cids, which means that (since the non-breaking space is later in the cmap) pdfminer.six returns every space as a non-breaking one, so I also added some very edge-casy code to handle that situation.

How Has This Been Tested?

I tested it on the PDF from the issue and also added it (minified) as a test.

Checklist

I have read CONTRIBUTING.md.
I have added a concise human-readable description of the change to CHANGELOG.md.
I have tested that this fix is effective or that this feature works.
I have added docstrings to newly created methods and classes.
I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

KunalGehlot

This is some excellent work.

pietermarsman · 2023-11-25T10:16:03Z

@NAZADOTH Thanks for figuring this out!

@KunalGehlot Thanks for checking this.

FYI, I've merged with master and added brackets to the platform checking to make the order more explicity.

pomy4 added 4 commits September 4, 2022 11:55

Skip non-Unicode cmaps in TrueType fonts

8847831

Prefer normal space when it has the same cid as a non-breaking one

b7861d7

Add test

a0816b0

Update CHANGELOG.md

fc78362

pomy4 marked this pull request as ready for review September 4, 2022 10:33

KunalGehlot approved these changes Sep 10, 2022

View reviewed changes

pietermarsman added 2 commits November 25, 2023 11:09

Merge remote-tracking branch 'origin/master' into fix-791

144ac5c

Add brackets to make operator order explicit

e6f6012

pietermarsman approved these changes Nov 25, 2023

View reviewed changes

pietermarsman added this pull request to the merge queue Nov 25, 2023

Merged via the queue into pdfminer:master with commit 997424d Nov 25, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #791 #806

Fix #791 #806

pomy4 commented Sep 4, 2022 •

edited

Loading

KunalGehlot left a comment

pietermarsman commented Nov 25, 2023

Fix #791 #806

Fix #791 #806

Conversation

pomy4 commented Sep 4, 2022 • edited Loading

KunalGehlot left a comment

Choose a reason for hiding this comment

pietermarsman commented Nov 25, 2023

pomy4 commented Sep 4, 2022 •

edited

Loading