Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #791 #806

Merged
merged 6 commits into from
Nov 25, 2023
Merged

Fix #791 #806

merged 6 commits into from
Nov 25, 2023

Conversation

pomy4
Copy link
Contributor

@pomy4 pomy4 commented Sep 4, 2022

Pull request

Fixes #791, where text extraction from a certain PDF is not quite correct.

So, the problematic font (ABCDEE+Arial) has its subtype specified as a Type 2 CIDFont (aka CIDFontType2) (which basically just means it is a TrueType font). According to the PDF spec, if a Type 2 CIDFont is embedded in a PDF, it must also have a CIDToGIDMap entry (I think), which this problematic font does not, which is why Adobe Reader DC fails when copying it:

adobe

However, pdfminer.six doesn't try to read the CIDToGIDMap entry, but instead just parses the cmaps in the font itself, which causes it to correctly identify most of the characters.

The few incorrectly identified characters are caused by the fact, that pdfminer assumes, that the cmap codes always represent Unicode code points, but this font also contains one cmap where the codes should be interpreted using the Mac OS Roman encoding:

mac_roman
(this is a screenshot from OTMaster Light (and the font was extracted from the PDF using fontforge))

The simplest solution seemed to me to be to just skip all such non-Unicode cmaps. I found out how to do that using the info in https://docs.microsoft.com/en-us/typography/opentype/spec/cmap

This means that the loop can now end without doing anything, so I added a small check that raises an exception, though I wasn't sure whether it isn't better to do assert False, and mainly I doubt that there are many TrueType fonts which don't have an Unicode cmap.

Another issue is that in this font, space and non-breaking space have the same cids, which means that (since the non-breaking space is later in the cmap) pdfminer.six returns every space as a non-breaking one, so I also added some very edge-casy code to handle that situation.

How Has This Been Tested?

I tested it on the PDF from the issue and also added it (minified) as a test.

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

@pomy4 pomy4 marked this pull request as ready for review September 4, 2022 10:33
Copy link
Contributor

@KunalGehlot KunalGehlot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some excellent work.

@pietermarsman
Copy link
Member

@NAZADOTH Thanks for figuring this out!

@KunalGehlot Thanks for checking this.

FYI, I've merged with master and added brackets to the platform checking to make the order more explicity.

@pietermarsman pietermarsman added this pull request to the merge queue Nov 25, 2023
Merged via the queue into pdfminer:master with commit 997424d Nov 25, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extracted text isn't correct
3 participants