Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request
Fixes #791, where text extraction from a certain PDF is not quite correct.
So, the problematic font (ABCDEE+Arial) has its subtype specified as a Type 2 CIDFont (aka CIDFontType2) (which basically just means it is a TrueType font). According to the PDF spec, if a Type 2 CIDFont is embedded in a PDF, it must also have a CIDToGIDMap entry (I think), which this problematic font does not, which is why Adobe Reader DC fails when copying it:
However, pdfminer.six doesn't try to read the CIDToGIDMap entry, but instead just parses the cmaps in the font itself, which causes it to correctly identify most of the characters.
The few incorrectly identified characters are caused by the fact, that pdfminer assumes, that the cmap codes always represent Unicode code points, but this font also contains one cmap where the codes should be interpreted using the Mac OS Roman encoding:
(this is a screenshot from OTMaster Light (and the font was extracted from the PDF using fontforge))
The simplest solution seemed to me to be to just skip all such non-Unicode cmaps. I found out how to do that using the info in https://docs.microsoft.com/en-us/typography/opentype/spec/cmap
This means that the loop can now end without doing anything, so I added a small check that raises an exception, though I wasn't sure whether it isn't better to do assert False, and mainly I doubt that there are many TrueType fonts which don't have an Unicode cmap.
Another issue is that in this font, space and non-breaking space have the same cids, which means that (since the non-breaking space is later in the cmap) pdfminer.six returns every space as a non-breaking one, so I also added some very edge-casy code to handle that situation.
How Has This Been Tested?
I tested it on the PDF from the issue and also added it (minified) as a test.
Checklist