name2unicode() should follow the Adobe Glyph List Specification #263

pietermarsman · 2019-07-10T18:41:03Z

Fixes #177
Fixes #261
Fixes #204

…n the Adobe Glyph List Specification

pietermarsman · 2019-07-12T06:22:26Z

Just an update of what I've been doing:

I've been testing name2unicode() with the samples/nonfree/f1040nr.pdf.
For this pdf it is called from EncodingDb.get_encoding() and from Type1FontHeaderParser.get_encoding(). This latter causes a problem, because it has a font that maps cid's to glyph-names in the form of \H[0-9]+.
This form is not a valid glyph-name according to Adobe. The current name2unicode() extracts the number part and returns the corresponding unicode character. This is wrong (for this document), since it returns chinese characters for an american document.
I am trying to figure out how to the \H[0-9]+ should be mapped to valid characters.
My best guess now is that there is some information in the encrypted part of the embedded font program (see PDF Reference 1.6, page 437)

…Glyph specs (with allowing lowercase)

…that the custom CharStrings of the font are mapped to ''

…his pattern is also used in the rest of pdfminer.six

pdfminer/encodingdb.py

tataganesh

Some small changes.

tataganesh · 2019-07-15T16:02:08Z

Also is this an extension of #204?

pietermarsman · 2019-07-15T20:48:05Z

Thanks for the feedback. No time left today to fix it, so I will look into this tomorrow.

… match any glyph name. Use this message to log debug statements.

pietermarsman · 2019-07-16T06:58:02Z

Also is this an extension of #204?

Yes, this is also implements the behaviour described by @jbarlow83, and more.

robinalexandre · 2019-07-26T16:27:53Z

Do you know when this PR and all the others about name2unicode and Cmap will be merged ?

I think we are a lot waiting for it and it seems like checks have passed 👍

Thanks to all

pietermarsman added 2 commits July 10, 2019 20:35

Add some (failing) unittests for name2unicode based on the examples i…

ec5218a

…n the Adobe Glyph List Specification

Added test for overflow error reported by @jtlz2: pdfminer#177 (comment)

5d7ac7e

pietermarsman mentioned this pull request Jul 10, 2019

OverflowError: Python int too large to convert to C long for certain pdfs #177

Closed

pietermarsman changed the title ~~Add some (failing) unittests for name2unicode based on the examples in the Adobe Glyph List Specification~~ name2unicode() should follow the Adobe Glyph List Specification Jul 10, 2019

pietermarsman added 6 commits July 14, 2019 15:16

Change implementation of name2unicode such that it follows the Adobe …

f0392f8

…Glyph specs (with allowing lowercase)

Add docstring to Type1FontHeaderParser.get_encoding() that describes …

33cc986

…that the custom CharStrings of the font are mapped to ''

Add lowercase adobe glyph name tests

fdb7e54

Use KeyError to signal that the name does not resemble any unicode, t…

c597e95

…his pattern is also used in the rest of pdfminer.six

Fix error, python2 cannot handle unicode in a .py file

1e24bfa

Fix error, python2 cannot handle unicode in a .py file

2bb850c

pietermarsman marked this pull request as ready for review July 14, 2019 13:55

tataganesh reviewed Jul 15, 2019

View reviewed changes

pdfminer/encodingdb.py Outdated Show resolved Hide resolved

tataganesh reviewed Jul 15, 2019

View reviewed changes

pdfminer/encodingdb.py Outdated Show resolved Hide resolved

tataganesh reviewed Jul 15, 2019

View reviewed changes

pdfminer/encodingdb.py Show resolved Hide resolved

tataganesh requested changes Jul 15, 2019

View reviewed changes

pietermarsman added 2 commits July 16, 2019 08:49

Remove intermediate variable full_stop because it is just a dot

0fb8336

Raise a KeyError with a useful message if unicode2name() does not…

6f362f5

… match any glyph name. Use this message to log debug statements.

tataganesh merged commit 42e2c81 into pdfminer:develop Jul 26, 2019

pietermarsman deleted the 261-glyph-list-specification branch July 27, 2019 07:20

This was referenced Jul 27, 2019

name2unicode() does not conform to the Adobe Glyph List Specification #261

Closed

Teach name2unicode about unconvertible names and Unicode names #204

Closed

HiromuHota mentioned this pull request Sep 16, 2020

OverflowError: Python int too large to convert to C long in pdfminer.six HazyResearch/pdftotree#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

name2unicode() should follow the Adobe Glyph List Specification #263

name2unicode() should follow the Adobe Glyph List Specification #263

pietermarsman commented Jul 10, 2019 •

edited

Loading

pietermarsman commented Jul 12, 2019

tataganesh left a comment

tataganesh commented Jul 15, 2019

pietermarsman commented Jul 15, 2019

pietermarsman commented Jul 16, 2019

robinalexandre commented Jul 26, 2019

name2unicode() should follow the Adobe Glyph List Specification #263

name2unicode() should follow the Adobe Glyph List Specification #263

Conversation

pietermarsman commented Jul 10, 2019 • edited Loading

pietermarsman commented Jul 12, 2019

tataganesh left a comment

Choose a reason for hiding this comment

tataganesh commented Jul 15, 2019

pietermarsman commented Jul 15, 2019

pietermarsman commented Jul 16, 2019

robinalexandre commented Jul 26, 2019

pietermarsman commented Jul 10, 2019 •

edited

Loading