Always try to get CMap, even if name is not recognized #438

pietermarsman · 2020-06-06T14:15:39Z

Pull request

Fixes #391

This PR allows always tries to get a cmap_name. Previously this was only done for . It does not break any existing behavior.

How Has This Been Tested?

By testing it on the pdf in New versions of pdfminer.six cannot extract chinese characters from pdf #391
By adding a test that fails without this PR

Checklist

I have added tests that prove my fix is effective or that my feature
works
I have added docstrings to newly created methods and classes
I have optimized the code at least one time after creating the initial
version
I have updated the README.md or I am verified that this
is not necessary
I have updated the readthedocs documentation or I
verified that this is not necessary
I have added a consice human-readable description of the change to
CHANGELOG.md

fakabbir

Seems Good,

Initially, any value outside IDENTITY_ENCODER was set "unknown" as default, now this will be set to the cmap_name itself.

pietermarsman · 2020-06-09T09:09:59Z

@fakabbir Could you elaborate on that? I don't see the difference in the cmap_name between before and after.

fakabbir · 2020-06-11T19:13:35Z

So, at this place https://github.com/pdfminer/pdfminer.six/pull/438/files#diff-9d138ff43c58cd4903b1e16ce49c98fcR766 cmap_name = IDENTITY_ENCODER.get(cmap_name, cmap_name) you have ensured that the the cmap_name is always assigned in case even if the name is not in IDENTITY_ENCODER. I see this making a difference.

pietermarsman · 2020-06-17T11:45:58Z

But in the old code the cmap_name is also always assigned, right? It is set to "unknown" if it cannot be inferred from the pdf. And that did not change.

Happy to improve thing here, but I'm not (yet) seeing it.

fakabbir · 2020-06-18T10:01:53Z

But in the old code the cmap_name is also always assigned, right? It is set to "unknown" if it cannot be inferred from the pdf. And that did not change.

Happy to improve thing here, but I'm not (yet) seeing it.

I felt it's getting assigned as UniGB-UCS2-H instead of unknown. But if that not the case, I wonder why solved the issue. Let me try debug the test case.

fakabbir

Need Root Cause Figured Out

fakabbir

Removing Approved Status.

pietermarsman · 2020-06-29T18:54:37Z

@fakabbir I'm a bit confused on what to do with this PR now.

jstockwin · 2020-07-09T14:21:02Z

Hey @pietermarsman and @fakabbir. I've been sorting through the issues (as per conversation on gitter). I've marked the issue relating to this PR as "in progress". Is this moving forwards somehow?

… default if the key is not in there

pietermarsman · 2020-07-11T09:28:48Z

I felt it's getting assigned as UniGB-UCS2-H instead of unknown. But if that not the case, I wonder why solved the issue. Let me try debug the test case.

That's indeed the case. The differences:

Earlier only cmap names from IDENTITY_ENCODER were allowed. That behavior was introduced in fa40043, created by @fakabbir, reviewed by me.
Now it gets the cmap name from the font specification, uses IDENTITY_ENCODER to map some values, and then tries to get the appropriate CMap with CMap.get_cmap(). If that succeeds it returns that one, if it fails it returns a dummy.

So I think this is an improvement. But it would be great if someone else can confirm this.

pietermarsman · 2020-07-11T09:31:51Z

It improves the output of the issue, so thats a good start :)

dwalton76 · 2020-07-16T14:39:07Z

Does the current patch resolve the issue with the Chinese characters in the PDF in #391 ?

pietermarsman · 2020-07-18T13:17:29Z

Does the current patch resolve the issue with the Chinese characters in the PDF in #391 ?

Yes

pietermarsman · 2020-07-18T13:18:49Z

I need another review before merging this. Either from @fakabbir or someone else.

jstockwin · 2020-07-20T07:52:17Z

I can take a look if you want, but I think fakabbir probably understood this more than I will so you should probably wait for him. Looks like CI is failing at the moment anyway?

…nto 391-fix-cmap-from-pickle-file

pietermarsman · 2020-07-20T19:25:44Z

I thought I fixed that earlier, but actually it never passed 🤔

Now it is! :) Actually, the test output improved because the CJK characters in simple3.pdf are now also recognized.

pietermarsman · 2020-07-20T19:27:26Z

Now it is...

jstockwin

LGTM, just one suggestion to add a comment (which you're welcome to ignore if you think it's obvious)

pdfminer/pdffont.py

pietermarsman added 5 commits June 6, 2020 16:10

Add trying to get cmap from pickle file. And cleaning up a bit.

e60bc80

Don't use keyword argument for dict.get

e04a72c

Add docs

ee4118b

Make _get_cmap_name static

63a2eaa

Add test

33714ea

pietermarsman mentioned this pull request Jun 6, 2020

New versions of pdfminer.six cannot extract chinese characters from pdf #391

Closed

Add CHANGELOG.md

a8c256f

fakabbir approved these changes Jun 6, 2020

View reviewed changes

fakabbir reviewed Jun 18, 2020

View reviewed changes

fakabbir suggested changes Jun 18, 2020

View reviewed changes

Remove identity mappings from IDENTITY_ENCODER because that's now the…

7a42741

… default if the key is not in there

Merge branch 'develop' into 391-fix-cmap-from-pickle-file

54d02f7

pietermarsman added 2 commits July 20, 2020 21:23

Add CJK characters to expected output of simple3.pdf

26a2b7e

Merge remote-tracking branch 'origin/391-fix-cmap-from-pickle-file' i…

21e2aae

…nto 391-fix-cmap-from-pickle-file

Fix line length

9d82029

jstockwin approved these changes Jul 21, 2020

View reviewed changes

pdfminer/pdffont.py Show resolved Hide resolved

Add comment

bb77c16

pietermarsman merged commit 4f65242 into develop Jul 23, 2020

pietermarsman added a commit that referenced this pull request Jul 26, 2020

Move changelog line for #438 to current release

0b44f77

pietermarsman deleted the 391-fix-cmap-from-pickle-file branch February 2, 2022 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always try to get CMap, even if name is not recognized #438

Always try to get CMap, even if name is not recognized #438

pietermarsman commented Jun 6, 2020 •

edited

Loading

fakabbir left a comment

pietermarsman commented Jun 9, 2020

fakabbir commented Jun 11, 2020

pietermarsman commented Jun 17, 2020

fakabbir commented Jun 18, 2020

fakabbir left a comment

fakabbir left a comment

pietermarsman commented Jun 29, 2020

jstockwin commented Jul 9, 2020

pietermarsman commented Jul 11, 2020

pietermarsman commented Jul 11, 2020

dwalton76 commented Jul 16, 2020

pietermarsman commented Jul 18, 2020

pietermarsman commented Jul 18, 2020

jstockwin commented Jul 20, 2020

pietermarsman commented Jul 20, 2020

pietermarsman commented Jul 20, 2020

jstockwin left a comment

Always try to get CMap, even if name is not recognized #438

Always try to get CMap, even if name is not recognized #438

Conversation

pietermarsman commented Jun 6, 2020 • edited Loading

fakabbir left a comment

Choose a reason for hiding this comment

pietermarsman commented Jun 9, 2020

fakabbir commented Jun 11, 2020

pietermarsman commented Jun 17, 2020

fakabbir commented Jun 18, 2020

fakabbir left a comment

Choose a reason for hiding this comment

fakabbir left a comment

Choose a reason for hiding this comment

pietermarsman commented Jun 29, 2020

jstockwin commented Jul 9, 2020

pietermarsman commented Jul 11, 2020

pietermarsman commented Jul 11, 2020

dwalton76 commented Jul 16, 2020

pietermarsman commented Jul 18, 2020

pietermarsman commented Jul 18, 2020

jstockwin commented Jul 20, 2020

pietermarsman commented Jul 20, 2020

pietermarsman commented Jul 20, 2020

jstockwin left a comment

Choose a reason for hiding this comment

pietermarsman commented Jun 6, 2020 •

edited

Loading