Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always try to get CMap, even if name is not recognized #438

Merged
merged 12 commits into from
Jul 23, 2020

Conversation

pietermarsman
Copy link
Member

@pietermarsman pietermarsman commented Jun 6, 2020

Pull request

Fixes #391

This PR allows always tries to get a cmap_name. Previously this was only done for . It does not break any existing behavior.

How Has This Been Tested?

Checklist

  • I have added tests that prove my fix is effective or that my feature
    works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial
    version
  • I have updated the README.md or I am verified that this
    is not necessary
  • I have updated the readthedocs documentation or I
    verified that this is not necessary
  • I have added a consice human-readable description of the change to
    CHANGELOG.md

Copy link
Contributor

@fakabbir fakabbir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems Good,

Initially, any value outside IDENTITY_ENCODER was set "unknown" as default, now this will be set to the cmap_name itself.

@pietermarsman
Copy link
Member Author

@fakabbir Could you elaborate on that? I don't see the difference in the cmap_name between before and after.

@fakabbir
Copy link
Contributor

So, at this place https://github.com/pdfminer/pdfminer.six/pull/438/files#diff-9d138ff43c58cd4903b1e16ce49c98fcR766 cmap_name = IDENTITY_ENCODER.get(cmap_name, cmap_name) you have ensured that the the cmap_name is always assigned in case even if the name is not in IDENTITY_ENCODER. I see this making a difference.

@pietermarsman
Copy link
Member Author

But in the old code the cmap_name is also always assigned, right? It is set to "unknown" if it cannot be inferred from the pdf. And that did not change.

Happy to improve thing here, but I'm not (yet) seeing it.

@fakabbir
Copy link
Contributor

But in the old code the cmap_name is also always assigned, right? It is set to "unknown" if it cannot be inferred from the pdf. And that did not change.

Happy to improve thing here, but I'm not (yet) seeing it.

I felt it's getting assigned as UniGB-UCS2-H instead of unknown. But if that not the case, I wonder why solved the issue. Let me try debug the test case.

Copy link
Contributor

@fakabbir fakabbir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need Root Cause Figured Out

Copy link
Contributor

@fakabbir fakabbir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing Approved Status.

@pietermarsman
Copy link
Member Author

@fakabbir I'm a bit confused on what to do with this PR now.

@jstockwin
Copy link
Member

Hey @pietermarsman and @fakabbir. I've been sorting through the issues (as per conversation on gitter). I've marked the issue relating to this PR as "in progress". Is this moving forwards somehow?

@pietermarsman
Copy link
Member Author

I felt it's getting assigned as UniGB-UCS2-H instead of unknown. But if that not the case, I wonder why solved the issue. Let me try debug the test case.

That's indeed the case. The differences:

  • Earlier only cmap names from IDENTITY_ENCODER were allowed. That behavior was introduced in fa40043, created by @fakabbir, reviewed by me.
  • Now it gets the cmap name from the font specification, uses IDENTITY_ENCODER to map some values, and then tries to get the appropriate CMap with CMap.get_cmap(). If that succeeds it returns that one, if it fails it returns a dummy.

So I think this is an improvement. But it would be great if someone else can confirm this.

@pietermarsman
Copy link
Member Author

It improves the output of the issue, so thats a good start :)

@dwalton76
Copy link

Does the current patch resolve the issue with the Chinese characters in the PDF in #391 ?

@pietermarsman
Copy link
Member Author

Does the current patch resolve the issue with the Chinese characters in the PDF in #391 ?

Yes

@pietermarsman
Copy link
Member Author

I need another review before merging this. Either from @fakabbir or someone else.

@jstockwin
Copy link
Member

I can take a look if you want, but I think fakabbir probably understood this more than I will so you should probably wait for him. Looks like CI is failing at the moment anyway?

@pietermarsman
Copy link
Member Author

I thought I fixed that earlier, but actually it never passed 🤔

Now it is! :) Actually, the test output improved because the CJK characters in simple3.pdf are now also recognized.

@pietermarsman
Copy link
Member Author

Now it is...

Copy link
Member

@jstockwin jstockwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one suggestion to add a comment (which you're welcome to ignore if you think it's obvious)

pdfminer/pdffont.py Show resolved Hide resolved
@pietermarsman pietermarsman merged commit 4f65242 into develop Jul 23, 2020
@pietermarsman pietermarsman deleted the 391-fix-cmap-from-pickle-file branch February 2, 2022 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New versions of pdfminer.six cannot extract chinese characters from pdf
4 participants