Pdfstream as cmap #283

fakabbir · 2019-08-10T05:13:22Z

Resolves #210 .
The previous work #264 maps "OneByteEncodingH/V" to default CMap initialization. It also ignores the fact that "DLIdent-H/V" is a valid encoding since it isn't mentioned in any PDF Reference the author can dig in.

While looking for the output of pdf2text.py the OneByteEncoding characters were absent denoting that default CMap encoding is ineffective to capture the glyph.

I have added the decode function for OneByteEncoding and also added DLIdent as there is a possibility of this being in many PDF.

Both OneByteEncoding and DLIdent aren't available in any PDF Reference referred.

…ner.six into pdfstream-as-cmap

pietermarsman

This is a PR with batteries included:

new functionality
a bit of refactoring
added functional tests
an example pdf

Wow! Let's merge it!

pietermarsman · 2019-08-19T13:55:03Z

pdfminer/pdffont.py

@@ -140,7 +142,13 @@ def do_keyword(self, pos, token):


 NIBBLES = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '.', 'e', 'e-', None, '-')
-
+IDENTITY_ENCODER = {'Identity-H':'Identity-H',


You mentioned that DLIdent-* is not in the PDF reference manual. Did you find other pdf documentation that mentions this? If so, you could add a comment that refers to it such that we do not remove DLIdent-* by accident.

I haven't seen any pdf with DLIdent-* but have included it due to its harmless nature. Will also include code comment to avoid DLIdent-* being removed by accident.

pietermarsman · 2019-08-19T13:57:25Z

pdfminer/cmapdb.py

+            return struct.unpack('>%dB' % n, code)
+        else:
+            return ()
+
 ##  UnicodeMap


There are a lot of these unnecessary code comments in cmapdb.py. Since you are editing this file anyway, could you remove those? And likewise for pdffont.py?

Sure, Can do that.

pietermarsman · 2019-08-20T11:36:16Z

pdfminer/pdffont.py

@@ -140,12 +130,18 @@ def do_keyword(self, pos, token):


 NIBBLES = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '.', 'e', 'e-', None, '-')
+"""


I might be nitpicking here: according to pep8 block comments should start with a hashtag instead of using a multi-line string.

Advantage of using hashtag: an intelligent editor can understand that it is a comment

Got it, Done.

fakabbir · 2019-09-11T06:13:46Z

@pietermarsman Hi, Have removed the comments. Can you please review it once again and merge if no other changes are required. :)

pietermarsman · 2019-09-11T16:59:22Z

Hi @fakabbir, sorry for the miscommunication but I do not have commit rights right now. I do my bit by reviewing and creating PR's.

I hope to get commit rights soon though.

igormp

This looks like euske/pdfminer/pull/179 with some extra features and tests. It'd be great to have this merged asap.

john-redshelf and others added 10 commits February 25, 2019 11:42

Handle PDFStream as character map name in PDFCIDFont

8ab2e28

Encapsulates character map name

c022358

Corrects Indentation

8e4a82a

Removes @Property, Adds docstring

cc40af3

Adds Test, Removes Unnecessary Assumptions

fa40043

Removes Code Comments

b4c261b

Adds Test Cases, Neater Code For CMap Assignment

f1a4dce

Adds decoder for OnebyteIdentityH/V instead of using default CMap

5a0d8db

Adds Test Case

5b21098

Merge branch 'develop' into pdfstream-as-cmap

fe38695

fakabbir mentioned this pull request Aug 10, 2019

text extraction while font Encoding is a PDFStream object #279

Closed

fakabbir added 2 commits August 10, 2019 11:03

Correct old test cases

3125d36

Merge branch 'pdfstream-as-cmap' of https://github.com/fakabbir/pdfmi…

3f0f05d

…ner.six into pdfstream-as-cmap

pietermarsman approved these changes Aug 19, 2019

View reviewed changes

Removes code comments

3d549ea

pietermarsman approved these changes Aug 20, 2019

View reviewed changes

fakabbir added 2 commits August 20, 2019 17:13

Corrects Code Comment

abd685f

Corrects Comment

7c03d96

igormp approved these changes Oct 4, 2019

View reviewed changes

tataganesh merged commit f53fbd9 into pdfminer:develop Oct 12, 2019

pietermarsman mentioned this pull request Oct 15, 2019

AttributeError: 'PDFStream' object has no attribute 'replace' #210

Closed

igormp mentioned this pull request Oct 15, 2019

AttributeError from PDFMiner camelot-dev/camelot#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdfstream as cmap #283

Pdfstream as cmap #283

fakabbir commented Aug 10, 2019

pietermarsman left a comment

pietermarsman Aug 19, 2019

fakabbir Aug 19, 2019

pietermarsman Aug 19, 2019

fakabbir Aug 19, 2019

pietermarsman Aug 20, 2019

fakabbir Aug 22, 2019

fakabbir commented Sep 11, 2019

pietermarsman commented Sep 11, 2019

igormp left a comment

		@@ -140,7 +142,13 @@ def do_keyword(self, pos, token):


		NIBBLES = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '.', 'e', 'e-', None, '-')

		IDENTITY_ENCODER = {'Identity-H':'Identity-H',

		@@ -140,12 +130,18 @@ def do_keyword(self, pos, token):


		NIBBLES = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '.', 'e', 'e-', None, '-')
		"""

Pdfstream as cmap #283

Pdfstream as cmap #283

Conversation

fakabbir commented Aug 10, 2019

pietermarsman left a comment

Choose a reason for hiding this comment

pietermarsman Aug 19, 2019

Choose a reason for hiding this comment

fakabbir Aug 19, 2019

Choose a reason for hiding this comment

pietermarsman Aug 19, 2019

Choose a reason for hiding this comment

fakabbir Aug 19, 2019

Choose a reason for hiding this comment

pietermarsman Aug 20, 2019

Choose a reason for hiding this comment

fakabbir Aug 22, 2019

Choose a reason for hiding this comment

fakabbir commented Sep 11, 2019

pietermarsman commented Sep 11, 2019

igormp left a comment

Choose a reason for hiding this comment