Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exporting JBIG2 images #46

Closed
wants to merge 11 commits into from
Closed

exporting JBIG2 images #46

wants to merge 11 commits into from

Conversation

side2k
Copy link

@side2k side2k commented Jan 20, 2017

This PR is adaptation of this one: euske/pdfminer#107

@vstoykov
Copy link
Contributor

There is no cStringIO in Python2.

@goulu
Copy link
Member

goulu commented Apr 18, 2017

Please check why tests fail and fix this before we can merge...
Thank you !

@vstoykov
Copy link
Contributor

vstoykov commented Aug 1, 2017

@side2k you can replace input_stream = StringIO() with input_stream = BytesIO() (BytesIO is already imported) and remove the import of StringIO. Also you can rebase probably.

@tataganesh tataganesh changed the base branch from master to develop November 8, 2018 18:54
eladkehat added a commit to eladkehat/yapdfminer that referenced this pull request May 4, 2019
@pietermarsman
Copy link
Member

This PR needs only a little work. @side2k can you do that?

@pietermarsman
Copy link
Member

This PR fixes #26

@side2k
Copy link
Author

side2k commented Jul 14, 2019

@pietermarsman I will look into it within a nearest couple of hours

@side2k
Copy link
Author

side2k commented Jul 14, 2019

@pietermarsman i've rebased and added the fixes. All the tests are passing now.

@side2k
Copy link
Author

side2k commented Jul 14, 2019

it would be really great to get a code review from someone who is familiar with the current state of things in pdfminer - I didn't touch it for a couple of years by now.

@pietermarsman
Copy link
Member

pietermarsman commented Jul 14, 2019

Nice, quick response! :) Do you have a pdf and script to test the changes with? That would make it a lot easier for me to review the code and understand what it does and what you have changed.

I don't have a lot of experience with pdfs, nor with pdfminer. But I want to learn that and I can give your code a thorough sanity check.

@pietermarsman
Copy link
Member

(@side2k , not sure if you've missed my previous message, this is a friendly notification if you did)

Do you have a pdf and sample code to test this with? I can't understand and check this PR if I don't have a pdf that triggers this code.

@side2k
Copy link
Author

side2k commented Jul 16, 2019

@pietermarsman I think I had once, but right now I dont have an opportunity to search. Maybe later. I am on a business trip right now, sorry.

@pietermarsman
Copy link
Member

@side2k let us know when you find it.

I think we can close this PR until we have some testing material. Once we have that we can reopen, test, review and merge.

@pietermarsman
Copy link
Member

@ganeshtata, there is no pdf to test this PR with. Do you agree that we cannot merge this PR if there is nothing to test?

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried testing this with the this pdf: jbig2.pdf.

I needed to make a few adjustments to make this work, they are all related to byte-strings vs. normal strings.

The output I got has the extension .jb2 but I could not determine if that was a proper jbig2 file since I have no viewer for that, and could not find one on the internet. At least, not one that shows the actual image. I've also attached the output image (Im1.jb2.zip).


# file literals

FILE_HEADER_ID = '\x97\x4A\x42\x32\x0D\x0A\x1A\x0A'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a bytestring

return segments

def is_eof(self):
if self.stream.read(1) == '':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should compare to a bytestring, e.g. self.stream.read(1) == b''

return data_len

def encode_segment(self, segment):
data = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a bytestring

'flags': {'deferred': False, 'type': SEG_TYPE_END_OF_FILE},
'number': seg_number,
'page_assoc': 0,
'raw_data': '',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a bytestring

@pietermarsman
Copy link
Member

I've found a pdf with JBIG2 images pdfbox jira.

ItDoesntWorkScan.pdf

@pietermarsman pietermarsman removed the request for review from goulu October 22, 2019 08:49
@pietermarsman pietermarsman mentioned this pull request Oct 22, 2019
4 tasks
pietermarsman added a commit that referenced this pull request Oct 22, 2019
And added test for pdf with JBIG2 image.

Fixes #26 
Closes #46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants