Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Text extraction is not allowed error to warning #453

Merged

Conversation

madhurcodes
Copy link
Contributor

@madhurcodes madhurcodes commented Jul 3, 2020

I was using this library for parsing a large set of files and it gave an error halfway through because it encountered such a file. Users should not have to handle this manually in a library meant for parsing PDFs.

Changes the Text extraction is not allowed error thrown to a descriptive warning.

Fixes #350

How Has This Been Tested?

I tested this on an example PDF which threw the mentioned error with the original code and gave the correct text after the change as expected. I cannot share the PDF because I don't own copyright to it.

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version
  • I have updated the README.md or I am verified that this is not necessary
  • I have updated the readthedocs documentation or I verified that this is not necessary
  • I have added a consice human-readable description of the change to CHANGELOG.md

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @madhurcodes,

Thanks for your work. Could you make sure that the changes are like the suggested changes in the issue:

  • Set the default value for check_extractable to False.
  • If check_extractable is True we throw an Error, if False we raise a warning.
  • Remove the explicit arguments for check_extractable from the high_level module.

pdfminer/pdfpage.py Show resolved Hide resolved
pdfminer/pdfpage.py Outdated Show resolved Hide resolved
Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Last thing: can you add a test that checks if the warning is warned and, more importantly, that the code completes without errors.

@madhurcodes
Copy link
Contributor Author

madhurcodes commented Jul 7, 2020

Hey as mentioned in the PR the PDFs I have which trigger this warning are copyright protected so I can't upload them here for testing. I have tested that it raises the correct warning on my machine.

The code is really basic and it should be clear that there is no functional change so IMO it should not require a new test since the earlier error throwing code was also untested.

@pietermarsman
Copy link
Member

I tried changing the content of an sample encrypted document, but apparently its hard to change the password of an encrypted file using only vim 😖

Should have looked at the issue earlier. An example pdf is posted by @Recursing. You could use that one. I also notices that the encrypted samples that are in the repo already are never used for testing, these can also be used.

A lot of functionality of pdfminer.six is still untested. Having a test suite for all functionality makes it easier to add code, and to change/improve implementations in the future. That's why I'm asking. I do believe that this code works as intended 😄

If you have time to add the sample pdf from the issue, please do so. Otherwise I will find some time to do it myself and then merge this. Thanks for all the efforts so far!

@pietermarsman pietermarsman merged commit 6a9269b into pdfminer:develop Jul 11, 2020
@pietermarsman
Copy link
Member

Thanks @madhurcodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add check_extractable argument to high_level.extract_text
2 participants