Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple wrapper to extract text from pdf #330

Merged
merged 12 commits into from
Nov 7, 2019

Conversation

igormp
Copy link
Contributor

@igormp igormp commented Nov 4, 2019

Description

Added a simple wrapper that allows one to easily extract text from a PDF file.

Fixes #327

How Has This Been Tested?

Erh, gotta work on some actual tests for that. Other than that, I've been using this function everywhere in my projects.

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the README.md and other documentation, or I am sure that this is not necessary
  • I have added a concise human-readable description of the change to CHANGELOG.md
  • I have added docstrings to newly created methods and classes
  • I have optimized the code at least one time after creating the initial version

Copy link
Member

@pietermarsman pietermarsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said on gitter: this will be terribly useful!

I'm sure you are planning on adding it, but I noticed that the test is missing.

pdfminer/high_level.py Outdated Show resolved Hide resolved
pdfminer/high_level.py Outdated Show resolved Hide resolved
pdfminer/high_level.py Outdated Show resolved Hide resolved
pdfminer/high_level.py Outdated Show resolved Hide resolved
@pietermarsman
Copy link
Member

I've made some small changes.

@igormp
Copy link
Contributor Author

igormp commented Nov 4, 2019

For some reason I a word in that last commit message lol
I'll squash everything into a single commit before we merge anyway.

About the docstrings, shouldn't we mention the default values, or is the documentation tool we're using able to infer this from the function signature?

Apart from that, I think the tests are the only missing thing before this PR can be merged, right?

@pietermarsman
Copy link
Member

For some reason I a word in that last commit message lol
I'll squash everything into a single commit before we merge anyway.

I can also squash them with github during the merge if you want.

About the docstrings, shouldn't we mention the default values, or is the documentation tool we're using able to infer this from the function signature?

The documentation shows the function signature with default values. Example.

Apart from that, I think the tests are the only missing thing before this PR can be merged, right?

Yep.

@igormp
Copy link
Contributor Author

igormp commented Nov 5, 2019

I can also squash them with github during the merge if you want.

TIL about this feature, nice

I'll try to get a simple test working today.

@igormp igormp marked this pull request as ready for review November 6, 2019 13:55
@igormp
Copy link
Contributor Author

igormp commented Nov 6, 2019

Ok, finally added some simple tests.

Travis seems to have failed on python 2.7, and I'm not really sure why. Should I bother with it?

@pietermarsman
Copy link
Member

Hhm, I think we should fix that. I've already spend some time on it but I can't find what's going wrong.

@igormp
Copy link
Contributor Author

igormp commented Nov 7, 2019

I left that options=None parameter for possible test cases where we'd like to change something like the LAParams(), but in hindsight **kwargs would've been better.

I'll try to fix the test strings for the new default value for laparams and try to fix it for python 2.7.

@igormp
Copy link
Contributor Author

igormp commented Nov 7, 2019

Ok, I managed to fix it, the error was due to StringIO in python3 behaving like BytesIO in python2 or something like that.

@pietermarsman pietermarsman merged commit 40aa253 into pdfminer:develop Nov 7, 2019
@pietermarsman
Copy link
Member

I hate python 2 so much, I can't wait to see it dead

😂

@pietermarsman
Copy link
Member

Thanks for all the work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create function that returns the extracted text
2 participants