GitHub - the-deep/deepex

DeepExt

installation

pip install git+https://github.com/the-deep/deepex

Usage

from deep_parser import TextFromFile


PDF_DOCUMENT = [YOUR_DOCUMENT_PATH]

with open(PDF_DOCUMENT_PATH,'rb') as f:
    binary = base64.b64encode(f.read())

document = TextFromFile(stream=binary, ext="pdf")
text, other = document.extract_text()

TextFromFile class also accepts a PDF document URL from a website:

from deep_parser import TextFromFile

PDF_URL = [DOCUMENT_URL]

document = TextFromFile(stream=None, ext="pdf", from_web=True, url=PDF_URL)
text, other = document.extract_text()

extract_text() method returns a tupla with the extracted text and a Results class instance. Output format can be selected with the output_format parameter:

text, images = document.extract_text(output_format="list")

return list-formatted text. Results instance can be used for document images processing or saving, for example:

text, other = document.extract_text()
other.save_images(directory_path = DIRECTORY_PATH)

extract_text() method can be replaced with extract_text_multi() for a multi-processing management of extraction.

You can also extract text from webpages/html:

from deep_parser import TextFromWeb

URL = WEBSITE_URL
webpage = TextFromWeb(url=URL)
text = webpage.extract_text(output_format="list", url=URL)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
src/deep_parser		src/deep_parser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepExt

installation

Usage

About

Releases

Packages

Languages

License

the-deep/deepex

Folders and files

Latest commit

History

Repository files navigation

DeepExt

installation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages