PDF to semantic HTML conversion

Transcript contains Python programs whose job is to transcribe PDF into sematic HTML.

pdftranscript - Get semantic HTML from PDFs converted by pdf2htmlEX.

pdfttf - Recover lost text from PDFs where true type font characters are nothing more than images of themselves.

pdf2html - Batch process a folder full of PDFs ready for pdftranscript

Read the docstrings for more information.

Example

PDF before and semantic HTML after

Installation

pip install pdftranscript

Get Python installed along with latest pdf2htmlEX.

On OS X with Homebrew:

brew install python3 pdf2htmlEX

or on Ubuntu/Debian

sudo apt update && sudo apt install -y libfontconfig1 libcairo2 libjpeg-turbo8 ttfautohint
wget https://github.com/pdf2htmlEX/pdf2htmlEX/releases/download/v0.18.8.rc1/pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb
sudo apt install ./pdf2htmlEX-0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64.deb

Check your install

pdf2htmlEX -v

Docker install of pdf2htmlEX is also supported (brew one started failing as of late). This particular image is tested and used in the default config via DOCKER_IMG_TAG.

docker pull
pdf2htmlex/pdf2htmlex:0.18.8.rc2-master-20200820-ubuntu-20.04-x86_64

pip install pdftranscript should install lxml and freetype-py too.

Configure

Configure your project path in your .env file and config.py most importantly the DATA_DIR. This can be any folder let's say DATA_DIR=/path/to/pdf-transcript/tests. If you use a docker install of pdf2htmlEX, you'll need to set DOCKER_INSTALL=1 This will mount your data dir to Docker path. DOCKER_IMG_TAG is also configurable. Go ahead create your .env file and add DATA_DIR=...

Your DATA_DIR should end up containing 3 folders: PDF, HTML and HTM if you otherwise stick with default configuration. Create a 'PDF' folder inside and drop your PDFs there.

PDF is a folder where your PDFs are.
HTML is where pdf2htmlEX output (non-semantic HTML) ends up after running ./pdf2html.py, which just runs pdf2htmlEX with suitable options.
HTM is the final destination where semantic HTML gets born after running ./transcript.py.

Run

pdf2html or ./pdftranscript/pdf2html.py in a cloned repo.

pdftranscript or ./pdftranscript/transcript.py

When you change configuration within transcript.py or tweak some code. You only need to run ./pdftranscript/transcript.py

Development process

Set expected (hand-adjusted) output to aim for and improve codebase to get transcript output closer to the ideal semantic output. Make sure your changes don't make output worse for other tests. Use ruff check.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github/workflows		.github/workflows
pdftranscript		pdftranscript
tests		tests
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to semantic HTML conversion

Example

Installation

Configure

Run

Development process

About

Releases

Packages

Contributors 4

Languages

License

fmalina/PDFtranscript

Folders and files

Latest commit

History

Repository files navigation

PDF to semantic HTML conversion

Example

Installation

Configure

Run

Development process

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages