pdftojson

using XPDF, pdftojson extracts text from PDF files as JSON, including word bounding boxes.

Compile

./configure
make

On MacOS, you might need to specify libpng and libfreetype locations, e.g.

./configure --with-libpng-library=/usr/local/Cellar/libpng/1.6.16/lib/  --with-libpng-includes=/usr/local/Cellar/libpng/1.6.16/include/ --with-freetype2-library=/usr/local/lib/ --with-freetype2-includes=/usr/local/include/freetype2/

You will find pdftojson inside the directory xpdf/pdftojson

Usage

pdftojson <input.pdf> <output.json>

File format

The JSON produced looks like: [ { "pages":14, "number":1, "width":612, "height":792, "text":[ [115,162,41,14,0,"What "], ... ] }, { "pages":14, "number":2, "width":612, "height":792, "text":[ [115,162,41,14,0,"Here "], ... ] }, ... ];

For each page, the text array contains: [top,left,width,height,0,text]

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
doc		doc
fofi		fofi
goo		goo
misc		misc
splash		splash
xpdf		xpdf
.gitignore		.gitignore
ANNOUNCE		ANNOUNCE
CHANGES		CHANGES
COPYING		COPYING
COPYING3		COPYING3
INSTALL		INSTALL
Makefile.in		Makefile.in
README		README
README.md		README.md
aclocal.m4		aclocal.m4
aconf-dj.h		aconf-dj.h
aconf-win32.h		aconf-win32.h
aconf.h		aconf.h
aconf.h.in		aconf.h.in
aconf2.h		aconf2.h
config_mac.sh		config_mac.sh
config_ubuntu.sh		config_ubuntu.sh
configure		configure
configure.in		configure.in
dj_make.bat		dj_make.bat
install-sh		install-sh
ms_make.bat		ms_make.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdftojson

Compile

Usage

File format

About

Releases

Packages

Contributors 2

Languages

License

ldenoue/pdftojson

Folders and files

Latest commit

History

Repository files navigation

pdftojson

Compile

Usage

File format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages