PDFBox-Co-ordinates-of-text

This PDFBox wrapper that can be used for extracting text and text co-ordinates from a printed PDF doc (no OCR)

Dependencies

JRE 8 or above
Nodejs 8 or above - only if you want to build from source
PDFBox v2 or above - only if you want to build from source

Note: Neither the binaries, nor the source files will work without the Java Runtime Environment (JRE)

Usage (Windows)

After cloning the repository, copy the main_java.exe and the BoomPdf.jar files to the desired folder. Make sure you have nodejs installed and then run the following code in the command prompt:

I:\Path> node main_java.exe "Absolute Path of the PDF" FromPage ToPage

Where I:\Path> is the folder path (don't type this in, it will show up by default once you navigate to the folder using cd), FromPage is the first page you want to convert and ToPage is the end page

The zero position/origin (0,0) is on the top left corner of the page:

Build From Source (Linux and OS X)

Out of the box, BoomPdf.java returns each glyph/alphabet and special character with coordinates.

If you're building the code from source, please download the PDFBox jar and use the BoomPdf.java file to customize your solution.

Further customization in the output can be done by altering the main_java.js file. This is the Node.js code that parses the text and returns words with their coordinates (the left-most character's position is taken for reference)

License

The main code of this project is licensed under the Apache 2.0 License, found at http://www.apache.org/licenses/LICENSE-2.0.html Any code released under a different licenses will be stated in the header.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
img		img
source		source
.gitignore		.gitignore
BoomPdf.jar		BoomPdf.jar
LICENSE		LICENSE
README.md		README.md
main_java.exe		main_java.exe
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFBox-Co-ordinates-of-text

Dependencies

Usage (Windows)

Build From Source (Linux and OS X)

License

About

Releases

Packages

Languages

License

kanishk-mehta/PDFBox-get-Coordinates-of-text

Folders and files

Latest commit

History

Repository files navigation

PDFBox-Co-ordinates-of-text

Dependencies

Usage (Windows)

Build From Source (Linux and OS X)

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages