Skip to content

Automatically extract tables and convert them to XLS using bash script.

laigor edited this page Dec 16, 2018 · 8 revisions

Usage: pdftoxls.sh <PDF file>

The work of the script consists of three steps:

  1. Detect of tables areas using tabula-java guess option.
  2. Recognition of tables in the found areas using tabula-java “lattice” method and writing them to a CSV file.
  3. Convert the CSV file to excel using the unoconv utility.

In order for this script to work, the following utilities must be installed:

  1. jq
  2. poppler-utils
  3. unoconv
  4. Libreoffice Calc
  5. latest version of tabula-java from master branch.

pdftoxls.sh