Skip to content

Using the command line tabula extractor tool

Mike Tigas edited this page May 20, 2016 · 2 revisions

Say you have a 1,000 page PDF file — or 1,000 separate PDF files — but each page is laid out identically and you want the same table from the middle of the page. Trying this in the full-blown web version of Tabula may not work due to the size, or because each file is a separate thing you have to upload and process.

You can use tabula-java — the engine that powers Tabula — as a standalone command-line tool to handle these situations.


Contents:

  1. Download tabula-java
  2. How to get the coordinates of the table you want
  1. Using tabula-extractor with coordinates

Download tabula-java

You’ll need Java installed.

See the README for the latest download link. Simply place the .jar file someplace you will be able to locate later.

You can test that your java and your .jar file has been acquired correctly by running the following to see the command-line help text. (In this example, the tabula-java download has been placed in a folder called target.)

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar --help 

Grab coordinates of the table you want

The command-line Tabula extractor tool needs the coordinates (in point measurements, not pixels) of the table you want to extract. (Alternatively, it can auto-detect tables, but if you're dealing with thousands of pages with identical regions, it's better to be explicit.)

You can either use the "full" Tabula app to get these coordinates, or manually measure using the Preview app in Mac OS X.

Use the Tabula app to grab table coordinates

  1. Download Tabula from http://tabula.technology/ if you haven't already.
  2. Open Tabula and upload your PDF into the web page that appears.
  3. Select your table area(s) as usual and proceed to the "Preview & Export Extracted Data" step.
  4. Under Export Format, select "Script" instead of CSV, and then click "Export" to download the generated code. Save this file somewhere you can find it.
  5. Open the script you downloaded in a code editor.
  • The Using tabula-extractor with coordinates section below will describe how to use the command-line invocation of Tabula.
  • Note that the script export starts each line with tabula instead of java -jar /path/to/tabula.jar — make sure you edit the script to use this java invocation and the correct path of the downloaded .jar file.
  • The generated script contains measurements already filled in, based on what you selected in the Tabula app. You can use this as a starting point to process many of the same type of document, for example if you have a monthly report that is generated as separate PDFs for each month, and the table you want is located in the exact same place each time.

Use Preview to grab table coordinates (OS X only)

  1. Open your PDF file in the Preview app
  2. Make sure Tools > Rectangular selection is checked.
  3. Open the inspector by going to Tools > Show inspector.
  4. Go to the "crop inspector" tab — second from the right, it looks like a ruler
  5. Change "Units" to Points
  6. Select the area you want on the page.

Note the left, top, height, and width parameters and calculate the following:

  • y1 = top
  • x1 = left
  • y2 = top + height
  • x2 = left + width

You'll need four values these in Part 3: "Using tabula-extractor with coordinates", below.

Using tabula-extractor with coordinates

Open up your terminal.

You can now use these coordinates doing this:

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

where:

  • $y1, $x1, etc. are the numbers you got above
  • $csvfile is the name of a CSV file you'll write the tables out to
  • $filename is the name of the PDF file you're reading in.

You can safely ignore any SLF4J: warning messages.

Example:

$ java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o data_table.csv report.pdf
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

$ head -n10 data_table.csv
"Abdullah, Ghazanfar ","BRONX, NY "," HIV Primary Care, PC ","","$5,250 ","$5,250"
"Aberg, Judith ","NEW  YORK, NY ",Judith Aberg ,"$2,000 ","","$2,000"
"Abriola, Kenneth ","GLAST ONBURY, CT ",Kenneth P Abriola ,"","$5,250 ","$5,250"
"Ahern, Barbara ","JAMAICA, NY ",Barbara A Ahern ,"","$2,750 ","$2,750"
"Akil, Bisher ","NEW  YORK, NY ", Chelsea Village Medical PC ,"","$53,350 ","$53,350"
"Albrecht, Helmut ","COLUMBIA, SC ", Department of Internal Medicine ,"$6,000 ","","$6,000"
"Albrecht, Helmut ","COLUMBIA, SC ",Helmut Albrecht ,"$2,000 ","","$2,000"
"Alpert, Peter ","BRONX, NY ",Peter L Alpert ,"","$12,000 ","$12,000"
"Altidor, Sherly ","NEW  YORK, NY ",Sherly Altidor ,"","$7,500 ","$7,500"
"Alvarado, Eduardo ","EL MONT E, CA ",Eduardo Alvarado ,"","$10,750 ","$10,750"

You can write a script like this to iterate over many identical-format PDFs in a directory:

#!/bin/bash
for f in /path/to/dir/*.pdf; do
	java -jar /path/to/tabula/tabula-0.9.0-jar-with-dependencies.jar -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done