You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using Tabula (more specifically the command-line version, tabula-java) to extract data from PDFs. I have a bash script which calls tabula-java a total of four times per PDF. It's a slow process (10 sec per PDF). I have almost 200K PDFs to process, so I was hoping to see some speed-up by using drip.
Unfortunately, my script doesn't like drip. When I pipe tabula's output to tr (translate), the script hangs within tr. Here's one of those tabula calls which hangs in a piped-to tr: export id_value=$(drip -cp tabula-0.8.0-jar-with-dependencies.jar technology.tabula.CommandLineApp -a 240.593,124.695,264.308,227.97 -p 1 $filename | tr -d '\r\n')
When I say this "hangs" I mean that it enters but never exits tr. Control-C will get me back to the prompt.
The script works just fine when I avoid drip and call tabula through java: export id_value=$(java -cp tabula-0.8.0-jar-with-dependencies.jar technology.tabula.CommandLineApp -a 240.593,124.695,264.308,227.97 -p 1 $filename | tr -d '\r\n')
Details: OS X 10.8.5, tabula-java 0.8.0
The text was updated successfully, but these errors were encountered:
Can you get thread dumps of the processes involved? That would let us see where the Java processes are stuck, at least. I would guess there's some stdio buffering happening preventing this from working nicely.
I'm using Tabula (more specifically the command-line version, tabula-java) to extract data from PDFs. I have a bash script which calls tabula-java a total of four times per PDF. It's a slow process (10 sec per PDF). I have almost 200K PDFs to process, so I was hoping to see some speed-up by using drip.
Unfortunately, my script doesn't like drip. When I pipe tabula's output to tr (translate), the script hangs within tr. Here's one of those tabula calls which hangs in a piped-to tr:
export id_value=$(drip -cp tabula-0.8.0-jar-with-dependencies.jar technology.tabula.CommandLineApp -a 240.593,124.695,264.308,227.97 -p 1 $filename | tr -d '\r\n')
When I say this "hangs" I mean that it enters but never exits tr. Control-C will get me back to the prompt.
The script works just fine when I avoid drip and call tabula through java:
export id_value=$(java -cp tabula-0.8.0-jar-with-dependencies.jar technology.tabula.CommandLineApp -a 240.593,124.695,264.308,227.97 -p 1 $filename | tr -d '\r\n')
Details: OS X 10.8.5, tabula-java 0.8.0
The text was updated successfully, but these errors were encountered: