- Parsr Installation Guide
This document will guide you through the installation process.
You can install Parsr either using Docker containers, or directly on your machine with an automatic script or manually. You don't need to do everything!
Containers are already available on Docker Hub.
The documentation to build and run Docker containers is here.
You can install Parsr locally via a Node.js script:
- Download and install
node.js
- In the root of Parsr directory, open a terminal and run
npm run install:pre
. This command will install every required and optional dependency.
For Windows platforms, this script requires TLS 1.2 or newer enabled.
Note: Currently, table detection requires Python 3.7 and below, as a vital dependency, camelot-py
has not been ported yet to Python 3.8.
If the automatic install script is not available for your platform, you can always do a manual installation following this steps:
Under a Debian based distribution:
sudo add-apt-repository ppa:ubuntuhandbook1/apps
sudo apt-get update
sudo apt-get install nodejs npm qpdf imagemagick tesseract-ocr libtesseract-dev python3-tk ghostscript python3-pip
pip install camelot-py[cv] numpy pillow scikit-image PyPDF2 pdfminer.six sklearn
Under Arch Linux:
pacman -S nodejs npm qpdf imagemagick python-pdfminer tesseract python-pip
pip install camelot-py[cv] numpy pillow scikit-image PyPDF2 pdfminer.six sklearn
Note: if camelot-py[cv] generates an error in console, you might want to try replacing it with camelot-py\[cv\].
The package manager we suggest using under MacOS is homebrew. To install it, launch the following in a terminal
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"
Next, install the required dependencies:
brew install node python qpdf imagemagick tesseract tesseract-lang tcl-tk ghostscript
Next, upgrade python:
brew upgrade python
To install the python3 based dependencies (pdfminer and camelot), install, first install pip3
:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py
and then the dependencies:
pip3 install pdfminer.six
pip3 install camelot-py[cv]
pip3 install numpy pillow scikit-image
python2.7 -m pip install PyPDF2
To install the python2 based dependencies (pdfminer and camelot), install, first install pip
:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py
and then the dependencies:
python2.7 -m pip install PyPDF2
The installation procedure for Parsr requires the command where.exe
to be in the path.
Try typing where
in the command prompt. If the command cannot be found, execute the following to add its location to PATH:
setx PATH "\$env:PATH;C:\Windows\System32" -m
Then,
-
We recommend using Chocolatey as the package manager for installing dependencies under Windows. To install Chocolatey, follow these instructions.
-
For the pdfminer extractor for pdfs, follow these steps.
-
Install
qpdf
andimagemagick
using Powershell (Run as Administrator):choco install qpdf imagemagick
-
For table detection, install camelot.
-
For
tesseract
you can download and install, or check out other available formats on the wiki. Then, you need to add tesseract.exe to your PATH: If you have install it inC:\Program Files (x86)\Tesseract-OCR
, you can either add it using the user interface execute the following command in Powershell (Run as Administrator):setx PATH "\$env:PATH;C:\Program Files (x86)\Tesseract-OCR" -m
-
For PDF manipulation, install PyPDF2. Note: If camelot is installed, PyPDF2 will be already available.
The following dependencies are completely optional, and their exclusion does not hinder the proper functioning of the Parsr pipeline.
The functions of each, as well as the installation process are are explained below:
MuPDF, in the Parsr platform is Used to fix certain error-prone or corrupt PDF files on input.
To install MuPDF, follow the steps corresponding to your environment:
-
Under a Debian based distribution:
sudo apt-get install mupdf mupdf-tools
-
Under Arch Linux:
pacman -S mupdf-tools
-
Under MacOS:
brew install mupdf-tools
-
Under Windows:
choco install mupdf
If MuPDF is not installed, a corrupt/unreadable PDF file at input will be left untreated. A message of such an occurrence will be logged.
Pandoc is a document format conversion program, used under Parsr to generate PDF files from an intermediate Markdown output after the cleaning operation in the pipeline.
To install Pandoc, follow the steps corresponding to your environment:
-
Under a Debian based distribution:
sudo apt-get install pandoc
-
Under Arch Linux:
pacman -S pandoc
-
Under MacOS:
brew install pandoc
-
Under Windows:
choco install pandoc
If Pandoc is not installed, the user will not be able to generate PDF files on output. Any configuration requiring a PDF file output will be ignored.
ABBYY FineReader is a proprietary high precision OCR solution for generating rich text from images. One can obtain the ABBYY FineReader Server from here.
ABBYY FineReader is an optional dependency, and it's absence should in no way hinder the everyday usage of Parsr's default OCR solution, tesseract.
To install every Node dependency, just open a terminal at the root directory of Parsr and type:
npm install
To verify that you have everything correctly installed, you can follow this steps:
-
Run the test suite:
npm run test
-
Start the API:
npm run start:api
-
Open the following URL on your browser:
If all test passed and every required dependency is found, then you're good to go!