Skip to content

Using OCR to grep dead trees the easy way

License

Notifications You must be signed in to change notification settings

chrisz/paperwork

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paperwork

Description

Paperwork is a tool to make papers searchable.

The basic idea behind Paperwork is "scan & forget" : You should be able to just scan a new document and forget about it until the day you need it again. Let the machine do most of the work.

Screenshots

Main window

Search suggestions

Labels

Settings window

Details

Papers are organized into documents. Each document contains pages.

It uses mainly 3 other pieces of software:

  • Sane: To scan the pages
  • Cuneiform or Tesseract: To extract the words from the pages (OCR)
  • GTK/Glade: For the user interface

Page orientation is automatically guessed using OCR.

Paperwork uses a custom indexation system to search documents and to provide keyword suggestions. Since OCR is not perfect, and since some documents don't contain useful keywords, Paperwork allows also to put labels on each document.

Licence

GPLv3. See COPYING.

Dependencies

  • python v2.7
    Paperwork is written for Python 2.7. So depending of your Linux distribution, you may have to invoke "python2" instead of "python" (for instance with Arch Linux)
  • pygtk v2 (required)
    • Debian/Ubuntu package: python-gtk2
  • python-glade2 (required)
    • Debian/Ubuntu package: python-glade2
  • pycountry (required)
    • Debian/Ubuntu package: python-pycountry
  • python-imaging (required)
    • Debian/Ubuntu package: python-imaging
  • python-poppler (required)
    • Debian/Ubuntu package: python-poppler
  • python-enchant (required)
    • Debian/Ubuntu package: python-enchant
  • python-levenshtein (required)
    • Debian/Ubuntu package: python-levenshtein
  • pyinsane (required)
    • Debian/Ubuntu package: none at the moment
    • Manual installation:
      • git clone git://github.com/jflesch/pyinsane.git
      • cd pyinsane
      • sudo python ./setup.py install
  • OCR (optional for document searching ; required for scanning)
    • Tesseract (>= v3) (recommended)
      • Debian package: none at the moment
      • Ubuntu package: tesseract-ocr tesseract-ocr-
    • Or Cuneiform (>= v1.1)
      • Debian/Ubuntu package: cuneiform
  • pyocr (required)
    • Debian/Ubuntu package: none at the moment
    • Manual installation:
      • git clone git://github.com/jflesch/pyocr.git
      • cd pyocr
      • sudo python ./setup.py install

Installation

$ git clone git://github.com/jflesch/paperwork.git
$ cd paperwork
$ sudo python ./setup.py install
$ paperwork

Enjoy :-)

Contact

Development

Rules

Try to stick to PEP-8 as much as possible. Mainly:

  1. Lines are at most 80 characters long
  2. Indentation is done using 4 spaces

Code organisation

The code is splited in two pieces:

  • backend : Everything related to document management. May depend on various things but not Gtk
  • frontend : The GUI. Entirely dependant on Gtk

Tips

If you want to make changes, here are few things that can help you:

  1. You don't need to install paperwork to run it. Just run 'src/paperwork.py'.
  2. Paperwork looks for a 'paperwork.conf' in the current work directory before looking for a '.paperwork.conf' in your home directory. So if you want to use a different config and/or a different set of documents for development than for your real-life work, just copy your '/.paperwork.conf' to './paperwork.conf'. Note that the settings dialog will also take care of updating './paperwork.conf' instead of '/.paperwork.conf'.
  3. "pep8" is your friend
  4. "pylint" is your friend: $ cd src ; pylint --rcfile=../pylintrc *.py

About

Using OCR to grep dead trees the easy way

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published