Hocrtex

Hocr is html microformat for information from OCR packages. You can find more information about hocr in this document. It can be generated by tesseract v3.0>=, ocropus or cuneiform programs. With hocrtex, it is possible to use this information from LaTeX and convert this file to PDF.

Hocrtex is based on xmltex, xml processor written in TeX.

Install

Unzip contents of the file hocr.tar.gz to your local texmf dir. You can find its location with the following command:

kpsewhich -var-value TEXMFHOME

Usage

First, you need to get hocr file. You have to process images from your scanned book with one of OCR packages listed above.

In tesseract, you can generate hocr output with this procedure:

Create file named "hocr", put it somewhere and copy this line into it:

tessedit_create_hocr 1
call tesseract

tesseract imagename outputname -l lang_name +path_to_hocr/hocr

Now we have html file with hocr information.

For processing with hocrtex, we need to generate config file using package hocrconfig.

Create file sample.tex:

\documentclass{article}
\usepackage[
   FileName=example   % name of hocr file without .html suffix
  ,ResizeRatio=5.5    % division from bbox coordinates to points
  ,ImageName=normal- % in hocr, each page includes name of its 
                      % source image. but if source image is multipage tiff,      
                      % this name is on all pages the same. it is best to 
                      % convert this tiff image into series of png images
                      % named normal-0.png, ..., normal-n.png
                      % ImageName is the prefix before image number 
  ,Driver=underimage  % driver defines actions on hocr classes 
]{hocrconfig}
\begin{document}
\end{document}

after compilation with LaTeX, file normal.cfg is created. Now you can call xmltex:

pdfxmltex normal.html

file normal.pdf will be created.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
hocr.sty		hocr.sty
hocr.xmt		hocr.xmt
hocrconfig.sty		hocrconfig.sty
hocrdriver-simple.sty		hocrdriver-simple.sty
hocrdriver-underimage.sty		hocrdriver-underimage.sty
normal.html		normal.html
readme.markdown		readme.markdown
readme.tex		readme.tex

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hocrtex

Install

Usage

About

Releases

Packages

michal-h21/hocrtex

Folders and files

Latest commit

History

Repository files navigation

Hocrtex

Install

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages