Hocr is html microformat for information from OCR packages. You can find more information about hocr in this document. It can be generated by tesseract v3.0>=, ocropus or cuneiform programs. With hocrtex, it is possible to use this information from LaTeX and convert this file to PDF.
Hocrtex is based on xmltex, xml processor written in TeX.
Unzip contents of the file hocr.tar.gz to your local texmf dir. You can find its location with the following command:
kpsewhich -var-value TEXMFHOME
First, you need to get hocr file. You have to process images from your scanned book with one of OCR packages listed above.
In tesseract, you can generate hocr output with this procedure:
-
Create file named "hocr", put it somewhere and copy this line into it:
tessedit_create_hocr 1
-
call tesseract
tesseract imagename outputname -l lang_name +path_to_hocr/hocr
Now we have html file with hocr information.
For processing with hocrtex, we need to generate config file using package hocrconfig
.
Create file sample.tex:
\documentclass{article}
\usepackage[
FileName=example % name of hocr file without .html suffix
,ResizeRatio=5.5 % division from bbox coordinates to points
,ImageName=normal- % in hocr, each page includes name of its
% source image. but if source image is multipage tiff,
% this name is on all pages the same. it is best to
% convert this tiff image into series of png images
% named normal-0.png, ..., normal-n.png
% ImageName is the prefix before image number
,Driver=underimage % driver defines actions on hocr classes
]{hocrconfig}
\begin{document}
\end{document}
after compilation with LaTeX, file normal.cfg
is created. Now you can call xmltex:
pdfxmltex normal.html
file normal.pdf will be created.