pdftoroff.1

.TH pdftoroff 1 "September 12, 2017"
.
.
.
.SH NAME
pdftoroff - convert pdf to various text formats (roff, html, TeX, text)
.
.
.
.SH SYNOPSIS
.TP 10
\fBpdftoroff\fP
[\fI-r\fP|\fI-w\fP|\fI-p\fP|\fI-f\fP|\fI-t\fP|\fI-s fmt\fP]
[\fI-m method\fP [\fI-d distance\fP] [\fI-o order\fP]]
[\fI-i range\fP] [\fI-b box\fP] [\fI-n\fP] [\fI-v\fP]
\fIfile.pdf\fP
.
.
.
.SH DESCRIPTION

Extract text from a pdf file undoing page, column and paragraph formatting if
possible but retaining italic and bold faces. The output is in one of the
following formats: groff(1), html, plain TeX, text with font changes, simple
text or a user-given format.

The groff output can be used to reformat the text to a smaller page size and a
different font to make it more readable on a small tablet or e-ink ebook
reader, as shown in the REFORMAT section. The \fIpdftoebook\fP script does
this.
.
.
.
.SH OPTIONS
.TP
.B
-r
output in groff(1) format; it can be directly compiled by a pipe like
\fIpdftoroff -r file.pdf | groff -Dutf8 -Tutf8 -\fP or prepended by code for
page and character formatting, like in the REFORMAT section, below

.TP
.B
-w
output in html format; only the body of the html file is generated, not the
header

.TP
.B
-p
convert to plain TeX; see BUGS below

.TP
.B
-f
text format; font changes are marked \fI\\[fontname]\fP, and backslashes
escaped to \fI\\\\\fP

.TP
.B
-t
text only

.TP
\fB-s\fP \fIfmt\fP
output using the parameters in \fIfmt\fP;
see OUTPUT FORMAT, below

.TP
\fB-m\fP \fImethod\fP
conversion method:

.RS
.IP 0 4
detect columns on the fly
.IP 1
use the bounding box of the page
.IP 2
use the blocks of text on the page
.IP 3
use the blocks of text on the page, sorted
.IP 4
use rows of text
.RE

.IP
the default method is 1, which is fast and usually gives good results on
single-column documents; methods 2 is slower, but often produces better results
on multiple-column documents; method 3 is even slower, but the sorting of the
blocks may be necessary when the characters in the document are not in the
correct order; method 4 is for tables; see \fICONVERSION METHODS\fP, below

.TP
\fB-d\fP \fIdistance\fP
minimal distance between blocks of text in the page;
for conversion method 4 the default is 0, for all others is 15; a smaller value
like 10 may be appropriate when the document uses small fonts or has little
space between columns or between the header/footer and the text; this value
only affects methods 2, 3 and 4

.TP
\fB-o\fP \fIorder\fP
the method used for sorting the blocks of text in the page:

.RS
.IP 0 4
by their position, quick and approximate
.IP 1
by their position, exact
.IP 2
by the occurrence of their characters in the file
.RE

.TP
\fB-i\fP \fIrange\fP
pages to convert, in the format \fIfirst:last\fP;
negative or zero is from the last page backwards;
for example, \fI-2:0\fP is the range for converting the last three pages

.TP
\fB-b\fP \fI[x1,y1-x2,y2]\fP
convert only the characters that are positioned
within the coordinates \fIx1,y1\fP and \fIx2,y2\fP

.TP
.B -n
do not convert the recurring elements in the page, such as page numbers,
headers and footers; locating these elements takes time, making the conversion
not to start immediately; it may fail, resulting in loss of text or these
elements ending up in the output; see \fIpdfrecur(1)\fP for details

.TP
.B -v
print markers to facilitate checking that the output is correct; see
\fIMARKERS\fP, below

.SH REFORMAT

The following script re-formats a pdf file for a 200x250 page with 5pt margins
and Helvetica font, so that it reads better to a small tablet or e-ink reader.
It extracts the text from the pdf file, prepends it with some groff(7) page and
font code and then compiles back to pdf. This is the core of the
\fIpdftoebook\fP script.

.nf
.ft I
{
cat <<!
\[char46]device papersize=200p,250p
\[char46]po 5p
\[char46]ll 190p
\[char46]pl 240p
\[char46]fam H
!
pdftoroff -r file.pdf;
} | \\
groff -Dutf8 -Tpdf - > new.pdf
.ft P
.fi

.
.
.
.SH OUTPUT FORMAT

The text from the pdf file is scanned for font changes and paragraph breaks.
Short lines, indents and vertical spaces are taken as the start of a new
paragraph, otherwise the new line is considered the continuation of the
previous. Font names are matched agains "Italic" and "Bold", which indicate the
begin of an italic or bold face, and their lack as the end of the font face.

The various output formats are obtained by adding the appropriate strings at
paragraph breaks and font changes, and by substituting some characters (for
example, a plain \fI<\fP is replaced by \fI&lt;\fP for the html format).

The \fI-s fmt\fP option allows arbitrary output strings. For example, the html
format can be alternatively generated by the command:

.nf
\fI
pdftoroff -s '
<p>,</p>
,,,,,,<i>,</i>,<b>,</b>,true,\\,.,&lt;,&gt;,&amp;' file.pdf
\fP
.fi

The format string is a comma-separated list of the following fields. Some may
be empty and some may contain newlines.

.TP
.I
parstart
the string printed when a paragraph begins
.TP
.I
parend
the string printed when a paragraph ends
.TP
.I
fontname
the \fIprintf(3)\fP format for printing the font name;
for example, the \fI-f\fP option uses \fI\\\\[%s]\fP, so that when the text
begins using the font TimesNewRomanCM this is marked
\fI\\[TimesNewRomanCM]\fP in the output
.TP
.I
plain
printed when the font changes to non-italic and non-bold
.br
(example: \fI\\fR\fP in roff)
.TP
.I
italic
printed when the font changes to italic but not bold
.br
(example: \fI\\fI\fP in roff)
.TP
.I
bold
printed when the font changes to bold but not italic
.br
(example: \fI\\fB\fP in roff)
.TP
.I
bolditalic
printed when the font changes to both italic and bold
.br
(example: \fI\\f[BI]\fP in roff)
.TP
.I
italicbegin
printed when the text begins using an italic font
.br
(example: \fI<i>\fP in html)
.TP
.I
italicend
printed when the text ends using an italic font
.br
(example: \fI</i>\fP in html)
.TP
.I
boldbegin
printed when the text begins using a bold font
.br
(example: \fI<b>\fP in html)
.TP
.I
boldend
printed when the text ends using a bold font
.br
(example: \fI</b>\fP in html)
.TP
.I
reset
if this is \fItrue\fP,
turn off all active font faces when a paragraph ends and restore them when the
new one starts; for example, if the pdf starts using a bold font and then ends
it after two paragraphs, the html output is \fI<p><b>first paragraph</b></p>
<p><b>second</b></p>\fP
.TP
.I
backslash
replace every backslash with this string
.TP
.I
firstdot
replace a dot at the start of a line with this string
(this is only useful for roff output)
.TP
.I
less
replace the minus sign (\fI<\fP) with this
.TP
.I
greater
replace the greater sign (\fI>\fP) with this
.TP
.I
and
replace the ampersand (\fI&\fP) with this
.
.
.
.SH CONVERSION METHODS

All conversion methods scan the characters in the page in the same order as in
the pdf file. A new line is detected on:

.IP \(bu 4
a large vertical space from the previous character
.IP \(bu
a small vertical space from the previous character, if the previous character
is not at the right of the column (short previous line)
.IP \(bu
a small vertical space from the previous character, if the current character is
not at the left of the column (indented line)
.RE

The second and third conditions depend on the left and right border of the
current column. The conversion methods differ on how these are found:

.IP 0 4
The left border is the left corner of the leftmost character in the page.
Column changes are detected by large decreases in the y coordinate, and
cause a recalculation of the left border from the remaining charaters in the
page. The right border is a fixed position in the page.

.IP 1
The left and right border are given by the bounding box of the page. This works
on single-column pages. This is the default method.

.IP 2
The blocks of text in the page are determined before scanning the page. The
left and right borders for each character are those of the blocks of text it is
in.

.IP 3
This is the same as 2, but blocks are sorted before scanning the page. It is
slower than method 2 not because of the sorting but because the whole page
needs to be scanned in search of characters in the first block, again for the
second, the third, etc. This may be necessary if the characters in the file are
not in the order they shold be printed.

Three sorting algorithms can be used: the first two try to guess the order of
the blocks based on their position on the page; the third does it based on the
occurrence of their characters in the page. In particular, the algorithms based
on the position of the box sort boxes vertically if they overlap horizontally,
otherwise they order them horizontally. This usually gives reasonable results
on single-column and multiple-column documents. The difference between the two
is that the first is quick and approximate, the second is slower and exact. The
third method scans the characters as they occurr in the file; the block
containing the first is the first block; the block containing the first
character not in the first block is the second, and so on.

.IP 4
This method assumes that the document is a single table: a sequence or rows,
each made of a number of cells. The rows are first located in vertical order,
then each is converted to a line of text.

This method allows converting tables even if their cells are ordered by columns
instead of rows, which is often the case.

The usual rules for line breaking and joining are ignored, and every row is
output as a single line. The minimal text distance (option \fI-f\fP) is used as
the minimal distance between rows; if they are very close to each other, a
negative value may be used to separate them.

.
.
.SH MARKERS

Unformatting text requires introducing line breaks in some places and not in
others and removing the hyphens used to break a word between lines.
This cannot in general be done uniquely. Option \fI-v\fP is for printing
markers that show what have been done and why.
.TP
.I []
a newline was translated into a space because it was considered to
separate two lines of the same paragraph
.TP
.I [-]
an hyphen and the following newline were removed because they looked like a
word broken between two lines
.TP
.I [S]
the following line break is because the current line is short, like the only
or final line of a paragraph
.TP
.I [E]
same, but the line is also at the end of a block of text
.TP
.I [V]
the following line break is due to vertical space between lines
.TP
.I [I]
the following line break is because the next line is indented

.P

These markers are intended for debugging and checking the final result. For
example, a text may look converted correctly, but two dash-separated words have
been merged because the dash fell at the end of the line, and therefore looked
like the hyphen of a single hyphenated word broken between two lines. Marker
.I [-]
helps helps for checking this kind of errors. Spelling the two parts that have
been merged and their result may suggest whether merging was correct, but some
cases cannot be automatically solved this way. For example, if the dash in the
sentence "Price is not under 3, is much more -- over 10, I think." is placed at
the end of a line, it looks like the word "moreover" when hyphenated to split
it between two lines.

.
.
.
.SH BUGS

Replacements are limited to some fixed characters (\\, ., <, > and &). Instead,
the \fI-s\fP option should support replacing arbitrary characters (say,
\fI@\fP).

The plain TeX conversion is primitive: it does not convert accented characters
as it should; it does not support fonts that are both bold and italic; it does
not finish with \fI\\end\fP (but the latter is coherent with generating only
the body of the text in the other formats).

A command line option should allow specifying a number of boxes so that text is
extracted from them in order rather than from the whole page. This is because
the method used by pdftoroff to detect the start of a new column does not
always work, and even if it does, characters in the file are not necessarily in
the correct order. Such an option would also allow to discard headers and
footer. As an example, \fI-b box1,box2,box3;box4;box5;2*\fP would extract text
from \fIbox1,box2,box3\fP from the first page, from \fIbox4\fP from the second,
from \fIbox5\fP from the third, and the repeat with \fIbox4\fP and \fIbox5\fP
until the end of the document.

The html output is not always correct. If the document starts with an italic
font, then switches to italic and bold and then to bold only, the resulting
code is \fI<i>...<b>....</i>...</b>\fP, which is not nested correctly. The
right code would be \fI<i>...<b>....</b></i><b>...</b>\fP. Two solutions are
possible:

.IP "  * " 4
turn off all faces before starting a new one
.IP "  * "
remember which of italic and bold was started first

.P
The numeric parameters for detecting the start of a new paragraph or column are
fixed (the \fIstruct measure\fP in the code). They should be changeable by
command line options.

.SH SEE ALSO
pdftotext(1), pdftohtml(1), poppler (https://poppler.freedesktop.org/)