-
Notifications
You must be signed in to change notification settings - Fork 1
/
pdftoroff.1
435 lines (384 loc) · 12.8 KB
/
pdftoroff.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
.TH pdftoroff 1 "September 12, 2017"
.
.
.
.SH NAME
pdftoroff - convert pdf to various text formats (roff, html, TeX, text)
.
.
.
.SH SYNOPSIS
.TP 10
\fBpdftoroff\fP
[\fI-r\fP|\fI-w\fP|\fI-p\fP|\fI-f\fP|\fI-t\fP|\fI-s fmt\fP]
[\fI-m method\fP [\fI-d distance\fP] [\fI-o order\fP]]
[\fI-i range\fP] [\fI-b box\fP] [\fI-n\fP] [\fI-v\fP]
\fIfile.pdf\fP
.
.
.
.SH DESCRIPTION
Extract text from a pdf file undoing page, column and paragraph formatting if
possible but retaining italic and bold faces. The output is in one of the
following formats: groff(1), html, plain TeX, text with font changes, simple
text or a user-given format.
The groff output can be used to reformat the text to a smaller page size and a
different font to make it more readable on a small tablet or e-ink ebook
reader, as shown in the REFORMAT section. The \fIpdftoebook\fP script does
this.
.
.
.
.SH OPTIONS
.TP
.B
-r
output in groff(1) format; it can be directly compiled by a pipe like
\fIpdftoroff -r file.pdf | groff -Dutf8 -Tutf8 -\fP or prepended by code for
page and character formatting, like in the REFORMAT section, below
.TP
.B
-w
output in html format; only the body of the html file is generated, not the
header
.TP
.B
-p
convert to plain TeX; see BUGS below
.TP
.B
-f
text format; font changes are marked \fI\\[fontname]\fP, and backslashes
escaped to \fI\\\\\fP
.TP
.B
-t
text only
.TP
\fB-s\fP \fIfmt\fP
output using the parameters in \fIfmt\fP;
see OUTPUT FORMAT, below
.TP
\fB-m\fP \fImethod\fP
conversion method:
.RS
.IP 0 4
detect columns on the fly
.IP 1
use the bounding box of the page
.IP 2
use the blocks of text on the page
.IP 3
use the blocks of text on the page, sorted
.IP 4
use rows of text
.RE
.IP
the default method is 1, which is fast and usually gives good results on
single-column documents; methods 2 is slower, but often produces better results
on multiple-column documents; method 3 is even slower, but the sorting of the
blocks may be necessary when the characters in the document are not in the
correct order; method 4 is for tables; see \fICONVERSION METHODS\fP, below
.TP
\fB-d\fP \fIdistance\fP
minimal distance between blocks of text in the page;
for conversion method 4 the default is 0, for all others is 15; a smaller value
like 10 may be appropriate when the document uses small fonts or has little
space between columns or between the header/footer and the text; this value
only affects methods 2, 3 and 4
.TP
\fB-o\fP \fIorder\fP
the method used for sorting the blocks of text in the page:
.RS
.IP 0 4
by their position, quick and approximate
.IP 1
by their position, exact
.IP 2
by the occurrence of their characters in the file
.RE
.TP
\fB-i\fP \fIrange\fP
pages to convert, in the format \fIfirst:last\fP;
negative or zero is from the last page backwards;
for example, \fI-2:0\fP is the range for converting the last three pages
.TP
\fB-b\fP \fI[x1,y1-x2,y2]\fP
convert only the characters that are positioned
within the coordinates \fIx1,y1\fP and \fIx2,y2\fP
.TP
.B -n
do not convert the recurring elements in the page, such as page numbers,
headers and footers; locating these elements takes time, making the conversion
not to start immediately; it may fail, resulting in loss of text or these
elements ending up in the output; see \fIpdfrecur(1)\fP for details
.TP
.B -v
print markers to facilitate checking that the output is correct; see
\fIMARKERS\fP, below
.SH REFORMAT
The following script re-formats a pdf file for a 200x250 page with 5pt margins
and Helvetica font, so that it reads better to a small tablet or e-ink reader.
It extracts the text from the pdf file, prepends it with some groff(7) page and
font code and then compiles back to pdf. This is the core of the
\fIpdftoebook\fP script.
.nf
.ft I
{
cat <<!
\[char46]device papersize=200p,250p
\[char46]po 5p
\[char46]ll 190p
\[char46]pl 240p
\[char46]fam H
!
pdftoroff -r file.pdf;
} | \\
groff -Dutf8 -Tpdf - > new.pdf
.ft P
.fi
.
.
.
.SH OUTPUT FORMAT
The text from the pdf file is scanned for font changes and paragraph breaks.
Short lines, indents and vertical spaces are taken as the start of a new
paragraph, otherwise the new line is considered the continuation of the
previous. Font names are matched agains "Italic" and "Bold", which indicate the
begin of an italic or bold face, and their lack as the end of the font face.
The various output formats are obtained by adding the appropriate strings at
paragraph breaks and font changes, and by substituting some characters (for
example, a plain \fI<\fP is replaced by \fI<\fP for the html format).
The \fI-s fmt\fP option allows arbitrary output strings. For example, the html
format can be alternatively generated by the command:
.nf
\fI
pdftoroff -s '
<p>,</p>
,,,,,,<i>,</i>,<b>,</b>,true,\\,.,<,>,&' file.pdf
\fP
.fi
The format string is a comma-separated list of the following fields. Some may
be empty and some may contain newlines.
.TP
.I
parstart
the string printed when a paragraph begins
.TP
.I
parend
the string printed when a paragraph ends
.TP
.I
fontname
the \fIprintf(3)\fP format for printing the font name;
for example, the \fI-f\fP option uses \fI\\\\[%s]\fP, so that when the text
begins using the font TimesNewRomanCM this is marked
\fI\\[TimesNewRomanCM]\fP in the output
.TP
.I
plain
printed when the font changes to non-italic and non-bold
.br
(example: \fI\\fR\fP in roff)
.TP
.I
italic
printed when the font changes to italic but not bold
.br
(example: \fI\\fI\fP in roff)
.TP
.I
bold
printed when the font changes to bold but not italic
.br
(example: \fI\\fB\fP in roff)
.TP
.I
bolditalic
printed when the font changes to both italic and bold
.br
(example: \fI\\f[BI]\fP in roff)
.TP
.I
italicbegin
printed when the text begins using an italic font
.br
(example: \fI<i>\fP in html)
.TP
.I
italicend
printed when the text ends using an italic font
.br
(example: \fI</i>\fP in html)
.TP
.I
boldbegin
printed when the text begins using a bold font
.br
(example: \fI<b>\fP in html)
.TP
.I
boldend
printed when the text ends using a bold font
.br
(example: \fI</b>\fP in html)
.TP
.I
reset
if this is \fItrue\fP,
turn off all active font faces when a paragraph ends and restore them when the
new one starts; for example, if the pdf starts using a bold font and then ends
it after two paragraphs, the html output is \fI<p><b>first paragraph</b></p>
<p><b>second</b></p>\fP
.TP
.I
backslash
replace every backslash with this string
.TP
.I
firstdot
replace a dot at the start of a line with this string
(this is only useful for roff output)
.TP
.I
less
replace the minus sign (\fI<\fP) with this
.TP
.I
greater
replace the greater sign (\fI>\fP) with this
.TP
.I
and
replace the ampersand (\fI&\fP) with this
.
.
.
.SH CONVERSION METHODS
All conversion methods scan the characters in the page in the same order as in
the pdf file. A new line is detected on:
.IP \(bu 4
a large vertical space from the previous character
.IP \(bu
a small vertical space from the previous character, if the previous character
is not at the right of the column (short previous line)
.IP \(bu
a small vertical space from the previous character, if the current character is
not at the left of the column (indented line)
.RE
The second and third conditions depend on the left and right border of the
current column. The conversion methods differ on how these are found:
.IP 0 4
The left border is the left corner of the leftmost character in the page.
Column changes are detected by large decreases in the y coordinate, and
cause a recalculation of the left border from the remaining charaters in the
page. The right border is a fixed position in the page.
.IP 1
The left and right border are given by the bounding box of the page. This works
on single-column pages. This is the default method.
.IP 2
The blocks of text in the page are determined before scanning the page. The
left and right borders for each character are those of the blocks of text it is
in.
.IP 3
This is the same as 2, but blocks are sorted before scanning the page. It is
slower than method 2 not because of the sorting but because the whole page
needs to be scanned in search of characters in the first block, again for the
second, the third, etc. This may be necessary if the characters in the file are
not in the order they shold be printed.
Three sorting algorithms can be used: the first two try to guess the order of
the blocks based on their position on the page; the third does it based on the
occurrence of their characters in the page. In particular, the algorithms based
on the position of the box sort boxes vertically if they overlap horizontally,
otherwise they order them horizontally. This usually gives reasonable results
on single-column and multiple-column documents. The difference between the two
is that the first is quick and approximate, the second is slower and exact. The
third method scans the characters as they occurr in the file; the block
containing the first is the first block; the block containing the first
character not in the first block is the second, and so on.
.IP 4
This method assumes that the document is a single table: a sequence or rows,
each made of a number of cells. The rows are first located in vertical order,
then each is converted to a line of text.
This method allows converting tables even if their cells are ordered by columns
instead of rows, which is often the case.
The usual rules for line breaking and joining are ignored, and every row is
output as a single line. The minimal text distance (option \fI-f\fP) is used as
the minimal distance between rows; if they are very close to each other, a
negative value may be used to separate them.
.
.
.SH MARKERS
Unformatting text requires introducing line breaks in some places and not in
others and removing the hyphens used to break a word between lines.
This cannot in general be done uniquely. Option \fI-v\fP is for printing
markers that show what have been done and why.
.TP
.I []
a newline was translated into a space because it was considered to
separate two lines of the same paragraph
.TP
.I [-]
an hyphen and the following newline were removed because they looked like a
word broken between two lines
.TP
.I [S]
the following line break is because the current line is short, like the only
or final line of a paragraph
.TP
.I [E]
same, but the line is also at the end of a block of text
.TP
.I [V]
the following line break is due to vertical space between lines
.TP
.I [I]
the following line break is because the next line is indented
.P
These markers are intended for debugging and checking the final result. For
example, a text may look converted correctly, but two dash-separated words have
been merged because the dash fell at the end of the line, and therefore looked
like the hyphen of a single hyphenated word broken between two lines. Marker
.I [-]
helps helps for checking this kind of errors. Spelling the two parts that have
been merged and their result may suggest whether merging was correct, but some
cases cannot be automatically solved this way. For example, if the dash in the
sentence "Price is not under 3, is much more -- over 10, I think." is placed at
the end of a line, it looks like the word "moreover" when hyphenated to split
it between two lines.
.
.
.
.SH BUGS
Replacements are limited to some fixed characters (\\, ., <, > and &). Instead,
the \fI-s\fP option should support replacing arbitrary characters (say,
\fI@\fP).
The plain TeX conversion is primitive: it does not convert accented characters
as it should; it does not support fonts that are both bold and italic; it does
not finish with \fI\\end\fP (but the latter is coherent with generating only
the body of the text in the other formats).
A command line option should allow specifying a number of boxes so that text is
extracted from them in order rather than from the whole page. This is because
the method used by pdftoroff to detect the start of a new column does not
always work, and even if it does, characters in the file are not necessarily in
the correct order. Such an option would also allow to discard headers and
footer. As an example, \fI-b box1,box2,box3;box4;box5;2*\fP would extract text
from \fIbox1,box2,box3\fP from the first page, from \fIbox4\fP from the second,
from \fIbox5\fP from the third, and the repeat with \fIbox4\fP and \fIbox5\fP
until the end of the document.
The html output is not always correct. If the document starts with an italic
font, then switches to italic and bold and then to bold only, the resulting
code is \fI<i>...<b>....</i>...</b>\fP, which is not nested correctly. The
right code would be \fI<i>...<b>....</b></i><b>...</b>\fP. Two solutions are
possible:
.IP " * " 4
turn off all faces before starting a new one
.IP " * "
remember which of italic and bold was started first
.P
The numeric parameters for detecting the start of a new paragraph or column are
fixed (the \fIstruct measure\fP in the code). They should be changeable by
command line options.
.SH SEE ALSO
pdftotext(1), pdftohtml(1), poppler (https://poppler.freedesktop.org/)