-
Notifications
You must be signed in to change notification settings - Fork 592
Page Segmentation
Work in Progress Help us by extending these notes...
The command ocropus-gpageseg
will segment the page into smaller parts like columns and lines.
ocropus-gpageseg [add here possible options] <input files>
For input, you can name one or several png
images, which usually you will
be given after the ocropus-nlbin
command. For each file name a binarized
version with the file ending .bin.png
is looked for and the computation
here is actually done on this binarized image. (However, it is still important
that the file name points to an existing file.)
For example all of these calls
ocropus-gpageseg test.jpg
ocropus-gpageseg test.bin.png
ocropus-gpageseg test.nrm.png
will make the computation on test.bin.png
(if this does not exists then
an error is thrown).
For each input image a new folder with this base name is
created and inside this folder the images of all detected
lines are saved. They are named with 6 hexadecimal digits
starting from 010000
.
Moreover, a pseg.png
file is created, OCRopus-File-Formats/Physical Layout for more details.
The following parameters influence the output further:
--noise NOISE noise threshold for removing small components from
lines, default: 8
-p PAD, --pad PAD padding for extracted lines, default: 3
--gray output grayscale lines as well, default: False
-e EXPAND, --expand EXPAND
expand mask for grayscale extraction, default: 3
There is a step which removes small components from the whole image just before outputing the individual lines. These small components are usually just some noise and should be considerable smaller than for example the period as a punctation mark or the point in the i
:
The above image after the noise removal step:
In the noise removal step only components of size at most 8 are remove by default. This threshold can be adjusted by the parameter --noise NOISE
. For example --noise 30
will remove also components of size at most 30 (containing most .
and points in i
for the example) and --noise 0
will deactivate the noise removal completely.
By default only binarized images .bin.png
of the lines are generated. However,
by adding the parameter --gray
also normalized grayscale images .nrm.png
of all the lines are generated. You can also use grayscale images for training
or prediction, see FAQ/When can colorful, grayscale or black-and-white images used? for some
further discussion.
🚧 What does --expand EXPAND
exactly? 🚧
The outputed line images have an additional white space around the actual text (on the top, left bottom and right). The amount of space is determined by the padding value --pad PAD
which is set by default to 3
pixels. If you want more spacing around the actual text lines, then provide higher padding value. If you don't want any padding, then use --pad 0
.
--threshold THRESHOLD
baseline threshold, default: 0.2
--usegauss use gaussian instead of uniform, default: False
It is checked whether any whitespace column separators are present (by default at most 3) which are important for determining the start and end of the different lines. For single column text there is no separators and for two-column text there is exactly one separator in the middle.
--maxcolseps MAXCOLSEPS
maximum # whitespace column separators, default: 3
--csminheight CSMINHEIGHT
minimum column height (units=scale), default: 10
By default, the black column separators are not checked, but this can be
changed by setting the maxseps
parameter to a positive number:
--maxseps MAXSEPS maximum black column separators, default: 0
--sepwiden SEPWIDEN widen black separators (to account for warping),
default: 10
ocropus-gpageseg -d --maxcolseps 0 tests/testpage.png
will segment the page into lines without detecting any columns. This results in long lines which spread across different columns:
However, by calling
ocropus-gpageseg -d --maxcolseps 1 tests/testpage.png
the lines are separated among the two columns, because the white space separator in the middle was found and taken into account:
The column separators are bounds for the following line detection. However, note that the "smearing" does stop at some point. Therefore, there are even different lines detected, when there is no column separator at all. For example the footer in the previous pictures will be put into two lines with both parameters.
There are some consistency checks during the page segmentation. If any of these checks finds an unusual value, then the computation is stopt and an error message is shown. The standard check tries to make sure that the image has a reasonable size and is binarized; moreover, there are checks to make sure that the scale is not too small and the maximum number of lines is not too high. These checks and values can be controlled by these parameters:
-n, --nocheck disable error checking on inputs
--minscale MINSCALE minimum scale permitted, default: 12.0
--maxlines MAXLINES maximum # lines permitted, default: 300
The default values should work fine for an usual A4 paper of an article or book page with 300 dpi. However, the might not work well for very small images or newspaper pages with a lot of text lines. In such cases it might be necessary to deactivate the check and/or adjust the values, e.g.
ocropus-gpageseg -n --minscale 5 --maxlines 900 tests/testpage.png
--scale SCALE the basic scale of the document (roughly, xheight)
0=automatic, default: 0.0
--hscale HSCALE non-standard scaling of horizontal parameters,
default: 1.0
--vscale VSCALE non-standard scaling of vertical parameters, default:
1.0
-q, --quiet be less verbose, default: False
-d, --debug
-h, --help show this help message and exit
-Q PARALLEL, --parallel PARALLEL
number of CPUs to use