Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated misleading documentation about word_margin #407

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [Unreleased]

### Fixed

- Updated misleading documentation for `word_margin` and `char_margin` ([#407](https://github.com/pdfminer/pdfminer.six/pull/407))
- Ignore ValueError when converting font encoding differences ([#389](https://github.com/pdfminer/pdfminer.six/pull/389))
- Grouping of text lines outside of parent container bounding box ([#386](https://github.com/pdfminer/pdfminer.six/pull/386))

Expand Down
20 changes: 10 additions & 10 deletions docs/source/topics/converting_pdf_to_text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,12 @@ meaningful way. Each character has an x-coordinate and a y-coordinate for its
bottom-left corner and upper-right corner, i.e. its bounding box. Pdfminer
.six uses these bounding boxes to decide which characters belong together.

Characters that are both horizontally and vertically close are grouped. How
close they should be is determined by the `char_margin` (M in figure) and the
`line_overlap` (not in figure) parameter. The horizontal *distance* between the
bounding boxes of two characters should be smaller that the `char_margin` and
the vertical *overlap* between the bounding boxes should be smaller the the
`line_overlap`.
Characters that are both horizontally and vertically close are grouped onto
one line. How close they should be is determined by the `char_margin`
(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
*distance* between the bounding boxes of two characters should be smaller that
the `char_margin` and the vertical *overlap* between the bounding boxes should
be smaller the the `line_overlap`.


.. raw:: html
Expand All @@ -69,10 +69,10 @@ relative to the minimum height of either one of the bounding boxes.
Spaces need to be inserted between characters because the PDF format has no
notion of the space character. A space is inserted if the characters are
further apart that the `word_margin` (W in the figure). The `word_margin` is
relative to the maximum width or height of the new character. Having a larger
`word_margin` creates smaller words and inserts spaces between characters
more often. Note that the `word_margin` should be smaller than the
`char_margin` otherwise all the characters are seperated by a space.
relative to the maximum width or height of the new character. Having a smaller
`word_margin` creates smaller words. Note that the `word_margin` should at
least be smaller than the `char_margin` otherwise none of the characters will
be separated by a space.

The result of this stage is a list of lines. Each line consists a list of
characters. These characters either original `LTChar` characters that
Expand Down
14 changes: 6 additions & 8 deletions pdfminer/layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,12 @@ class LAParams:
are considered to be on the same line. The overlap is specified
relative to the minimum height of both characters.
:param char_margin: If two characters are closer together than this
margin they are considered to be part of the same word. If
characters are on the same line but not part of the same word, an
intermediate space is inserted. The margin is specified relative to
the width of the character.
:param word_margin: If two words are are closer together than this
margin they are considered to be part of the same line. A space is
added in between for readability. The margin is specified relative
to the width of the word.
margin they are considered part of the same line. The margin is
specified relative to the width of the character.
:param word_margin: If two characters on the same line are further apart
than this margin then they are considered to be two separate words, and
an intermediate space will be added for readability. The margin is
specified relative to the width of the character.
:param line_margin: If two lines are are close together they are
considered to be part of the same paragraph. The margin is
specified relative to the height of a line.
Expand Down
10 changes: 5 additions & 5 deletions tools/pdf2txt.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,14 @@ def maketheparser():
la_params.add_argument(
"--char-margin", "-M", type=float, default=2.0,
help="If two characters are closer together than this margin they "
"are considered to be part of the same word. The margin is "
"are considered to be part of the same line. The margin is "
"specified relative to the width of the character.")
la_params.add_argument(
"--word-margin", "-W", type=float, default=0.1,
help="If two words are are closer together than this margin they "
"are considered to be part of the same line. A space is added "
"in between for readability. The margin is specified relative "
"to the width of the word.")
help="If two characters on the same line are further apart than this "
"margin then they are considered to be two separate words, and "
"an intermediate space will be added for readability. The margin "
"is specified relative to the width of the character.")
la_params.add_argument(
"--line-margin", "-L", type=float, default=0.5,
help="If two lines are are close together they are considered to "
Expand Down