Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add x_tolerance_ratio param to extract_text and similar functions (now properly linted!) #1041

Merged
merged 8 commits into from
Nov 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ All notable changes to this project will be documented in this file. The format

- Add support for marked-content sequences, represented by `mcid` and `tag` attributes on `char`/`rect`/`line`/`curve`/`image` objects (h/t @dhdaines). ([#961](https://github.com/jsvine/pdfplumber/pulls/961))
- Add `gs_path` argument to `pdfplumber.open(...)` and `pdfplumber.repair(...)`, to allow passing a custom Ghostscript path to be used for repairing. ([#953](https://github.com/jsvine/pdfplumber/issues/953))
- Add `x_tolerance_ratio` to `extract_text` and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels)

### Fixed

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,9 +327,9 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr

| Method | Description |
|--------|-------------|
|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
|`.extract_text(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character["size"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|
|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |
|`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|
Expand Down
26 changes: 20 additions & 6 deletions pdfplumber/utils/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -294,6 +294,8 @@ def __init__(
self,
x_tolerance: T_num = DEFAULT_X_TOLERANCE,
y_tolerance: T_num = DEFAULT_Y_TOLERANCE,
x_tolerance_ratio: Union[int, float, None] = None,
y_tolerance_ratio: Union[int, float, None] = None,
keep_blank_chars: bool = False,
use_text_flow: bool = False,
horizontal_ltr: bool = True, # Should words be read left-to-right?
Expand All @@ -304,6 +306,8 @@ def __init__(
):
self.x_tolerance = x_tolerance
self.y_tolerance = y_tolerance
self.x_tolerance_ratio = x_tolerance_ratio
self.y_tolerance_ratio = y_tolerance_ratio
self.keep_blank_chars = keep_blank_chars
self.use_text_flow = use_text_flow
self.horizontal_ltr = horizontal_ltr
Expand Down Expand Up @@ -348,6 +352,8 @@ def char_begins_new_word(
self,
prev_char: T_obj,
curr_char: T_obj,
x_tolerance: T_num,
y_tolerance: T_num,
) -> bool:
"""This method takes several factors into account to determine if
`curr_char` represents the beginning of a new word:
Expand Down Expand Up @@ -380,12 +386,11 @@ def char_begins_new_word(
compare, while horizontal_ltr/vertical_ttb determine the direction
of the comparison.
"""

# Note: Due to the grouping step earlier in the process,
# curr_char["upright"] will always equal prev_char["upright"].
if curr_char["upright"]:
x = self.x_tolerance
y = self.y_tolerance
x = x_tolerance
y = y_tolerance
ay = prev_char["top"]
cy = curr_char["top"]
if self.horizontal_ltr:
Expand All @@ -398,8 +403,8 @@ def char_begins_new_word(
cx = -curr_char["x1"]

else:
x = self.y_tolerance
y = self.x_tolerance
x = y_tolerance
y = x_tolerance
ay = prev_char["x0"]
cy = curr_char["x0"]
if self.vertical_ttb:
Expand Down Expand Up @@ -434,6 +439,10 @@ def start_next_word(

current_word = [] if new_char is None else [new_char]

xt = self.x_tolerance
xtr = self.x_tolerance_ratio
yt = self.y_tolerance

for char in ordered_chars:
text = char["text"]

Expand All @@ -444,7 +453,12 @@ def start_next_word(
yield from start_next_word(char)
yield from start_next_word(None)

elif current_word and self.char_begins_new_word(current_word[-1], char):
elif current_word and self.char_begins_new_word(
current_word[-1],
char,
x_tolerance=(xt if xtr is None else xtr * current_word[-1]["size"]),
y_tolerance=yt,
):
yield from start_next_word(char)

else:
Expand Down
Binary file added tests/pdfs/issue-987-test.pdf
Binary file not shown.
11 changes: 11 additions & 0 deletions tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,17 @@ def test_decode_psl_list(self):
a = [PSLiteral("test"), "test_2"]
assert utils.decode_psl_list(a) == ["test", "test_2"]

def test_x_tolerance_ratio(self):
pdf = pdfplumber.open(os.path.join(HERE, "pdfs/issue-987-test.pdf"))
page = pdf.pages[0]

assert page.extract_text() == "Big Te xt\nSmall Text"
assert page.extract_text(x_tolerance=4) == "Big Te xt\nSmallText"
assert page.extract_text(x_tolerance_ratio=0.15) == "Big Text\nSmall Text"

words = page.extract_words(x_tolerance_ratio=0.15)
assert "|".join(w["text"] for w in words) == "Big|Text|Small|Text"

def test_extract_words(self):
path = os.path.join(HERE, "pdfs/issue-192-example.pdf")
with pdfplumber.open(path) as pdf:
Expand Down