Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No ligature preservation in searches #3685

Merged
merged 1 commit into from
Jul 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/app1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,7 @@ Text Extraction Flags Defaults
========================= ==== ==== ===== === ==== ======= ===== ====== ======
Indicator text html xhtml xml dict rawdict words blocks search
========================= ==== ==== ===== === ==== ======= ===== ====== ======
preserve ligatures 1 1 1 1 1 1 1 1 1
preserve ligatures 1 1 1 1 1 1 1 1 0
preserve whitespace 1 1 1 1 1 1 1 1 1
preserve images n/a 1 1 n/a 1 1 n/a 0 0
inhibit spaces 0 0 0 0 0 0 0 0 0
Expand Down
2 changes: 1 addition & 1 deletion docs/vars.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,7 +262,7 @@ The following constants represent the default combinations of the above for text

.. py:data:: TEXTFLAGS_SEARCH

`TEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_DEHYPHENATE`
`TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_DEHYPHENATE`


.. _linkDest Kinds:
Expand Down
1 change: 0 additions & 1 deletion src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13312,7 +13312,6 @@ def width(self):
TEXTFLAGS_RAWDICT = TEXTFLAGS_DICT

TEXTFLAGS_SEARCH = (0
| TEXT_PRESERVE_LIGATURES
| TEXT_PRESERVE_WHITESPACE
| TEXT_MEDIABOX_CLIP
| TEXT_DEHYPHENATE
Expand Down
Binary file added tests/resources/text-find-ligatures.pdf
Binary file not shown.
15 changes: 15 additions & 0 deletions tests/test_textsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,15 @@
Text search with 'clip' parameter - clip rectangle contains two occurrences
of searched text. Confirm search locations are inside clip.
"""

import os

import pymupdf

scriptdir = os.path.abspath(os.path.dirname(__file__))
filename1 = os.path.join(scriptdir, "resources", "2.pdf")
filename2 = os.path.join(scriptdir, "resources", "github_sample.pdf")
filename3 = os.path.join(scriptdir, "resources", "text-find-ligatures.pdf")


def test_search1():
Expand All @@ -35,3 +37,16 @@ def test_search2():
assert len(rl) == 2
for r in rl:
assert r in clip


def test_search3():
"""Ensure we find text whether or not it contains ligatures."""
doc = pymupdf.open(filename3)
page = doc[0]
needle = "flag"
hits = page.search_for(needle, flags=pymupdf.TEXTFLAGS_SEARCH)
assert len(hits) == 2 # all occurrences found
hits = page.search_for(
needle, flags=pymupdf.TEXTFLAGS_SEARCH | pymupdf.TEXT_PRESERVE_LIGATURES
)
assert len(hits) == 1 # only found text without ligatures
Loading