Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing literals #3669

Closed
Bardo-Konrad opened this issue Jul 9, 2024 · 3 comments
Closed

Missing literals #3669

Bardo-Konrad opened this issue Jul 9, 2024 · 3 comments

Comments

@Bardo-Konrad
Copy link

Bardo-Konrad commented Jul 9, 2024

Description of the bug

In some documents, get_text outputs the wrong literals in words. For instance the text in the pdf reads "Dort machten die Handelsschiffe auf der Überfahrt" but I get "Dort machten die Handelsschiye auf der Überfahrt".
It happens with ff and probably others. When copying from the document in a PDF reader like SumatraPDF, I also get "Dort machten die Handelsschiye auf der Überfahrt".

PyMuPDF version

1.23.x or earlier

Operating system

Windows

Python version

3.11

@JorjMcKie
Copy link
Collaborator

You did not include a reproducing file and neither any code snippet.
So this post does not yet qualify as a bug and we are forced to do guesswork:
Your file may use ligatures in the text. "ff" is one of the 6 standard ligatures in Latin text - which means that 1 Unicode (and one glyph) is used to represent multiple characters.
By default, ligatures are passed through in text extraction - however, depending on your output device, they should still look ok.
You can try with a modified text extraction flag bit combination to confirm. E.g. flags=0. This will dissolve ligatures into their components. For details see documentation.

@JorjMcKie
Copy link
Collaborator

Closing this for lack of response over an extended time interval.
In a future release we will change the text flag default for searches that will no longer preserve ligatures.

@Bardo-Konrad
Copy link
Author

Bardo-Konrad commented Jul 17, 2024

Thank you for your effort. I changed what you suggested silently, so you were not notified. I apologize for you feeling like your reply was in vain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants