-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getTextContent doesn't always return the right fontRef #14755
Comments
Is this a regression from some of the later
Always flushing the textContent on |
I'm pretty new to this lib (or pdf specs for that matter) but I spent a few days wrapping my head around it. Especially the text retrieval part. So I can't really tell if that is a newly introduced bug. Oh, if you want to not split textContentItems if possible, it probably would be better to put my fix in the first condition of the function Wouldn't having a fontAgnostic argument to the page.getTextContent() be the way to go for search and text selection ? |
Would you be able to help with testing a few older releases, see https://github.com/mozilla/pdf.js/releases, to help answer this?
Considering where that function is being called, I'm really not sure that it'd be correct to flush the textContent there!? @calixteman Any ideas for a good solution to this issue, one that doesn't cause us to speculatively break apart text-runs when not actually necessary?
That sound pretty orthogonal to this issue, and anyway it'd not actually be correct since the |
Ok here are my results: [...,
{fontName: 'g_d0_f4', str: 'odem'}
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f4', str: ')'},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f3', str: 'ou une liturgie de clairvoyance (par exemple'},
...] pdfjs-2.13.216 ❌ (reason of this post) [...,
{fontName: 'g_d0_f4', str: 'odem'},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f4', str: ') ou une liturgie de clairvoyance (par exemple'},
...] pdfjs-2.12.313 ❌ (not exactly the same results, the item's string is even longer) [...,
{fontName: 'g_d0_f4', str: 'odem) ou une liturgie de clairvoyance (par exemple'},
...] pdfjs-2.11.338 ✅ (the text does not split in the same way as the last release but there's no obvious problem in the fontRefs) [...,
{fontName: 'g_d0_f4', str: ''},
{fontName: 'g_d0_f4', str: 'odem) '},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f3', str: 'ou une liturgie de clairvoyance (par exemple '},
{fontName: 'g_d0_f3', str: ''},
{fontName: 'g_d0_f3', str: ' '},
...] |
Do we know which PR caused this? |
For this specific pdf, the issue has been fixed thanks to: |
Flush the current chunk when the font changed because of a restore op (issue #14755)
Attach (recommended) or Link to PDF file here: test8.pdf
Configuration:
Steps to reproduce the problem:
What is the expected behavior?
On this particular example when using the chrome debugger I can see the fontRef used to render the glyphs of (part of) this item is
d_d0_f3
What went wrong?
The item that get logged out has a fontRef
g_d0_f4
.I think I isolated the problem to this particular pattern of OPS: (ignoring irrelevant OPS here)
This should return two different items with different fonts in my opinion (or is this PDF file broken in some way ? It renders fine though). What I get is the item with the str value of
) ou une liturgie de clairvoyance (par exemple
and fontRefg_d0_f4
.I found a fix, but I'm not sure it's very elegant the way I did it.
In the
function buildTextContentItem
in the filesrc/core/evaluator.js:2633
I added at the top :With this, I get to separate items, each with it's own correct fontRef.
I think handling the
restore
case in the switch (same file, line 2861) would be a lot cleaner but I don't know enough of the specifics of this library to be sure. Is always trying toflushTextContentItem()
when reading arestore
OPS valid ?Anyway, thanks a lot for your time !
The text was updated successfully, but these errors were encountered: