Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure the text layer uses the correct font to determine the scaleX transformation of text divs. #18709

Conversation

radiachkik
Copy link

I was having issues with the text layer since I was upgraded from PDF.js version 3 to 4. One of the initially rendered pages (mostly the second page), seemed to be subject to some kind of race condition, which caused problems with the text layer.

Most of the times there was at least a part of the page, where the scaleX transformation of the text divs (spans) had a very small value of ~0.09. After spending some time debugging I noticed that the result of ctx.measureText(div.textContent) in text_layer.js was not consistent for every run. After that I noticed that the font used by the canvas to calculate the text width was off (~150px instead of the ~10px passed to the #layout method.

This causes the the text divs to be very small, which makes text selection impossible.

This PR should hopefully fix the issue while having no side effects. As I am new to the project and do not really know the codebase, I cannot tell if this is the right way to do it.

For now I am using patch-package to get rid of the problem and I just wanted to share the findings here.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Sep 8, 2024

Sorry, but as-is it's unfortunately not really clear why this would be a necessary/correct change since you've not included any information or test-case that allows your particular situation to be reproduced. Perhaps it would be better to open an issue, with all necessary details included, first?

(Also, generally speaking, please note that commit messages should contain all the same information as #18709 (comment) above.)

@timvandermeij
Copy link
Contributor

In addition to the above, have you run the full test suite to make sure that this doesn't regress any existing test cases?

@calixteman
Copy link
Contributor

calixteman commented Sep 9, 2024

The problem of not having a test case is that doesn't help to write a test.
While a test is useful to prevent any future regressions, a test case is helpful to understand what's wrong exactly.

Comment on lines -409 to -412
if (prevFontSize !== fontSize || prevFontFamily !== fontFamily) {
ctx.font = `${fontSize * this.#scale}px ${fontFamily}`;
params.prevFontSize = fontSize;
params.prevFontFamily = fontFamily;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please note why this code looks the way it does: To avoid needlessly querying the DOM to get the ctx.font data, when the font hasn't changed, for every single textLayer element.

Hence just removing this seems wrong, unfortunately (even if a test-case is provided).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep and maybe I'm missing something but I don't understand how #this.scale can be changed while calling #layout.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Sep 10, 2024

Having looked at the textLayer code, I now have a theory of what might be happening here. (If I'm right, this is intermittent enough that it's going to be very difficult to test this reliably.)


Assumption: This bug requires that multiple textLayers are parsed/rendered in parallel.
Given that text-extraction happens asynchronously on the worker-thread, creating the DOM textLayers are also asynchronous.

Note that the textLayer is returned in "chunks" through the use of ReadableStreams, and on the main-thread we're (normally) using just one temporary canvas in order to measure the size of the textLayer spans; see the #layout method.

Order of events:

  1. Call render of the textLayer for page A.
  2. Immediately call render of the textLayer for page B.
  3. The first text-chunk for pageA arrives, and it's parsed/layout which means updating the cached fontSize/fontFamily for the textLayer of page A.
  4. The first text-chunk for pageB arrives, which means updating the cached fontSize/fontFamily for the textLayer of page B since this data is unique to each TextLayer-instance.
  5. The second text-chunk for pageA arrives, and we don't update the canvas-font since the cached parameters still apply from step 3 above.

Where this potentially breaks down is between the last steps, since we're using just one temporary canvas for all measurements but have individual fontSize/fontFamily caches for each textLayer.
Hence it's possible that the canvas-font has changed, despite the cached values suggesting otherwise.


Assuming that the above is (somewhat) accurate, a possible solution would be to move caching of fontSize/fontFamily to be global just like the temporary canvases are; maybe something along these lines: master...Snuffleupagus:pdf.js:TextLayer-ensureCtxFont

@calixteman
Copy link
Contributor

@Snuffleupagus, your explanation makes sense to me, thank you for investigating.
I learned that using static elements in a multithreaded context is generally considered a bad practice, but in our case having one ctx per text layer would imply to have too much canvas element attached to the body... which is not great.
At first glance your patch lgtm, I'd prefer having a test but as you said it should pretty hard to have something reliable.
That said maybe it'd be possible to have a unit test where we've two text layers, then provide a chunk to the first one, an other to the second one, and we could just check that everything is ok, something like that, wdyt ?

@radiachkik
Copy link
Author

Thank you @Snuffleupagus @timvandermeij @calixteman for having a look.

As I am fully occupied with the release of another software, I will not get to find the time for writing tests and dedicating the efforts appropriate for contributing to this project. For that reason it might make sense to replace this PR with an issue as suggested by @Snuffleupagus or to delete it altogether.

I just spend some time finding out why the issue is so consistent in our case while most users do not seem to experience this issue at all. Sadly I did not come to a result.

Best I can do for now is offering to replace / remove this PR and providing more details on the issue we experience.

Environment

  • pdfjs-dist@4.6.82
  • node v18.16.1
  • PDFViewer initialized in the useEffect hook of a functional react component

Observations

  • The bug does not disappear when removing the react strict mode (@Snuffleupagus theory lead to my assumption of this being the culprit, as it makes the code be executed twice, but it turned out to be wrong)
  • It appears nearly every time and if it does always on the second page of the PDF rendered by the viewer
  • The issue is consistently fixed by the patch, even though I do not know if it introduces regressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants