Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a heuristic, in src/core/jpg.js, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10880) #11523

Merged
merged 2 commits into from
Mar 6, 2020

Conversation

Snuffleupagus
Copy link
Collaborator

  • Add basic validation of the scanLines parameter in JPEG images, before delegating decoding to the browser

    In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect scanLines parameter.
    Currently, for "simple" JPEG images, we're relying on native image decoding to fail before falling back to the implementation in src/core/jpg.js. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render.

    In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of src/core/jpg.js directly in cases where the native JPEG decoder cannot be trusted to do the right thing.
    The only way to implement this is unfortunately to parse the beginning of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the JpegStream instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks.


    Slightly off-topic: Working on this really makes me start questioning if native rendering/decoding of JPEG images is actually a good idea.
    There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the NativeImageDecoder, we're having to check for certain properties in the image dictionary, parse the ColorSpace, and finally read the actual image data to find the SOF marker.
    Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing.
    In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.

  • Add a heuristic, in src/core/jpg.js, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10880)

    This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.

    To limit the impact of these changes, we only re-parse JPEG images using a reduced scanLines value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data and the "actual" scanLines value is at least one order of magnitude smaller than expected.

Fixes #10880

@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/ac13d4c85645b30/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/1d743b30fffcebe/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/ac13d4c85645b30/output.txt

Total script time: 26.63 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/ac13d4c85645b30/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/1d743b30fffcebe/output.txt

Total script time: 56.03 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/1d743b30fffcebe/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

If I recall correctly, there was a discussion a while ago about only using our own JPEG decoder instead of native decoding because native decoding is not always possible. Looking at this patch, it requires quite a lot of additional code to be able to feature-detect if we can use the native decoder or not, so I agree with your analysis that we should perhaps wonder if we shouldn't just always use own own JPEG decoder. I don't know exactly what the pros/cons were of both solutions, but I can imagine that native decoding is faster. Do you happen to know what advantages/disadvantages may be? If so, we could perhaps reconsider this after having worked with native decoding for quite a while now.

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented Jan 27, 2020

If I recall correctly, there was a discussion a while ago about only using our own JPEG decoder instead of native decoding because native decoding is not always possible.

Correct, there was a brief discussion about this in #6984 (comment) and I also tried it in #8637 (comment).
At that time there were some failures, but I've since fixed those such that src/core/jpg.js can now successfully render all JPEG images in our test-suite (and I also wrote a patch that improved the quality of some JPEG images).

At this point in time, there's two kinds of failure when running the reference tests with master...Snuffleupagus:rm-nativeImageDecoderSupport:

  • Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder.
    This type of "failure" accounts for the vast majority of the total number (227 if I recall correctly) of changes in the reference tests.
  • Changes where the JPEG images now looks slightly blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to some sort of general deficiency in the src/core/jpg.js implementation, however I very recently discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually closer to 200% is enough).
    Basically, if you disable this downscaling in canvas.js (which is what happens when zooming in) the differences simply vanish!
    Hence I'm pretty satisfied that there's no significant problems with the src/core/jpg.js implementation, and the problems are rather tied to general image quality (for which there's already open bugs) when downscaling occurs; all-in-all I don't think that this should block a decision to use src/core/jpg.js exclusively!

I don't know exactly what the pros/cons were of both solutions, but I can imagine that native decoding is faster.

That seems like a safe assumption, and one that I'd fully support. However, even if the native browser JPEG decoder is generally faster, actually using it involves sending data between the worker- and main-threads (and waiting for responses). Furthermore, as we both noted above, there's also a fair bit of overhead involved just in deciding what JPEG decoder to use (which probably isn't free).
There's also the question of memory usage with "JpegDecode" (and the issue of main-thread image parsing), but I'm probably not the right person to evaluate this properly though...

In my quick testing though, basically running SKIP_BABEL=true gulp browsertest with master respectively rm-nativeImageDecoderSupport, I'm not seeing any huge and thus easily spotted regressions in test runtime when using only src/core/jpg.js.
However, in order for the comparison to be fair we probably need land this PR first (obviously assuming that its src/core/jpg.js parts seem OK) since otherwise the comparison doesn't seem correct.

@Snuffleupagus Snuffleupagus force-pushed the issue-10880 branch 3 times, most recently from 08f9eb2 to c83948a Compare February 11, 2020 10:26
@Snuffleupagus
Copy link
Collaborator Author

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/32bb72a11a819bd/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/32bb72a11a819bd/output.txt

Total script time: 1.74 mins

Published

@Snuffleupagus Snuffleupagus force-pushed the issue-10880 branch 2 times, most recently from 180d77e to 5555b82 Compare February 14, 2020 23:04
@Snuffleupagus Snuffleupagus mentioned this pull request Feb 15, 2020
3 tasks
…ore delegating decoding to the browser

In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter.
Currently, for "simple" JPEG images, we're relying on native image decoding to *fail* before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render.

In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing.
The only way to implement this is unfortunately to parse the *beginning* of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks.

---

*Slightly off-topic:* Working on this *really* makes me start questioning if native rendering/decoding of JPEG images is actually a good idea.
There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker.
Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing.
In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.
…ildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880)

*This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.*

To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data *and* the "actual" `scanLines` value is at least one order of magnitude smaller than expected.
@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/224a80be4393b6d/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/f0c8a14649036b7/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/f0c8a14649036b7/output.txt

Total script time: 19.41 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/f0c8a14649036b7/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/224a80be4393b6d/output.txt

Total script time: 24.74 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/224a80be4393b6d/reftest-analyzer.html#web=eq.log

@timvandermeij
Copy link
Contributor

I'm still unsure about if we should take this approach. As you already noted there are downsides that we should consider, and in general the heuristic feels a bit fragile. Wouldn't it be better to check if #11601 works better instead, since it avoids quite a bit of overhead and ensures that we can always decode these kinds of images properly? I would be convinced with a benchmark of this PR versus #11601; if there are no visible changes in rendering and no significant regressions in runtime, I'd much rather use that approach instead. Would that be possible to check?

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented Mar 4, 2020

Wouldn't it be better to check if #11601 works better instead

That'd only replace the first commit in this PR, but even with that done you'd still need the src/core/jpg.js changes to actually address the referenced issue.

Edit: The Add basic validation of the scanLines parameter in JPEG images, before delegating decoding to the browser commit simply ensures that we do not let the browser handle JPEG decoding for these sort of corrupt images (since it fails to handle them correctly).
The Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) commit is what actually fixes the bug here. Please note that the heuristic is limited to images whose correct height is off by at least a factor of 10 when compared with the embedded data.

[...] if there are no visible changes in rendering and no significant regressions in runtime,

There's lots of smaller changes (some visible as outlined elsewhere), and a general switch of JPEG decoders probably requires additional discussion; hence why I hoped to get this landed first.

@timvandermeij timvandermeij merged commit 1a97c14 into mozilla:master Mar 6, 2020
@timvandermeij
Copy link
Contributor

I have thought about this some more and I think it's indeed best to merge this now to get the issue resolved. We can look at the native decoding later. Thanks!

@timvandermeij
Copy link
Contributor

/botio makeref

@pdfjsbot
Copy link

pdfjsbot commented Mar 6, 2020

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/00bd08118c4edf3/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Mar 6, 2020

From: Bot.io (Windows)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/7bb356dd30fa118/output.txt

@pdfjsbot
Copy link

pdfjsbot commented Mar 6, 2020

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/00bd08118c4edf3/output.txt

Total script time: 17.87 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

pdfjsbot commented Mar 6, 2020

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/7bb356dd30fa118/output.txt

Total script time: 22.69 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Attached PDF not able View (7th page)
3 participants