Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) #11523

Snuffleupagus · 2020-01-20T13:13:01Z

Add basic validation of the scanLines parameter in JPEG images, before delegating decoding to the browser

In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect scanLines parameter.
Currently, for "simple" JPEG images, we're relying on native image decoding to fail before falling back to the implementation in src/core/jpg.js. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render.

In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of src/core/jpg.js directly in cases where the native JPEG decoder cannot be trusted to do the right thing.
The only way to implement this is unfortunately to parse the beginning of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the JpegStream instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks.

Slightly off-topic: Working on this really makes me start questioning if native rendering/decoding of JPEG images is actually a good idea.
There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the NativeImageDecoder, we're having to check for certain properties in the image dictionary, parse the ColorSpace, and finally read the actual image data to find the SOF marker.
Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing.
In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.
Add a heuristic, in src/core/jpg.js, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10880)

This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.

To limit the impact of these changes, we only re-parse JPEG images using a reduced scanLines value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data and the "actual" scanLines value is at least one order of magnitude smaller than expected.

Fixes #10880

Snuffleupagus · 2020-01-20T13:16:09Z

/botio test

pdfjsbot · 2020-01-20T13:16:11Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/ac13d4c85645b30/output.txt

pdfjsbot · 2020-01-20T13:16:11Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/1d743b30fffcebe/output.txt

pdfjsbot · 2020-01-20T13:42:50Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/ac13d4c85645b30/output.txt

Total script time: 26.63 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/ac13d4c85645b30/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-01-20T14:12:14Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/1d743b30fffcebe/output.txt

Total script time: 56.03 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/1d743b30fffcebe/reftest-analyzer.html#web=eq.log

timvandermeij · 2020-01-26T20:37:15Z

If I recall correctly, there was a discussion a while ago about only using our own JPEG decoder instead of native decoding because native decoding is not always possible. Looking at this patch, it requires quite a lot of additional code to be able to feature-detect if we can use the native decoder or not, so I agree with your analysis that we should perhaps wonder if we shouldn't just always use own own JPEG decoder. I don't know exactly what the pros/cons were of both solutions, but I can imagine that native decoding is faster. Do you happen to know what advantages/disadvantages may be? If so, we could perhaps reconsider this after having worked with native decoding for quite a while now.

Snuffleupagus · 2020-01-27T13:14:17Z

If I recall correctly, there was a discussion a while ago about only using our own JPEG decoder instead of native decoding because native decoding is not always possible.

Correct, there was a brief discussion about this in #6984 (comment) and I also tried it in #8637 (comment).
At that time there were some failures, but I've since fixed those such that src/core/jpg.js can now successfully render all JPEG images in our test-suite (and I also wrote a patch that improved the quality of some JPEG images).

At this point in time, there's two kinds of failure when running the reference tests with master...Snuffleupagus:rm-nativeImageDecoderSupport:

Changes which are basically imperceivable to the naked eye, where some pixels in the images are essentially off-by-one (in all components), which could probably be attributed to things such as different rounding behaviour in the browser/PDF.js JPEG decoder.
This type of "failure" accounts for the vast majority of the total number (227 if I recall correctly) of changes in the reference tests.
Changes where the JPEG images now looks slightly blurrier than with the native browser decoder. For quite some time I've just assumed that this pointed to some sort of general deficiency in the src/core/jpg.js implementation, however I very recently discovered when comparing two viewers side-by-side that the differences vanish at higher zoom levels (usually closer to 200% is enough).
Basically, if you disable this downscaling in canvas.js (which is what happens when zooming in) the differences simply vanish!
Hence I'm pretty satisfied that there's no significant problems with the src/core/jpg.js implementation, and the problems are rather tied to general image quality (for which there's already open bugs) when downscaling occurs; all-in-all I don't think that this should block a decision to use src/core/jpg.js exclusively!

I don't know exactly what the pros/cons were of both solutions, but I can imagine that native decoding is faster.

That seems like a safe assumption, and one that I'd fully support. However, even if the native browser JPEG decoder is generally faster, actually using it involves sending data between the worker- and main-threads (and waiting for responses). Furthermore, as we both noted above, there's also a fair bit of overhead involved just in deciding what JPEG decoder to use (which probably isn't free).
There's also the question of memory usage with "JpegDecode" (and the issue of main-thread image parsing), but I'm probably not the right person to evaluate this properly though...

In my quick testing though, basically running SKIP_BABEL=true gulp browsertest with master respectively rm-nativeImageDecoderSupport, I'm not seeing any huge and thus easily spotted regressions in test runtime when using only src/core/jpg.js.
However, in order for the comparison to be fair we probably need land this PR first (obviously assuming that its src/core/jpg.js parts seem OK) since otherwise the comparison doesn't seem correct.

Snuffleupagus · 2020-02-11T10:27:10Z

/botio-linux preview

pdfjsbot · 2020-02-11T10:27:11Z

From: Bot.io (Linux m4)

Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/32bb72a11a819bd/output.txt

pdfjsbot · 2020-02-11T10:28:56Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/32bb72a11a819bd/output.txt

Total script time: 1.74 mins

Published

Viewer: http://54.67.70.0:8877/32bb72a11a819bd/web/viewer.html

…ore delegating decoding to the browser In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter. Currently, for "simple" JPEG images, we're relying on native image decoding to *fail* before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render. In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing. The only way to implement this is unfortunately to parse the *beginning* of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks. --- *Slightly off-topic:* Working on this *really* makes me start questioning if native rendering/decoding of JPEG images is actually a good idea. There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker. Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing. In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.

…ildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) *This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.* To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data *and* the "actual" `scanLines` value is at least one order of magnitude smaller than expected.

Snuffleupagus · 2020-02-26T10:37:16Z

/botio test

pdfjsbot · 2020-02-26T10:37:18Z

From: Bot.io (Windows)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/224a80be4393b6d/output.txt

pdfjsbot · 2020-02-26T10:37:18Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/f0c8a14649036b7/output.txt

pdfjsbot · 2020-02-26T10:56:44Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.67.70.0:8877/f0c8a14649036b7/output.txt

Total script time: 19.41 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/f0c8a14649036b7/reftest-analyzer.html#web=eq.log

pdfjsbot · 2020-02-26T11:02:03Z

From: Bot.io (Windows)

Failed

Full output at http://54.215.176.217:8877/224a80be4393b6d/output.txt

Total script time: 24.74 mins

Font tests: Passed
Unit tests: Passed
Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/224a80be4393b6d/reftest-analyzer.html#web=eq.log

timvandermeij · 2020-03-04T22:25:05Z

I'm still unsure about if we should take this approach. As you already noted there are downsides that we should consider, and in general the heuristic feels a bit fragile. Wouldn't it be better to check if #11601 works better instead, since it avoids quite a bit of overhead and ensures that we can always decode these kinds of images properly? I would be convinced with a benchmark of this PR versus #11601; if there are no visible changes in rendering and no significant regressions in runtime, I'd much rather use that approach instead. Would that be possible to check?

Snuffleupagus · 2020-03-04T22:29:34Z

Wouldn't it be better to check if #11601 works better instead

That'd only replace the first commit in this PR, but even with that done you'd still need the src/core/jpg.js changes to actually address the referenced issue.

Edit: The Add basic validation of the scanLines parameter in JPEG images, before delegating decoding to the browser commit simply ensures that we do not let the browser handle JPEG decoding for these sort of corrupt images (since it fails to handle them correctly).
The Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) commit is what actually fixes the bug here. Please note that the heuristic is limited to images whose correct height is off by at least a factor of 10 when compared with the embedded data.

[...] if there are no visible changes in rendering and no significant regressions in runtime,

There's lots of smaller changes (some visible as outlined elsewhere), and a general switch of JPEG decoders probably requires additional discussion; hence why I hoped to get this landed first.

timvandermeij · 2020-03-06T22:04:09Z

I have thought about this some more and I think it's indeed best to merge this now to get the issue resolved. We can look at the native decoding later. Thanks!

timvandermeij · 2020-03-06T22:04:15Z

/botio makeref

pdfjsbot · 2020-03-06T22:04:16Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/00bd08118c4edf3/output.txt

pdfjsbot · 2020-03-06T22:04:16Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @timvandermeij received. Current queue size: 1

Live output at: http://54.215.176.217:8877/7bb356dd30fa118/output.txt

pdfjsbot · 2020-03-06T22:22:09Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/00bd08118c4edf3/output.txt

Total script time: 17.87 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2020-03-06T22:29:04Z

From: Bot.io (Windows)

Success

Full output at http://54.215.176.217:8877/7bb356dd30fa118/output.txt

Total script time: 22.69 mins

Lint: Passed
Make references: Passed
Check references: Passed

timvandermeij added core image-jpeg and removed core labels Jan 20, 2020

Snuffleupagus force-pushed the issue-10880 branch 2 times, most recently from 8b66da9 to 01b9da1 Compare January 25, 2020 21:56

Snuffleupagus force-pushed the issue-10880 branch 3 times, most recently from 08f9eb2 to c83948a Compare February 11, 2020 10:26

Snuffleupagus force-pushed the issue-10880 branch 2 times, most recently from 180d77e to 5555b82 Compare February 14, 2020 23:04

Snuffleupagus mentioned this pull request Feb 15, 2020

Release a new version #11495

Closed

3 tasks

Snuffleupagus added 2 commits February 22, 2020 14:16

Snuffleupagus force-pushed the issue-10880 branch from 5555b82 to c3c3b8c Compare February 22, 2020 13:16

timvandermeij approved these changes Mar 6, 2020

View reviewed changes

timvandermeij merged commit 1a97c14 into mozilla:master Mar 6, 2020

Snuffleupagus deleted the issue-10880 branch March 6, 2020 23:51

Snuffleupagus mentioned this pull request Mar 18, 2020

Always prefer the PDF.js JPEG decoder for very large images, in order to reduce peak memory usage (issue 11694) #11707

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) #11523

Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) #11523

Snuffleupagus commented Jan 20, 2020

Snuffleupagus commented Jan 20, 2020

pdfjsbot commented Jan 20, 2020

pdfjsbot commented Jan 20, 2020

pdfjsbot commented Jan 20, 2020

pdfjsbot commented Jan 20, 2020

timvandermeij commented Jan 26, 2020

Snuffleupagus commented Jan 27, 2020 •

edited

Loading

Snuffleupagus commented Feb 11, 2020

pdfjsbot commented Feb 11, 2020

pdfjsbot commented Feb 11, 2020

Snuffleupagus commented Feb 26, 2020

pdfjsbot commented Feb 26, 2020

pdfjsbot commented Feb 26, 2020

pdfjsbot commented Feb 26, 2020

pdfjsbot commented Feb 26, 2020

timvandermeij commented Mar 4, 2020

Snuffleupagus commented Mar 4, 2020 •

edited

Loading

timvandermeij commented Mar 6, 2020

timvandermeij commented Mar 6, 2020

pdfjsbot commented Mar 6, 2020

pdfjsbot commented Mar 6, 2020

pdfjsbot commented Mar 6, 2020

pdfjsbot commented Mar 6, 2020

Add a heuristic, in src/core/jpg.js, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10880) #11523

Add a heuristic, in src/core/jpg.js, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines parameter (issue 10880) #11523

Conversation

Snuffleupagus commented Jan 20, 2020

Snuffleupagus commented Jan 20, 2020

pdfjsbot commented Jan 20, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented Jan 20, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Jan 20, 2020

From: Bot.io (Windows)

Failed

pdfjsbot commented Jan 20, 2020

From: Bot.io (Linux m4)

Failed

timvandermeij commented Jan 26, 2020

Snuffleupagus commented Jan 27, 2020 • edited Loading

Snuffleupagus commented Feb 11, 2020

pdfjsbot commented Feb 11, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Feb 11, 2020

From: Bot.io (Linux m4)

Success

Published

Snuffleupagus commented Feb 26, 2020

pdfjsbot commented Feb 26, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented Feb 26, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Feb 26, 2020

From: Bot.io (Linux m4)

Failed

pdfjsbot commented Feb 26, 2020

From: Bot.io (Windows)

Failed

timvandermeij commented Mar 4, 2020

Snuffleupagus commented Mar 4, 2020 • edited Loading

timvandermeij commented Mar 6, 2020

timvandermeij commented Mar 6, 2020

pdfjsbot commented Mar 6, 2020

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 6, 2020

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 6, 2020

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 6, 2020

From: Bot.io (Windows)

Success

Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) #11523

Add a heuristic, in `src/core/jpg.js`, to handle JPEG images with a wildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) #11523

Snuffleupagus commented Jan 27, 2020 •

edited

Loading

Snuffleupagus commented Mar 4, 2020 •

edited

Loading