-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a heuristic, in src/core/jpg.js
, to handle JPEG images with a wildly incorrect SOF (Start of Frame) scanLines
parameter (issue 10880)
#11523
Conversation
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/ac13d4c85645b30/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/1d743b30fffcebe/output.txt |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/ac13d4c85645b30/output.txt Total script time: 26.63 mins
Image differences available at: http://54.215.176.217:8877/ac13d4c85645b30/reftest-analyzer.html#web=eq.log |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/1d743b30fffcebe/output.txt Total script time: 56.03 mins
Image differences available at: http://54.67.70.0:8877/1d743b30fffcebe/reftest-analyzer.html#web=eq.log |
8b66da9
to
01b9da1
Compare
If I recall correctly, there was a discussion a while ago about only using our own JPEG decoder instead of native decoding because native decoding is not always possible. Looking at this patch, it requires quite a lot of additional code to be able to feature-detect if we can use the native decoder or not, so I agree with your analysis that we should perhaps wonder if we shouldn't just always use own own JPEG decoder. I don't know exactly what the pros/cons were of both solutions, but I can imagine that native decoding is faster. Do you happen to know what advantages/disadvantages may be? If so, we could perhaps reconsider this after having worked with native decoding for quite a while now. |
Correct, there was a brief discussion about this in #6984 (comment) and I also tried it in #8637 (comment). At this point in time, there's two kinds of failure when running the reference tests with master...Snuffleupagus:rm-nativeImageDecoderSupport:
That seems like a safe assumption, and one that I'd fully support. However, even if the native browser JPEG decoder is generally faster, actually using it involves sending data between the worker- and main-threads (and waiting for responses). Furthermore, as we both noted above, there's also a fair bit of overhead involved just in deciding what JPEG decoder to use (which probably isn't free). In my quick testing though, basically running |
08f9eb2
to
c83948a
Compare
/botio-linux preview |
From: Bot.io (Linux m4)ReceivedCommand cmd_preview from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/32bb72a11a819bd/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/32bb72a11a819bd/output.txt Total script time: 1.74 mins Published |
180d77e
to
5555b82
Compare
…ore delegating decoding to the browser In some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect `scanLines` parameter. Currently, for "simple" JPEG images, we're relying on native image decoding to *fail* before falling back to the implementation in `src/core/jpg.js`. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render. In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of `src/core/jpg.js` directly in cases where the native JPEG decoder cannot be trusted to do the right thing. The only way to implement this is unfortunately to parse the *beginning* of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the `JpegStream` instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks. --- *Slightly off-topic:* Working on this *really* makes me start questioning if native rendering/decoding of JPEG images is actually a good idea. There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the `NativeImageDecoder`, we're having to check for certain properties in the image dictionary, parse the `ColorSpace`, and finally read the actual image data to find the SOF marker. Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing. In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.
…ildly incorrect SOF (Start of Frame) `scanLines` parameter (issue 10880) *This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.* To limit the impact of these changes, we only re-parse JPEG images using a reduced `scanLines` value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data *and* the "actual" `scanLines` value is at least one order of magnitude smaller than expected.
5555b82
to
c3c3b8c
Compare
/botio test |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.215.176.217:8877/224a80be4393b6d/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.67.70.0:8877/f0c8a14649036b7/output.txt |
From: Bot.io (Linux m4)FailedFull output at http://54.67.70.0:8877/f0c8a14649036b7/output.txt Total script time: 19.41 mins
Image differences available at: http://54.67.70.0:8877/f0c8a14649036b7/reftest-analyzer.html#web=eq.log |
From: Bot.io (Windows)FailedFull output at http://54.215.176.217:8877/224a80be4393b6d/output.txt Total script time: 24.74 mins
Image differences available at: http://54.215.176.217:8877/224a80be4393b6d/reftest-analyzer.html#web=eq.log |
I'm still unsure about if we should take this approach. As you already noted there are downsides that we should consider, and in general the heuristic feels a bit fragile. Wouldn't it be better to check if #11601 works better instead, since it avoids quite a bit of overhead and ensures that we can always decode these kinds of images properly? I would be convinced with a benchmark of this PR versus #11601; if there are no visible changes in rendering and no significant regressions in runtime, I'd much rather use that approach instead. Would that be possible to check? |
That'd only replace the first commit in this PR, but even with that done you'd still need the Edit: The
There's lots of smaller changes (some visible as outlined elsewhere), and a general switch of JPEG decoders probably requires additional discussion; hence why I hoped to get this landed first. |
I have thought about this some more and I think it's indeed best to merge this now to get the issue resolved. We can look at the native decoding later. Thanks! |
/botio makeref |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 0 Live output at: http://54.67.70.0:8877/00bd08118c4edf3/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @timvandermeij received. Current queue size: 1 Live output at: http://54.215.176.217:8877/7bb356dd30fa118/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.67.70.0:8877/00bd08118c4edf3/output.txt Total script time: 17.87 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.215.176.217:8877/7bb356dd30fa118/output.txt Total script time: 22.69 mins
|
Add basic validation of the
scanLines
parameter in JPEG images, before delegating decoding to the browserIn some cases PDF documents can contain JPEG images that the native browser decoder cannot handle, e.g. images with DNL (Define Number of Lines) markers or images where the SOF (Start of Frame) marker contains a wildly incorrect
scanLines
parameter.Currently, for "simple" JPEG images, we're relying on native image decoding to fail before falling back to the implementation in
src/core/jpg.js
. In some cases, note e.g. issue 10880, the native image decoder doesn't outright fail and thus some images may not render.In an attempt to improve the current situation, this patch adds additional validation of the JPEG image SOF data to force the use of
src/core/jpg.js
directly in cases where the native JPEG decoder cannot be trusted to do the right thing.The only way to implement this is unfortunately to parse the beginning of the JPEG image data, looking for a SOF marker. To limit the impact of this extra parsing, the result is cached on the
JpegStream
instance and this code is only run for images which passed all of the pre-existing "can the JPEG image be natively rendered and/or decoded" checks.Slightly off-topic: Working on this really makes me start questioning if native rendering/decoding of JPEG images is actually a good idea.
There's certain kinds of JPEG images not supported natively, and all of the validation which is now necessary isn't "free". At this point, in the
NativeImageDecoder
, we're having to check for certain properties in the image dictionary, parse theColorSpace
, and finally read the actual image data to find the SOF marker.Furthermore, we cannot just send the image to the main-thread and be done in the "JpegStream" case, but we also need to wait for rendering to complete (or fail) before continuing with other parsing.
In the "JpegDecode" case we're even having to parse part of the image on the main-thread, which seems completely at odds with the principle of doing all heavy parsing in the Worker, and there's also a couple of potentially large (temporary) allocations/copies of TypedArray data involved as well.
Add a heuristic, in
src/core/jpg.js
, to handle JPEG images with a wildly incorrect SOF (Start of Frame)scanLines
parameter (issue 10880)This whole patch feels somewhat arbitrary, and I'd be slightly worried about possibly breaking something else.
To limit the impact of these changes, we only re-parse JPEG images using a reduced
scanLines
value if and only if: An unexpected EOI (End of Image) marker was encountered during decoding of Scan data and the "actual"scanLines
value is at least one order of magnitude smaller than expected.Fixes #10880