Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flushing the output queue WITHOUT invalidating the decode pipeline #698

Closed
RogerSanders opened this issue Jun 30, 2023 · 34 comments
Closed
Labels

Comments

@RogerSanders
Copy link

First up, let me say thanks to everyone who's worked on this very important spec. I appreciate the time and effort that's gone into this, and I've found it absolutely invaluable in my work. That said, I have hit a snag, which I consider to be a significant issue in the standard.

Some background first so you can understand what I'm trying to do.

  • I'm doing 3D rendering on a cloud server in realtime, using hardware H265 encoding to compress frames, and sending those frames down over a websocket to the client browser, where I use the WebCodecs api (specifically VideoDecoder) to decode those frames and display them to the user.
  • My goal is low latency, which is done by balancing rendering time, encoding time, compressed frame size, and decoding time.
  • My application is an event driven, CAD-style application. Unlike video encoding from a real camera, or 3D games, where a constant or target framerate exists, my application has no such concept. The last image generated remains valid for an indefinite period of time. The next frame I send over the wire may be displayed for milliseconds, or for an hour, it all depends on what state the program is in and what the user does next in terms of input.

So then, based on the above, hopefully my requirements become clearer. When I get a video frame at the client side, I need to ensure that frame gets displayed to the user, without any subsequent frame being required to arrive in order to "flush" it out of the pipeline. My frames come down at an irregular and unpredictable rate. If the pipeline has a frame or two of latency, I need to be able to manually flush that pipeline on demand, to ensure that all frames which have been sent have been sent to the codec for decoding have been made visible.

At this point, I find myself fighting a few parts of the WebCodecs spec. As stated in the spec, when I call the decode method (https://www.w3.org/TR/webcodecs/#dom-videodecoder-decode), it "enqueues a control message to decode the given chunk". That decode request ends up on the "codec work queue" (https://www.w3.org/TR/webcodecs/#dom-videodecoder-codec-work-queue-slot), and ultimately stuck in the "internal pending output" queue (https://www.w3.org/TR/webcodecs/#internal-pending-output). As stated in the spec, "Codec outputs such as VideoFrames that currently reside in the internal pipeline of the underlying codec implementation. The underlying codec implementation may emit new outputs only when new inputs are provided. The underlying codec implementation must emit all outputs in response to a flush." So frames may be held up in this queue, as per the spec. As also stated however, in response to a flush, all pending outputs MUST be emitted in response to a flush. There is, of course, an explicit flush method: https://www.w3.org/TR/webcodecs/#dom-videodecoder-flush
The spec states that this method "Completes all control messages in the control message queue and emits all outputs". Fantastic, that's what I need. Unfortunately, the spec also specifically states that in response to a flush call, the implementation MUST "Set [[key chunk required]] to true." This means that after a flush, I can only provide a key frame. Not so good. In my scenario, without knowing when, or if, a subsequent frame is going to arrive, I end up having to flush after every frame, and now due to this requirement that a key frame must follow a flush, every frame must be a keyframe. This increases my payload size significantly, and can cause noticeable delays and stuttering on slower connections.

When I use a dedicated desktop client, and have full control of the decoding hardware, I can perform a "flush" without invalidating the pipeline state, so I can, for example, process a sequence of five frames such as "IPPPP", flushing the pipeline after each one, and this works without issue. I'd like to be able to achieve the same thing under the WebCodecs API. Currently, this seems impossible, as following a call to flush, the subsequent frame must be an I frame, not a P frame.

My question now is, how can this be overcome? It seems to me, I'd need one of two things:

  1. The ability to guarantee no frames will be held up in the internal pending output queue waiting for another input, OR:
  2. The ability to flush the decoder, or the codec internal pending output queue, without invalidating the decoder pipeline.

At the hardware encoding/decoding API level, my only experience is with NVENC/NVDEC, but I know what I want is possible under this implementation at least. Are there known hardware implementations where what I'm asking for isn't possible? Can anyone see a possible way around this situation?

I can tell you right now, I have two workarounds. One is to encode every frame as a keyframe. This is clearly sub-optimal for bandwidth, and not required for a standalone client. The second workaround is ugly, but functions. I can measure the "queue depth" of the output queue, and send the same frame for decoding multiple times. This works with I or P frames. With a queue depth of 1 for example, which is what I see on Google Chrome, for each frame I receive at the client end, I send it for decoding twice. The second copy of the frame "pushes" the first one out of the pipeline. A hack, for sure, and sub-optimal use of the decoding hardware, but it keeps my bandwidth at the expected level, and I'm able to implement it on the client side alone.

What I would like to see, ideally, is some extra control in the WebCodecs API. Perhaps a boolean flag in the codec configuration? We currently have the "optimizeForLatency" flag. I'd like to see a "forceImmediateOutput" flag or the like, which guarantees that every frame that is sent for decoding WILL be passed to the VideoFrameOutputCallback without the need to flush or force it through the pipeline with extra input. Failing that, an alternate method of flushing that doesn't invalidate the decode pipeline would work. Without either of these solutions though, it seems to me that WebCodecs as it stands is unsuitable for use with variable rate streams, as you have no guarantees about the depth of the internal pending output queue, and no way to flush it without breaking the stream.

@RogerSanders
Copy link
Author

Pulling in a reference to #220, as the discussion on that issue directly informed the current spec requirements that a key frame must follow a flush.

@dalecurtis
Copy link
Contributor

dalecurtis commented Jun 30, 2023

Whether a H.264 or H.265 stream can be decoded as 1-in-1-out precisely depends on how the bitstream is setup. Is your bitstream setup correctly? If you have a sample we can check it. https://crbug.com/1352442 discusses this for H.264.

If the bitstream is setup correctly and this isn't working, that's a bug in that UA. If it's Chrome, you can file an issue at https://crbug.com/new with the chrome://gpu information and a sample bitstream.

If the bitstream is not setup properly, there's nothing WebCodecs can do to help since the behavior would be up to the hardware decoder.

@sandersdan

@sandersdan
Copy link
Contributor

First I should note that there is VideoDecoderConfig.optimizeForLatency, which can affect the choices made in setting up a codec. (In Chrome it affects how background threads are set up for software decoding.)

I expect that WebCodecs implementations will output frames as soon as possible, subject to the limitations of the codec implementation. This means that I predict 1-in-1-out behavior unless the bitstream prevents it. The most likely bitstream limitation is when frame reordering is enabled, as for example discussed in the crbug linked above.

Some codec implementations do allow us to "flush without resetting", which can be used to force 1-in-1-out behavior, but I am reluctant to specify this feature knowing that not all implementations can support it (eg. Android MediaCodec).

If you can post a sample of your bitstream, I'll take a look and let you know if there is a change you can make to improve decode latency.

@RogerSanders
Copy link
Author

Thanks, your replies have given me some things to check on. I'll dig into the bitstream myself as well and see if I can spot any issues with the configuration.

As requested, I've thrown together a minimal sample in a single html page, with some individual frames base64 encoded in:
index.zip
There's a small animation of a coloured triangle which changes shape over the course of 5 frames. I have five buttons F1_I through F5_I, which are IDR frames. As you press each button, that frame will be sent to the decoder. The first frame includes the SPS header. I have also alternatively encoded frames 2-5 as P frames, under buttons F2_P through F5_P, and there's a button to flush.

What I'd like to see is each frame appear as soon as you hit the corresponding button. Right now you'll observe the 1 frame of latency, where I have to send the frame twice, send a new frame in, or flush the decoder to get it to appear.

@sandersdan
Copy link
Contributor

I took a quick look at the bitstream you have provided, and at first glance it looks correctly configured to me (in particular, max_num_reoder_frames is set to zero). I also ran the sample on Chrome Canary on macOS, and while I was not able to decode the P frames, the I frames did display immediately.

You may be experiencing a browser bug, which should be filed with the appropriate bug tracker (Chrome: https://bugs.chromium.org/p/chromium/issues/entry).

@aboba aboba added the Question label Jul 6, 2023
@RogerSanders
Copy link
Author

RogerSanders commented Jul 7, 2023

Thanks, I also couldn't find problems in the bitstream. I did however diagnose the problem in Chrome. In the chromium project, under "src\media\gpu\h265_decoder.cc", inside the H265Decoder::Decode() method, there's a bit of code that currently looks like this:

      if (par_res == H265Parser::kEOStream) {
        curr_nalu_.reset();
        // We receive one frame per buffer, so we can output the frame now.
        CHECK_ACCELERATOR_RESULT(FinishPrevFrameIfPresent());
        return kRanOutOfStreamData;
      }

As per the comment, the intent is to output the frame when the end of the bitstream is reached, but the FinishPrevFrameIfPresent() method doesn't actually output the frame, it just submits it for decoding. Fix is simple:

      if (par_res == H265Parser::kEOStream) {
        curr_nalu_.reset();
        // We receive one frame per buffer, so we can output the frame now.
        CHECK_ACCELERATOR_RESULT(FinishPrevFrameIfPresent());
        OutputAllRemainingPics();
        return kRanOutOfStreamData;
      }

A corresponding change also needs to occur in h264_decoder.cc, which shares the same logic. I tested this change, and it fixes the issue. I'll prepare a submission for the Chromium project, see what they think.

On the issue of the spec itself though, I'm still concerned by an apparent lack of guarantees around when frames submitted for decoding will be made visible. You mentioned that "I expect that WebCodecs implementations will output frames as soon as possible, subject to the limitations of the codec implementation. This means that I predict 1-in-1-out behavior unless the bitstream prevents it." This is reassuring to me, it's what I'd like to see myself, but I couldn't actually see any part of the standard that would seem to require it. Couldn't a perfectly conforming implementation hold onto 1000 frames right now, and claim they're meeting the spec? This might lead to inconsistent behaviour between implementations, and problems getting bug reports actioned, as they may be closed as "by design".

From what you said before, I gather the intent of the "optimizeForLatency" flag is probably basically the same as what I was asking for with a "forceImmediateOutput", in that both of them really are intended to instruct the decoder not to hold onto any frames beyond what is required by the underlying bitstream. The wording in the spec right now doesn't really guarantee it does much of anything. The docs state about the "optimizeForLatency" flag that it is a:

"Hint that the selected decoder SHOULD be configured to minimize the number of EncodedVideoChunks that have to be decoded before a VideoFrame is output.
NOTE: In addition to User Agent and hardware limitations, some codec bitstreams require a minimum number of inputs before any output can be produced."

Making it a "hint" that the decoder "SHOULD" do something doesn't really sound reassuring. Can't we use stronger language here, such as "MUST"? It would also be helpful to elaborate on this flag a bit, maybe tie it back to the kind of use case I've outlined in this issue. I'd like to see the standard read more like this under the "optimizeForLatency" flag:

"Instructs the selected decoder that it MUST ensure that the absolute minimum number of EncodedVideoChunks have to be decoded before a VideoFrame is output, subject to the limitations of the codec and the supplied bitstream. Where the codec and supplied bitstream allows frames to be produced immediately upon decoding, setting this flag guarantees that each decoded frame will be produced for output without further interaction, such as requiring a call to flush(), or submitting further frames for decoding."

If I saw that in the spec as a developer making use of VideoDecoder, it would give me confidence that setting that flag would be sufficient to achieve 1-in-1-out behaviour, with appropriately prepared input data. A lack of clarity on that is what led me here to create this issue. It would also give stronger guidance to developers implementing this spec that they really needed to achieve this result to be conforming. Right now it reads more as a suggestion or a nice-to-have, but not required.

@RogerSanders
Copy link
Author

RogerSanders commented Jul 7, 2023

For completeness, I'll add this issue has been reported to the Chromium project under https://crbug.com/1462868, and I've submitted a proposed fix under https://chromium-review.googlesource.com/c/chromium/src/+/4672706

@dalecurtis
Copy link
Contributor

Thanks, we can continue discussion of the Chromium issue there.

In terms of the spec language, I think the current text accurately reflects that UAs are beholden to the underlying hardware decoders (and other OS constraints). I don't think we can use SHOULD/MUST here due to that. We could probably instead add some more non-normative text giving a concrete example (1-in-1-out) next to the optimizeForLatency flag.

Ultimately it seems like you're looking for confidence that you can always rely on 1-in-1-out behavior. I don't think you'll find that to be true; e.g., many Android decoders won't work that way. It's probably best to submit a test decode on your target platforms before relying on that functionality.

@aboba
Copy link
Collaborator

aboba commented May 5, 2024

@Djuffin @padenot It looks like this issue was due to a bug, rather than a spec issue. Can we close it?

@Djuffin Djuffin closed this as completed May 8, 2024
@jvo203
Copy link

jvo203 commented Sep 5, 2024

Please would you re-open this issue? Am having the same problem, using Microsoft Edge on the latest macOS on Apple Silicon.

The existing WebAssembly solution works flawlessly, under the hood it uses WASM-compiled FFmpeg to decode greyscale HEVC frames (received via WebSockets in real-time). The integrated WASM C code also applies colormaps to turn greyscale into full RGB prior to rendering in the HTML5 Canvas. The current client-side WebAssembly solution has ZERO latency: incoming frames are decoded by FFmpeg and they are available without any delays. On the server the video frames, coming at irregular intervals from a scientific application (astronomy), are encoded on-demand using the x265 C API.

The known downside of the existing setup is the WebAssembly binary size is inflated due to the inclusion of FFmpeg, which is exactly what WebCodecs API is trying to solve.

Having tried the WebCodecs API, the current WebCodecs API is basically unusable. There is a long delay (many many frames, simply unacceptable) before any output frames appear. Forcing optimizeForLatency: true in the VideoDecoder results in an error:

WebCodecs API Video Worker DataError: Failed to execute 'decode' on 'VideoDecoder': A key frame is required after configure() or flush(). If you're using HEVC formatted H.265 you must fill out the description field in the VideoDecoderConfig.
    at video_worker.js?JS2024-09-06.0:92:26

and no output frames appear.

The same problem in Safari:

WebCodecs API Video Worker
DataError: Key frame is required
decode
(anonymous関数) — video_worker.js:92

Have tried the latest Firefox 130 but it fails right at the outset: HEVC does not appear to be supported.

So for real-time streaming applications where the video output needs to be available ASAP without any delays, the current WebCodecs API is unusable, which is a shame.

I will need to stick to the existing (working without a hitch) WebAssembly FFmpeg for the foreseeable future, until WebCodecs API improves the specification and browsers revise their implementations.

@padenot
Copy link
Collaborator

padenot commented Sep 6, 2024

This issue is best file in the tracker for the corresponding implementations, and/or OS vendors (since implementations are using OS APIs for H265):

cc-ing @youennf and @aboba for Safari and Edge respectively.

You're correct that Firefox doesn't support H265 encoding. Support for H265 isn't planed either, we're focusing on royalty-free formats. I'm not sure about Chromium, wpt.fyi marks it as not supporting it, but Edge as well so I'm not sure.

@jvo203
Copy link

jvo203 commented Sep 6, 2024

Hi, thanks for a prompt reply. I guess it will take some time before all the kinks in the browser implementations are sorted out. In theory WebCodecs API seems God-sent, in practice the devil is in the details, imperfect / inconsistent OS implementations etc.

@youennf
Copy link
Contributor

youennf commented Sep 6, 2024

https://w3c.github.io/webcodecs/#dom-videodecoder-flush is saying to set [[[key chunk required]]](https://w3c.github.io/webcodecs/#dom-videodecoder-key-chunk-required-slot) to true.

https://w3c.github.io/webcodecs/#dom-videodecoder-decode is then throwing if [[[key chunk required]]](https://w3c.github.io/webcodecs/#dom-videodecoder-key-chunk-required-slot) is true and the chunk is not of type key.

With regards to the delay, HEVC/H264 decoders need to compute the reordering window size.
It might be that the stream has a long reordering window size, or that there is a user agent bug in its computation.

@jvo203, the error you mention is not unexpected.
Are you calling decode with a key frame after a flush? If so, this is indeed a bug on user agent side, it might be nice to have some repro steps.

@jvo203
Copy link

jvo203 commented Sep 6, 2024

I am not calling decoder.flush() explicitly by myself. When optimizeForLatency: true is used it seems the flush() must be getting called internally by the browser implementations.

This is my current experimental JS web worker code (just kicking the tires). By the way, it is rather difficult to find the correct information about the HEVC codec string. My string has been lifted somewhere on the Internet, but it's hard to find good info.

// #define IS_NAL_UNIT_START(buffer_ptr) (!buffer_ptr[0] && !buffer_ptr[1] && !buffer_ptr[2] && (buffer_ptr[3] == 1))
function is_nal_unit_start(buffer_ptr) {
    return (!buffer_ptr[0] && !buffer_ptr[1] && !buffer_ptr[2] && (buffer_ptr[3] == 1));
}

// #define IS_NAL_UNIT_START1(buffer_ptr) (!buffer_ptr[0] && !buffer_ptr[1] && (buffer_ptr[2] == 1))
function is_nal_unit_start1(buffer_ptr) {
    return (!buffer_ptr[0] && !buffer_ptr[1] && (buffer_ptr[2] == 1));
}

// #define GET_H265_NAL_UNIT_TYPE(buffer_ptr) ((buffer_ptr[0] & 0x7E) >> 1)
function get_h265_nal_unit_type(byte) {
    return ((byte & 0x7E) >> 1);
    // (Byte >> 1) & 0x3f
    //return ((byte >> 1) & 0x3f);
}

console.log('WebCodecs API Video Worker initiated');

var timestamp = 0; // [microseconds]

self.addEventListener('message', function (e) {
    try {
        let data = e.data;
        console.log('WebCodecs API Video Worker message received:', data);

        if (data.type == "init_video") {
            timestamp = 0;

            const config = {
                codec: "hev1.1.60.L153.B0.0.0.0.0.0",
                codedWidth: data.width,
                codedHeight: data.height,
                /*optimizeForLatency: true,*/
            };

            VideoDecoder.isConfigSupported(config).then((supported) => {
                if (supported) {
                    console.log("WebCodecs::HEVC is supported");

                    const init = {
                        output: (frame) => {
                            console.log("WebCodecs::HEVC output video frame: ", frame);
                            frame.close();
                        },
                        error: (e) => {
                            console.log(e.message);
                        },
                    };

                    const decoder = new VideoDecoder(init);
                    decoder.configure(config);
                    this.decoder = decoder;

                    console.log("WebCodecs::HEVC decoder created:", this.decoder);
                } else {
                    console.log("WebCodecs::HEVC is not supported");
                }
            }).catch((e) => {
                console.log(e.message);
            });

            return;
        }

        if (data.type == "end_video") {
            try {
                this.decoder.close();
                console.log("WebCodecs::HEVC decoder closed");
            } catch (e) {
                console.log("WebCodecs::HEVC decoder close error:", e);
            }

            return;
        }

        if (data.type == "video") {
            let nal_start = 0;

            if (is_nal_unit_start1(data.frame))
                nal_start = 3;
            else if (is_nal_unit_start(data.frame))
                nal_start = 4;

            let nal_type = get_h265_nal_unit_type(data.frame[nal_start]);
            console.log("HEVC NAL unit type:", nal_type);

            const type = (nal_type == 19 || nal_type == 20) ? "key" : "delta";

            const chunk = new EncodedVideoChunk({ data: data.frame, timestamp: timestamp, type: type });
            timestamp += 1;

            this.decoder.decode(chunk);
            console.log("WebCodecs::HEVC decoded video chunk:", chunk);
        }
    } catch (e) {
        console.log('WebCodecs API Video Worker', e);
    }
}, false);

@jvo203
Copy link

jvo203 commented Sep 6, 2024

With regards to the delay, HEVC/H264 decoders need to compute the reordering window size.
It might be that the stream has a long reordering window size, or that there is a user agent bug in its computation.

Regarding the unexpected initial delays when optimizeForLatency: true is not being used, WebAssembly FFmpeg code does not seem to have any problems with the same HEVC bitstreams. Decoded frames always appear immediately.

With the WebCodecs API, following the initial delay of even several hundred frames (!) (or 10 seconds or more in user time), afterwards in-coming HEVC frames are being decoded without any delays as they come in real-time. A new HEVC frame comes in and a corresponding decoded output video frame appears immediately.

@youennf
Copy link
Contributor

youennf commented Sep 6, 2024

I feel like this might be a user agent bug.
Safari is not yet really using optimizeForLatency: true so I am unsure whether you are talking about Safari or Chrome..
Filing a bug in https://bugs.webkit.org/enter_bug.cgi?product=WebKit&component=Media seems good with repro steps if there is a Safari issue.

@jvo203
Copy link

jvo203 commented Sep 6, 2024

That's strange, perhaps a new Safari uses optimizeForLatency? My Safari is 17.6 (19618.3.11.11.5). The following error comes from Safari when optimizeForLatency is enabled:

WebCodecs API Video Worker
DataError: Key frame is required
decode
(anonymous関数) — video_worker.js:92

Microsoft Edge as well as Chrome on macOS give

WebCodecs API Video Worker DataError: Failed to execute 'decode' on 'VideoDecoder': A key frame is required after configure() or flush(). If you're using HEVC formatted H.265 you must fill out the description field in the VideoDecoderConfig.
    at video_worker.js?JS2024-09-06.0:92:26

When the optimizeForLatency is disabled (the flag is commented out in the init object), all three browsers (Safari, Edge and Chrome) start outputting decoded video frames after a long initial delay.

@jvo203
Copy link

jvo203 commented Sep 6, 2024

When you say "I feel like this might be a user agent bug.", what exactly do you mean by the "user agent"? Do you mean an incorrect HEVC codec string codec: "hev1.1.60.L153.B0.0.0.0.0.0"?

@youennf
Copy link
Contributor

youennf commented Sep 6, 2024

user agent = browser, like Safari.
If you are using a JS library that is wrapping VideoDecoder, maybe this library is calling flush.
Without a repro case like a jsfiddle, it is hard to know what the issue is.

@jvo203
Copy link

jvo203 commented Sep 6, 2024

I see. No I am not using any wrapping JavaScript libraries. Nothing at all. Am calling VideoDecoder directly from JavaScript, no wrapping, no extra third-party WebCodecs API libraries, "no nothing". You can see my JavaScript code a few posts above.

@dalecurtis
Copy link
Contributor

dalecurtis commented Sep 6, 2024

optimizeForLatency: true doesn't change how key-frame detection works. I'd guess something is going wrong with your bitstream generation. If you can provide your input as an annexb .h264 or .h265 file we can validate the bitstream.

optimizeForLatency: true also doesn't cause flushing. In Chromium, it just means we configure ffmpeg to not use threading for decoding -- since that introduces latency beyond the minimum possible for the bitstream.

@aboba
Copy link
Collaborator

aboba commented Sep 6, 2024

I ran the WebCodecs Encode/Decode Worker sample on Chrome Canary 130.0.6700.0 on Mac OS X 14.6.1 (MacBook Air M2), with H.265 enabled via the flags:
--enable-features=PlatformHEVCEncoderSupport,WebRtcAllowH265Send,WebRtcAllowH265Receive --force-fieldtrials="WebRTC-Video-H26xPacketBuffer/Enabled/"

Selecting H.265 (codec string "hvc1.1.6.L120.00" (Main profile, level 4.0)) with 500Kbps bitrate, "realtime" (which sets optimizeForLatency: true on the decoder), the encode (< 10ms) and decode (< 2 ms) times are quite low:

image image

@jvo203
Copy link

jvo203 commented Sep 6, 2024

@aboba For this particular example the typical decode times using WebAssembly-compiled FFmpeg in a browser (non-threaded FFmpeg I guess) are between 1ms and 2ms for the small HEVC frames. Typical frame image dimensions are small, like 150x150 etc. Not bad, good enough for real-time. Even larger HEVC video frames like 1024x786 etc. can be decoded in a few milliseconds.

@dalecurtis OK, I'll see what can be done to extract the raw NAL frames from the server and save them to disk, just prior to sending them via WebSockets to the client browser. This particular application uses Julia on the server, Julia in turn calls x265 C API bindings to generate raw HEVC NAL units (no wrapping video containers are used, just raw HEVC NAL units). In addition I'll look into trying to save the incoming NAL units on the receiving side too (received via WebSockets in JavaScript). Don't know if it can be done easily in JavaScript (disk access rights etc). Perhaps there is some discrepancy in the HEVC NAL units order on the server and the receiver.

Let's assume for a moment that indeed my HEVC bitstreams are a bit dodgy, or some NAL units are getting "lost in transit", or the WebSockets receive NAL units in the wrong order. Then it seems that WASM FFmpeg is extraordinarily resilient to any such errors. This client-server architecture has been in use since about 2016 in various forms (C/C++ on the server, also Rust, Julia too as well as C/FORTRAN server parts, always calling the x265 C API). The WASM FFmpeg decoder has never complained about the incoming NAL units, it has always been able to decode those HEVC bitstreams in real-time without any delays.

@dalecurtis
Copy link
Contributor

ffmpeg is indeed far more resilient than hardware decoders. Unless you configure it to explode on errors (and even then) it will decode all sorts of content that the GPU will often balk at. Since issues in the GPU process are more serious security issues, we will also check the bitstream for conformance before passing to the hardware decoder in Chromium.

@jvo203
Copy link

jvo203 commented Sep 7, 2024

Here is some food for thought. Changing the codec string to the one used by @aboba makes the "keyframe required" problem go away in Chrome and Edge when optimizeForLatency: true is enabled.

 /*codec: "hev1.1.60.L153.B0.0.0.0.0.0",*/
 codec: "hvc1.1.6.L120.00",

In both Chrome and Edge there is now no difference in behaviour with and without optimizeForLatency. In either case video frames are decoded without keyframe errors. However, the problem of a delayed video frame output still remains. After carefully observing the decoded frame timestamps (counts), the initial delay seems to be 258 frames. In other words, the initial frames with timestamps from 0 to 257 never appear in the VideoDecoder output queue. The first output frame always has a timestamp of 258. From the frame timestamp 258 onwards all frames immediately appear in the output queue. I.e. you pass an EncodedVideoChunk with timestamp 260 to the decoder, and almost immediately an output Video Frame with the same timestamp 260 comes out of a decoder output queue. So the mystery is why the first 258 NAL units are completely lost (like falling into a black hole).

There is a new snag: Safari stopped working with codec: "hvc1.1.6.L120.00", with or without optimizeForLatency. There is an unspecified Data Error upon decoding the frames. Oh bummer!

On the subject of the codec string, there is a distinct lack of practical information regarding the correct HEVC codec strings. For vp8, vp9, AV1 googling yields some examples / useful information. Less so for the HEVC.

Anyway, I will try to extract the HEVC bitstreams from the application, some clues might be found.

@jvo203
Copy link

jvo203 commented Sep 7, 2024

For completeness attached is a raw HEVC bitstream (annexb file) that plays fine on macOS using ffmpeg (ffplay):

(venv) chris@studio /tmp % ffplay video-123-190.h265
ffplay version 7.0.2 Copyright (c) 2003-2024 the FFmpeg developers
  built with Apple clang version 15.0.0 (clang-1500.3.9.4)
  configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/7.0.2 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox --enable-neon
  libavutil      59.  8.100 / 59.  8.100
  libavcodec     61.  3.100 / 61.  3.100
  libavformat    61.  1.100 / 61.  1.100
  libavdevice    61.  1.100 / 61.  1.100
  libavfilter    10.  1.100 / 10.  1.100
  libswscale      8.  1.100 /  8.  1.100
  libswresample   5.  1.100 /  5.  1.100
  libpostproc    58.  1.100 / 58.  1.100
Input #0, hevc, from 'video-123-190.h265':
  Duration: N/A, bitrate: N/A
  Stream #0:0: Video: hevc (Rext), yuv444p(tv), 123x190, 25 fps, 30 tbr, 1200k tbn
    nan M-V:    nan fd=   0 aq=    0KB vq=    0KB sq=    0B

Filename: video-123-190.h265
https://drive.google.com/file/d/1KkZR0HisRJcWH21VAKheDB7QRbfUPqkL/view?usp=sharing

When you play it the video starts with some noise-like dots so please don't be surprised. Each dot represents an X-ray event count bin from an X-ray orbiting satellite observatory.

Abbreviated server-side Julia code:

# "init_video" event
fname = "/tmp/video-$image_width-$image_height.h265"
annexb = open(fname, "w")

# for each 2D Video Frame
# prepare the frame, encode with x265, then
  # for each NAL
  for idx = 1:iNal[]
      nal = x265_nal(pNals[], idx)

      #binary NAL payload
       payload = Vector{UInt8}(undef, nal.sizeBytes)
                                    unsafe_copyto!(
                                        pointer(payload),
                                        nal.payload,
                                        nal.sizeBytes)

        write(resp, payload) # complete the WebSocket response
         put!(outgoing, resp) # queue the WebSocket response

       # append the NAL unit to the Annex-B file
       write(annexb, payload)
   end

# upon an "end_video" event close the annexb file 
close(annexb)

Such HEVC streams experience that 258-NAL unit initial delay when being decoded with WebCodecs API. However, there is no initial delay observed with the WebAssembly FFmpeg on the client side.

Any help / insights would be much appreciated!

Edit: two more HEVC bitstream examples:
https://drive.google.com/file/d/1KlcrqlWqucoyrWwj0H2utKdKuxk96nMb/view?usp=sharing
https://drive.google.com/file/d/1KqtblUuCaCr-0ED3kYWXkgtkuz_YpHQ9/view?usp=sharing

P.S. The meaning of the RGB planes:

Red - Luminance
Green - Alpha mask
Blue - Dummy (ignored)

The WebAssembly C code then turns the decoded luminance and alpha channels into a proper RGBA array, displayed in an HTML5 Canvas.

@dalecurtis
Copy link
Contributor

@sandersdan ^^^ -- I wonder how hard it'd be to make a JS bitstream analyzer for H.264, H.265 which verify if a bitstream is configured for 1 in / 1 out.

@Djuffin
Copy link
Contributor

Djuffin commented Sep 10, 2024

I created a demo page for decoding a video from #698 (comment).

Chrome 128 on M1 Macbook: no lag in decoding is observed. The decoder outputs frames as soon as it gets the corresponding chunks. In order to achieve that, I had to add a small wait after each decoder() call, but it is expected. If you keep feeding frames, the output callback simply has no opportunity to run on the main thread.

@jvo203
Copy link

jvo203 commented Sep 10, 2024

Hmm, it's interesting that you hardcoded type : "key" for all the NAL units, even if they are "delta" or some other NAL control frames. Could this also be making a difference? In my code I was trying to decode the NAL unit types, and then set the type to either "key" or "delta".

Another difference: you are incrementing the timestamp by 30000 and not 1 as I have been doing. Could it be that the WebCodecs API decoder is sensitive to the timestamps (my initial 258 "lost frames" were being lost due to insufficient timestamp differences?)

Plus one major difference: your WebCodecs API code runs in the main thread, not in a Web Worker, but perhaps it should not matter too much. If anything using a Web Worker should help reduce a pressure on the main thread.

@Djuffin
Copy link
Contributor

Djuffin commented Sep 10, 2024

In my code I was trying to decode the NAL unit types, and then set the type to either "key" or "delta".

Your way is more accurate. But I'm pretty sure it doesn't affect decoder's lag.

@jvo203
Copy link

jvo203 commented Sep 10, 2024

@Djuffin There is also another difference, don't know how significant it might be. When you extract the individual NAL units from the HEVC bitstream, you are skipping the first sequence of three or four 0x00 bytes until 0x1.

@Djuffin
Copy link
Contributor

Djuffin commented Sep 10, 2024

No, start codes are preserved by the demo page

@jvo203
Copy link

jvo203 commented Sep 10, 2024

I see.

@jvo203
Copy link

jvo203 commented Sep 10, 2024

Thanks to hints from @Djuffin finally a victory has been achieved. The key was waiting (without calling decoder.decode()) and accumulating enough incoming NAL units until the first keyframe has been found.

The solution in the demo from @Djuffin had the benefit of seeing the whole HEVC bitstream all at once. The rudimentary demuxer in the demo could extract enough NAL units so as to pass the first keyframe at the beginning of the decoding.

In my case, using a real-time WebSockets stream, individual NAL units were coming one short NAL at a time, and the first keyframe in the HEVC bitstream would only appear after a few non-key control NAL units.

The following code, whilst perhaps not optimal (perhaps merging the buffers might be done better), finally works. No initial frames are getting lost, and there is a perfect "1 frame in / 1 frame out" behaviour.

Thank you all for your patience and advice!

const type = (nal_type == 19 || nal_type == 20) ? "key" : "delta";

            if (first) {
                // append the frame to the frame buffer
                frame_buffer.push(data.frame);

                if (type == "key") {
                    first = false;
                    console.log("WebCodecs::HEVC first keyframe received");

                    // merge entries from the frame_buffer into a single byte array
                    let frame_buffer_size = frame_buffer.reduce((acc, cur) => acc + cur.length, 0);
                    let merged_buffer = new Uint8Array(frame_buffer_size);
                    let offset = 0;
                    for (let i = 0; i < frame_buffer.length; i++) {
                        merged_buffer.set(frame_buffer[i], offset);
                        offset += frame_buffer[i].length;
                    }

                    const chunk = new EncodedVideoChunk({ data: merged_buffer, timestamp: timestamp, type: "key" }); // force keyframes
                    timestamp += 1;//30000; // 30ms

                    this.decoder.decode(chunk);
                    console.log("WebCodecs::HEVC decoded video chunk:", chunk);
                }
            } else {
                const chunk = new EncodedVideoChunk({ data: data.frame, timestamp: timestamp, type: type }); // force keyframes
                timestamp += 1;//30000; // 30ms

                this.decoder.decode(chunk);
                console.log("WebCodecs::HEVC decoded video chunk:", chunk);
            }

As mentioned by @dalecurtis , the WebAssembly-compiled FFmpeg, used in my production solution, is very resilient and can "swallow" incoming NAL units "as-is", without any manual NAL unit accumulation. Well perhaps the FFmpeg is doing its own internal buffering. Anyway, victory at last!

Edit: P.S. If only Firefox could support HEVC too in the WebCodecs API ... Then we'd have all the major browsers: Chrome, Edge, Firefox and Safari (listed in an alph. order) using the same client-side JavaScript code. P.S 2 . Even with Firefox getting onboard there would still be a need to provide a full WASM legacy back-up for older browsers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants