Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inaccurate seeking #3

Open
bjudkewitz opened this issue Sep 18, 2024 · 15 comments
Open

inaccurate seeking #3

bjudkewitz opened this issue Sep 18, 2024 · 15 comments

Comments

@bjudkewitz
Copy link

Thanks a lot for making this plugin! I noticed that seeking can be off by a few frames. This becomes apparent when using napari_video for annotation. You can then notice that the annotation layer and video suddenly get out of sync when you seek to another frame and come back. I assume this is due to a known limitation of OpenCV. opencv/opencv#9053
It is possible that this is worse for some codecs than others, but it is a problem for our videos. Has this been observed by others? Is there any interest in switching this plugin e.g. to pyav?

@postpop
Copy link
Contributor

postpop commented Oct 6, 2024

Hi! 😃Terribly sorry about the late reply!! If pyav solves this issue, then we would be happy to use it instead of opencv as a backend. We have not noticed your specific issue before but are aware that seeking is a bit imprecise with opencv. Would you be able to share a video we can use to test pyav? Thanks!

@bjudkewitz
Copy link
Author

bjudkewitz commented Oct 6, 2024

Hi Jan, no worries. Fun timing: I just put sth quick together yesterday. It's not as versatile as napari-video, so I am happy to fold anything you find useful into napari-video if there is interest. https://github.com/danionella/napari-pyav

Example video: they are all large, so let me see if I can make a small one with the same problem. Also, note it's hard to see the problem unless you have a reference. We noticed it only during annotation and when seeking back and forth.

@postpop
Copy link
Contributor

postpop commented Oct 6, 2024

That looks perfect I think this is much cleaner and more focused than what we have atm. And probably better suited for the use case of working with videos in napari. Would love to incorporate this into napari-video. Wanna open a pull request? I would then also make you a co-owner of the repo and move the package to a new shared organization...

@bjudkewitz
Copy link
Author

Thank you for the offer. No need to move it and happy to create a branch if you give me rights. Being listed as contributor is more than enough credit. However, I should caution against replacing the old approach without further testing. For example, we only use grayscale videos (more specifically, h264 .mp4) and haven't tested rgb (should work, but just giving an example). Also there is a strict timestamp check to ensure reproducible seeking, but I could imagine it might simply fail for videos with variable frame rate or other peculiarities – raising an exception. We like that, but others might prefer a plugin that gives you a frame no matter what, even if it might be off by a few ms. The nice thing about opencv is that it is high level and others have done the job of testing various formats with it.

@niksirbi
Copy link
Collaborator

Hi both, just chiming in here to say that I'm really interested in these developments, as a napari-video user.
I'd also love for videos to be reliably seekable.

I gave napari-pyav a quick try with an RGB .mp4 video, but it failed (probably due to the strict timestamp checking that was mentioned above). @bjudkewitz, I can open an issue on the napari-pyav repo if you want, but I also understand if you don't wish to be bombarded with noise at this early stage.

Also, apart from opencv and pyav, I wonder whether imageio-ffmpeg would be another option here. I'm bringing this up because I noticed that sleap-io recently dropped pyav and now handles their Video I/O through imegeio-ffmpeg. Given that they care about annotating frames (and thus seekability), I wonder if they had a reason that may be relevant for us too. Tagging @talmo in case he has an insight on this.

@talmo
Copy link

talmo commented Oct 12, 2024

Hello friends!

A general primer on frame-accurate reliable seeking

This is hard for a bunch of reasons, the primary one being that different video container formats have varying support for precise seeking depending on how they store timestamps / seek points / keyframes.

Video container formats most commonly used for behavior are MP4 and AVI, but this is different from the video encoder codec which the algorithm that defines how the pixel data is represented and compressed. Both MP4 and AVI support multiple different codecs, which makes things even more complicated.

MP4 containers are ideal for our purposes. MP4 files have an index of keyframes (I-frames), which makes seeking to those fast and precise. Once you're there, MP4 stores frames with a combination of moov atoms and mdata atoms. The moov atoms contain timestamp metadata (and other stuff) which makes precise indexing way easier, while the mdata atoms contain the actual encoded image data.

AVI containers have an index of keyframes, but not the intermediate metadata, which makes it tricky to seek precisely unless you make some assumptions about the preciseness of frame rate, or you have every frame stored as a keyframe (e.g., MJPEG).

The specifics of the codec will affect the precise performance and seeking strategy. In general, the idea is to seek to the nearest keyframe, then decode forwards or backwards until you hit the exact frame index you want. In FFMPEG, this strategy is enabled through the -accurate_seek flag (on by default since v2.1) when seeking through the -ss flag (ref).

In practice, this is trickier due to how non-keyframe data is stored and dependent on each other, and the fuzziness in the specification of different codecs and container formats.

So why do we get inconsistent results when seeking?

In short: most software that writes video files are bad.

One big stumbling block is that seeking is all designed around timestamps, not frame indices. This means that we have to use the FPS to calculate the timestamp corresponding to our frame. As you might imagine, this creates big issues if the video was recorded with a variable frame rate, stored at a variable frame rate, or with incorrect timing information.

Depending on the software used to encode the video, it might try to store the precise time that the frame was received by the camera, by the encoder process, or not at all! When using free running cameras rather than externally triggered ones, this gets annoying real fast.

Other issues are gnarlier:

  • AVI files don't actually enforce that you have index, which makes seeking a lot more expensive and less precise.
  • Variable keyframe (I-frame) spacing is sometimes supported and can help with improving encoding efficiency, but makes it harder to seek reliably.
  • H.264 and H.264 encoded videos rely on a group of pictures (GOP) structure, which groups keyframes (I-frames) with partial frames which rely on information from past+future frames (B-frames) and/or just the past frames (P-frames). The larger the GOP, the more likely it is that a reconstruction gets messed up, especially if there are issues with the timestamping.
  • Variable bit rate encoding (especially in AVIs) can result in miscalculation of keyframe byte positions, which messes everything up.
  • The behavior of decoders can vary across platforms.

A major culprit of a lot of these is opencv's VideoWriter since it doesn't support specifying a lot of the settings needed to prevent these (see more info a couple sections down).

Could we just check when videos are bad then?

Ideally, if we could detect some of these conditions, we could try to mitigate for them (e.g., use a slower frame-by-frame decoding strategy with integrity checks and no skipping), but this is a huge pain in the ass considering how many edge cases and combinations there are.

In principle, you could use ffprobe to dump out the frame metadata (timing and type) so you could calculate the GOP size, I-frame intervals and regularity, etc., but this is a bit of a project and it's not clear how these numbers translate to precise seekability in every scenario. You could also check for variability in frame rate in a similar way, with the same caveats.

Other issues like missing or corrupted timestamps of keyframe indices are detectable by parsing the STDERR from ffmpeg, but I think it pretty much incurs a full file read, which is quite expensive.

Even assuming you could detect these issues, it's not clear that frame-by-frame reading is the answer (and certainly isn't when we want to do a precise random seek).

Can you just force a video to be seekable?

Since these are pretty annoying technical issues, our recommended solution is to just re-encode your video if you have any seekability issues. From the SLEAP docs:

Many types of video acquisition software, however, do not save videos in a format suitable for computer vision-based processing. A very common issue is that videos are not reliably seekable, meaning that you may not get the same data when trying to read a particular frame index. This is because many video formats are optimized for realtime and sequential playback and save space by reconstructing the image using data in adjacent frames. The consequence is that you may not get the exact same image depending on what the last frame you read was. Check out this blog post for more details.

If you think you may be affected by this issue, or just want to be safe, re-encode your videos using the following command:

ffmpeg -y -i "input.mp4" -c:v libx264 -pix_fmt yuv420p -preset superfast -crf 23 "output.mp4"

Breaking down what this does:

  • -i "input.mp4": Specifies the path to the input file. Replace this with your video. Can be .avi or any other video format.
  • -c:v libx264: Sets the video compression to use H264.
  • -pix_fmt yuv420p: Necessary for playback on some systems.
  • -preset superfast: Sets a number of parameters that enable reliable seeking.
  • -crf 23: Sets the quality of the output video. Lower numbers are less lossy, but result in larger files. A CRF of 15 is nearly lossless, while 30 will be highly compressed.
  • "output.mp4": The name of the output video file (must end in .mp4).

The superfast preset has been a silver bullet for us and our userbase. The specific options that this sets in H264 are:

        param->analyse.inter = X264_ANALYSE_I8x8|X264_ANALYSE_I4x4;
        param->analyse.i_me_method = X264_ME_DIA;
        param->analyse.i_subpel_refine = 1;
        param->i_frame_reference = 1;
        param->analyse.b_mixed_references = 0;
        param->analyse.i_trellis = 0;
        param->rc.b_mb_tree = 0;
        param->analyse.i_weighted_pred = X264_WEIGHTP_NONE;
        param->rc.i_lookahead = 0;

(non-canonical source)

These will overwrite the defaults for the codec.

I haven't tested these extensively, but these are the ones that are likely to interact with seekability:

  • i_frame_reference = 1: Setting it to 1 means only 1 frame is used as a reference, so there's fewer temporal dependencies
  • analyse.b_mixed_references = 0: This disables the use of different reference frames within a macroblock (a patch of the image), so the dependency is always from a single reference frame.
  • rc.b_mb_tree = 0: This disables lookaheads when computing the macroblock tree, which controls how many bits are used to describe image patches and their inter-frame dependencies. Disabling it doesn't directly affect seekability, but it can interact with other settings that affect the placement (both regularity and frequency) of I-frames.
  • analyse.i_weighted_pred = X264_WEIGHTP_NONE: This disables weights for P-frames, which can help encoding efficiency by modulating the same patch by a (scalar?) weight, which is useful for capturing subtle lighting variation. It probably doesn't mess with seekability too much, but it does have the potential for introducing more temporal dependencies (I'm just not sure how this is prioritized relative to i_frame_reference and the other settings).

We've put a lot of work into just directly encoding to this preset at acquisition time to avoid having to recompress data, but I appreciate that this isn't always possible depending on the acquisition software and available bindings.

What about decoding then?

Since we don't have much control over how people encode their videos, what should we use to maximize compatibility on the decoding side?

Ideally, we'd like to use something that is both fast AND supports the accurate seeking strategies outlined above (failing gracefully when it can't).

The most feature-complete implementation of all of this complex decoding logic is definitely libav, which is the backend for ffmpeg. BUT: ffmpeg itself also has some additional logic on top of the core libav modules which can help with some of the aforementioned frame seeking strategies. The downside is that ffmpeg is a CLI tool, not a library that you can easily use in your own software.

The most commonly used high-level reader that uses ffmpeg is opencv, which does some pretty unholy proxying to ffmpeg and libav with lots of inline monkey patches and a pretty annoying OS-dependent build pipeline.

While it has the advantage that it can decode the video bytes in-memory, it comes with the rest of the opencv bloat and doesn't support working with the full feature-set available in the ffmpeg CLI (e.g., using different -ss modes).

In principle, we could just use libav directly, for which there are native Python bindings via PyAV. This allows for in-memory decoding and low-level control of the decoding strategy -- best of both worlds! We originally targeted this as the default video backend for sleap-io, but in our experience it's been a huge pain. It's been slower, produces insuppressible error messages directly to STDOUT (affecting performance in some cases where it spams errors on every repeated call), more buggy, and is an annoying compiled dependency that didn't used to have native packages across all platforms (e.g., osx-arm64). The last point has been resolved, especially now that it's on conda-forge.

Recently, however, we dropped av as a dependency in sleap-io in v0.1.9. We still support it if it's installed in the environment, but it's no longer in the dependencies since we rarely preferred it over opencv or ffmpeg. We're also trying to get sleap-io compiled to WebAssembly, so we need to get all of our dependencies ported over as well and this one was an easy one to cut.

So what's the best solution? We now default to the tried-and-true (if a bit janky) pipe protocol. This works by doing ffmpeg -i pipe:0 ... (or equivalently ffmpeg -i - ...) which tells FFMPEG to read from STDIN, consuming the bytestream directly as if it were reading from a file. Using this input format, you can read your video bytes, pass them to ffmpeg via a subprocess, then write to STDIN. Alternatively, you can give it the path to the video file directly and let it handle the byte-seeking for you (definitely recommended). Then, you can get the decoded bytes back by reading STDOUT and format it back into a PIL image or numpy array.

There are like 1000 wrappers for doing this, but we use imageio-ffmpeg since we depend on imageio anyway for other video/image formats and since this plugin plays nicely with the ffmpeg binary in conda-forge as well as system-level ffmpeg binaries available in the PATH. It also has nice checks for all sorts of edge cases and handles the packaging into PIL/numpy gracefully. You can see the core logic for frame reading here -- pretty straightforward.

The only thing that feels icky is having to spin up a subprocess and communicate via pipes. This is unavoidable, but the nice thing is that every OS has solid support for pipes, with overall performance (while governed by multiple factors) generally following the trend of Linux > macOS > Windows. FFMPEG itself also has a pretty fast startup time, so you probably won't feel that overhead. Empirically, when comparing this approach to opencv, av or even some straight-to-GPU-memory decoders like decord, we've seen almost no difference. torchcodec is one to watch since it's in active development, but right now the pipe-based approach is still faster than nvdec even in their implementation.

Takeaways

  • Video encoding wasn't really built for frame-accurate seeking, and video acquisition software even less so.
  • Given a random video file, it's not very easy to check whether it is reliably seekable.
  • Silver bullet is to re-encode with ffmpeg -y -i "input.mp4" -c:v libx264 -pix_fmt yuv420p -preset superfast -crf 23 "output.mp4". This works 100% of the time.
  • The most reliable decoding approach is to use pipes, such as with imageio-ffmpeg.

Hope this deep dive is helpful! I'll try to cross link to here from other places too for findability.

@bjudkewitz
Copy link
Author

bjudkewitz commented Oct 12, 2024

Hi @talmo, thank you for this very helpful summary!

I'm a big fan of SLEAP and you obviously have a lot of experience working with a wide variety of sources. I'm very inclined to go with your recommendation, but what still holds me back is the speed – at least in my perhaps naive test – of imageio. Here is a quick comparison with an h264 encoded mp4 (on ubuntu linux):

image

Is there a faster approach for frame-accurate seeking with imageio-ffmpeg? Or do you think another wrapper might be faster?

EDIT: I should add that FastVideoReader is the PyAV-based class currently being used in napari-pyav

@bjudkewitz
Copy link
Author

I gave napari-pyav a quick try with an RGB .mp4 video, but it failed (probably due to the strict timestamp checking that was mentioned above). @bjudkewitz, I can open an issue on the napari-pyav repo if you want, but I also understand if you don't wish to be bombarded with noise at this early stage.

@niksirbi I guess the future of napari-pyav is a bit uncertain at this stage, but I'd be curious to hear more about the issue (feel very free to open one, especially if you can link to a file that fails)

@postpop
Copy link
Contributor

postpop commented Oct 13, 2024

Thanks for the in-depth explanation of the issues, @talmo! To me, it looks like using imageio with pyav as the default backend (offering ffmpeg as an option) would be the way forward. It's seems sufficiently fast and accurate for working with videos in napari. And it should be straightforward to implement. Seems like @bjudkewitz's FastVideoReader is faster but seems to fail for some videos. Not sure why that is - could look into this in the future to make FastVideoReader work with more video formats.

@bjudkewitz
Copy link
Author

bjudkewitz commented Oct 13, 2024

Some thoughts on minimal criteria for a video reader in data analysis:

  1. it should seek/read deterministically (random seeking to a given index should always result in the same output frame, regardless of previous frame location)
  2. frame indices should be identical between seek and sequential read of all frames from the start
  3. all frames in the file should be indexed according to their pts (or dts) order

For our work, I would say we need a reader that fulfills at least criteria 1 and 2 (if some frames are out of order it would be acceptable, as long as they are displayed deterministically). For others, all three might be important.

Based on experience opencv can fail on criteria (1) and (2).

A superficial inspection of imageio-pyav makes me think it could fail silently on (2), but I'd need to inspect it a bit more. Linking to the relevant sections of imageio-pyav seek and read code.

Seeking constant frame rate videos in imageio-pyav is quite similar to FastVideoReader, except that the latter fails if the resulting frame has a different pts from the one requested. One could turn the exception into a mere warning and return the frame anyway. I suspect this would work similarly to how imageio-pyav works now (except with a warning when things are wrong).

imageio-pyav handles variable frame rate videos in a different way: it just rewinds to the first frame and reads frame by frame until the target. That's obviously very slow for frames that are not at the very beginning. FastVideoReader currently doesn't handle variable frame rate and would likely fail due to a pts mismatch. It would be easy to make it handle it like imageio-pyav, but I'm inclined to disallow seeking in such cases and just emit a warning that gives instructions on transcoding (basically Talmo's superfast recipe above). Maybe we should do the same if there are b-frames present in the stream. PyAV exposes this information via container.format.variable_fps and stream.codec_context.has_b_frames, respectively.

The main difference between the two pyav approaches is how the decoded frame is converted into an rgb numpy array. Imageo-pyav seems to use an ffmpeg filter pipeline, while FastVideoReader uses libav's pretty efficient sws_scale (frame_obj.to_ndarray() in pyav). I suspect this is the main reason why their code is slower. I don't quite understand why they choose the less efficient route, except maybe compatibility across their plugin options.

@niksirbi
Copy link
Collaborator

@niksirbi I guess the future of napari-pyav is a bit uncertain at this stage, but I'd be curious to hear more about the issue (feel very free to open one, especially if you can link to a file that fails)

I've opened an issue here: danionella/napari-pyav#1

It contains a link to the offending video, to help with debugging/testing.

@niksirbi
Copy link
Collaborator

Thanks a lot @talmo and @bjudkewitz for the detailed explanations, they have definitely improved my understanding of the problem. This kind of information should be made available as a doi-ed public document, to be honest. Lots of people in the field struggle with similar issues.

I hope this discussion helps us move towards a functional napari plugin for video playback.

@talmo
Copy link

talmo commented Nov 15, 2024

Circling back here given the relevance: check out the release notes and PR stack around robustifying video seeking support in rerun 0.20.0.

Have dug into it but looks like some promising recipes for this issue!

@postpop
Copy link
Contributor

postpop commented Nov 15, 2024

Thanks @talmo and @bjudkewitz for your input!

It seems there are solutions for robust and accurate video reading, though currently only in Rust. Given that existing Python solutions are imperfect at best, developing a robust and fast video reading framework in Python would benefit us all. Rerun appears to have found some viable solutions that we might be able to adapt.

Regarding napari-video, I suggest a two-phase approach:

  1. Replace the OpenCV backend with a variant of @bjudkewitz's FastVideoReader approach.
  2. Implement some of the recipes @talmo mentioned from Rerun to improve accuracy, robustness, and performance (this will require significant development effort and would be best done in a separate library that napari-video and other packages can rely on).

I will look into implementing step 1 but thIngs are busy around here atm so this will probably not happen in the very near future, so any help would be greatly appreciated!

@bjudkewitz
Copy link
Author

Hi all, in case this is useful to anyone: we've consolidated some much-used video (and hdf5) IO tools under https://github.com/danionella/daio (see daio.video.VideoReader; could be useful for napari with argument format='rgb')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants