Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support yuv444p/yuvj444p in to_ndarray/from_ndarray #788

Merged
merged 1 commit into from
Oct 30, 2023
Merged

Support yuv444p/yuvj444p in to_ndarray/from_ndarray #788

merged 1 commit into from
Oct 30, 2023

Conversation

NPN
Copy link
Contributor

@NPN NPN commented May 21, 2021

No description provided.

@NPN NPN changed the title Support yuv444p in to_ndarray Support yuv444p/yuvj444p in to_ndarray/from_ndarray May 21, 2021
Copy link
Member

@WyattBlue WyattBlue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. Applying these changes with a yuv444p or yuvj444p pixel format produces a video with wild distortions.

Original
mpv-shot0001

Processed
mpv-shot0002

I'm curious, how did you determine how the data was processed? I'm trying add support for the yuv420p10le and I'm struggling to find good resources?


command used to generate test video:

ffmpeg -f lavfi -i testsrc -t 30 -pix_fmt yuv444p testsrc.mp4

@NPN
Copy link
Contributor Author

NPN commented Aug 26, 2021

How are you getting that error? I generated the test video and used this code, which reads in the video, converts each frame to a yuv444p ndarray, then back to a yuv444p frame, and then writes out the frame:

import av

with av.open("testsrc.mp4", "r") as vin, av.open("out.mp4", "w") as vout:
    sout = vout.add_stream("h264", rate=25)
    sout.width = 320
    sout.height = 240
    sout.pix_fmt = "yuv444p"

    for frame in vin.decode(video=0):
        assert frame.format.name == "yuv444p"
        array = frame.to_ndarray()
        assert array.shape == (240, 320, 3)
        new_frame = av.VideoFrame.from_ndarray(array, "yuv444p")
        for packet in sout.encode(new_frame):
            vout.mux(packet)

    for packet in sout.encode():
        vout.mux(packet)

For me, the output out.mp4 looks the same as testsrc.mp4.


As for figuring out the pixel format, I think I searched and read until I knew enough to understand the terminology around pixel formats. Then I could reference FFmpeg's pixfmt.h header, which has some comments about each pixel format. For yuv420p10le, it says:

Notice that each 9/10 bits sample is stored in 16 bits with extra padding.
...
planar YUV 4:2:0, 15bpp, (1 Cr & Cb sample per 2x2 Y samples), little-endian

So we know that this format is:

  • planar: Pixels are stored as YYY...UUU...VVV..., as opposed to packed/interleaved, which is YUVYUV...

  • YUV: as opposed to RGB

  • 4:2:0 chroma subsampling (Wikipedia has a section on the chroma subsampling notation)

  • 15bpp (bits per pixel): 4 real pixels are stored as 4 luma (Y) pixels and 2 chroma (Cr, Cb) pixels. Each format pixel is 10 bits, so that's (6 format pixels * 10 bits) / 4 real pixels = 15 bits per pixel

  • little-endian: This means that each 10-bit sample is packed into 16 bits as follows:

                   MSB      LSB
                   v        v
    10-bit pixel:  0123456789
    
                   Byte 1      Byte 2
                   v           v
    16-bit pixel:  23456789    ......01 
    

If you understand all the terminology, the description for each format should be enough. Unfortunately, I don't know of a resource that describes everything in one place, but as long as you search around, look at existing code, etc., you should be good.

@FirefoxMetzger
Copy link
Contributor

This is ultimately an API design decision, but I am not 100% sure if pyAV should convert yuv444p to channel-last internally. p indicates a planar pixel format, and I would have consequentially expected the axis order of [channel, height, width] and not [height, width, channel]. The latter, I would have expected to see from yuyv444 (which is a packed yuv444).

Doing this internal conversion may result in interesting behavior. The final np.moveaxis in this PR creates a view into a planar buffer (not a copy!), so depending on which application consumes the data later on, you may see artifacts due to malaligned channels (@WyattBlue 's comment). Here is a snippet to demonstrate the "problem":

>>> import av
>>> import numpy as np
>>> foo = av.open("testsrc.mp4")      
>>> frame = next(foo.decode(video=0))
>>> np_frame = np.stack([np.frombuffer(frame.planes[idx], dtype=np.uint8) for idx in range(3)]).reshape(3, frame.height, frame.width)
>>> np_frame2 = np.moveaxis(np_frame, 0, -1)
>>> np_frame2.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Another point is that np.stack will definitely create a copy of the frame's data, which could be avoided by directly using frame._buffer instead of splitting the data into planes and reassembling later. This suggestion does come with a grain of salt though, because I don't know if FFMPEG guarantees contiguity of planes (though I don't see any obvious reason why it shouldn't be able to do so), and further because I don't know the memory management model that pyAV uses for frames.


@WyattBlue You can find a good overview/breakdown of the various names used by ffmpeg here: https://ffmpeg.org/doxygen/trunk/pixfmt_8h_source.html (it's C code, but the comments are the important part).

yuv420p10le is a bit more involved than yuv444p, because it is subsampled and not 8-bit. If you want to get it into a similar format as discussed here ([height, width, channels]), you have to upsample the UV channels. (Fortunately, the data comes byte-aligned instead of being a bitstream)

>>> bar = av.open("testsrc.mp4")
>>> frame = next(bar.decode(video=0))
>>> # cast the planes to np.uint16 (they are 10-bit each in little endian, but have 2-byte alignment)
>>> y_plane, u_plane, v_plane = tuple(np.frombuffer(frame.planes[idx], dtype="<u2").reshape(frame.planes[idx].height, frame.planes[idx].width) for idx in range(3))
>>> # upsample U and V to match Y plane (this essentially creates a YUV444p image)
>>> np_frame = np.stack([y_plane, np.repeat(np.repeat(u_plane, 2, axis=0), 2, axis=1), np.repeat(np.repeat(v_plane, 2, axis=0), 2, axis=1)])
>>> # as per the above, if you want channel-last contiguous data you need to copy or use axis=-1 when stacking
>>> np_frame2 = np.moveaxis(np_frame, 0, -1).copy()

The above is a manual example of doing the extraction. You can alternatively use frame.reformat("yuyv444") and have FFMPEG do this conversion for you. While at it, why not reformat to RGB or which ever your target format is?

@NPN
Copy link
Contributor Author

NPN commented Feb 8, 2022

@FirefoxMetzger Thanks for the suggestion--it does make more sense to put channels first. I think I was blindly following the order of the RGB/BGR/etc formats without realizing that they were packed.

@WyattBlue
Copy link
Member

@NPN Your last commit solved the problem I had before. This pull request is ready to be merged.

@FirefoxMetzger
Copy link
Contributor

@jlaine When you do your next round of issue reporting, could you let us know pyAV's policy on tests and coverage? Is it mandatory, and if so to what extent?

Side question: How does pyAV manage ffmpeg's frame object? Is it ref-counted and decommissioned by FFmpeg (afaik default), or are you handling the frame yourself? I.e., is it sane to do something like gray_image = np.frombuffer(frame.planes[0], dtype=np.uint8).reshape(...) in python land or do I have to worry that FFmpeg will garbage collect that buffer, because it doesn't know about numpy holding a pointer/reference to it? I'm asking because this has performance implications in this PR, but also in other places.

@jlaine
Copy link
Member

jlaine commented Feb 25, 2022

Hi @FirefoxMetzger yes we do need unit tests for these new formats. I unfortunately haven't been able to wire up an automatic check for this, I'm not sure how coverage works on Cython code. Could we also have this code rebased on top of main as there have been some changes in VideoFrame?

I'm not sure I understand the question about memory ownership? As long a you hold a reference to the frame I don't expect its buffer to disappear under your feet?

EDIT: I've rebased on top of main, but we do need tests

@FirefoxMetzger
Copy link
Contributor

I unfortunately haven't been able to wire up an automatic check for this, I'm not sure how coverage works on Cython code.

I guess you are aware of coverage.py support for cython (see here)? I know that it exists, but never actually set it up myself.

After you get the coverage xml, it should behave like any other project, i.e., upload to CodeCov (or another provider) and have it hook into CI to pass/fail the test suite based on coverage. This part I've set up for ImageIO, so I'm happy to help if you decide to go for it.

I'm not sure I understand the question about memory ownership? As long a you hold a reference to the frame I don't expect its buffer to disappear under your feet?

>>> import av
>>> import numpy as np
>>> frame = av.VideoFrame(16, 16, "rgb24")
>>> no_copy_array = np.frombuffer(frame.planes[0], dtype=np.uint8).reshape((16, 16, 3))
>>> del frame  # will this be dangerous?

Here, we created the image buffer via lib.av_image_alloc during frame's construction and then we free it via av_freep during the deconstruction of the frame. However, python land still holds a reference to that buffer (no_copy_array) which FFmpeg doesn't know about, so it will happily free the buffer. At least I think that's what will happen, but I am not sure. I wonder if this could lead to undefined behavior, especially if we do the above on a decoded frame instead.

@jlaine
Copy link
Member

jlaine commented Feb 25, 2022

Feel free to open a new discussion on the memory ownership question, I'd rather keep discussion here narrowly focused on the PR.

@jlaine
Copy link
Member

jlaine commented Feb 25, 2022

I've been trying to put together a unit test for this, and the test would look like this:

    def test_ndarray_yuv444p(self):
        array = numpy.random.randint(0, 256, size=(3, 480, 640), dtype=numpy.uint8)
        frame = VideoFrame.from_ndarray(array, format="yuv444p")
        self.assertEqual(frame.width, 640)
        self.assertEqual(frame.height, 480)
        self.assertEqual(frame.format.name, "yuv444p")
        self.assertNdarraysEqual(frame.to_ndarray(), array)

=> AFAICT this would be the only format for which our ndarray has the channel along the first axis. For all the other formats it looks like (height, widtht, channel)?

@FirefoxMetzger
Copy link
Contributor

FirefoxMetzger commented Feb 26, 2022

AFAICT this would be the only format for which our ndarray has the channel along the first axis. For all the other formats it looks like (height, widtht, channel)?

There is also yuv420p and yuvj420p that are supported and technically channel-first; For these formats, however, it seems like we just get an array of lines:

>>> import av
>>> foo = av.VideoFrame(5*16, 5*16, "yuv420p")
>>> bar = foo.to_ndarray()
>>> bar.shape
(120, 80)

So I guess the answer is no you can get other layouts than (height, width, channel).

I wonder if subsampled formats like this should raise an error, promote to yuv444p or return macro-block macro-pixel strided arrays instead of an array of lines, but that's a different discussion. Is it worth having @jlaine ? If so, I'll open a new issue.

@NPN
Copy link
Contributor Author

NPN commented Mar 29, 2022

I have added tests based on @jlaine's comment above. Let me know if anything else is needed.

@jlaine
Copy link
Member

jlaine commented Mar 30, 2022

As stated I'm still not comfortable with the array dimensions, this creates an element of surprise for the user: the other formats do (height, width, channels). Either convince me or change it :)

@FirefoxMetzger
Copy link
Contributor

FirefoxMetzger commented Mar 30, 2022

As stated I'm still not comfortable with the array dimensions, this creates an element of surprise for the user: the other formats do (height, width, channels). Either convince me or change it :)

@jlaine What about yuv420p and yuvj420p? Both are supported and don't return (height, width, channels). (See #788 (comment) for an example)

@jlaine
Copy link
Member

jlaine commented Mar 30, 2022

As stated I'm still not comfortable with the array dimensions, this creates an element of surprise for the user: the other formats do (height, width, channels). Either convince me or change it :)

@jlaine What about yuv420p and yuvj420p? Both are supported and don't return (height, width, channels). (See #788 (comment) for an example)

Those formats decimate the chrominance, so there is no way they are do have a "normal" array, regardless of the order of dimensions :)

@FirefoxMetzger
Copy link
Contributor

Those formats decimate the chrominance, so there is no way they are do have a "normal" array, regardless of the order of dimensions :)

@jlaine What would be the difference between copying pixels here (channel-first to channel-last) and copying pixels for yuv420p? (we do need a copy here, because #788 (review))

yuv420p shares a chroma pixel amongst multiple luma pixels, so making a new copy that duplicates values and serves the familiar [height, width, channels] format seems about as surprising to me as yuv444p (which is planar/channel-first) serving [height, width, channels] (which is channel-last). Personally, I would not copy in either case, because the result is really surprising and unexpected (to me). I ask for something that is channel-first (yuv444p), but get something that is channel-last instead; worse, yuv444 is - afaik - not even a valid format in FFMPEG. If we have to do a copy, I would at least like to to be consistent across YUV formats in that they all make the copy to [height, width, channels].

@NPN
Copy link
Contributor Author

NPN commented Mar 30, 2022

For channels first, the benefit is faithfulness to the underlying pixel format. So, code which operates on the raw ndarray data and expects a planar ordering will work correctly (e.g. WyattBlue's example above).

For channels last, the benefit is consistency with the other non-chroma subsampled formats. So, if you just want to get/set pixels using ndarray indexing, then you don't have to worry about whether the underlying format is planar or packed. Also, Pillow seems to always use channels last (though their YCbCr is based on JPEG, anyway).

Honestly, I'm not sure which is better. Unless someone else has an opinion, we can go with channels last, if that's what you prefer. I can just go back to the old code (after changing it to make a copy instead of a view, since np.moveaxis seems to cause problems?)

Edit: Somehow I missed FirefoxMetzger's comment above when writing this response. Those arguments seem pretty good, so I suppose it's up to what jlaine thinks.

@WyattBlue
Copy link
Member

@NPN @FirefoxMetzger @jlaine I have merged these changes as-is in my fork.

@FirefoxMetzger
Copy link
Contributor

Nice!

@WyattBlue Are you planning on forking PyAV and maintaining it, or is this more of a "one-off" thing?

@WyattBlue
Copy link
Member

Nice!

@WyattBlue Are you planning on forking PyAV and maintaining it, or is this more of a "one-off" thing?

I definitely plan on maintaining it. I'm already using the fork for my project, auto-editor.

Co-authored-by: Jeremy Lainé <jeremy.laine@m4x.org>
@jlaine jlaine merged commit 29ef112 into PyAV-Org:main Oct 30, 2023
17 checks passed
@jlaine
Copy link
Member

jlaine commented Oct 30, 2023

Thanks @NPN , it's merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants