✨ [$4.000 BOUNTY 💰] Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling) #1837

mrousavy · 2023-09-22T14:01:28Z

What

$4.000 bounty to anyone who can solve this problem once and for all!!!!

One of VisionCamera's strong suits is the flexibility of it. You can configure the Camera for photo capture, video recording, frame processor, or even all at once.

This is roughly how VisionCamera's Camera Session is set up:

graph TD;
Camera-->Preview
Camera-->Photo
Camera-->VideoPipeline
VideoPipeline-->FP[Frame Processing]
VideoPipeline-->REC[Video Recording]

For the VideoPipeline, we have 6 requirements:

✅ It needs to be as efficient as possible. If no Frame Processor is attached, we need fast recording (native pixel format). If no recording is attached, we need fast Frame Processing (no roundtrips or render-passes). And both should work at the same time.
✅ It needs to support 3 pixel-formats: yuv, rgb and native (most efficient platform format)
✅ Video Recording should support h.264 and h.265 and should be hardware-accelerated.
✅ Video Recording should support Camera flipping (back <-> front) while recording and should auto-scale buffers to fit the size of the video.
✅ Video Recording should be synchronous so that we can theoretically detect faces in MLKit, then draw a dog mask over them for the resulting recording.
✅ Frame Processing should receive CPU/GPU buffers of the pixel data (see point 2. for formats). Ideally, this should be the android.media.Image type, but if that's not possible we can also use some other form of buffer (ByteBuffer/AHardwareBuffer)

iOS

On iOS, this was relatively easy to implement:

graph TD;
  Camera-->AVCaptureVideoDataOutput
  AVCaptureVideoDataOutput-->|"captureOutput(_:didOutput:from:)"| CMSampleBuffer
  CMSampleBuffer-->CVPixelBuffer-->FP["Frame Processor (e.g. MLKit)"]
  CMSampleBuffer-->AVAssetWriter

The CMSampleBuffer type is amazing, it exposes the GPU-backed buffer to the CPU (IOSurface), but can also fully stay on the GPU without any CPU copies if we don't use MLKit and only record to a file.

For the 6 main requirements:

✅ It is as efficient as possible, it uses the most efficient native format (e.g. kCVPixelFormat_YCbCr4208BiFullRange) and doesn't involve any render passes or CPU copies.
✅ We can fully configure the pixel formats (yuv, rgb or native) and the AVAssetWriter understands all of them.
✅ AVAssetWriter can record either in h.264 or h.265 HEVC.
✅ We can easily switch the Camera device (e.g. front <-> back) while recording, and the AVAssetWriter will handle scaling automatically if the buffers are a different size than what we originally configured it to.
✅ The Video Pipeline is synchronous, we can theoretically first detect faces in ML then draw a dog mask above them for the recording.
✅ We have access to a CMSampleBuffer, which is a GPU buffer. We can easily get CPU access on that using AVF APIs.

Android

On Android however, it seems like this approach is simply not possible. There is a new type in API 26 called HardwareBuffer which seems similar to CMSampleBuffer, but it is still not widely exposed in APIs and all of the Camera/android.media APIs expect surfaces.

Here's a few things I tried:

1. Separate outputs

Use separate Camera outputs, MediaRecorder and ImageReader - Does not work because the Camera only allows 3 outputs. We already have 3 (preview, photo, video). Also:

✅ Is as efficient as possible, no client code involved if MediaRecorder is attached directly to the Camera output
✅ We can configure pixel format for the ImageReader
✅ We can configure h.264 or h.265
❌ We cannot switch the Camera Devices while recording as the MediaRecorder's surface will be destroyed when I re-open the Camera
❌ Our VideoPipeline is not synchronous.
✅ We would receive an android.media.Image in the ImageReader's callback

2. `ImageReader`/`ImageWriter`

See #1789 + #1799 + #1834 for code

I tried to create an ImageReader/ImageWriter setup that just receives Images, then passes them through to the output Surface (MediaRecorder/MediaCodec):

graph TD;
  Camera-->ImageReader["VideoPipeline (ImageReader)"]
  ImageReader-->MLKit[MLKit Image Processing]
  ImageReader-->ImageWriter
  ImageWriter-->REC[MediaRecorder/MediaCodec]

This feels like closest to what we have on iOS and it seems like ImageReader/ImageWriter are really efficient as they are just moving buffers around. This is my code:

// TODO: Do I need those flags? Or no?
val flags = HardwareBuffer.USAGE_VIDEO_ENCODE or HardwareBuffer.USAGE_GPU_SAMPLED_IMAGE
val readerFormat = ImageFormat.YUV_420_888 // (or PRIVATE or RGBA_8888)
imageReader = ImageReader.newInstance(width, height, readerFormat, MAX_IMAGES, flags) // <-- API 29+
// ...
val mediaRecorder = ...
mediaRecorder.prepare()
// TODO: Does this need to be ImageFormat.PRIVATE instead?
val writerFormat = readerFormat
imageWriter = ImageWriter.newInstance(mediaRecorder.surface, MAX_IMAGES, writerFormat) // <-- API 29+

imageReader.setOnImageAvailableListener({ reader ->
  val image = reader.acquireNextImage()
  imageWriter.queueInputImage(image)
}, handler)

...but I couldn't really get this to work. A really smart guy on StackOverflow said that it is not guaranteed that MediaRecorder/MediaCodec can be fed with Images from an ImageWriter, so sometimes it just silently crashes 🤦‍♂️

Also, I'm not sure what format the MediaRecorder/MediaCodec expects - so maybe this requires an additional conversion step:

graph LR;

R1["ImageReader (YUV)"]-->W1["ImageWriter (YUV)"]-->R2["ImageReader (PRIVATE)"]-->W2["ImageWriter (PRIVATE)"]-->REC["MediaRecorder/MediaCodec"]
R1-->MLKit[MLKit Image Processing]

...which is just ridiculous.

For the 6 main requirements:

✅ Is efficient as far as I know. No rendering, just moving Images around.
❌ We cannot configure pixel format for the ImageReader as the MediaRecorder/MediaCodec requires a PRIVATE format. If we feed it RGB/YUV data, it crashes.
✅ We can configure h.264 or h.265
❌ We cannot switch the Camera Devices while recording as both the ImageReader and the MediaRecorder are configured with a specific width/height, there's no scaling step involved here.
✅ Our VideoPipeline would be synchronous starting from the point we receive an Image from the ImageReader.
✅ We would receive an android.media.Image in the ImageReader's callback

3. Create a custom OpenGL Pipeline

See #1836 for code

I created a custom OpenGL pipeline that the Camera will render to, then we do a pass-through render pass to render the Frame to all the outputs:

graph TD;
Camera-->OpenGL["VideoPipeline (OpenGL)"]
OpenGL-->Pass[Pass-Through Shader]
Pass-->ImageReader-->MLKit[MLKit Image Processing]
Pass-->REC[MediaRecorder/MediaCodec]

But, this has four major drawbacks:

It's really really complex to build (I already built it, see this PR, so not a real problem tbh)
It seems like this is not as efficient as a ImageReader/ImageWriter approach, as we do an implicit RGB conversion and an actual render pass, whereas ImageReader/ImageWriter just moving Image Buffers around (at least as far as I understood this)
It only works in RGBA_8888, as OpenGL works in RGB. This means, our frame processor (MLKit) does not work if it is trained on YUV_420_888 data - this is a hard requirement.
It is not synchronous, the ImageReader gets called at a later point. We could not really use information from the Frame to decide what gets rendered later (e.g. to apply a face filter).

As for the 6 main requirements:

❌ Is efficient, but it has quite a large overhead (creating the GL context and doing 2 render passes with the pass-through shaders). Compared to just moving Buffers around, this is quite a large overhead.
❌ We cannot use whatever pixel-format we want - OpenGL always operates in RGBA_8888. I have no idea how to get a YUV_420_888 Image from there.
✅ We can configure h.264 or h.265
✅ We can switch the Camera device while recording as OpenGL automatically handles the scaling - nice!
❌ Our VideoPipeline would not be synchronous, as we render to an ImageReader which calls the Frame Processor at some later point, whenever it has an Image available. This could be solved though by rendering to a HardwareBuffer wrapped as a Texture/FBO, which we can then wrap using our Frame type. But; we no longer have an Image.
✅ We would receive an android.media.Image in the ImageReader's callback (unless we render to a HardwareBuffer; point 5.)

4. `AHardwareBuffer`

This is something I couldn't get working yet and I'm not sure if that's possible, but my theory is to receive HardwareBuffers (which should represent GPU memory afaik) instantly, then somehow pass them to a MediaRecorder/MediaCodec for encoding:

graph TD;
Camera-->|???|HW[HardwareBuffer]
HW-->FP["Frame Processing (e.g. MLKit)"]
HW-->REC[MediaRecorder/MediaCodec]

..but I have no idea how to get direct low-level access to such buffers from the Camera. Can the Camera only render to Surfaces? These Surface abstractions are really annoying.

5. FFmpeg

This is something I couldn't get working yet and I'm not sure if that's possible, but my theory is to use FFmpeg instead of MediaRecorder/MediaCodec to make the recording step simpler and more flexible:

graph TD;
Camera-->ImageReader
ImageReader-->FP["Frame Processing (e.g. MLKit)"]
ImageReader-->REC[FFmpeg]

..but I'm not sure if that would grant me any advantages. And also, I think ffmpeg is just using MediaRecorder/MediaCodec under the hood - so that's something I could build myself.

At this point I'm pretty clueless tbh. Is a synchronous video pipeline simply not possible at all in Android? I'd appreciate any pointers/help here, maybe I'm not aware of some great APIs.

Happy to pay $4.000 to anyone who comes up with a solution for this problem once and for all.

The text was updated successfully, but these errors were encountered:

mrousavy · 2023-09-22T15:06:36Z

TL:DR; As of today, VisionCamera uses an OpenGL pipeline on Android. Downsides:

💥 It crashes when you try to use another format than RGB. E.g. if you pass pixelFormat="yuv" it will crash (YUV because your MLKit plugin might be trained on YUV). RGB also isn't the most efficient format (PRIVATE or YUV is).
🐌 This does introduce some additional overhead for recording videos as it converts from native Camera format to RGB, then renders it into the MediaRecorder. So it is theoretically slower than a native Camera2 app.
🧵 It is not synchronous, so the Frame Processor is called in parallel. If at some point I decide to add Skia to VisionCamera (e.g. V4), this architecture does not work so you can't detect faces first, then draw over the detected faces in the recording/preview because again, it runs in parallel, not synchronous.

Instead, I want to use an approach similar to how it works on iOS - just pass GPU buffers (AHardwareBuffer?) around. I tried to use ImageReader/ImageWriter, but those don't play well with the MediaRecorder/MediaCodec interfaces. So for now, I ran out of ideas. Pixel Format YUV just doesn't work in VisionCamera Android today. I just think Android doesn't have any APIs that support that, lmao.

If someone has any ideas, please comment here. $4.000 bounty if you can solve this problem.

Feel free to share this, e.g. if you know someone who works at Android/Google or someone with Camera2/android.media/OpenGL experience.

mrousavy · 2023-09-29T19:30:46Z

I found a temporary solution to the problem: #1874

Basically, I plug an ImageReader and an ImageWriter in-between the Camera and the OpenGL pipeline.

The problem here is that this does not support RGB, but at least it works in YUV and PRIVATE. I think most Frame Processors need YUV anyways.

As far as I can see, there is no real solution on Android right now.

Solution

I think this can be achieved if the Android OS adds two features:

ImageWriter should be capable of automatically resizing Images to the target surface's size (width + height)
MediaRecorder Surfaces should be capable of working in ImageFormat.YUV_420_888 and ImageFormat.PRIVATE formats.

I think 2. already works on most phones, but for some reason not on every phone. So 1. is the major point; if that works, we can scrap the entire OpenGL pipeline (a shit ton of C++ files).

I think I will create a feature request for that in the Android issue tracker, but they probably won't care about that tbh. iOS already supports this since iOS 12 😅

mrousavy · 2023-09-30T12:44:36Z

Lmao even ChatGPT says that in a perfect world Android would be able to handle frame resizing automatically in the GPU Encoder. (Same as iOS in AVAssetWriter)

It even says that this might be possible with custom SIMD instructions 😂

mrousavy · 2023-09-30T12:45:48Z

NEON (SIMD) only works on arm, and I'm definitely not gonna go down that rabbit hole 😂

mrousavy · 2024-07-29T09:25:59Z

CameraX has a stream sharing feature, this is kinda what I am aiming for. really cool stuff, no need for me to bring everything into OpenGL.

I hope it's as efficient as on iOS :)

mrousavy added the ✨ enhancement label Sep 22, 2023

mrousavy changed the title ~~✨ Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling)~~ ✨ [$4.000 BOUNTY 💰] Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling) Sep 22, 2023

mrousavy added the 🤖 android Issue affects the Android platform label Sep 22, 2023

mrousavy pinned this issue Sep 22, 2023

mrousavy mentioned this issue Sep 22, 2023

feat: Use C++ OpenGL GPU VideoPipeline again #1836

Merged

rkmackinnon mentioned this issue Sep 29, 2023

feat: support react-native-vision-camera v3 rodgomesc/vision-camera-code-scanner#139

Open

mrousavy mentioned this issue Sep 29, 2023

feat: Route images through ImageWriter into OpenGL pipeline #1874

Merged

mrousavy unpinned this issue Sep 30, 2023

mrousavy removed the ✨ enhancement label Sep 30, 2023

mrousavy closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ [$4.000 BOUNTY 💰] Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling) #1837

✨ [$4.000 BOUNTY 💰] Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling) #1837

mrousavy commented Sep 22, 2023 •

edited

Loading

mrousavy commented Sep 22, 2023

mrousavy commented Sep 29, 2023

mrousavy commented Sep 30, 2023

mrousavy commented Sep 30, 2023

mrousavy commented Jul 29, 2024

✨ [$4.000 BOUNTY 💰] Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling) #1837

✨ [$4.000 BOUNTY 💰] Improve VideoPipeline (lower-overhead, no OpenGL, all pixel-formats, auto-scaling) #1837

Comments

mrousavy commented Sep 22, 2023 • edited Loading

What

iOS

Android

1. Separate outputs

2. ImageReader/ImageWriter

3. Create a custom OpenGL Pipeline

4. AHardwareBuffer

5. FFmpeg

mrousavy commented Sep 22, 2023

mrousavy commented Sep 29, 2023

Solution

mrousavy commented Sep 30, 2023

mrousavy commented Sep 30, 2023

mrousavy commented Jul 29, 2024

mrousavy commented Sep 22, 2023 •

edited

Loading

2. `ImageReader`/`ImageWriter`

4. `AHardwareBuffer`