Skip to content

Virtual Background with WebRTC in Android

Kai Dao edited this page Mar 11, 2024 · 4 revisions

Overview

Virtual backgrounds are becoming necessary nowadays in the video conferencing world. It allows us to replace our natural background with an image or a video. We can also upload our custom images in the background.

👉 By end of this wiki, you can expect the virtual background feature to look like this:

Virtual Background on Android (Mediapipe for Image segment)

Dependencies

Add the dependencies for the Mediapipe Android libraries to the module's app-level gradle file, which is usually app/build.gradle:

dependencies {  implementation 'com.google.mlkit:segmentation-selfie:16.0.0-beta3'}

Common WebRTC terms you should know

  1. VideoFrame: It contains the buffer of the frame captured by the camera device in I420 format.
  2. VideoSink: It is used to send the frame back to WebRTC native source.
  3. VideoSource: It reads the camera device, produces VideoFrames, and delivers them to VideoSinks.
  4. VideoProcessor: It is an interface provided by WebRTC to update videoFrames produced by videoSource .
  5. MediaStream: It is an API related to WebRTC which provides support for streaming audio and video data. It consists of zero or more MediaStreamTrack objects, representing various audio or video tracks

Approaches we thought of

  1. Updating the WebRTC MediaStream by passing it to the mlkit selfie segmentation model and getting the updated stream. But sadly we don’t have a method in android to replaceTrack in WebRTC.

  2. Updating the stream coming from the source camera and then passing it to WebRTC. Got some success on it, but then issues were faced in using the updated stream in the WebRTC.

  3. Creating another virtual video source from the camera source and using that as an input to mlkit API. But sending the updated stream back to WebRTC gave us issues.

  4. Using Android CameraX Apis to read frames but again WebRTC doesn't support it.

After trying all these approaches and not getting suitable results, we figured out that we need to do processing on VideoFrame for our use case.

Implement in code

Getting the VideoFrame from WebRTC

Initialize Mediapipe Image Segmenter

Handle Person Mask from Mediapipe

Draw segmented and background on canvas

Task benchmarks

Here's the task benchmarks for the whole pipeline based on the above pre-trained models. The latency result is the average latency on Pixel 6 using CPU / GPU.

Model Name CPU Latency GPU Latency
SelfieSegmenter (square) 33.46ms 35.15ms

Reference