Add Support for Video Input in Serverless API #8310

WorkTimer · 2024-08-16T03:03:50Z

Actions before raising this issue

I searched the existing issues and did not find anything similar.
I read/searched the docs

Is your feature request related to a problem? Please describe.

The current serverless API in CVAT only supports processing individual frames, which limits its ability to handle tasks that require video input. My deep learning model needs to process entire videos, and the current frame-by-frame processing is not sufficient for this purpose.

Describe the solution you'd like

I would like the serverless API to be enhanced to support video input, allowing entire videos to be passed to the API rather than just individual frames. This would enable models that require video input for processing to function correctly within the CVAT serverless environment.

Describe alternatives you've considered

No response

Additional context

No response

Virajjai · 2024-08-16T10:07:31Z

Hi @WorkTimer , could you please explain a bit more on this and if there is any code references then please give it also.

WorkTimer · 2024-08-18T03:58:16Z

Thank you for your response!

In the current CVAT serverless API code, the processing is done on individual image frames,
as shown in the following code: https://github.com/cvat-ai/cvat/blob/develop/serverless/pytorch/facebookresearch/sam/nuclio/main.py:

def handler(context, event):
    context.logger.info("call handler")
    data = event.body
    buf = io.BytesIO(base64.b64decode(data["image"]))
    image = Image.open(buf)
    image is converted to RGB
    features = context.user_data.model.handle(image)
    return context.Response(body=json.dumps({
            'blob': base64.b64encode((features.cpu().numpy() if features.is_cuda else features.numpy())).decode(),
        }),
        headers={},
        content_type='application/json',
        status_code=200
    )

In this code, the API can only receive and process a single image frame.

However, the new SAM2 model supports predicting an entire video at once, as demonstrated in this example：https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/video_predictor_example.ipynb, which shows how to perform segmentation on an entire video.

My suggestion is to enhance the CVAT serverless API to accept and process complete video files rather than just individual frames. This would allow models like SAM2, which are designed for video processing, to be directly integrated into CVAT for automatic segmentation and annotation of entire videos.

Virajjai · 2024-08-19T07:00:51Z

This update to the CVAT serverless API enables processing of entire video files using the SAM2 model, allowing for automatic segmentation of video content. The implementation follows the code structure and conventions found in CVAT's serverless/pytorch/facebookresearch/sam/nuclio/main.py.

import io
import base64
import json
import cv2
import numpy as np
from PIL import Image

def init_context(context):
    from segment_anything import SamAutomaticMaskGenerator
    from segment_anything import sam_model_registry

    model_type = "vit_b"
    checkpoint_path = "/opt/nuclio/sam_vit_b.pth"
    sam = sam_model_registry[model_type](checkpoint=checkpoint_path)
    mask_generator = SamAutomaticMaskGenerator(sam)
    context.user_data.model = mask_generator

def handler(context, event):
    context.logger.info("Handling request for video segmentation")

    try:
        # Decode video from base64
        data = event.body
        video_data = base64.b64decode(data["video"])
        video_buf = io.BytesIO(video_data)

        # Read video frames using OpenCV
        video = cv2.VideoCapture(video_buf)
        frames = []
        while video.isOpened():
            ret, frame = video.read()
            if not ret:
                break
            rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(rgb_frame)
        video.release()

        # Process frames with SAM model
        context.logger.info(f"Processing {len(frames)} frames")
        segmented_frames = []
        for frame in frames:
            masks = context.user_data.model.generate(frame)
            mask_image = np.zeros_like(frame)
            for mask in masks:
                mask_image[mask['segmentation']] = mask['class_id']
            segmented_frames.append(mask_image)

        # Encode segmented frames into a video
        output_video = io.BytesIO()
        height, width, _ = segmented_frames[0].shape
        video_writer = cv2.VideoWriter(
            output_video, cv2.VideoWriter_fourcc(*'mp4v'), 24, (width, height)
        )
        for frame in segmented_frames:
            video_writer.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
        video_writer.release()

        # Base64 encode the output video
        output_video.seek(0)
        encoded_video = base64.b64encode(output_video.read()).decode()

        return context.Response(
            body=json.dumps({'video': encoded_video}),
            headers={},
            content_type='application/json',
            status_code=200
        )
    
    except Exception as e:
        context.logger.error(f"Error processing video: {str(e)}")
        return context.Response(
            body=json.dumps({'error': str(e)}),
            headers={},
            content_type='application/json',
            status_code=500
        )

1. `init_context(context)`:

This function initializes the serverless function's context by loading the SAM model (SamAutomaticMaskGenerator).
Model Type: The SAM model type ("vit_b") and the checkpoint path ("/opt/nuclio/sam_vit_b.pth") are specified.
Context Usage: The initialized model is stored in context.user_data to make it accessible during the handler's execution.

2. `handler(context, event)`:

Logging: Logs are added to trace the processing flow and to debug if necessary.
Video Decoding: The video file is received as a base64 encoded string, decoded into binary data, and read using OpenCV.
Frame Extraction: Each frame of the video is extracted and converted to RGB format for processing by the SAM model.
SAM Model Processing: Each frame is processed by the SAM model to generate segmentation masks, which are applied to create segmented frames.
Re-encoding Video: The processed frames are re-encoded into a video, which is then base64 encoded for transmission back to the client.
Error Handling: The function includes robust error handling, ensuring that any issues are logged and a proper error response is returned.

3. OpenCV Integration:

cv2.VideoCapture: Used to capture video frames from the in-memory buffer.
cv2.VideoWriter: Used to create the output video from the segmented frames.

4. Base64 Encoding:

The input and output videos are base64 encoded to allow easy transmission as JSON.

5. Model Output Handling:

Segmentation masks are applied to frames to create a visual representation of the segmentation.
The mask is stored as a binary image where the mask is applied based on the class ID.

I have done some changes can you review it and I'm open for any suggestions .

WorkTimer · 2024-09-01T04:50:01Z

Thank you for the update and the detailed explanation! I have reviewed the changes you've made to the code and understand the newly implemented features. This solution aligns perfectly with what I was looking for, and I really appreciate the time and effort you've put into making these improvements. I'm looking forward to seeing it in action within CVAT.

WorkTimer added the enhancement New feature or request label Aug 16, 2024

WorkTimer closed this as completed Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Video Input in Serverless API #8310

Add Support for Video Input in Serverless API #8310

WorkTimer commented Aug 16, 2024

Virajjai commented Aug 16, 2024

WorkTimer commented Aug 18, 2024

Virajjai commented Aug 19, 2024

WorkTimer commented Sep 1, 2024

Add Support for Video Input in Serverless API #8310

Add Support for Video Input in Serverless API #8310

Comments

WorkTimer commented Aug 16, 2024

Actions before raising this issue

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Virajjai commented Aug 16, 2024

WorkTimer commented Aug 18, 2024

Virajjai commented Aug 19, 2024

1. init_context(context):

2. handler(context, event):

3. OpenCV Integration:

4. Base64 Encoding:

5. Model Output Handling:

WorkTimer commented Sep 1, 2024

1. `init_context(context)`:

2. `handler(context, event)`: