-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Video Input in Serverless API #8310
Comments
Hi @WorkTimer , could you please explain a bit more on this and if there is any code references then please give it also. |
Thank you for your response! In the current CVAT serverless API code, the processing is done on individual image frames,
In this code, the API can only receive and process a single image frame. However, the new SAM2 model supports predicting an entire video at once, as demonstrated in this example:https://github.com/facebookresearch/segment-anything-2/blob/main/notebooks/video_predictor_example.ipynb, which shows how to perform segmentation on an entire video. My suggestion is to enhance the CVAT serverless API to accept and process complete video files rather than just individual frames. This would allow models like SAM2, which are designed for video processing, to be directly integrated into CVAT for automatic segmentation and annotation of entire videos. |
This update to the CVAT serverless API enables processing of entire video files using the SAM2 model, allowing for automatic segmentation of video content. The implementation follows the code structure and conventions found in CVAT's import io
import base64
import json
import cv2
import numpy as np
from PIL import Image
def init_context(context):
from segment_anything import SamAutomaticMaskGenerator
from segment_anything import sam_model_registry
model_type = "vit_b"
checkpoint_path = "/opt/nuclio/sam_vit_b.pth"
sam = sam_model_registry[model_type](checkpoint=checkpoint_path)
mask_generator = SamAutomaticMaskGenerator(sam)
context.user_data.model = mask_generator
def handler(context, event):
context.logger.info("Handling request for video segmentation")
try:
# Decode video from base64
data = event.body
video_data = base64.b64decode(data["video"])
video_buf = io.BytesIO(video_data)
# Read video frames using OpenCV
video = cv2.VideoCapture(video_buf)
frames = []
while video.isOpened():
ret, frame = video.read()
if not ret:
break
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(rgb_frame)
video.release()
# Process frames with SAM model
context.logger.info(f"Processing {len(frames)} frames")
segmented_frames = []
for frame in frames:
masks = context.user_data.model.generate(frame)
mask_image = np.zeros_like(frame)
for mask in masks:
mask_image[mask['segmentation']] = mask['class_id']
segmented_frames.append(mask_image)
# Encode segmented frames into a video
output_video = io.BytesIO()
height, width, _ = segmented_frames[0].shape
video_writer = cv2.VideoWriter(
output_video, cv2.VideoWriter_fourcc(*'mp4v'), 24, (width, height)
)
for frame in segmented_frames:
video_writer.write(cv2.cvtColor(frame, cv2.COLOR_RGB2BGR))
video_writer.release()
# Base64 encode the output video
output_video.seek(0)
encoded_video = base64.b64encode(output_video.read()).decode()
return context.Response(
body=json.dumps({'video': encoded_video}),
headers={},
content_type='application/json',
status_code=200
)
except Exception as e:
context.logger.error(f"Error processing video: {str(e)}")
return context.Response(
body=json.dumps({'error': str(e)}),
headers={},
content_type='application/json',
status_code=500
) 1.
|
Thank you for the update and the detailed explanation! I have reviewed the changes you've made to the code and understand the newly implemented features. This solution aligns perfectly with what I was looking for, and I really appreciate the time and effort you've put into making these improvements. I'm looking forward to seeing it in action within CVAT. |
Actions before raising this issue
Is your feature request related to a problem? Please describe.
The current serverless API in CVAT only supports processing individual frames, which limits its ability to handle tasks that require video input. My deep learning model needs to process entire videos, and the current frame-by-frame processing is not sufficient for this purpose.
Describe the solution you'd like
I would like the serverless API to be enhanced to support video input, allowing entire videos to be passed to the API rather than just individual frames. This would enable models that require video input for processing to function correctly within the CVAT serverless environment.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: