diff --git a/comps/dataprep/multimodal/redis/langchain/README.md b/comps/dataprep/multimodal/redis/langchain/README.md index 65e4f5d45f..db24b431fd 100644 --- a/comps/dataprep/multimodal/redis/langchain/README.md +++ b/comps/dataprep/multimodal/redis/langchain/README.md @@ -1,6 +1,10 @@ # Dataprep Microservice for Multimodal Data with Redis -This `dataprep` microservice accepts videos (mp4 files) and their transcripts (optional) from the user and ingests them into Redis vectorstore. +This `dataprep` microservice accepts the following from the user and ingests them into a Redis vector store: + +- Videos (mp4 files) and their transcripts (optional) +- Images (gif, jpg, jpeg, and png files) and their captions (optional) +- Audio (wav files) ## 🚀1. Start Microservice with Python(Option 1) @@ -107,18 +111,18 @@ docker container logs -f dataprep-multimodal-redis ## 🚀4. Consume Microservice -Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert videos and their transcripts (optional) to embeddings and save to the Redis vector store. +Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images and videos and their transcripts (optional) to embeddings and save to the Redis vector store. -This mircroservice has provided 3 different ways for users to ingest videos into Redis vector store corresponding to the 3 use cases. +This microservice provides 3 different ways for users to ingest files into Redis vector store corresponding to the 3 use cases. -### 4.1 Consume _videos_with_transcripts_ API +### 4.1 Consume _ingest_with_text_ API -**Use case:** This API is used when a transcript file (under `.vtt` format) is available for each video. +**Use case:** This API is used when videos are accompanied by transcript files (`.vtt` format) or images are accompanied by text caption files (`.txt` format). **Important notes:** - Make sure the file paths after `files=@` are correct. -- Every transcript file's name must be identical with its corresponding video file's name (except their extension .vtt and .mp4). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in this API call, this microservice will return error `No captions file video1.vtt found for video1.mp4`. +- Every transcript or caption file's name must be identical to its corresponding video or image file's name (except their extension - .vtt goes with .mp4 and .txt goes with .jpg, .jpeg, .png, or .gif). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in the API call, the microservice will return an error `No captions file video1.vtt found for video1.mp4`. #### Single video-transcript pair upload @@ -127,10 +131,20 @@ curl -X POST \ -H "Content-Type: multipart/form-data" \ -F "files=@./video1.mp4" \ -F "files=@./video1.vtt" \ - http://localhost:6007/v1/videos_with_transcripts + http://localhost:6007/v1/ingest_with_text +``` + +#### Single image-caption pair upload + +```bash +curl -X POST \ + -H "Content-Type: multipart/form-data" \ + -F "files=@./image.jpg" \ + -F "files=@./image.txt" \ + http://localhost:6007/v1/ingest_with_text ``` -#### Multiple video-transcript pair upload +#### Multiple file pair upload ```bash curl -X POST \ @@ -139,16 +153,20 @@ curl -X POST \ -F "files=@./video1.vtt" \ -F "files=@./video2.mp4" \ -F "files=@./video2.vtt" \ - http://localhost:6007/v1/videos_with_transcripts + -F "files=@./image1.png" \ + -F "files=@./image1.txt" \ + -F "files=@./image2.jpg" \ + -F "files=@./image2.txt" \ + http://localhost:6007/v1/ingest_with_text ``` ### 4.2 Consume _generate_transcripts_ API -**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available. +**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available, or for audio files with speech. -In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video. +In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video or audio files. -#### Single video upload +#### Single file upload ```bash curl -X POST \ @@ -157,21 +175,22 @@ curl -X POST \ http://localhost:6007/v1/generate_transcripts ``` -#### Multiple video upload +#### Multiple file upload ```bash curl -X POST \ -H "Content-Type: multipart/form-data" \ -F "files=@./video1.mp4" \ -F "files=@./video2.mp4" \ + -F "files=@./audio1.wav" \ http://localhost:6007/v1/generate_transcripts ``` ### 4.3 Consume _generate_captions_ API -**Use case:** This API should be used when a video does not have meaningful audio or does not have audio. +**Use case:** This API should be used when uploading an image, or when uploading a video that does not have meaningful audio or does not have audio. -In this use case, transcript either does not provide any meaningful information or does not exist. Thus, it is preferred to leverage a LVM microservice to summarize the video frames. +In this use case, there is no meaningful language transcription. Thus, it is preferred to leverage a LVM microservice to summarize the frames. - Single video upload @@ -192,22 +211,31 @@ curl -X POST \ http://localhost:6007/v1/generate_captions ``` -### 4.4 Consume get_videos API +- Single image upload + +```bash +curl -X POST \ + -H "Content-Type: multipart/form-data" \ + -F "files=@./image.jpg" \ + http://localhost:6007/v1/generate_captions +``` + +### 4.4 Consume get_files API -To get names of uploaded videos, use the following command. +To get names of uploaded files, use the following command. ```bash curl -X POST \ -H "Content-Type: application/json" \ - http://localhost:6007/v1/dataprep/get_videos + http://localhost:6007/v1/dataprep/get_files ``` -### 4.5 Consume delete_videos API +### 4.5 Consume delete_files API -To delete uploaded videos and clear the database, use the following command. +To delete uploaded files and clear the database, use the following command. ```bash curl -X POST \ -H "Content-Type: application/json" \ - http://localhost:6007/v1/dataprep/delete_videos + http://localhost:6007/v1/dataprep/delete_files ``` diff --git a/comps/dataprep/multimodal/redis/langchain/config.py b/comps/dataprep/multimodal/redis/langchain/config.py index 0cae533788..90a73d5a96 100644 --- a/comps/dataprep/multimodal/redis/langchain/config.py +++ b/comps/dataprep/multimodal/redis/langchain/config.py @@ -4,7 +4,7 @@ import os # Models -EMBED_MODEL = os.getenv("EMBED_MODEL", "BridgeTower/bridgetower-large-itm-mlm-itc") +EMBED_MODEL = os.getenv("EMBEDDING_MODEL_ID", "BridgeTower/bridgetower-large-itm-mlm-itc") WHISPER_MODEL = os.getenv("WHISPER_MODEL", "small") # Redis Connection Information diff --git a/comps/dataprep/multimodal/redis/langchain/multimodal_utils.py b/comps/dataprep/multimodal/redis/langchain/multimodal_utils.py index f8168b582a..09abad5641 100644 --- a/comps/dataprep/multimodal/redis/langchain/multimodal_utils.py +++ b/comps/dataprep/multimodal/redis/langchain/multimodal_utils.py @@ -39,8 +39,8 @@ def clear_upload_folder(upload_path): os.rmdir(dir_path) -def generate_video_id(): - """Generates a unique identifier for a video file.""" +def generate_id(): + """Generates a unique identifier for a file.""" return str(uuid.uuid4()) @@ -128,8 +128,49 @@ def convert_img_to_base64(image): return encoded_string.decode() +def generate_annotations_from_transcript(file_id: str, file_path: str, vtt_path: str, output_dir: str): + """Generates an annotations.json from the transcript file.""" + + # Set up location to store frames and annotations + os.makedirs(output_dir, exist_ok=True) + + # read captions file + captions = webvtt.read(vtt_path) + + annotations = [] + for idx, caption in enumerate(captions): + start_time = str2time(caption.start) + end_time = str2time(caption.end) + mid_time = (end_time + start_time) / 2 + mid_time_ms = mid_time * 1000 + text = caption.text.replace("\n", " ") + + # Create annotations for frame from transcripts with an empty image + annotations.append( + { + "video_id": file_id, + "video_name": os.path.basename(file_path), + "b64_img_str": "", + "caption": text, + "time": mid_time_ms, + "frame_no": 0, + "sub_video_id": idx, + } + ) + + # Save transcript annotations as json file for further processing + with open(os.path.join(output_dir, "annotations.json"), "w") as f: + json.dump(annotations, f) + + return annotations + + def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: str, vtt_path: str, output_dir: str): - """Extract frames (.png) and annotations (.json) from video file (.mp4) and captions file (.vtt)""" + """Extract frames (.png) and annotations (.json) from media-text file pairs. + + File pairs can be a video + file (.mp4) and transcript file (.vtt) or an image file (.png, .jpg, .jpeg, .gif) and caption file (.txt) + """ # Set up location to store frames and annotations os.makedirs(output_dir, exist_ok=True) os.makedirs(os.path.join(output_dir, "frames"), exist_ok=True) @@ -139,18 +180,28 @@ def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: s fps = vidcap.get(cv2.CAP_PROP_FPS) # read captions file - captions = webvtt.read(vtt_path) + if os.path.splitext(vtt_path)[-1] == ".vtt": + captions = webvtt.read(vtt_path) + else: + with open(vtt_path, "r") as f: + captions = f.read() annotations = [] for idx, caption in enumerate(captions): - start_time = str2time(caption.start) - end_time = str2time(caption.end) + if os.path.splitext(vtt_path)[-1] == ".vtt": + start_time = str2time(caption.start) + end_time = str2time(caption.end) - mid_time = (end_time + start_time) / 2 - text = caption.text.replace("\n", " ") + mid_time = (end_time + start_time) / 2 + text = caption.text.replace("\n", " ") + + frame_no = time_to_frame(mid_time, fps) + mid_time_ms = mid_time * 1000 + else: + frame_no = 0 + mid_time_ms = 0 + text = captions.replace("\n", " ") - frame_no = time_to_frame(mid_time, fps) - mid_time_ms = mid_time * 1000 vidcap.set(cv2.CAP_PROP_POS_MSEC, mid_time_ms) success, frame = vidcap.read() diff --git a/comps/dataprep/multimodal/redis/langchain/prepare_videodoc_redis.py b/comps/dataprep/multimodal/redis/langchain/prepare_videodoc_redis.py index 2784e54300..fa8ed4896e 100644 --- a/comps/dataprep/multimodal/redis/langchain/prepare_videodoc_redis.py +++ b/comps/dataprep/multimodal/redis/langchain/prepare_videodoc_redis.py @@ -23,7 +23,8 @@ extract_frames_and_annotations_from_transcripts, extract_frames_and_generate_captions, extract_transcript_from_audio, - generate_video_id, + generate_annotations_from_transcript, + generate_id, load_json_file, load_whisper_model, write_vtt, @@ -44,7 +45,7 @@ class MultimodalRedis(Redis): def from_text_image_pairs_return_keys( cls: Type[Redis], texts: List[str], - images: List[str], + images: List[str] = None, embedding: Embeddings = BridgeTowerEmbedding, metadatas: Optional[List[dict]] = None, index_name: Optional[str] = None, @@ -55,7 +56,8 @@ def from_text_image_pairs_return_keys( """ Args: texts (List[str]): List of texts to add to the vectorstore. - images (List[str]): List of path-to-images to add to the vectorstore. + images (List[str]): Optional list of path-to-images to add to the vectorstore. If provided, the length of + the list of images must match the length of the list of text strings. embedding (Embeddings): Embeddings to use for the vectorstore. metadatas (Optional[List[dict]], optional): Optional list of metadata dicts to add to the vectorstore. Defaults to None. @@ -74,8 +76,8 @@ def from_text_image_pairs_return_keys( ValueError: If the number of texts does not equal the number of images. ValueError: If the number of metadatas does not match the number of texts. """ - # the length of texts must be equal to the length of images - if len(texts) != len(images): + # If images are provided, the length of texts must be equal to the length of images + if images and len(texts) != len(images): raise ValueError(f"the len of captions {len(texts)} does not equal the len of images {len(images)}") redis_url = get_from_dict_or_env(kwargs, "redis_url", "REDIS_URL") @@ -117,7 +119,11 @@ def from_text_image_pairs_return_keys( **kwargs, ) # Add data to Redis - keys = instance.add_text_image_pairs(texts, images, metadatas, keys=keys) + keys = ( + instance.add_text_image_pairs(texts, images, metadatas, keys=keys) + if images + else instance.add_text(texts, metadatas, keys=keys) + ) return instance, keys def add_text_image_pairs( @@ -188,6 +194,66 @@ def add_text_image_pairs( pipeline.execute() return ids + def add_text( + self, + texts: Iterable[str], + metadatas: Optional[List[dict]] = None, + embeddings: Optional[List[List[float]]] = None, + clean_metadata: bool = True, + **kwargs: Any, + ) -> List[str]: + """Add more embeddings of text to the vectorstore. + + Args: + texts (Iterable[str]): Iterable of strings/text to add to the vectorstore. + metadatas (Optional[List[dict]], optional): Optional list of metadatas. + Defaults to None. + embeddings (Optional[List[List[float]]], optional): Optional pre-generated + embeddings. Defaults to None. + keys (List[str]) or ids (List[str]): Identifiers of entries. + Defaults to None. + Returns: + List[str]: List of ids added to the vectorstore + """ + ids = [] + # Get keys or ids from kwargs + # Other vectorstores use ids + keys_or_ids = kwargs.get("keys", kwargs.get("ids")) + + # type check for metadata + if metadatas: + if isinstance(metadatas, list) and len(metadatas) != len(texts): # type: ignore # noqa: E501 + raise ValueError("Number of metadatas must match number of texts") + if not (isinstance(metadatas, list) and isinstance(metadatas[0], dict)): + raise ValueError("Metadatas must be a list of dicts") + + if not embeddings: + embeddings = self._embeddings.embed_documents(list(texts)) + self._create_index_if_not_exist(dim=len(embeddings[0])) + + # Write data to redis + pipeline = self.client.pipeline(transaction=False) + for i, text in enumerate(texts): + # Use provided values by default or fallback + key = keys_or_ids[i] if keys_or_ids else str(uuid.uuid4().hex) + if not key.startswith(self.key_prefix + ":"): + key = self.key_prefix + ":" + key + metadata = metadatas[i] if metadatas else {} + metadata = _prepare_metadata(metadata) if clean_metadata else metadata + pipeline.hset( + key, + mapping={ + self._schema.content_key: text, + self._schema.content_vector_key: _array_to_buffer(embeddings[i], self._schema.vector_dtype), + **metadata, + }, + ) + ids.append(key) + + # Cleanup final batch + pipeline.execute() + return ids + def prepare_data_and_metadata_from_annotation( annotation, path_to_frames, title, num_transcript_concat_for_ingesting=2, num_transcript_concat_for_inference=7 @@ -211,11 +277,14 @@ def prepare_data_and_metadata_from_annotation( video_id = frame["video_id"] b64_img_str = frame["b64_img_str"] time_of_frame = frame["time"] - embedding_type = "pair" + embedding_type = "pair" if b64_img_str else "text" source_video = frame["video_name"] text_list.append(caption_for_ingesting) - image_list.append(path_to_frame) + + if b64_img_str: + image_list.append(path_to_frame) + metadatas.append( { "content": caption_for_ingesting, @@ -268,45 +337,52 @@ def drop_index(index_name, redis_url=REDIS_URL): @register_microservice( name="opea_service@prepare_videodoc_redis", endpoint="/v1/generate_transcripts", host="0.0.0.0", port=6007 ) -async def ingest_videos_generate_transcripts(files: List[UploadFile] = File(None)): - """Upload videos with speech, generate transcripts using whisper and ingest into redis.""" +async def ingest_generate_transcripts(files: List[UploadFile] = File(None)): + """Upload videos or audio files with speech, generate transcripts using whisper and ingest into redis.""" if files: - video_files = [] - uploaded_videos_saved_videos_map = {} + files_to_ingest = [] + uploaded_files_map = {} for file in files: - if os.path.splitext(file.filename)[1] == ".mp4": - video_files.append(file) + if os.path.splitext(file.filename)[1] in [".mp4", ".wav"]: + files_to_ingest.append(file) else: raise HTTPException( status_code=400, detail=f"File {file.filename} is not an mp4 file. Please upload mp4 files only." ) - for video_file in video_files: + for file_to_ingest in files_to_ingest: st = time.time() - print(f"Processing video {video_file.filename}") + file_extension = os.path.splitext(file_to_ingest.filename)[1] + is_video = file_extension == ".mp4" + file_type_str = "video" if is_video else "audio file" + print(f"Processing {file_type_str} {file_to_ingest.filename}") # Assign unique identifier to video - video_id = generate_video_id() + file_id = generate_id() # Create video file name by appending identifier - video_name = os.path.splitext(video_file.filename)[0] - video_file_name = f"{video_name}_{video_id}.mp4" - video_dir_name = os.path.splitext(video_file_name)[0] - - # Save video file in upload_directory - with open(os.path.join(upload_folder, video_file_name), "wb") as f: - shutil.copyfileobj(video_file.file, f) - - uploaded_videos_saved_videos_map[video_name] = video_file_name - - # Extract temporary audio wav file from video mp4 - audio_file = video_dir_name + ".wav" - print(f"Extracting {audio_file}") - convert_video_to_audio( - os.path.join(upload_folder, video_file_name), os.path.join(upload_folder, audio_file) - ) - print(f"Done extracting {audio_file}") + base_file_name = os.path.splitext(file_to_ingest.filename)[0] + file_name_with_id = f"{base_file_name}_{file_id}{file_extension}" + dir_name = os.path.splitext(file_name_with_id)[0] + + # Save file in upload_directory + with open(os.path.join(upload_folder, file_name_with_id), "wb") as f: + shutil.copyfileobj(file_to_ingest.file, f) + + uploaded_files_map[base_file_name] = file_name_with_id + + if is_video: + # Extract temporary audio wav file from video mp4 + audio_file = dir_name + ".wav" + print(f"Extracting {audio_file}") + convert_video_to_audio( + os.path.join(upload_folder, file_name_with_id), os.path.join(upload_folder, audio_file) + ) + print(f"Done extracting {audio_file}") + else: + # We already have an audio file + audio_file = file_name_with_id # Load whisper model print("Loading whisper model....") @@ -318,196 +394,215 @@ async def ingest_videos_generate_transcripts(files: List[UploadFile] = File(None transcripts = extract_transcript_from_audio(whisper_model, os.path.join(upload_folder, audio_file)) # Save transcript as vtt file and delete audio file - vtt_file = video_dir_name + ".vtt" + vtt_file = dir_name + ".vtt" write_vtt(transcripts, os.path.join(upload_folder, vtt_file)) - delete_audio_file(os.path.join(upload_folder, audio_file)) + if is_video: + delete_audio_file(os.path.join(upload_folder, audio_file)) print("Done extracting transcript.") - # Store frames and caption annotations in a new directory - print("Extracting frames and generating annotation") - extract_frames_and_annotations_from_transcripts( - video_id, - os.path.join(upload_folder, video_file_name), - os.path.join(upload_folder, vtt_file), - os.path.join(upload_folder, video_dir_name), - ) + if is_video: + # Store frames and caption annotations in a new directory + print("Extracting frames and generating annotation") + extract_frames_and_annotations_from_transcripts( + file_id, + os.path.join(upload_folder, file_name_with_id), + os.path.join(upload_folder, vtt_file), + os.path.join(upload_folder, dir_name), + ) + else: + # Generate annotations based on the transcript + print("Generating annotations for the transcription") + generate_annotations_from_transcript( + file_id, + os.path.join(upload_folder, file_name_with_id), + os.path.join(upload_folder, vtt_file), + os.path.join(upload_folder, dir_name), + ) + print("Done extracting frames and generating annotation") # Delete temporary vtt file os.remove(os.path.join(upload_folder, vtt_file)) # Ingest multimodal data into redis print("Ingesting data to redis vector store") - ingest_multimodal(video_name, os.path.join(upload_folder, video_dir_name), embeddings) + ingest_multimodal(base_file_name, os.path.join(upload_folder, dir_name), embeddings) # Delete temporary video directory containing frames and annotations - shutil.rmtree(os.path.join(upload_folder, video_dir_name)) + shutil.rmtree(os.path.join(upload_folder, dir_name)) - print(f"Processed video {video_file.filename}") + print(f"Processed file {file_to_ingest.filename}") end = time.time() print(str(end - st)) return { "status": 200, "message": "Data preparation succeeded", - "video_id_maps": uploaded_videos_saved_videos_map, + "file_id_maps": uploaded_files_map, } - raise HTTPException(status_code=400, detail="Must provide at least one video (.mp4) file.") + raise HTTPException(status_code=400, detail="Must provide at least one video (.mp4) or audio (.wav) file.") @register_microservice( name="opea_service@prepare_videodoc_redis", endpoint="/v1/generate_captions", host="0.0.0.0", port=6007 ) -async def ingest_videos_generate_caption(files: List[UploadFile] = File(None)): - """Upload videos without speech (only background music or no audio), generate captions using lvm microservice and ingest into redis.""" +async def ingest_generate_caption(files: List[UploadFile] = File(None)): + """Upload images and videos without speech (only background music or no audio), generate captions using lvm microservice and ingest into redis.""" if files: - video_files = [] - uploaded_videos_saved_videos_map = {} + file_paths = [] + uploaded_files_saved_files_map = {} for file in files: - if os.path.splitext(file.filename)[1] == ".mp4": - video_files.append(file) + if os.path.splitext(file.filename)[1] in [".mp4", ".png", ".jpg", ".jpeg", ".gif"]: + file_paths.append(file) else: raise HTTPException( - status_code=400, detail=f"File {file.filename} is not an mp4 file. Please upload mp4 files only." + status_code=400, + detail=f"File {file.filename} is not a supported file type. Please upload mp4, png, jpg, jpeg, and gif files only.", ) - for video_file in video_files: - print(f"Processing video {video_file.filename}") + for file in file_paths: + print(f"Processing file {file.filename}") - # Assign unique identifier to video - video_id = generate_video_id() + # Assign unique identifier to file + id = generate_id() - # Create video file name by appending identifier - video_name = os.path.splitext(video_file.filename)[0] - video_file_name = f"{video_name}_{video_id}.mp4" - video_dir_name = os.path.splitext(video_file_name)[0] + # Create file name by appending identifier + name, ext = os.path.splitext(file.filename) + file_name = f"{name}_{id}{ext}" + dir_name = os.path.splitext(file_name)[0] - # Save video file in upload_directory - with open(os.path.join(upload_folder, video_file_name), "wb") as f: - shutil.copyfileobj(video_file.file, f) - uploaded_videos_saved_videos_map[video_name] = video_file_name + # Save file in upload_directory + with open(os.path.join(upload_folder, file_name), "wb") as f: + shutil.copyfileobj(file.file, f) + uploaded_files_saved_files_map[name] = file_name # Store frames and caption annotations in a new directory extract_frames_and_generate_captions( - video_id, - os.path.join(upload_folder, video_file_name), + id, + os.path.join(upload_folder, file_name), LVM_ENDPOINT, - os.path.join(upload_folder, video_dir_name), + os.path.join(upload_folder, dir_name), ) # Ingest multimodal data into redis - ingest_multimodal(video_name, os.path.join(upload_folder, video_dir_name), embeddings) + ingest_multimodal(name, os.path.join(upload_folder, dir_name), embeddings) - # Delete temporary video directory containing frames and annotations - # shutil.rmtree(os.path.join(upload_folder, video_dir_name)) + # Delete temporary directory containing frames and annotations + # shutil.rmtree(os.path.join(upload_folder, dir_name)) - print(f"Processed video {video_file.filename}") + print(f"Processed file {file.filename}") return { "status": 200, "message": "Data preparation succeeded", - "video_id_maps": uploaded_videos_saved_videos_map, + "file_id_maps": uploaded_files_saved_files_map, } - raise HTTPException(status_code=400, detail="Must provide at least one video (.mp4) file.") + raise HTTPException(status_code=400, detail="Must provide at least one file.") @register_microservice( name="opea_service@prepare_videodoc_redis", - endpoint="/v1/videos_with_transcripts", + endpoint="/v1/ingest_with_text", host="0.0.0.0", port=6007, ) -async def ingest_videos_with_transcripts(files: List[UploadFile] = File(None)): - +async def ingest_with_text(files: List[UploadFile] = File(None)): if files: - video_files, video_file_names = [], [] - captions_files, captions_file_names = [], [] - uploaded_videos_saved_videos_map = {} + accepted_media_formats = [".mp4", ".png", ".jpg", ".jpeg", ".gif"] + # Create a lookup dictionary containing all media files + matched_files = {f.filename: [f] for f in files if os.path.splitext(f.filename)[1] in accepted_media_formats} + uploaded_files_map = {} + + # Go through files again and match caption files to media files for file in files: - if os.path.splitext(file.filename)[1] == ".mp4": - video_files.append(file) - video_file_names.append(file.filename) - elif os.path.splitext(file.filename)[1] == ".vtt": - captions_files.append(file) - captions_file_names.append(file.filename) - else: + file_base, file_extension = os.path.splitext(file.filename) + if file_extension == ".vtt": + if "{}.mp4".format(file_base) in matched_files: + matched_files["{}.mp4".format(file_base)].append(file) + else: + print(f"No video was found for caption file {file.filename}.") + elif file_extension == ".txt": + if "{}.png".format(file_base) in matched_files: + matched_files["{}.png".format(file_base)].append(file) + elif "{}.jpg".format(file_base) in matched_files: + matched_files["{}.jpg".format(file_base)].append(file) + elif "{}.jpeg".format(file_base) in matched_files: + matched_files["{}.jpeg".format(file_base)].append(file) + elif "{}.gif".format(file_base) in matched_files: + matched_files["{}.gif".format(file_base)].append(file) + else: + print(f"No image was found for caption file {file.filename}.") + elif file_extension not in accepted_media_formats: print(f"Skipping file {file.filename} because of unsupported format.") - # Check if every video file has a captions file - for video_file_name in video_file_names: - file_prefix = os.path.splitext(video_file_name)[0] - if (file_prefix + ".vtt") not in captions_file_names: - raise HTTPException( - status_code=400, detail=f"No captions file {file_prefix}.vtt found for {video_file_name}" - ) + # Check if every media file has a caption file + for media_file_name, file_pair in matched_files.items(): + if len(file_pair) != 2: + raise HTTPException(status_code=400, detail=f"No caption file found for {media_file_name}") - if len(video_files) == 0: + if len(matched_files.keys()) == 0: return HTTPException( status_code=400, - detail="The uploaded files have unsupported formats. Please upload at least one video file (.mp4) with captions (.vtt)", + detail="The uploaded files have unsupported formats. Please upload at least one video file (.mp4) with captions (.vtt) or one image (.png, .jpg, .jpeg, or .gif) with caption (.txt)", ) - for video_file in video_files: - print(f"Processing video {video_file.filename}") + for media_file in matched_files: + print(f"Processing file {media_file}") - # Assign unique identifier to video - video_id = generate_video_id() + # Assign unique identifier to file + file_id = generate_id() - # Create video file name by appending identifier - video_name = os.path.splitext(video_file.filename)[0] - video_file_name = f"{video_name}_{video_id}.mp4" - video_dir_name = os.path.splitext(video_file_name)[0] - - # Save video file in upload_directory - with open(os.path.join(upload_folder, video_file_name), "wb") as f: - shutil.copyfileobj(video_file.file, f) - uploaded_videos_saved_videos_map[video_name] = video_file_name - - # Save captions file in upload directory - vtt_file_name = os.path.splitext(video_file.filename)[0] + ".vtt" - vtt_idx = None - for idx, caption_file in enumerate(captions_files): - if caption_file.filename == vtt_file_name: - vtt_idx = idx - break - vtt_file = video_dir_name + ".vtt" - with open(os.path.join(upload_folder, vtt_file), "wb") as f: - shutil.copyfileobj(captions_files[vtt_idx].file, f) + # Create file name by appending identifier + file_name, file_extension = os.path.splitext(media_file) + media_file_name = f"{file_name}_{file_id}{file_extension}" + media_dir_name = os.path.splitext(media_file_name)[0] + + # Save file in upload_directory + with open(os.path.join(upload_folder, media_file_name), "wb") as f: + shutil.copyfileobj(matched_files[media_file][0].file, f) + uploaded_files_map[file_name] = media_file_name + + # Save caption file in upload directory + caption_file_extension = os.path.splitext(matched_files[media_file][1].filename)[1] + caption_file = f"{media_dir_name}{caption_file_extension}" + with open(os.path.join(upload_folder, caption_file), "wb") as f: + shutil.copyfileobj(matched_files[media_file][1].file, f) # Store frames and caption annotations in a new directory extract_frames_and_annotations_from_transcripts( - video_id, - os.path.join(upload_folder, video_file_name), - os.path.join(upload_folder, vtt_file), - os.path.join(upload_folder, video_dir_name), + file_id, + os.path.join(upload_folder, media_file_name), + os.path.join(upload_folder, caption_file), + os.path.join(upload_folder, media_dir_name), ) - # Delete temporary vtt file - os.remove(os.path.join(upload_folder, vtt_file)) + # Delete temporary caption file + os.remove(os.path.join(upload_folder, caption_file)) # Ingest multimodal data into redis - ingest_multimodal(video_name, os.path.join(upload_folder, video_dir_name), embeddings) + ingest_multimodal(file_name, os.path.join(upload_folder, media_dir_name), embeddings) - # Delete temporary video directory containing frames and annotations - shutil.rmtree(os.path.join(upload_folder, video_dir_name)) + # Delete temporary media directory containing frames and annotations + shutil.rmtree(os.path.join(upload_folder, media_dir_name)) - print(f"Processed video {video_file.filename}") + print(f"Processed file {media_file}") return { "status": 200, "message": "Data preparation succeeded", - "video_id_maps": uploaded_videos_saved_videos_map, + "file_id_maps": uploaded_files_map, } raise HTTPException( - status_code=400, detail="Must provide at least one pair consisting of video (.mp4) and captions (.vtt)" + status_code=400, + detail="Must provide at least one pair consisting of video (.mp4) and captions (.vtt) or image (.png, .jpg, .jpeg, .gif) with caption (.txt)", ) @register_microservice( - name="opea_service@prepare_videodoc_redis", endpoint="/v1/dataprep/get_videos", host="0.0.0.0", port=6007 + name="opea_service@prepare_videodoc_redis", endpoint="/v1/dataprep/get_files", host="0.0.0.0", port=6007 ) async def rag_get_file_structure(): """Returns list of names of uploaded videos saved on the server.""" @@ -521,17 +616,17 @@ async def rag_get_file_structure(): @register_microservice( - name="opea_service@prepare_videodoc_redis", endpoint="/v1/dataprep/delete_videos", host="0.0.0.0", port=6007 + name="opea_service@prepare_videodoc_redis", endpoint="/v1/dataprep/delete_files", host="0.0.0.0", port=6007 ) -async def delete_videos(): - """Delete all uploaded videos along with redis index.""" +async def delete_files(): + """Delete all uploaded files along with redis index.""" index_deleted = drop_index(index_name=INDEX_NAME) if not index_deleted: - raise HTTPException(status_code=409, detail="Uploaded videos could not be deleted. Index does not exist") + raise HTTPException(status_code=409, detail="Uploaded files could not be deleted. Index does not exist") clear_upload_folder(upload_folder) - print("Successfully deleted all uploaded videos.") + print("Successfully deleted all uploaded files.") return {"status": True} diff --git a/comps/embeddings/multimodal/bridgetower/bridgetower_embedding.py b/comps/embeddings/multimodal/bridgetower/bridgetower_embedding.py index f61d8e1c33..fda2bc7902 100644 --- a/comps/embeddings/multimodal/bridgetower/bridgetower_embedding.py +++ b/comps/embeddings/multimodal/bridgetower/bridgetower_embedding.py @@ -56,7 +56,9 @@ def embed_documents(self, texts: List[str]) -> List[List[float]]: Returns: List of embeddings, one for each text. """ - encodings = self.PROCESSOR.tokenizer(texts, return_tensors="pt").to(self.device) + encodings = self.PROCESSOR.tokenizer( + texts, return_tensors="pt", max_length=200, padding="max_length", truncation=True + ).to(self.device) with torch.no_grad(): outputs = self.TEXT_MODEL(**encodings) embeddings = outputs.cpu().numpy().tolist() diff --git a/comps/lvms/llava/README.md b/comps/lvms/llava/README.md index adef9ef2e8..998eb4b664 100644 --- a/comps/lvms/llava/README.md +++ b/comps/lvms/llava/README.md @@ -103,9 +103,12 @@ docker run -p 9399:9399 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$htt ```bash # Use curl/python -# curl +# curl with an image and a prompt http_proxy="" curl http://localhost:9399/v1/lvm -XPOST -d '{"image": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "prompt":"What is this?"}' -H 'Content-Type: application/json' +# curl with a prompt only (no image) +http_proxy="" curl http://localhost:9399/v1/lvm -XPOST -d '{"image": "", "prompt":"What is deep learning?"}' -H 'Content-Type: application/json' + # python python check_lvm.py ``` diff --git a/comps/lvms/llava/dependency/llava_server.py b/comps/lvms/llava/dependency/llava_server.py index 52a22f1b1e..644e15a82e 100644 --- a/comps/lvms/llava/dependency/llava_server.py +++ b/comps/lvms/llava/dependency/llava_server.py @@ -14,6 +14,7 @@ from fastapi import FastAPI, Request from fastapi.responses import JSONResponse, Response from transformers import pipeline +from transformers.image_utils import load_image model_name_or_path = None model_dtype = None @@ -25,6 +26,72 @@ app = FastAPI() +def pipeline_preprocess(self, image, prompt=None, timeout=None): + """ + This replaces the preprocess function used by the image-to-text pipeline + (https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/image_to_text.py). + The original transformers image-to-text pipeline preprocess function requires that an image is passed in, and will + fail if the image parameter is null/empty. In order to support multimodal use cases with the same pipeline, this + preprocess function handles the case where there is no image with the prompt. + """ + + if image: + image = load_image(image, timeout=timeout) + + if prompt is not None: + if not isinstance(prompt, str): + raise ValueError( + f"Received an invalid text input, got - {type(prompt)} - but expected a single string. " + "Note also that one single text can be provided for conditional image to text generation." + ) + + model_type = self.model.config.model_type + + if model_type == "git": + if image: + model_inputs = self.image_processor(images=image, return_tensors=self.framework) + if self.framework == "pt": + model_inputs = model_inputs.to(self.torch_dtype) + else: + model_inputs = {} + input_ids = self.tokenizer(text=prompt, add_special_tokens=False).input_ids + input_ids = [self.tokenizer.cls_token_id] + input_ids + input_ids = torch.tensor(input_ids).unsqueeze(0) + model_inputs.update({"input_ids": input_ids}) + elif model_type == "pix2struct": + model_inputs = self.image_processor(images=image, header_text=prompt, return_tensors=self.framework) + if self.framework == "pt": + model_inputs = model_inputs.to(self.torch_dtype) + + elif model_type != "vision-encoder-decoder": + if image: + # vision-encoder-decoder does not support conditional generation + model_inputs = self.image_processor(images=image, return_tensors=self.framework) + + if self.framework == "pt": + model_inputs = model_inputs.to(self.torch_dtype) + else: + model_inputs = {} + + text_inputs = self.tokenizer(prompt, return_tensors=self.framework) + model_inputs.update(text_inputs) + + else: + raise ValueError(f"Model type {model_type} does not support conditional text generation") + + elif image: + model_inputs = self.image_processor(images=image, return_tensors=self.framework) + if self.framework == "pt": + model_inputs = model_inputs.to(self.torch_dtype) + else: + raise ValueError("Both image and prompt cannot be empty.") + + if self.model.config.model_type == "git" and prompt is None: + model_inputs["input_ids"] = None + + return model_inputs + + def process_image(image, max_len=1344, min_len=672): if max(image.size) > max_len: max_hw, min_hw = max(image.size), min(image.size) @@ -54,12 +121,16 @@ async def generate(request: Request) -> Response: # FIXME batch_size=1 for now, img_b64_str = request_dict.pop("img_b64_str") max_new_tokens = request_dict.pop("max_new_tokens", 100) - # format the prompt - prompt = f"\nUSER: {prompt}\nASSISTANT:" - - # Decode and Resize the image - image = PIL.Image.open(BytesIO(base64.b64decode(img_b64_str))) - image = process_image(image) + if img_b64_str: + # Decode and Resize the image + image = PIL.Image.open(BytesIO(base64.b64decode(img_b64_str))) + image = process_image(image) + # format the prompt with an image + prompt = f"\nUSER: {prompt}\nASSISTANT:" + else: + image = None + # format the prompt with text only + prompt = f"USER: {prompt}\nASSISTANT:" if args.device == "hpu": generate_kwargs = { @@ -74,11 +145,17 @@ async def generate(request: Request) -> Response: # FIXME batch_size=1 for now, } start = time.time() + + # Override the pipeline preprocessing + generator.preprocess = pipeline_preprocess.__get__(generator, type(generator)) + result = generator(image, prompt=prompt, batch_size=1, generate_kwargs=generate_kwargs) end = time.time() result = result[0]["generated_text"].split("ASSISTANT: ")[-1] print(f"LLaVA result = {result}, time = {(end-start) * 1000 }ms") - image.close() + if image: + image.close() + ret = {"text": result} return JSONResponse(ret) diff --git a/comps/lvms/llava/lvm.py b/comps/lvms/llava/lvm.py index 26dac1dda9..897f7cbbe4 100644 --- a/comps/lvms/llava/lvm.py +++ b/comps/lvms/llava/lvm.py @@ -52,11 +52,12 @@ async def lvm(request: Union[LVMDoc, LVMSearchedMultimodalDoc]) -> Union[TextDoc raise HTTPException(status_code=500, detail="There is no video segments retrieved given the query!") img_b64_str = retrieved_metadatas[0]["b64_img_str"] + has_image = img_b64_str != "" initial_query = request.initial_query context = retrieved_metadatas[0]["transcript_for_inference"] prompt = initial_query if request.chat_template is None: - prompt = ChatTemplate.generate_multimodal_rag_on_videos_prompt(initial_query, context) + prompt = ChatTemplate.generate_multimodal_rag_on_videos_prompt(initial_query, context, has_image) else: prompt_template = PromptTemplate.from_template(request.chat_template) input_variables = prompt_template.input_variables diff --git a/comps/lvms/llava/template.py b/comps/lvms/llava/template.py index 71c2b26679..01992d2f85 100644 --- a/comps/lvms/llava/template.py +++ b/comps/lvms/llava/template.py @@ -5,6 +5,13 @@ class ChatTemplate: @staticmethod - def generate_multimodal_rag_on_videos_prompt(question: str, context: str): - template = """The transcript associated with the image is '{context}'. {question}""" + def generate_multimodal_rag_on_videos_prompt(question: str, context: str, has_image: bool = False): + + if has_image: + template = """The transcript associated with the image is '{context}'. {question}""" + else: + template = ( + """Refer to the following results obtained from the local knowledge base: '{context}'. {question}""" + ) + return template.format(context=context, question=question) diff --git a/comps/lvms/tgi-llava/lvm_tgi.py b/comps/lvms/tgi-llava/lvm_tgi.py index b4367c181c..38b492c395 100644 --- a/comps/lvms/tgi-llava/lvm_tgi.py +++ b/comps/lvms/tgi-llava/lvm_tgi.py @@ -54,11 +54,12 @@ async def lvm(request: Union[LVMDoc, LVMSearchedMultimodalDoc]) -> Union[TextDoc # due to llava-tgi-gaudi should receive image as input; Otherwise, the generated text is bad. raise HTTPException(status_code=500, detail="There is no video segments retrieved given the query!") img_b64_str = retrieved_metadatas[0]["b64_img_str"] + has_image = img_b64_str != "" initial_query = request.initial_query context = retrieved_metadatas[0]["transcript_for_inference"] prompt = initial_query if request.chat_template is None: - prompt = ChatTemplate.generate_multimodal_rag_on_videos_prompt(initial_query, context) + prompt = ChatTemplate.generate_multimodal_rag_on_videos_prompt(initial_query, context, has_image) else: prompt_template = PromptTemplate.from_template(request.chat_template) input_variables = prompt_template.input_variables @@ -87,6 +88,13 @@ async def lvm(request: Union[LVMDoc, LVMSearchedMultimodalDoc]) -> Union[TextDoc top_k = request.top_k top_p = request.top_p + if not img_b64_str: + # Work around an issue where LLaVA-NeXT is not providing good responses when prompted without an image. + # Provide an image and then instruct the model to ignore the image. The base64 string below is the encoded png: + # https://raw.githubusercontent.com/opea-project/GenAIExamples/refs/tags/v1.0/AudioQnA/ui/svelte/src/lib/assets/icons/png/audio1.png + img_b64_str = "iVBORw0KGgoAAAANSUhEUgAAADUAAAAlCAYAAADiMKHrAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAKPSURBVHgB7Zl/btowFMefnUTqf+MAHYMTjN4gvcGOABpM+8E0doLSE4xpsE3rKuAG3KC5Ad0J6MYOkP07YnvvhR9y0lVzupTIVT5SwDjB9fd97WfsMkCef1rUXM8dY9HHK4hWUevzi/oVWAqnF8fzLmAtiPA3Aq0lFsVA1fRKxlgNLIbDPaQUZQuu6YO98aIipHOiFGtIqaYfn1UnUCDds6WPyeANlTFbv9WztbFTK+HNUVAPiz7nbPzq7HsPCoKWIBREGfsJXZit5xT07X0jp6iRdIbEHOnjyyD97OvzH00lVS2K5OS2ax11cBXxJgYxlEIE6XZclzdTX6n8XjkkcEIfbj2nMO0/SNd1vy4vsCNjYPyEovfyy88GZIQCSKOCMf6ORgStoboLJuSWKDYCfK2q4jjrMZ+GOh7Pib/gek5DHxVUJtcgA7mJ4kwZRbN7viQXFzQn0Nl52gXG4Fo7DKAYp0yI3VHQ16oaWV0wYa+iGE8nG+wAdx5DzpS/KGyhFGULpShbKEXZQinqLlBK/IKc2asoh4sZvoXJWhlAzuxV1KBVD3HrfYTFAK8ZHgu0hu36DHLG+Izinw250WUkXHJht02QUnxLP7fZxR7f1I6S7Ir2GgmYvIQM5OYUuYBdainATq2ZjTqPBlnbGXYeBrg9Od18DKmc1U0jpw4OIIwEJFxQSl2b4MN2lf74fw8nFNbHt/5N9xWKTZvJ2S6YZk6RC3j2cKpVhSIShZ0mea6caCOCAjyNHd5gPPxGncMBTvI6hunYdaJ6kf8VoSCP2odxX6RkR6NOtanfj13EswKVqEQrPzzFL1lK+YvCFraiEqs8TrwQLGYraqpX4kr/Hixml+63Z+CoM9DTo438AUmP+KyMWT+tAAAAAElFTkSuQmCC" + prompt = f"Please disregard the image and answer the question. {prompt}" + image = f"data:image/png;base64,{img_b64_str}" image_prompt = f"![]({image})\n{prompt}\nASSISTANT:" diff --git a/comps/lvms/tgi-llava/template.py b/comps/lvms/tgi-llava/template.py index 71c2b26679..01992d2f85 100644 --- a/comps/lvms/tgi-llava/template.py +++ b/comps/lvms/tgi-llava/template.py @@ -5,6 +5,13 @@ class ChatTemplate: @staticmethod - def generate_multimodal_rag_on_videos_prompt(question: str, context: str): - template = """The transcript associated with the image is '{context}'. {question}""" + def generate_multimodal_rag_on_videos_prompt(question: str, context: str, has_image: bool = False): + + if has_image: + template = """The transcript associated with the image is '{context}'. {question}""" + else: + template = ( + """Refer to the following results obtained from the local knowledge base: '{context}'. {question}""" + ) + return template.format(context=context, question=question) diff --git a/tests/dataprep/test_dataprep_multimodal_redis_langchain.sh b/tests/dataprep/test_dataprep_multimodal_redis_langchain.sh index a7461a8abb..664f72c628 100644 --- a/tests/dataprep/test_dataprep_multimodal_redis_langchain.sh +++ b/tests/dataprep/test_dataprep_multimodal_redis_langchain.sh @@ -11,9 +11,15 @@ LVM_PORT=5028 LVM_ENDPOINT="http://${ip_address}:${LVM_PORT}/v1/lvm" WHISPER_MODEL="base" INDEX_NAME="dataprep" +tmp_dir=$(mktemp -d) video_name="WeAreGoingOnBullrun" -transcript_fn="${video_name}.vtt" -video_fn="${video_name}.mp4" +transcript_fn="${tmp_dir}/${video_name}.vtt" +video_fn="${tmp_dir}/${video_name}.mp4" +audio_name="AudioSample" +audio_fn="${tmp_dir}/${audio_name}.wav" +image_name="apple" +image_fn="${tmp_dir}/${image_name}.png" +caption_fn="${tmp_dir}/${image_name}.txt" function build_docker_images() { cd $WORKPATH @@ -115,9 +121,17 @@ place to watch it is on BlackmagicShine.com. We're right here on the smoking 00:00:45.240 --> 00:00:47.440 tire.""" > ${transcript_fn} + echo "This is an apple." > ${caption_fn} + + echo "Downloading Image" + wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn} + echo "Downloading Video" wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoingOnBullrun.mp4 -O ${video_fn} + echo "Downloading Audio" + wget https://github.com/intel/intel-extension-for-transformers/raw/main/intel_extension_for_transformers/neural_chat/assets/audio/sample.wav -O ${audio_fn} + } function validate_microservice() { @@ -126,7 +140,55 @@ function validate_microservice() { # test v1/generate_transcripts upload file echo "Testing generate_transcripts API" URL="http://${ip_address}:$dataprep_service_port/v1/generate_transcripts" - HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./$video_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$video_fn" -F "files=@$audio_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') + RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') + SERVICE_NAME="dataprep - upload - file" + + if [ "$HTTP_STATUS" -ne "200" ]; then + echo "[ $SERVICE_NAME ] HTTP status is not 200. Received status was $HTTP_STATUS" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] HTTP status is 200. Checking content..." + fi + if [[ "$RESPONSE_BODY" != *"Data preparation succeeded"* ]]; then + echo "[ $SERVICE_NAME ] Content does not match the expected result: $RESPONSE_BODY" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] Content is as expected." + fi + + # test v1/ingest_with_text upload video file + echo "Testing ingest_with_text API with video+transcripts" + URL="http://${ip_address}:$dataprep_service_port/v1/ingest_with_text" + + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$video_fn" -F "files=@$transcript_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') + RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') + SERVICE_NAME="dataprep - upload - file" + + if [ "$HTTP_STATUS" -ne "200" ]; then + echo "[ $SERVICE_NAME ] HTTP status is not 200. Received status was $HTTP_STATUS" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] HTTP status is 200. Checking content..." + fi + if [[ "$RESPONSE_BODY" != *"Data preparation succeeded"* ]]; then + echo "[ $SERVICE_NAME ] Content does not match the expected result: $RESPONSE_BODY" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] Content is as expected." + fi + + # test v1/ingest_with_text upload image file + echo "Testing ingest_with_text API with image+caption" + URL="http://${ip_address}:$dataprep_service_port/v1/ingest_with_text" + + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$image_fn" -F "files=@$caption_fn" -H 'Content-Type: multipart/form-data' "$URL") HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') SERVICE_NAME="dataprep - upload - file" @@ -146,11 +208,11 @@ function validate_microservice() { echo "[ $SERVICE_NAME ] Content is as expected." fi - # test v1/videos_with_transcripts upload file - echo "Testing videos_with_transcripts API" - URL="http://${ip_address}:$dataprep_service_port/v1/videos_with_transcripts" + # test v1/ingest_with_text with video and image + echo "Testing ingest_with_text API with both video+transcript and image+caption" + URL="http://${ip_address}:$dataprep_service_port/v1/ingest_with_text" - HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./$video_fn" -F "files=@./$transcript_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$image_fn" -F "files=@$caption_fn" -F "files=@$video_fn" -F "files=@$transcript_fn" -H 'Content-Type: multipart/form-data' "$URL") HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') SERVICE_NAME="dataprep - upload - file" @@ -170,11 +232,35 @@ function validate_microservice() { echo "[ $SERVICE_NAME ] Content is as expected." fi - # test v1/generate_captions upload file - echo "Testing generate_captions API" + # test v1/ingest_with_text with invalid input (.png image with .vtt transcript) + echo "Testing ingest_with_text API with invalid input (.png and .vtt)" + URL="http://${ip_address}:$dataprep_service_port/v1/ingest_with_text" + + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$image_fn" -F "files=@$transcript_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') + RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') + SERVICE_NAME="dataprep - upload - file" + + if [ "$HTTP_STATUS" -ne "400" ]; then + echo "[ $SERVICE_NAME ] HTTP status is not 400. Received status was $HTTP_STATUS" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] HTTP status is 400. Checking content..." + fi + if [[ "$RESPONSE_BODY" != *"No caption file found for $image_name"* ]]; then + echo "[ $SERVICE_NAME ] Content does not match the expected result: $RESPONSE_BODY" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] Content is as expected." + fi + + # test v1/generate_captions upload video file + echo "Testing generate_captions API with video" URL="http://${ip_address}:$dataprep_service_port/v1/generate_captions" - HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@./$video_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$video_fn" -H 'Content-Type: multipart/form-data' "$URL") HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') SERVICE_NAME="dataprep - upload - file" @@ -194,11 +280,33 @@ function validate_microservice() { echo "[ $SERVICE_NAME ] Content is as expected." fi + # test v1/generate_captions upload image file + echo "Testing generate_captions API with image" + URL="http://${ip_address}:$dataprep_service_port/v1/generate_captions" + + HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -F "files=@$image_fn" -H 'Content-Type: multipart/form-data' "$URL") + HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') + RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') + SERVICE_NAME="dataprep - upload - file" + if [ "$HTTP_STATUS" -ne "200" ]; then + echo "[ $SERVICE_NAME ] HTTP status is not 200. Received status was $HTTP_STATUS" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] HTTP status is 200. Checking content..." + fi + if [[ "$RESPONSE_BODY" != *"Data preparation succeeded"* ]]; then + echo "[ $SERVICE_NAME ] Content does not match the expected result: $RESPONSE_BODY" + docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_upload_file.log + exit 1 + else + echo "[ $SERVICE_NAME ] Content is as expected." + fi - # test /v1/dataprep/get_videos - echo "Testing get_videos API" - URL="http://${ip_address}:$dataprep_service_port/v1/dataprep/get_videos" + # test /v1/dataprep/get_files + echo "Testing get_files API" + URL="http://${ip_address}:$dataprep_service_port/v1/dataprep/get_files" HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST "$URL") HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') @@ -211,7 +319,7 @@ function validate_microservice() { else echo "[ $SERVICE_NAME ] HTTP status is 200. Checking content..." fi - if [[ "$RESPONSE_BODY" != *${video_name}* ]]; then + if [[ "$RESPONSE_BODY" != *${image_name}* || "$RESPONSE_BODY" != *${video_name}* || "$RESPONSE_BODY" != *${audio_name}* ]]; then echo "[ $SERVICE_NAME ] Content does not match the expected result: $RESPONSE_BODY" docker logs test-comps-dataprep-multimodal-redis >> ${LOG_PATH}/dataprep_file.log exit 1 @@ -219,9 +327,9 @@ function validate_microservice() { echo "[ $SERVICE_NAME ] Content is as expected." fi - # test /v1/dataprep/delete_videos - echo "Testing delete_videos API" - URL="http://${ip_address}:$dataprep_service_port/v1/dataprep/delete_videos" + # test /v1/dataprep/delete_files + echo "Testing delete_files API" + URL="http://${ip_address}:$dataprep_service_port/v1/dataprep/delete_files" HTTP_RESPONSE=$(curl --silent --write-out "HTTPSTATUS:%{http_code}" -X POST -d '{"file_path": "dataprep_file.txt"}' -H 'Content-Type: application/json' "$URL") HTTP_STATUS=$(echo $HTTP_RESPONSE | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') RESPONSE_BODY=$(echo $HTTP_RESPONSE | sed -e 's/HTTPSTATUS\:.*//g') @@ -255,8 +363,7 @@ function stop_docker() { function delete_data() { cd ${LOG_PATH} - rm -rf WeAreGoingOnBullrun.vtt - rm -rf WeAreGoingOnBullrun.mp4 + rm -rf ${tmp_dir} sleep 1s } diff --git a/tests/lvms/test_lvms_llava.sh b/tests/lvms/test_lvms_llava.sh index 18db0c40e5..4627ec6ee7 100644 --- a/tests/lvms/test_lvms_llava.sh +++ b/tests/lvms/test_lvms_llava.sh @@ -68,6 +68,17 @@ function validate_microservice() { exit 1 fi + # Test the LVM with text only (no image) + result=$(http_proxy="" curl http://localhost:$lvm_port/v1/lvm -XPOST -d '{"image": "", "prompt":"What is deep learning?"}' -H 'Content-Type: application/json') + if [[ $result == *"Deep learning is"* ]]; then + echo "Result correct." + else + echo "Result wrong." + docker logs test-comps-lvm-llava >> ${LOG_PATH}/llava-dependency.log + docker logs test-comps-lvm-llava-svc >> ${LOG_PATH}/llava-server.log + exit 1 + fi + } function stop_docker() { diff --git a/tests/lvms/test_lvms_tgi-llava_on_intel_hpu.sh b/tests/lvms/test_lvms_tgi-llava_on_intel_hpu.sh index 25af4f8691..1fa0155266 100644 --- a/tests/lvms/test_lvms_tgi-llava_on_intel_hpu.sh +++ b/tests/lvms/test_lvms_tgi-llava_on_intel_hpu.sh @@ -34,14 +34,26 @@ function validate_microservice() { lvm_port=5050 result=$(http_proxy="" curl http://localhost:$lvm_port/v1/lvm -XPOST -d '{"image": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "prompt":"What is this?"}' -H 'Content-Type: application/json') if [[ $result == *"yellow"* ]]; then - echo "Result correct." + echo "LVM prompt with an image - Result correct." else - echo "Result wrong." + echo "LVM prompt with an image - Result wrong." docker logs test-comps-lvm-tgi-llava >> ${LOG_PATH}/llava-dependency.log docker logs test-comps-lvm-tgi >> ${LOG_PATH}/llava-server.log exit 1 fi + result=$(http_proxy="" curl http://localhost:$lvm_port/v1/lvm --silent --write-out "HTTPSTATUS:%{http_code}" -XPOST -d '{"image": "", "prompt":"What is deep learning?"}' -H 'Content-Type: application/json') + http_status=$(echo $result | tr -d '\n' | sed -e 's/.*HTTPSTATUS://') + if [ "$http_status" -ne "200" ]; then + + echo "LVM prompt without image - HTTP status is not 200. Received status was $http_status" + docker logs test-comps-lvm-tgi-llava >> ${LOG_PATH}/llava-dependency.log + docker logs test-comps-lvm-tgi >> ${LOG_PATH}/llava-server.log + exit 1 + else + echo "LVM prompt without image - HTTP status (successful)" + fi + } function stop_docker() {