Skip to content

Commit

Permalink
MultimodalQnA Image and Audio Support Phase 1 (#852)
Browse files Browse the repository at this point in the history
* Adds an endpoint for image ingestion

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Combined image and video endpoint

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Add test and update README

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* fixed variable name for embedding model (#1)

Signed-off-by: okhleif-IL <omar.khleif@intel.com>

* Fixed test script

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Remove redundant function

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* get_videos, delete_videos --> get_files, delete_files (#3)

Signed-off-by: okhleif-IL <omar.khleif@intel.com>

* Updates test per review feedback

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Fixed test

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Add support for audio files multimodal data ingestion (#4)

* Add support for audio files multimodal data ingestion

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* Update function name

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

---------

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* Change videos_with_transcripts to ingest_with_text

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Add image support to video ingestion with transcript functionality

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Update test and README

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Updated for review suggestions

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* Add two tests for ingest_with_text

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>

* LVM TGI Gaudi update for prompts without images (#7)

* LVM Gaudi TGI update for prompts without images

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* Wording

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* Add a test

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

---------

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change dummy image to be b64 encoded instead of the url (#9)

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* Updates based on review feedback (#10)

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

* Test fix (#11)

Signed-off-by: dmsuehir <dina.s.jones@intel.com>

---------

Signed-off-by: Melanie Buehler <melanie.h.buehler@intel.com>
Signed-off-by: okhleif-IL <omar.khleif@intel.com>
Signed-off-by: dmsuehir <dina.s.jones@intel.com>
Co-authored-by: dmsuehir <dina.s.jones@intel.com>
Co-authored-by: Omar Khleif <omar.khleif@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abolfazl Shahbazi <12436063+ashahba@users.noreply.github.com>
  • Loading branch information
5 people authored Nov 8, 2024
1 parent 786cabe commit 29ef642
Show file tree
Hide file tree
Showing 14 changed files with 618 additions and 209 deletions.
70 changes: 49 additions & 21 deletions comps/dataprep/multimodal/redis/langchain/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Dataprep Microservice for Multimodal Data with Redis

This `dataprep` microservice accepts videos (mp4 files) and their transcripts (optional) from the user and ingests them into Redis vectorstore.
This `dataprep` microservice accepts the following from the user and ingests them into a Redis vector store:

- Videos (mp4 files) and their transcripts (optional)
- Images (gif, jpg, jpeg, and png files) and their captions (optional)
- Audio (wav files)

## 🚀1. Start Microservice with Python(Option 1)

Expand Down Expand Up @@ -107,18 +111,18 @@ docker container logs -f dataprep-multimodal-redis

## 🚀4. Consume Microservice

Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert videos and their transcripts (optional) to embeddings and save to the Redis vector store.
Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images and videos and their transcripts (optional) to embeddings and save to the Redis vector store.

This mircroservice has provided 3 different ways for users to ingest videos into Redis vector store corresponding to the 3 use cases.
This microservice provides 3 different ways for users to ingest files into Redis vector store corresponding to the 3 use cases.

### 4.1 Consume _videos_with_transcripts_ API
### 4.1 Consume _ingest_with_text_ API

**Use case:** This API is used when a transcript file (under `.vtt` format) is available for each video.
**Use case:** This API is used when videos are accompanied by transcript files (`.vtt` format) or images are accompanied by text caption files (`.txt` format).

**Important notes:**

- Make sure the file paths after `files=@` are correct.
- Every transcript file's name must be identical with its corresponding video file's name (except their extension .vtt and .mp4). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in this API call, this microservice will return error `No captions file video1.vtt found for video1.mp4`.
- Every transcript or caption file's name must be identical to its corresponding video or image file's name (except their extension - .vtt goes with .mp4 and .txt goes with .jpg, .jpeg, .png, or .gif). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in the API call, the microservice will return an error `No captions file video1.vtt found for video1.mp4`.

#### Single video-transcript pair upload

Expand All @@ -127,10 +131,20 @@ curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video1.vtt" \
http://localhost:6007/v1/videos_with_transcripts
http://localhost:6007/v1/ingest_with_text
```

#### Single image-caption pair upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./image.jpg" \
-F "files=@./image.txt" \
http://localhost:6007/v1/ingest_with_text
```

#### Multiple video-transcript pair upload
#### Multiple file pair upload

```bash
curl -X POST \
Expand All @@ -139,16 +153,20 @@ curl -X POST \
-F "files=@./video1.vtt" \
-F "files=@./video2.mp4" \
-F "files=@./video2.vtt" \
http://localhost:6007/v1/videos_with_transcripts
-F "files=@./image1.png" \
-F "files=@./image1.txt" \
-F "files=@./image2.jpg" \
-F "files=@./image2.txt" \
http://localhost:6007/v1/ingest_with_text
```

### 4.2 Consume _generate_transcripts_ API

**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available.
**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available, or for audio files with speech.

In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video.
In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video or audio files.

#### Single video upload
#### Single file upload

```bash
curl -X POST \
Expand All @@ -157,21 +175,22 @@ curl -X POST \
http://localhost:6007/v1/generate_transcripts
```

#### Multiple video upload
#### Multiple file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video2.mp4" \
-F "files=@./audio1.wav" \
http://localhost:6007/v1/generate_transcripts
```

### 4.3 Consume _generate_captions_ API

**Use case:** This API should be used when a video does not have meaningful audio or does not have audio.
**Use case:** This API should be used when uploading an image, or when uploading a video that does not have meaningful audio or does not have audio.

In this use case, transcript either does not provide any meaningful information or does not exist. Thus, it is preferred to leverage a LVM microservice to summarize the video frames.
In this use case, there is no meaningful language transcription. Thus, it is preferred to leverage a LVM microservice to summarize the frames.

- Single video upload

Expand All @@ -192,22 +211,31 @@ curl -X POST \
http://localhost:6007/v1/generate_captions
```

### 4.4 Consume get_videos API
- Single image upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./image.jpg" \
http://localhost:6007/v1/generate_captions
```

### 4.4 Consume get_files API

To get names of uploaded videos, use the following command.
To get names of uploaded files, use the following command.

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6007/v1/dataprep/get_videos
http://localhost:6007/v1/dataprep/get_files
```

### 4.5 Consume delete_videos API
### 4.5 Consume delete_files API

To delete uploaded videos and clear the database, use the following command.
To delete uploaded files and clear the database, use the following command.

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6007/v1/dataprep/delete_videos
http://localhost:6007/v1/dataprep/delete_files
```
2 changes: 1 addition & 1 deletion comps/dataprep/multimodal/redis/langchain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os

# Models
EMBED_MODEL = os.getenv("EMBED_MODEL", "BridgeTower/bridgetower-large-itm-mlm-itc")
EMBED_MODEL = os.getenv("EMBEDDING_MODEL_ID", "BridgeTower/bridgetower-large-itm-mlm-itc")
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "small")

# Redis Connection Information
Expand Down
71 changes: 61 additions & 10 deletions comps/dataprep/multimodal/redis/langchain/multimodal_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ def clear_upload_folder(upload_path):
os.rmdir(dir_path)


def generate_video_id():
"""Generates a unique identifier for a video file."""
def generate_id():
"""Generates a unique identifier for a file."""
return str(uuid.uuid4())


Expand Down Expand Up @@ -128,8 +128,49 @@ def convert_img_to_base64(image):
return encoded_string.decode()


def generate_annotations_from_transcript(file_id: str, file_path: str, vtt_path: str, output_dir: str):
"""Generates an annotations.json from the transcript file."""

# Set up location to store frames and annotations
os.makedirs(output_dir, exist_ok=True)

# read captions file
captions = webvtt.read(vtt_path)

annotations = []
for idx, caption in enumerate(captions):
start_time = str2time(caption.start)
end_time = str2time(caption.end)
mid_time = (end_time + start_time) / 2
mid_time_ms = mid_time * 1000
text = caption.text.replace("\n", " ")

# Create annotations for frame from transcripts with an empty image
annotations.append(
{
"video_id": file_id,
"video_name": os.path.basename(file_path),
"b64_img_str": "",
"caption": text,
"time": mid_time_ms,
"frame_no": 0,
"sub_video_id": idx,
}
)

# Save transcript annotations as json file for further processing
with open(os.path.join(output_dir, "annotations.json"), "w") as f:
json.dump(annotations, f)

return annotations


def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: str, vtt_path: str, output_dir: str):
"""Extract frames (.png) and annotations (.json) from video file (.mp4) and captions file (.vtt)"""
"""Extract frames (.png) and annotations (.json) from media-text file pairs.
File pairs can be a video
file (.mp4) and transcript file (.vtt) or an image file (.png, .jpg, .jpeg, .gif) and caption file (.txt)
"""
# Set up location to store frames and annotations
os.makedirs(output_dir, exist_ok=True)
os.makedirs(os.path.join(output_dir, "frames"), exist_ok=True)
Expand All @@ -139,18 +180,28 @@ def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: s
fps = vidcap.get(cv2.CAP_PROP_FPS)

# read captions file
captions = webvtt.read(vtt_path)
if os.path.splitext(vtt_path)[-1] == ".vtt":
captions = webvtt.read(vtt_path)
else:
with open(vtt_path, "r") as f:
captions = f.read()

annotations = []
for idx, caption in enumerate(captions):
start_time = str2time(caption.start)
end_time = str2time(caption.end)
if os.path.splitext(vtt_path)[-1] == ".vtt":
start_time = str2time(caption.start)
end_time = str2time(caption.end)

mid_time = (end_time + start_time) / 2
text = caption.text.replace("\n", " ")
mid_time = (end_time + start_time) / 2
text = caption.text.replace("\n", " ")

frame_no = time_to_frame(mid_time, fps)
mid_time_ms = mid_time * 1000
else:
frame_no = 0
mid_time_ms = 0
text = captions.replace("\n", " ")

frame_no = time_to_frame(mid_time, fps)
mid_time_ms = mid_time * 1000
vidcap.set(cv2.CAP_PROP_POS_MSEC, mid_time_ms)
success, frame = vidcap.read()

Expand Down
Loading

0 comments on commit 29ef642

Please sign in to comment.