Skip to content

Multi-modal starter kit for AI video understanding and narration. Works with Ollama (Llava, bakllava), GPT-4v

License

Notifications You must be signed in to change notification settings

tigrisdata-community/multi-modal-starter-kit

Repository files navigation

Multi Modal Starter Kit 🤖📽️

A multi modal starter kit that can have AI narrate a video or scene of your choice. Includes examples of how to do video processing, frames extraction, and sending frames to AI models optimally. Cost $0 to run.

Works with the following models 👇🦙

Have questions? Join AI Stack devs #multi-modal-starter-kit

🎉 Demo (Sound ON 🔊)

MM-demo.mp4

Stack

Overview

Quickstart

Step 0: Fork this repo and clone it

git clone git@github.com:[YOUR_GITHUB_ACCOUNT_NAME]/multi-modal-starter-kit.git

Install dependencies

If you are using Homebrew on your machine, run brew bundle to install all the needed dependencies. If you need to install them manually, install these from your package manager of choice:

  • ffmpeg (ideally with a wide berth of codecs supported; if you don't know what this means, the default package is probably fine)
  • Node.js 20.x or higher

Step 1: Set up Tigris

  1. Create an .env file
cd multi-modal-starter-kit
cp .env.example .env
  1. Set up Tigris
  • Make sure you have a fly.io account and have fly CLI installed on your computer
  • cd multi-modal-starter-kit
  • Pick a name for your version of your app. App names on fly are global, so it has to be unique. For example multi-modal-awesomeness
  • Create the app on fly with fly app create <your app name> so for example fly app create multi-modal-awesomeness
  • Create the storage with fly storage create
  • You should get a list of credentials like below: Screenshot 2024-03-24 at 5 40 36 PM
  • If you get a list of keys without values, destroy the bucket with fly storage destroy and try again.
  • Copy paste these values to your .env under "Tigris"
  • Note that the name for the storage bucket is NEXT_PUBLIC_BUCKET_NAME. If you copy/paste add the NEXT_ part at the beginning
  1. Set Tigris bucket cors policy and bucket access policy
  • fly storage update YOUR_BUCKET_NAME --public
  • Make sure you have aws CLI installed and run aws configure. Enter the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY printed above. Note that these are not actual Amazon Web Services credentials, but Tigris credentials. If you have the aws CLI already configured for Amazon, it will overwrite those values.
  • Run the following command to update CORS policy on the bucket
    aws s3api put-bucket-cors --bucket BUCKET_NAME --cors-configuration file://cors.json --endpoint-url https://fly.storage.tigris.dev/
    

Step 2: Create a test video

We have a sample video in the assets directory that you can use to test the app. You can run the following command if you want to test the app with this video

aws s3 cp ./assets/pasta-making.mp4 s3://BUCKET_NAME --endpoint-url https://fly.storage.tigris.dev`

Alternatively you can also uploading your own videos.

Step 3: Set up Ollama / Llava

By Default the app uses Ollama / llava for vision. If you want to use OpenAI Chatgpt4v instead, you can set INFERENCE_PLATFROM="OpenAI" and fill in OPENAI_API_KEY in .env

There are two ways to get Ollama up and running. You can either use Fly GPU, which provides very fast inference, or use your laptop.

Option 1: Fly GPU

  • Make sure you have a Fly account and flyctl installed
  • Fork ollama-demo, edit fly.toml to rename the app, and run fly launch
  • Under the ollama-demo directory, run fly console ssh -- once you have ssh'd into the instance, run ollama pull llava -- by default, this pulls the llava7b model, but you could also pull other vision models to use with your app, such as:
ollama pull llava:34b
ollama pull llava:7b-v1.6-vicuna-q4_0
...
  • You should get a hostname once fly launch succeeds, copy paste this value to OLLAMA_HOST in .env Your app will now use this Fly GPU for instance.

Option 2: Your laptop

  • Install Ollama
  • Run ollama pull llava on your terminal. Like mentioned under Option 1, you can also pull other models to compare the results.
  • (optional) Watch requests coming into Ollama by running this in a new terminal tab tail -f ~/.ollama/logs/server.log

Step 4: Set up ElevenLabs

  • Go to https://elevenlabs.io/, log in, and click on your profile picture on lower left. Select "Profile + API key". Copy the API key and save it as XI_API_KEY in the .env file
  • Select a 11labs voice by clicking on "Voices" on the left side nav bar and navigate to "VoiceLab". Copy the voice ID and save it as XI_VOICE_ID in .env

Step 5: Set up Upstash

When narrating a very long video, Upstash Redis is used for pub/sub and notifies the client when new snippets of reply come back. Upstash is also used for the critical task of caching video/images so the subsequent requests don't take long.

  • Go to https://console.upstash.com/, select "Create Database" with the following settings Screenshot 2024-03-24 at 5 46 30 PM
  • Once created, under 'Node' - 'io-redis' tab, copy the whole string starting with "rediss://" and set UPSTASH_REDIS_URL value as this string in .env Screenshot 2024-03-24 at 5 49 50 PM
  • On the same page, scroll down to the "Rest API" section and copy paste everything under ".env" tab to your .env file Screenshot 2024-03-24 at 5 52 25 PM

Step 6: Run App

npm install
npm run dev

Step 7: Deploying on fly

By now you should have a functional app, let's deploy it to fly.io cloud account that you setup in Step 1.

  • First, lets see what secrets are already available in our app using fly secrets list:
$ ➔  fly secrets list
NAME                            DIGEST         CREATED AT
AWS_ACCESS_KEY_ID               xxxxxxx        Feb 23 2024 20:33
AWS_ENDPOINT_URL_S3             xxxxxxx        Feb 23 2024 20:33
AWS_REGION                      xxxxxxx        Feb 23 2024 20:33
AWS_SECRET_ACCESS_KEY           xxxxxxx        Feb 23 2024 20:33
BUCKET_NAME                     xxxxxxx        Feb 23 2024 20:33
  • We need to match the secrets as in .env.example file. Rename the BUCKET_NAME secret to NEXT_PUBLIC_BUCKET_NAME:
$ ➔ fly secrets set NEXT_PUBLIC_BUCKET_NAME=<YOUR BUCKET NAME>
$ ➔ fly secrets unset BUCKET_NAME
  • Now, all other environment vars:
$ ➔ fly secrets set OPENAI_API_KEY=<YOUR KEY HERE>
$ ➔ fly secrets set UPSTASH_REDIS_URL=<UPSTASH REDIS URL HERE>
$ ➔ fly secrets set UPSTASH_REDIS_REST_URL=<UPSTASH REDIS REST URL HERE>
$ ➔ fly secrets set UPSTASH_REDIS_REST_TOKEN=<UPSTASH REDIS REST TOKEN HERE>
$ ➔ fly secrets set XI_API_KEY=<XI API KEY>
$ ➔ fly secrets set XI_VOICE_ID=<XI VOICE ID>
  • Once environment is all set, we can make the app fly:
$ ➔ fly launch
$ ➔ fly deploy

fly.io instructions for NextJS

[Optional] Step 8: Production-ready workflow orchestration

There is an example in the repo that leverages Inngest for workflow orchestration -- Inngest is especially helpful here when you have a long-running workflow and does automatic retries. Example code is in src/inngest/functions.ts.

In this example, Inngest waits for new images to upload to Tigris, then sends the image to Ollama/OpenAI for processing. The "describe-image" step is auto-retried when there is a failure or returned JSON is malformed.

export const inngestTick = inngest.createFunction(
  { id: "tick" },
  { cron: "* * * * *" },
  async ({ step }) => {
    await step.run("fetch-latest-snapshot", async () => {
      return await fetchLatestFromTigris();
    });

    const result = await step.waitForEvent("Tigris.complete", {
      event: "Tigris.complete",
      timeout: "1m",
    });

    const url = result?.data.url;
    console.log("url", url);
    if (!!url) {
      await step.run("describe-image", async () => {
        return await describeImage(url);
      });
    }
  }
);

[Optional] Step 9: Change Inference Platforms

fal

fal.ai is an inference platfrom that specilizes on fast media model inference. To use fal with the multimodal starter-kit demo set the INFERENCE_PLATFORM environment variable to "fal", and add a new FAL_KEY environment variable from the fal.ai website. First, create an account with fal.ai, navigate to the keys page keys and follow the steps to create a key. Copy the result into the .env file and save it as FAL_KEY.

INFERENCE_PLATFORM=fal
FAL_KEY=***

Currently, only the moondream model is avaliable with fal. Stay tuned for llava7B and llava34B.

Useful Commands

Tigris is 100% aws cli compatible. Here are some frequently used commands during active development:

Pause voice

Press 'v' to toggle the voice. This pauses the voice so it will resume at the point it was paused.

Check Tigris Dashboard

fly storage dashboard BUCKET_NAME

Periodic cleanup

Currently temporary files for the snapshots that get passed to the model and the elevenlabs voice files are stored in the bucket and are not cleaned up. To clean these up, you can run the following from the CLI:

aws s3 rm s3://BUCKET_NAME/ --endpoint-url https://fly.storage.tigris.dev --recursive --exclude "*.mp4"

Upload videos

aws s3 cp PATH_TO_YOUR_VIDEO s3://BUCKET_NAME --endpoint-url https://fly.storage.tigris.dev

About

Multi-modal starter kit for AI video understanding and narration. Works with Ollama (Llava, bakllava), GPT-4v

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published