Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PhotoBooth] mode with image persistence along side audio file. Also, streaming capabilities, setup script, readme updates, and narrator prompt update #39

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
8 changes: 8 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Required variables:
export OPENAI_API_KEY=
export ELEVENLABS_API_KEY=
export ELEVENLABS_VOICE_ID=

# Optional variables:
export ELEVENLABS_STREAMING=
export PHOTOBOOTH_MODE=
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
/venv
/narration
/frames/*
!/frames/.gitkeep
!/frames/.gitkeep
.trunk
38 changes: 35 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# David Attenborough narrates your life.
# David Attenborough narrates your life

https://twitter.com/charliebholtz/status/1724815159590293764

## Want to make your own AI app?

Check out [Replicate](https://replicate.com). We make it easy to run machine learning models with an API.

## Setup
Expand All @@ -20,26 +21,57 @@ Then, install the dependencies:

Make a [Replicate](https://replicate.com), [OpenAI](https://beta.openai.com/), and [ElevenLabs](https://elevenlabs.io) account and set your tokens:

```
```bash
export OPENAI_API_KEY=<token>
export ELEVENLABS_API_KEY=<eleven-token>
```

Make a new voice in Eleven and get the voice id of that voice using their [get voices](https://elevenlabs.io/docs/api-reference/voices) API, or by clicking the flask icon next to the voice in the VoiceLab tab.

```
```bash
export ELEVENLABS_VOICE_ID=<voice-id>
```

## Run it!

In on terminal, run the webcam capture:

```bash
python capture.py
```

In another terminal, run the narrator:

```bash
python narrator.py
```

## Options

### Setup

#### Script

Alternative to running the [Setup](#setup) commands above individually, one can use the `setup.sh` script to facilitate getting the two required shell envs ready to rock.

_Note: will have to run `source source venv/bin/activate` afterwards to activate the virtual env._
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: source is there twice


#### Dotenv

One can set the environment variables via the `.env` file, which is read every time the process starts. It is recommended to copy the `.env.example` file and rename to `.env`.

### Streaming

If you would like the speech to start quicker via a streaming manner set the environment variable to enable. The concession is that the audio and corresponding image is not saved in the `/narration` directory.

```bash
export ELEVENLABS_STREAMING=true
```

### PhotoBooth

The default behavior of this app will continually analyze images. If you would like to use in a mode more similar to a photo booth, set the environment variable. In this mode, the image will only be analyzed when the spacebar key is pressed.

```bash
export PHOTOBOOTH_MODE=true
```
9 changes: 5 additions & 4 deletions capture.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
import cv2
import os
import time
from PIL import Image

import cv2
import numpy as np
import os
from PIL import Image

# Folder
folder = "frames"
Expand Down Expand Up @@ -30,7 +31,7 @@
# Resize the image
max_size = 250
ratio = max_size / max(pil_img.size)
new_size = tuple([int(x*ratio) for x in pil_img.size])
new_size = tuple([int(x * ratio) for x in pil_img.size])
resized_img = pil_img.resize(new_size, Image.LANCZOS)

# Convert the PIL image back to an OpenCV image
Expand Down
130 changes: 101 additions & 29 deletions narrator.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,31 @@
import os
from openai import OpenAI
import base64
import json
import time
import simpleaudio as sa
import errno
from elevenlabs import generate, play, set_api_key, voices
import os
import shutil
import time

from dotenv import load_dotenv
from elevenlabs import generate, play, set_api_key, stream
from openai import OpenAI
from pynput import ( # Using pynput to listen for a keypress instead of native keyboard module which was requiring admin privileges
keyboard,
)

# import environment variables from .env file
load_dotenv()

client = OpenAI()

set_api_key(os.environ.get("ELEVENLABS_API_KEY"))

# Initializes the variables based their respective environment variable values, defaulting to false
isStreaming = os.environ.get("ELEVENLABS_STREAMING", "false") == "true"
isPhotoBooth = os.environ.get("PHOTOBOOTH_MODE", "false") == "true"

script = []
narrator = "Sir David Attenborough"


def encode_image(image_path):
while True:
try:
Expand All @@ -24,17 +39,30 @@ def encode_image(image_path):
time.sleep(0.1)


def play_audio(text):
audio = generate(text, voice=os.environ.get("ELEVENLABS_VOICE_ID"))
def play_audio(text, dir_path=None):
audio = generate(
text,
voice=os.environ.get("ELEVENLABS_VOICE_ID"),
model="eleven_turbo_v2",
stream=isStreaming,
)

unique_id = base64.urlsafe_b64encode(os.urandom(30)).decode("utf-8").rstrip("=")
dir_path = os.path.join("narration", unique_id)
os.makedirs(dir_path, exist_ok=True)
if isStreaming:
# Stream the audio for more real-time responsiveness
stream(audio)
return

# Save the audio file to the directory
file_path = os.path.join(dir_path, "audio.wav")

with open(file_path, "wb") as f:
f.write(audio)

# Copy the image analyzed to the same directory as the audio file
image_path = os.path.join(os.getcwd(), "./frames/frame.jpg")
new_image_path = os.path.join(dir_path, "image.jpg")
shutil.copy(image_path, new_image_path)

play(audio)


Expand All @@ -43,7 +71,10 @@ def generate_new_line(base64_image):
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{
"type": "text",
"text": f"Describe this image as if you are {narrator}",
},
{
"type": "image_url",
"image_url": f"data:image/jpeg;base64,{base64_image}",
Expand All @@ -59,8 +90,8 @@ def analyze_image(base64_image, script):
messages=[
{
"role": "system",
"content": """
You are Sir David Attenborough. Narrate the picture of the human as if it is a nature documentary.
"content": f"""
You are {narrator}. Narrate the picture of the human as if it is a nature documentary.
Make it snarky and funny. Don't repeat yourself. Make it short. If I do anything remotely interesting, make a big deal about it!
""",
},
Expand All @@ -73,30 +104,71 @@ def analyze_image(base64_image, script):
return response_text


def main():
script = []
def _main():
global script

# path to your image
image_path = os.path.join(os.getcwd(), "./frames/frame.jpg")

dir_path = None
if not isStreaming:
# create a unique directory to store the audio and image
unique_id = base64.urlsafe_b64encode(os.urandom(30)).decode("utf-8").rstrip("=")
dir_path = os.path.join("narration", unique_id)
os.makedirs(dir_path, exist_ok=True)

# copy the image to the directory
new_image_path = os.path.join(dir_path, "image.jpg")
shutil.copy(image_path, new_image_path)
image_path = new_image_path

# getting the base64 encoding
base64_image = encode_image(image_path)

# analyze the image
print(f"👀 {narrator} is watching...")
analysis = analyze_image(base64_image, script=script)

print(f"🎙️ {narrator} says:")
print(analysis)

# generate and play audio
play_audio(analysis, dir_path)

script = script + [{"role": "assistant", "content": analysis}]


def main():
while True:
# path to your image
image_path = os.path.join(os.getcwd(), "./frames/frame.jpg")
if isPhotoBooth:
pass
else:
_main()

# wait for 5 seconds
time.sleep(5)


# getting the base64 encoding
base64_image = encode_image(image_path)
def on_press(key):
if key == keyboard.Key.space:
# When space bar is pressed, run the main function which analyzes the image and generates the audio
_main()

# analyze posture
print("👀 David is watching...")
analysis = analyze_image(base64_image, script=script)

print("🎙️ David says:")
print(analysis)
def on_release(key):
if key == keyboard.Key.esc:
# Stop listener
return False

play_audio(analysis)

script = script + [{"role": "assistant", "content": analysis}]
# Create a listener
listener = keyboard.Listener(on_press=on_press, on_release=on_release)

# wait for 5 seconds
time.sleep(5)
# Start the listener
listener.start()

if isPhotoBooth:
print(f"Press the spacebar to trigger {narrator}")

if __name__ == "__main__":
main()
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ pure-eval==0.2.2
pydantic==2.4.2
pydantic_core==2.10.1
Pygments==2.16.1
pynput==1.7.6
python-dotenv==1.0.0
requests==2.31.0
simpleaudio==1.0.4
six==1.16.0
Expand All @@ -38,4 +40,4 @@ traitlets==5.13.0
typing_extensions==4.8.0
urllib3==2.0.7
wcwidth==0.2.10
websockets==12.0
websockets==12.0
13 changes: 13 additions & 0 deletions setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

# create a virtual environment
python3 -m pip install virtualenv
python3 -m virtualenv venv

# source the virtual environment to install dependencies
source venv/bin/activate

# install the dependencies
pip install -r requirements.txt

echo -e "\n\n\nSetup complete. Run $(source venv/bin/activate) to activate the virtual environment.\n\nAlso, please ensure your environment variables are set correctly in the .env file."