Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

Merged

Conversation

rlancemartin
Copy link
Collaborator

@rlancemartin rlancemartin commented Jun 6, 2023

This introduces the YoutubeAudioLoader, which will load blobs from a YouTube url and write them. Blobs are then parsed by OpenAIWhisperParser(), as show in this PR, but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX:

# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()

Tested on full set of Karpathy lecture videos:

# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()

# Split the audio into chunk_duration_ms chunks
for split_number,i in enumerate(range(0, len(audio), chunk_duration_ms)):

print(f"Transcribing part {split_number}!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"Transcribing part {split_number}!")


with blob.as_bytes_io() as f:
transcript = openai.Audio.transcribe("whisper-1", f)
yield Document(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we yield a single document if the input is a single audio file and we're trying to hide the fact there's chunking under the hood? We can collect the transcripts and concatenate them. The only problem is that it's unclear on which delimiter to use to join on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easy to do this. E.g., we can build a single blob from the combined docs:

combined_docs = [doc.page_content for doc in docs].join(strings)

But, as discussed, it's kind of nice to have the intermediate outputs.

(The latency is somewhat high - 15 min for 2 hr video.)

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url,download=False)
title = info.get('title', 'video')
print(f"Writing file: {title} to {self.save_dir}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print(f"Writing file: {title} to {self.save_dir}")

@rlancemartin rlancemartin force-pushed the rlm/simple_audio_load_and_split branch 7 times, most recently from 01a5729 to 74326d6 Compare June 6, 2023 18:39
try:
from pydub import AudioSegment
except ImportError:
print("Please install pydub : pip install pydub")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with raise ValueError or ImportError

try:
import openai
except ImportError:
print("Please install openai : pip install openai")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be raised as well

@rlancemartin rlancemartin force-pushed the rlm/simple_audio_load_and_split branch from 74326d6 to 4f0e4ca Compare June 6, 2023 21:59
@rlancemartin rlancemartin force-pushed the rlm/simple_audio_load_and_split branch from 4f0e4ca to e1fa1a4 Compare June 6, 2023 22:03
@rlancemartin rlancemartin merged commit 4092fd2 into langchain-ai:master Jun 6, 2023
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
)

This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](langchain-ai#5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:

```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
``` 

Tested on full set of Karpathy lecture videos:

```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants