-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create OpenAIWhisperParser for generating Documents from audio files #5580
Create OpenAIWhisperParser for generating Documents from audio files #5580
Conversation
288a86e
to
8330c55
Compare
langchain/document_loaders/audio.py
Outdated
from langchain.docstore.document import Document | ||
from langchain.document_loaders.base import BaseLoader | ||
|
||
class AudioFileLoader(BaseLoader): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This audio file loader is coupled to the file system right now, and allows loading a single file at a time.
This is OK, but users may find it a bit more convenient to be able to scan an entire directory of audio files and decide which kinds of file to pick up (e.g., mp3s, wav) etc, get tqdm progress bar, and eventually get both sync and async implementations that are concurrent (the concurrency part is not implemented yet).
Let's replace:
- Instead of inheriting from
BaseLoader
let's inherit fromGenericLoader
https://github.com/hwchase17/langchain/blob/5aa7264a63e13ed550b7c85cbabdf0afcdab390b/langchain/document_loaders/generic.py#L17-L17
The baseclass will introduce convenience classmethods that will allow picking up audio files from the file system, add progress bars, etc.
- Let's introduce a
BlobParser
for Audio (sayOpenAIWhisperParser
or something like that with a good name). It should live indocument_loaders.parsers.audio.py
(can follow the structure of any other blob parsers)
Testing:
Ideally, we can add a unit-test, we should be able to patch the request from openai.Audio.transcribe to return a mock transcription.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
langchain/document_loaders/audio.py
Outdated
loader.load() | ||
""" | ||
|
||
def __init__(self, audio_file_path: str = "text"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, audio_file_path: str = "text"): | |
def __init__(self, audio_file_path: str): |
langchain/document_loaders/audio.py
Outdated
|
||
def lazy_load(self) -> Document: | ||
"""Transcribe audio file to text w/ OpenAI Whisper API.""" | ||
audio_file = open(self.audio_file_path , "rb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally for opening files, it's best o always include a context manager to no file descriptors remain open
audio_file = open(self.audio_file_path , "rb") | |
with open(self.audio_file_path, 'rb') as f: | |
audio_file = f.read() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit on this: the bytes
object is created by calling f.read()
inside a with statement doesn't have a name attribute associated with it, which throws an error w/ the Whisper API.
packages/openai/api_resources/audio.py:55, in Audio.transcribe(cls, model, file, api_key, api_base, api_type, api_version, organization, **params)
43 @classmethod
44 def transcribe(
45 cls,
(...)
53 **params,
54 ):
---> 55 requestor, files, data = cls._prepare_request(file, file.name, model, **params)
56 url = cls._get_url("transcriptions")
57 response, _, api_key = requestor.request("post", url, files=files, params=data)
AttributeError: 'bytes' object has no attribute 'name'
We may work around it with something like:
with open(self.audio_file_path, 'rb') as f:
audio_data = f.read()
audio_file = io.BytesIO(audio_data)
audio_file.name = self.audio_file_path
langchain/document_loaders/audio.py
Outdated
audio_file = open(self.audio_file_path , "rb") | ||
fpath , fname = os.path.split(self.audio_file_path) | ||
transcript = openai.Audio.transcribe("whisper-1",audio_file) | ||
result = Document(page_content=transcript.text,metadata={"source":fname}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result = Document(page_content=transcript.text,metadata={"source":fname}) | |
yield Document(page_content=transcript.text,metadata={"source":fname}) |
langchain/document_loaders/audio.py
Outdated
fpath , fname = os.path.split(self.audio_file_path) | ||
transcript = openai.Audio.transcribe("whisper-1",audio_file) | ||
result = Document(page_content=transcript.text,metadata={"source":fname}) | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return result |
langchain/document_loaders/audio.py
Outdated
result = Document(page_content=transcript.text,metadata={"source":fname}) | ||
return result | ||
|
||
def load(self) -> List: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def load(self) -> List: | |
def load(self) -> List[Document]: |
langchain/document_loaders/audio.py
Outdated
""" | ||
self.audio_file_path = audio_file_path | ||
|
||
def lazy_load(self) -> Document: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def lazy_load(self) -> Document: | |
def lazy_load(self) -> Iterator[Document]: |
from langchain.document_loaders.blob_loaders import Blob | ||
from langchain.schema import Document | ||
|
||
class OpenAIWhisperParser(BaseBlobParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eyurtsev may be helpful to discuss this briefly in case I misunderstood. I followed the logic of other parsers, but I also notice that this will reproduce much of the logic we already have in the document loader. Perhaps this is OK because the parser operates on a blob
whereas the document loader is initialized w/ a audio_file_path
. But, I'd like to be sure I understand loaders
vs parsers
, because - at least in this case - the loader also returns a Document
. The difference is strictly the input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role of the parser is to make sure that we can implement parsing/transformation logic on raw bytes without caring about where the raw bytes came from (they can be in memory or on disk or on s3 or on a website).
import openai
from langchain.document_loaders.generic import GenericLoader
class OpenAIWhisperParser(BaseBlobParser):
"""Transcribe and parse audio files using audio-to-text transcription with OpenAI Whisper model."""
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
import openai
with blob.as_bytes_io() as f:
transcript = openai.Audio.transcribe('whisper-1', f)
yield Document(page_content=transcript.text,metadata={"source": blob.source})
loader = GenericLoader.from_filesystem('directory', glob="*.mp3", parser=OpenAIWhisperParser())
docs = loader.load()
In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.
One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath
GenericLoader.from_path()
or something of this sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath
imo glob
for mp3 files
is OK since a specific file name can always be added to the regex.
from langchain.document_loaders.blob_loaders import Blob | ||
from langchain.schema import Document | ||
|
||
class OpenAIWhisperParser(BaseBlobParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The role of the parser is to make sure that we can implement parsing/transformation logic on raw bytes without caring about where the raw bytes came from (they can be in memory or on disk or on s3 or on a website).
import openai
from langchain.document_loaders.generic import GenericLoader
class OpenAIWhisperParser(BaseBlobParser):
"""Transcribe and parse audio files using audio-to-text transcription with OpenAI Whisper model."""
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
import openai
with blob.as_bytes_io() as f:
transcript = openai.Audio.transcribe('whisper-1', f)
yield Document(page_content=transcript.text,metadata={"source": blob.source})
loader = GenericLoader.from_filesystem('directory', glob="*.mp3", parser=OpenAIWhisperParser())
docs = loader.load()
In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.
One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath
GenericLoader.from_path()
or something of this sort
langchain/document_loaders/audio.py
Outdated
loader.load() | ||
""" | ||
|
||
def __init__(self, audio_file_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid this kind of overloading of the init with any specific audio file path since that makes it impossible to use the classmethods for the generic loader to pick up files by pattern.
I suspect we might need to update how the parser is being specified if we want to be able to decouple loader from parsing more -- we likely we need to lift the parser to be a class level attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.
right, this makes sense. the duplication of logic between loader and parser was odd. let me test this.
78de636
to
b181b1b
Compare
b181b1b
to
ac00b73
Compare
This introduces the `YoutubeAudioLoader`, which will load blobs from a YouTube url and write them. Blobs are then parsed by `OpenAIWhisperParser()`, as show in this [PR](#5580), but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX: ``` # Transcribe the video to text loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()) docs = loader.load() ``` Tested on full set of Karpathy lecture videos: ``` # Karpathy lecture videos urls = ["https://youtu.be/VMj-3S1tku0" "https://youtu.be/PaCmpygFfXo", "https://youtu.be/TCH_1BHY58I", "https://youtu.be/P6sfmUTpUmc", "https://youtu.be/q8SA3rM6ckI", "https://youtu.be/t3YJ5hKiMQ0", "https://youtu.be/kCc8FmEb1nY"] # Directory to save audio files save_dir = "~/Downloads/YouTube" # Transcribe the videos to text loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser()) docs = loader.load() ```
…angchain-ai#5580) # OpenAIWhisperParser This PR creates a new parser, `OpenAIWhisperParser`, that uses the [OpenAI Whisper model](https://platform.openai.com/docs/guides/speech-to-text/quickstart) to perform transcription of audio files to text (`Documents`). Please see the notebook for usage.
) This introduces the `YoutubeAudioLoader`, which will load blobs from a YouTube url and write them. Blobs are then parsed by `OpenAIWhisperParser()`, as show in this [PR](langchain-ai#5580), but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX: ``` # Transcribe the video to text loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()) docs = loader.load() ``` Tested on full set of Karpathy lecture videos: ``` # Karpathy lecture videos urls = ["https://youtu.be/VMj-3S1tku0" "https://youtu.be/PaCmpygFfXo", "https://youtu.be/TCH_1BHY58I", "https://youtu.be/P6sfmUTpUmc", "https://youtu.be/q8SA3rM6ckI", "https://youtu.be/t3YJ5hKiMQ0", "https://youtu.be/kCc8FmEb1nY"] # Directory to save audio files save_dir = "~/Downloads/YouTube" # Transcribe the videos to text loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser()) docs = loader.load() ```
OpenAIWhisperParser
This PR creates a new parser,
OpenAIWhisperParser
, that uses the OpenAI Whisper model to perform transcription of audio files to text (Documents
). Please see the notebook for usage.