Create OpenAIWhisperParser for generating Documents from audio files #5580

rlancemartin · 2023-06-01T21:40:21Z

OpenAIWhisperParser

This PR creates a new parser, OpenAIWhisperParser, that uses the OpenAI Whisper model to perform transcription of audio files to text (Documents). Please see the notebook for usage.

langchain/document_loaders/audio.py

eyurtsev · 2023-06-02T01:52:11Z

langchain/document_loaders/audio.py

+from langchain.docstore.document import Document
+from langchain.document_loaders.base import BaseLoader
+
+class AudioFileLoader(BaseLoader):


@rlancemartin 👋

This audio file loader is coupled to the file system right now, and allows loading a single file at a time.

This is OK, but users may find it a bit more convenient to be able to scan an entire directory of audio files and decide which kinds of file to pick up (e.g., mp3s, wav) etc, get tqdm progress bar, and eventually get both sync and async implementations that are concurrent (the concurrency part is not implemented yet).

Let's replace:

Instead of inheriting from BaseLoader let's inherit from GenericLoader
https://github.com/hwchase17/langchain/blob/5aa7264a63e13ed550b7c85cbabdf0afcdab390b/langchain/document_loaders/generic.py#L17-L17

The baseclass will introduce convenience classmethods that will allow picking up audio files from the file system, add progress bars, etc.

Let's introduce a BlobParser for Audio (say OpenAIWhisperParser or something like that with a good name). It should live in document_loaders.parsers.audio.py (can follow the structure of any other blob parsers)

Testing:

Ideally, we can add a unit-test, we should be able to patch the request from openai.Audio.transcribe to return a mock transcription.

Thanks for the review!

Instead of inheriting from BaseLoader let's inherit from GenericLoader

Done. I see GenericLoader uses FileSystemBlobLoader, which I saw from your doc gives us some useful methods (e.g., progress bar) as you mention.

eyurtsev · 2023-06-02T01:52:28Z

langchain/document_loaders/audio.py

+        loader.load()
+    """
+
+    def __init__(self, audio_file_path: str = "text"):


Suggested change

def __init__(self, audio_file_path: str = "text"):

def __init__(self, audio_file_path: str):

eyurtsev · 2023-06-02T01:53:15Z

langchain/document_loaders/audio.py

+
+    def lazy_load(self) -> Document:
+        """Transcribe audio file to text w/ OpenAI Whisper API."""
+        audio_file = open(self.audio_file_path , "rb")


Generally for opening files, it's best o always include a context manager to no file descriptors remain open

Suggested change

audio_file = open(self.audio_file_path , "rb")

with open(self.audio_file_path, 'rb') as f:

audio_file = f.read()

Minor nit on this: the bytes object is created by calling f.read() inside a with statement doesn't have a name attribute associated with it, which throws an error w/ the Whisper API.

packages/openai/api_resources/audio.py:55, in Audio.transcribe(cls, model, file, api_key, api_base, api_type, api_version, organization, **params) 43 @classmethod 44 def transcribe( 45 cls, (...) 53 **params, 54 ): ---> 55 requestor, files, data = cls._prepare_request(file, file.name, model, **params) 56 url = cls._get_url("transcriptions") 57 response, _, api_key = requestor.request("post", url, files=files, params=data) AttributeError: 'bytes' object has no attribute 'name'

We may work around it with something like:

with open(self.audio_file_path, 'rb') as f: audio_data = f.read() audio_file = io.BytesIO(audio_data) audio_file.name = self.audio_file_path

eyurtsev · 2023-06-02T01:53:57Z

langchain/document_loaders/audio.py

+        audio_file = open(self.audio_file_path , "rb")
+        fpath , fname = os.path.split(self.audio_file_path)
+        transcript = openai.Audio.transcribe("whisper-1",audio_file)
+        result = Document(page_content=transcript.text,metadata={"source":fname})


Suggested change

result = Document(page_content=transcript.text,metadata={"source":fname})

yield Document(page_content=transcript.text,metadata={"source":fname})

eyurtsev · 2023-06-02T01:54:04Z

langchain/document_loaders/audio.py

+        fpath , fname = os.path.split(self.audio_file_path)
+        transcript = openai.Audio.transcribe("whisper-1",audio_file)
+        result = Document(page_content=transcript.text,metadata={"source":fname})
+        return result


Suggested change

return result

eyurtsev · 2023-06-02T01:54:13Z

langchain/document_loaders/audio.py

+        result = Document(page_content=transcript.text,metadata={"source":fname})
+        return result
+
+    def load(self) -> List:


Suggested change

def load(self) -> List:

def load(self) -> List[Document]:

eyurtsev · 2023-06-02T01:54:23Z

langchain/document_loaders/audio.py

+        """
+        self.audio_file_path = audio_file_path
+
+    def lazy_load(self) -> Document:


Suggested change

def lazy_load(self) -> Document:

def lazy_load(self) -> Iterator[Document]:

rlancemartin · 2023-06-02T03:37:15Z

langchain/document_loaders/parsers/audio.py

+from langchain.document_loaders.blob_loaders import Blob
+from langchain.schema import Document
+
+class OpenAIWhisperParser(BaseBlobParser):


@eyurtsev may be helpful to discuss this briefly in case I misunderstood. I followed the logic of other parsers, but I also notice that this will reproduce much of the logic we already have in the document loader. Perhaps this is OK because the parser operates on a blob whereas the document loader is initialized w/ a audio_file_path. But, I'd like to be sure I understand loaders vs parsers, because - at least in this case - the loader also returns a Document. The difference is strictly the input.

The role of the parser is to make sure that we can implement parsing/transformation logic on raw bytes without caring about where the raw bytes came from (they can be in memory or on disk or on s3 or on a website).

import openai from langchain.document_loaders.generic import GenericLoader class OpenAIWhisperParser(BaseBlobParser): """Transcribe and parse audio files using audio-to-text transcription with OpenAI Whisper model.""" def lazy_parse(self, blob: Blob) -> Iterator[Document]: """Lazily parse the blob.""" import openai with blob.as_bytes_io() as f: transcript = openai.Audio.transcribe('whisper-1', f) yield Document(page_content=transcript.text,metadata={"source": blob.source}) loader = GenericLoader.from_filesystem('directory', glob="*.mp3", parser=OpenAIWhisperParser()) docs = loader.load()

In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.

One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath

GenericLoader.from_path() or something of this sort

One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath

imo glob for mp3 files is OK since a specific file name can always be added to the regex.

eyurtsev · 2023-06-02T14:17:03Z

langchain/document_loaders/parsers/audio.py

+from langchain.document_loaders.blob_loaders import Blob
+from langchain.schema import Document
+
+class OpenAIWhisperParser(BaseBlobParser):


The role of the parser is to make sure that we can implement parsing/transformation logic on raw bytes without caring about where the raw bytes came from (they can be in memory or on disk or on s3 or on a website).

import openai from langchain.document_loaders.generic import GenericLoader class OpenAIWhisperParser(BaseBlobParser): """Transcribe and parse audio files using audio-to-text transcription with OpenAI Whisper model.""" def lazy_parse(self, blob: Blob) -> Iterator[Document]: """Lazily parse the blob.""" import openai with blob.as_bytes_io() as f: transcript = openai.Audio.transcribe('whisper-1', f) yield Document(page_content=transcript.text,metadata={"source": blob.source}) loader = GenericLoader.from_filesystem('directory', glob="*.mp3", parser=OpenAIWhisperParser()) docs = loader.load()

In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.

One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath

GenericLoader.from_path() or something of this sort

eyurtsev · 2023-06-02T14:22:52Z

langchain/document_loaders/audio.py

+        loader.load()
+    """
+
+    def __init__(self, audio_file_path: str):


Let's avoid this kind of overloading of the init with any specific audio file path since that makes it impossible to use the classmethods for the generic loader to pick up files by pattern.

I suspect we might need to update how the parser is being specified if we want to be able to decouple loader from parsing more -- we likely we need to lift the parser to be a class level attribute

In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.

right, this makes sense. the duplication of logic between loader and parser was odd. let me test this.

This introduces the `YoutubeAudioLoader`, which will load blobs from a YouTube url and write them. Blobs are then parsed by `OpenAIWhisperParser()`, as show in this [PR](#5580), but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX: ``` # Transcribe the video to text loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()) docs = loader.load() ``` Tested on full set of Karpathy lecture videos: ``` # Karpathy lecture videos urls = ["https://youtu.be/VMj-3S1tku0" "https://youtu.be/PaCmpygFfXo", "https://youtu.be/TCH_1BHY58I", "https://youtu.be/P6sfmUTpUmc", "https://youtu.be/q8SA3rM6ckI", "https://youtu.be/t3YJ5hKiMQ0", "https://youtu.be/kCc8FmEb1nY"] # Directory to save audio files save_dir = "~/Downloads/YouTube" # Transcribe the videos to text loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser()) docs = loader.load() ```

…angchain-ai#5580) # OpenAIWhisperParser This PR creates a new parser, `OpenAIWhisperParser`, that uses the [OpenAI Whisper model](https://platform.openai.com/docs/guides/speech-to-text/quickstart) to perform transcription of audio files to text (`Documents`). Please see the notebook for usage.

) This introduces the `YoutubeAudioLoader`, which will load blobs from a YouTube url and write them. Blobs are then parsed by `OpenAIWhisperParser()`, as show in this [PR](langchain-ai#5580), but we extend the parser to split audio such that each chuck meets the 25MB OpenAI size limit. As shown in the notebook, this enables a very simple UX: ``` # Transcribe the video to text loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser()) docs = loader.load() ``` Tested on full set of Karpathy lecture videos: ``` # Karpathy lecture videos urls = ["https://youtu.be/VMj-3S1tku0" "https://youtu.be/PaCmpygFfXo", "https://youtu.be/TCH_1BHY58I", "https://youtu.be/P6sfmUTpUmc", "https://youtu.be/q8SA3rM6ckI", "https://youtu.be/t3YJ5hKiMQ0", "https://youtu.be/kCc8FmEb1nY"] # Directory to save audio files save_dir = "~/Downloads/YouTube" # Transcribe the videos to text loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser()) docs = loader.load() ```

Create new document loader for audio files

8330c55

rlancemartin force-pushed the rlm/audio_text_loader branch from 288a86e to 8330c55 Compare June 1, 2023 21:59

Clean-up code

2d2d0ef

eyurtsev reviewed Jun 2, 2023

View reviewed changes

Address comments and add parser

6d9612d

rlancemartin commented Jun 2, 2023

View reviewed changes

eyurtsev reviewed Jun 2, 2023

View reviewed changes

Create OpenAIWhisperParser and remove AudioFileLoader

0a3e040

rlancemartin changed the title ~~Create new document loader for audio files~~ Create OpenAIWhisperParser for generating text from audio files Jun 2, 2023

rlancemartin changed the title ~~Create OpenAIWhisperParser for generating text from audio files~~ Create OpenAIWhisperParser for generating Documents from audio files Jun 2, 2023

rlancemartin force-pushed the rlm/audio_text_loader branch 2 times, most recently from 78de636 to b181b1b Compare June 2, 2023 17:52

Lint

ac00b73

rlancemartin force-pushed the rlm/audio_text_loader branch from b181b1b to ac00b73 Compare June 2, 2023 17:58

rlancemartin mentioned this pull request Jun 2, 2023

In memory audio blob loading and splitting #5635

Closed

eyurtsev approved these changes Jun 5, 2023

View reviewed changes

rlancemartin merged commit aea0900 into langchain-ai:master Jun 5, 2023

rlancemartin mentioned this pull request Jun 6, 2023

YoutubeAudioLoader and updates to OpenAIWhisperParser #5772

Merged

This was referenced Jun 25, 2023

Zep Authentication #6725

Closed

Zep Authentication #6728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create OpenAIWhisperParser for generating Documents from audio files #5580

Create OpenAIWhisperParser for generating Documents from audio files #5580

rlancemartin commented Jun 1, 2023 •

edited

Loading

eyurtsev Jun 2, 2023

rlancemartin Jun 2, 2023

eyurtsev Jun 2, 2023

eyurtsev Jun 2, 2023

rlancemartin Jun 2, 2023 •

edited

Loading

eyurtsev Jun 2, 2023

eyurtsev Jun 2, 2023

eyurtsev Jun 2, 2023

eyurtsev Jun 2, 2023

rlancemartin Jun 2, 2023 •

edited

Loading

eyurtsev Jun 2, 2023

rlancemartin Jun 2, 2023

eyurtsev Jun 2, 2023

eyurtsev Jun 2, 2023

rlancemartin Jun 2, 2023

	def __init__(self, audio_file_path: str = "text"):
	def __init__(self, audio_file_path: str):

	audio_file = open(self.audio_file_path , "rb")
	with open(self.audio_file_path, 'rb') as f:
	audio_file = f.read()

	result = Document(page_content=transcript.text,metadata={"source":fname})
	yield Document(page_content=transcript.text,metadata={"source":fname})

	def lazy_load(self) -> Document:
	def lazy_load(self) -> Iterator[Document]:

Create OpenAIWhisperParser for generating Documents from audio files #5580

Create OpenAIWhisperParser for generating Documents from audio files #5580

Conversation

rlancemartin commented Jun 1, 2023 • edited Loading

OpenAIWhisperParser

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jun 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Jun 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin commented Jun 1, 2023 •

edited

Loading

rlancemartin Jun 2, 2023 •

edited

Loading

rlancemartin Jun 2, 2023 •

edited

Loading