Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create OpenAIWhisperParser for generating Documents from audio files #5580

Merged

Conversation

rlancemartin
Copy link
Collaborator

@rlancemartin rlancemartin commented Jun 1, 2023

OpenAIWhisperParser

This PR creates a new parser, OpenAIWhisperParser, that uses the OpenAI Whisper model to perform transcription of audio files to text (Documents). Please see the notebook for usage.

@rlancemartin rlancemartin force-pushed the rlm/audio_text_loader branch from 288a86e to 8330c55 Compare June 1, 2023 21:59
langchain/document_loaders/audio.py Outdated Show resolved Hide resolved
from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

class AudioFileLoader(BaseLoader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rlancemartin 👋

This audio file loader is coupled to the file system right now, and allows loading a single file at a time.

This is OK, but users may find it a bit more convenient to be able to scan an entire directory of audio files and decide which kinds of file to pick up (e.g., mp3s, wav) etc, get tqdm progress bar, and eventually get both sync and async implementations that are concurrent (the concurrency part is not implemented yet).

Let's replace:

The baseclass will introduce convenience classmethods that will allow picking up audio files from the file system, add progress bars, etc.

  • Let's introduce a BlobParser for Audio (say OpenAIWhisperParser or something like that with a good name). It should live in document_loaders.parsers.audio.py (can follow the structure of any other blob parsers)

Testing:

Ideally, we can add a unit-test, we should be able to patch the request from openai.Audio.transcribe to return a mock transcription.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Instead of inheriting from BaseLoader let's inherit from GenericLoader

Done. I see GenericLoader uses FileSystemBlobLoader, which I saw from your doc gives us some useful methods (e.g., progress bar) as you mention.

loader.load()
"""

def __init__(self, audio_file_path: str = "text"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, audio_file_path: str = "text"):
def __init__(self, audio_file_path: str):


def lazy_load(self) -> Document:
"""Transcribe audio file to text w/ OpenAI Whisper API."""
audio_file = open(self.audio_file_path , "rb")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally for opening files, it's best o always include a context manager to no file descriptors remain open

Suggested change
audio_file = open(self.audio_file_path , "rb")
with open(self.audio_file_path, 'rb') as f:
audio_file = f.read()

Copy link
Collaborator Author

@rlancemartin rlancemartin Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit on this: the bytes object is created by calling f.read() inside a with statement doesn't have a name attribute associated with it, which throws an error w/ the Whisper API.

packages/openai/api_resources/audio.py:55, in Audio.transcribe(cls, model, file, api_key, api_base, api_type, api_version, organization, **params)
     43 @classmethod
     44 def transcribe(
     45     cls,
   (...)
     53     **params,
     54 ):
---> 55     requestor, files, data = cls._prepare_request(file, file.name, model, **params)
     56     url = cls._get_url("transcriptions")
     57     response, _, api_key = requestor.request("post", url, files=files, params=data)

AttributeError: 'bytes' object has no attribute 'name'

We may work around it with something like:

with open(self.audio_file_path, 'rb') as f:
    audio_data = f.read()
audio_file = io.BytesIO(audio_data)
audio_file.name = self.audio_file_path

audio_file = open(self.audio_file_path , "rb")
fpath , fname = os.path.split(self.audio_file_path)
transcript = openai.Audio.transcribe("whisper-1",audio_file)
result = Document(page_content=transcript.text,metadata={"source":fname})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = Document(page_content=transcript.text,metadata={"source":fname})
yield Document(page_content=transcript.text,metadata={"source":fname})

fpath , fname = os.path.split(self.audio_file_path)
transcript = openai.Audio.transcribe("whisper-1",audio_file)
result = Document(page_content=transcript.text,metadata={"source":fname})
return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return result

result = Document(page_content=transcript.text,metadata={"source":fname})
return result

def load(self) -> List:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def load(self) -> List:
def load(self) -> List[Document]:

"""
self.audio_file_path = audio_file_path

def lazy_load(self) -> Document:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def lazy_load(self) -> Document:
def lazy_load(self) -> Iterator[Document]:

from langchain.document_loaders.blob_loaders import Blob
from langchain.schema import Document

class OpenAIWhisperParser(BaseBlobParser):
Copy link
Collaborator Author

@rlancemartin rlancemartin Jun 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eyurtsev may be helpful to discuss this briefly in case I misunderstood. I followed the logic of other parsers, but I also notice that this will reproduce much of the logic we already have in the document loader. Perhaps this is OK because the parser operates on a blob whereas the document loader is initialized w/ a audio_file_path. But, I'd like to be sure I understand loaders vs parsers, because - at least in this case - the loader also returns a Document. The difference is strictly the input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of the parser is to make sure that we can implement parsing/transformation logic on raw bytes without caring about where the raw bytes came from (they can be in memory or on disk or on s3 or on a website).

import openai
from langchain.document_loaders.generic import GenericLoader

class OpenAIWhisperParser(BaseBlobParser):
    """Transcribe and parse  audio files using audio-to-text transcription with OpenAI Whisper model."""

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Lazily parse the blob."""
        import openai
        
        with blob.as_bytes_io() as f:
            transcript = openai.Audio.transcribe('whisper-1', f)
        yield Document(page_content=transcript.text,metadata={"source": blob.source})

loader = GenericLoader.from_filesystem('directory', glob="*.mp3", parser=OpenAIWhisperParser())
docs = loader.load()

In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.

One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath

GenericLoader.from_path() or something of this sort

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath

imo glob for mp3 files is OK since a specific file name can always be added to the regex.

from langchain.document_loaders.blob_loaders import Blob
from langchain.schema import Document

class OpenAIWhisperParser(BaseBlobParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The role of the parser is to make sure that we can implement parsing/transformation logic on raw bytes without caring about where the raw bytes came from (they can be in memory or on disk or on s3 or on a website).

import openai
from langchain.document_loaders.generic import GenericLoader

class OpenAIWhisperParser(BaseBlobParser):
    """Transcribe and parse  audio files using audio-to-text transcription with OpenAI Whisper model."""

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        """Lazily parse the blob."""
        import openai
        
        with blob.as_bytes_io() as f:
            transcript = openai.Audio.transcribe('whisper-1', f)
        yield Document(page_content=transcript.text,metadata={"source": blob.source})

loader = GenericLoader.from_filesystem('directory', glob="*.mp3", parser=OpenAIWhisperParser())
docs = loader.load()

In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.

One inconvenience is that at the moment it only supports loading patterns from the file system, so we can augment it to pick up a blob from a specific filepath

GenericLoader.from_path() or something of this sort

loader.load()
"""

def __init__(self, audio_file_path: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid this kind of overloading of the init with any specific audio file path since that makes it impossible to use the classmethods for the generic loader to pick up files by pattern.


I suspect we might need to update how the parser is being specified if we want to be able to decouple loader from parsing more -- we likely we need to lift the parser to be a class level attribute

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, we don't even need to introduce a new loader since the GenericLoader can handle this.

right, this makes sense. the duplication of logic between loader and parser was odd. let me test this.

@rlancemartin rlancemartin changed the title Create new document loader for audio files Create OpenAIWhisperParser for generating text from audio files Jun 2, 2023
@rlancemartin rlancemartin changed the title Create OpenAIWhisperParser for generating text from audio files Create OpenAIWhisperParser for generating Documents from audio files Jun 2, 2023
@rlancemartin rlancemartin force-pushed the rlm/audio_text_loader branch 2 times, most recently from 78de636 to b181b1b Compare June 2, 2023 17:52
@rlancemartin rlancemartin merged commit aea0900 into langchain-ai:master Jun 5, 2023
rlancemartin added a commit that referenced this pull request Jun 6, 2023
This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](#5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:

```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
``` 

Tested on full set of Karpathy lecture videos:

```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
…angchain-ai#5580)

# OpenAIWhisperParser

This PR creates a new parser, `OpenAIWhisperParser`, that uses the
[OpenAI Whisper
model](https://platform.openai.com/docs/guides/speech-to-text/quickstart)
to perform transcription of audio files to text (`Documents`). Please
see the notebook for usage.
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
)

This introduces the `YoutubeAudioLoader`, which will load blobs from a
YouTube url and write them. Blobs are then parsed by
`OpenAIWhisperParser()`, as show in this
[PR](langchain-ai#5580), but we extend
the parser to split audio such that each chuck meets the 25MB OpenAI
size limit. As shown in the notebook, this enables a very simple UX:

```
# Transcribe the video to text
loader = GenericLoader(YoutubeAudioLoader([url],save_dir),OpenAIWhisperParser())
docs = loader.load()
``` 

Tested on full set of Karpathy lecture videos:

```
# Karpathy lecture videos
urls = ["https://youtu.be/VMj-3S1tku0"
        "https://youtu.be/PaCmpygFfXo",
        "https://youtu.be/TCH_1BHY58I",
        "https://youtu.be/P6sfmUTpUmc",
        "https://youtu.be/q8SA3rM6ckI",
        "https://youtu.be/t3YJ5hKiMQ0",
        "https://youtu.be/kCc8FmEb1nY"]

# Directory to save audio files 
save_dir = "~/Downloads/YouTube"
 
# Transcribe the videos to text
loader = GenericLoader(YoutubeAudioLoader(urls,save_dir),OpenAIWhisperParser())
docs = loader.load()
```
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants