Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks ? #2325
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I don't understand this concept fully hence asking for clarification -
Even though whisper transcribes in chunks of 30s are the vector embeddings and attention available for the further chunks.
Take an example -
Chunk 1: "The bank manager told me to sign the papers at the branch. Later, when I returned..."
Chunk 2: "...to the branch, I noticed that the teller was gone."
Chunk 1 - Clearly sets the context for a vector embedding around branch with previous context as bank.
Chunk 2 - May not know branch is in context of a tree or a bank or a river unless attention is still active here.
The reason I ask is will the quality differ to transcribe chunks of audio in 30s(done externally lets say for a stream) or pass the full audio and let whisper chunk in 30s windows. The first case as per my understanding will reset embeddings and attention (Even if i pass audio with some overlap lets say 5s it would only carry over the common part only - not from a chunk 5 mins earlier).
Beta Was this translation helpful? Give feedback.
All reactions