Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage during Whisper inference #431

Merged
merged 15 commits into from
Feb 4, 2023
Merged

Reduce memory usage during Whisper inference #431

merged 15 commits into from
Feb 4, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jan 19, 2023

The idea is to avoid keeping all intermediate tensors of the computation graph by introducing "scratch" buffers to ggml #272 (comment)

I initially thought it would be enough to just keep the last 2 intermediate tensors at each point.
However, it's not the case since we have operations like this:

cur = ggml_add(ctx0,
        ggml_repeat(ctx0,
            model.e_conv_2_b,
            cur),
        cur);

The tensor cur is used to create 2 new intermediate tensors.
So we need to keep more than 2 tensors in the "scratch" buffer.

Initial results

Using scratch buffers during inference we reduce the total memory usage for the base model from 500 MB to just 213 MB. As an extra bonus, the decoder seems to be about %30 faster on M1 Pro without any loss of precision compared to master.

The main drawback is that the scratch buffer selection is currently done manually in whisper.cpp.
It makes the code quite unreadable and very error-prone. I think it can be automated by analysing the nodes in the created compute graphs and assigning them to the correct scratch buffers, but the assignment algorithm is not trivial to implement and it would need some major refactoring in ggml. For now I think it would be better to just clean-up the code a little bit and wait to see if some better idea pops up.

Memory usage change:

Model Disk Mem (Old) Mem (New)
tiny 75 MB ~390 MB ~125 MB
base 142 MB ~500 MB ~210 MB
small 466 MB ~1.0 GB ~600 MB
medium 1.5 GB ~2.6 GB ~1.7 GB
large 2.9 GB ~4.7 GB ~3.3 GB

Development notes:

  • Cannot use ggml_cpy with scratch tensors
  • Special-cased constant ggml tensors - need a better fix
  • We now only compute the logits for the last token in whisper_decode()
  • Use different scratch buffers for every other layer?

@ggerganov ggerganov force-pushed the mem branch 3 times, most recently from 1f7cd04 to 60d0f9d Compare January 25, 2023 20:16
@ggerganov ggerganov marked this pull request as ready for review January 29, 2023 07:33
@ggerganov ggerganov merged commit f3ee4a9 into master Feb 4, 2023
@ggerganov ggerganov deleted the mem branch February 4, 2023 07:45
rock3125 pushed a commit to rock3125/whisper.cpp that referenced this pull request Feb 21, 2023
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
anandijain pushed a commit to anandijain/whisper.cpp that referenced this pull request Apr 28, 2023
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* ggml : add "scratch" buffer support

* ggml : support for scratch ring-buffer

* ggml : bug fix in ggml_repeat()

* ggml : error on scratch buffer overflow

* whisper : use scratch buffers during inference (base model only)

* whisper : update memory usage for all models

* whisper : fix encoder memory usage

* whisper : use whisper_context functions instead of macros

* whisper : fix FF + remove it from README

* ggml : reuse ggml_new_i32

* ggml : refactor the scratch buffer storage

* whisper : reorder scratch buffers in the decoder

* main : add option to disable temp fallback

* Update README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants