You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Maybe I missed something, though one thing that was converted into a ring buffer was the "generated token storage".
A while ago all tokens that were generated were put into a integer vector output_tokens, once context size was reached a part of that entire storage was cut away and re-evaluations happened causing quite big delays in processing.
The sequential storage was just a "hacky" start to get going and remained that way for a while.
That mechanism was changed quite neatly, it's still a vector of integers (now in a sampling_context struct) but it's initialized at context size and whenever a token is generated it's added at the end and the oldest token (first one) is moved out.
This way the output vector now represents the actual context window.
In addition there are routines to modify the kv cache itself which stores the evaluated embeddings for each token sampled.
modifying kvcache was a bit more "raw" before that (with no API, so you had to get into libllama.cpp and understand the cache tensor structure) but now there are quite a few functions.
Refer to this PR that added them: #3228
What does kv_cache's ring-buffer mean and how does it work?
The text was updated successfully, but these errors were encountered: