Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: make tail invalid when kv cell is intersection for mamba #9249

Merged
merged 1 commit into from
Sep 2, 2024

Conversation

kylo5aby
Copy link
Contributor

Reset the tail of kv cell for the common prefix tokens for mamba. That maybe fix #9170

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Aug 30, 2024
@compilade compilade self-requested a review August 30, 2024 16:45
src/llama.cpp Outdated
Comment on lines 3689 to 3691
if ((0 < p0 && p0 <= cell.pos) || (0 < p1 && p1 <= cell.pos)) {
return false;
}
if (p0 <= cell.pos && p1 < cell.pos) {
tail_id = -1;
return false;
}
Copy link
Collaborator

@compilade compilade Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not fix the problem. Building llama-server from this branch, I still get

llama_kv_cache_find_slot: wrong tail for seq_id 1, (0 instead of -1)
/.../src/llama.cpp:3507: GGML_ASSERT(cell.has_seq_id(seq_id)) failed

When using the first request from #9170 with curl and then using the server Web UI for something unrelated. I'll investigate the root cause, but it does seem like the tail is not invalidated properly somewhere (which by the way also has to be done in sync with clearing the actual tail cell too, which is why the tail was previously invalidated here only when not returning and when the cell (which was the tail) would be cleared with the range conditions below).

Strangely enough, there is no problem when I try using the server UI directly.

Hmm, re-reading my parenthesis above, the problem in the code could actually simply be

diff --git a/src/llama.cpp b/src/llama.cpp
index 2274296b..6c3d4691 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -3689,7 +3689,8 @@ static bool llama_kv_cache_seq_rm(
                 if ((0 < p0 && p0 <= cell.pos) || (0 < p1 && p1 <= cell.pos)) {
                     return false;
                 }
-                if (p0 <= cell.pos && p1 < cell.pos) {
+                // invalidate tails which will be cleared
+                if (p0 <= cell.pos && cell.pos < p1) {
                     tail_id = -1;
                 }
             }

(when diffed from master)

Re-running the requests from #9170 no longer seems to fail with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks the feedback! I haven't tried this through curl, and don't know why there are different behavior between curl and web ui, maybe because of the different times of response? I have seen the times of response maybe inconsistent for the same request(same parameters).

Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jploski Does this fix the problem you've seen in #9170?

@jploski
Copy link
Contributor

jploski commented Sep 2, 2024

@jploski Does this fix the problem you've seen in #9170?

Yes, it does!

Although I see that the full prompt/history is being reprocessed on each request despite the prefix being same as in the previous request (i.e. no hidden state/context caching). Is the cache supposed to work for Mamba at this point or is it maybe still a work-in-progress?

@compilade
Copy link
Collaborator

compilade commented Sep 2, 2024

Although I see that the full prompt/history is being reprocessed on each request despite the prefix being same as in the previous request (i.e. no hidden state/context caching). Is the cache supposed to work for Mamba at this point or is it maybe still a work-in-progress?

This happens when llama-server removes more than one token at the end of the generated text or when it trims whitespace. But caching should work otherwise.

It should be fixed in #7531, but it will probably be "unfixed" before merging because the method used is too over-engineered and too complicated. (tree of sequences to fairly share available cells adds a lot of complexity)

The plan is to add explicit checkpoints requested by llama-server so that guarding against trimming requires fewer checkpoints.

@compilade compilade added merge ready indicates that this may be ready to merge soon and is just holding out in case of objections bugfix fixes an issue or bug labels Sep 2, 2024
@compilade compilade merged commit f148516 into ggerganov:master Sep 2, 2024
52 checks passed
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug merge ready indicates that this may be ready to merge soon and is just holding out in case of objections Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: llama-server crashes while serving a Mamba model - GGML_ASSERT(cell.has_seq_id(seq_id)) failed
4 participants