-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: use sliding window for phi3 #8627
Conversation
} | ||
} | ||
|
||
if (data_swa) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_swa
is also modified by the code above (line 14158). It is used by gemma2.
Overwriting it here may break gemma2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I'm not familiar with gemma2, so I haven't test the PR on gemma2. I only test this PR with Phi3 on CPU.
I do not understand when it should padded to GGML_KD_MASK_PAD
or not. I pad data_swa
to GGML_KD_MASK_PAD because it looks like the original just forgets to do so. Actually, padding data_swa
or not does not affect the correctness of Phi3 on CPU.
I'm confused on the original code that data
is explicitly padded to GGML_KQ_MASK_PAD
but data_swa
is not. Is this the intended behavior? If yes, I'm happy to revert the change (padding data_swa to GGML_KQ_MASK_PAD). but I still want someone could explain to me what GGML_KQ_MASK_PAD
actually means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson I agree with @FanShupei here, I think data_swa
should also be padded. I don't see why not, since the ranges of data written here and above do not overlap.
Not sure why this worked before though. Padding data_swa
seems saner than leaving the values uninitialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both should be padded. The padding is necessary so that GPU kernels (such as the Metal Flash-Attention) not perform extra checks for out-of-bounds access when working on chunks of data
* use sliding window for phi3 * fix typo, "data_swa" -> "data" * [conver_hf_to_gguf.py] add phi3 sliding window
This PR broke loading of all existing Phi models, even those published by Microsoft. What gives? From LM Studio:
|
From the Ollama maintainers:
Looks like this PR was a fix to an issue where Phi was outputting garbage after 2k tokens, and any model that fails to load would be broken by 2k tokens if we allowed it to load anyway. There's a number of fine-tuned Phi variants out there, I guess they all need to be separately updated before they can be used again? That's no small task, and kinda fragments the model ecosystem. Anything in place to ease the process or notify owners of fine-tuned variants? Not even the official ones seem to have been updated yet. |
me too . i can't load phi-3-mini-4k-instruct-gguf |
For me is the same. |
I spent a lot of effort finding Phi 3 Mini versions that are specialized in specific languages. All of them are now not working. A cursory check shows that there are no updates for these models, and those updates probably won't come. Is there perhaps a parameter that can force the models to load anyway? |
Whoa! You actually created an upstream fix! .. I'm so impressed! And it works too! Thank you so much for this. Gobsmacked. |
Related issue report: #7709
This PR switches Phi3 model to use sliding window attention. After this PR, it no longer geneartes broken output after the 2,048 token. Tested on "phi3-mini-4k-instruct" model.
TODO: (DONE)
convert_hf_to_gguf.py
changes