Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama: use sliding window for phi3 #8627

Merged
merged 3 commits into from
Jul 25, 2024
Merged

Conversation

FanShupei
Copy link
Contributor

@FanShupei FanShupei commented Jul 22, 2024

Related issue report: #7709

This PR switches Phi3 model to use sliding window attention. After this PR, it no longer geneartes broken output after the 2,048 token. Tested on "phi3-mini-4k-instruct" model.

TODO: (DONE) convert_hf_to_gguf.py changes

src/llama.cpp Outdated Show resolved Hide resolved
}
}

if (data_swa) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_swa is also modified by the code above (line 14158). It is used by gemma2.

Overwriting it here may break gemma2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not familiar with gemma2, so I haven't test the PR on gemma2. I only test this PR with Phi3 on CPU.

I do not understand when it should padded to GGML_KD_MASK_PAD or not. I pad data_swa to GGML_KD_MASK_PAD because it looks like the original just forgets to do so. Actually, padding data_swa or not does not affect the correctness of Phi3 on CPU.

I'm confused on the original code that data is explicitly padded to GGML_KQ_MASK_PAD but data_swa is not. Is this the intended behavior? If yes, I'm happy to revert the change (padding data_swa to GGML_KQ_MASK_PAD). but I still want someone could explain to me what GGML_KQ_MASK_PAD actually means.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson I agree with @FanShupei here, I think data_swa should also be padded. I don't see why not, since the ranges of data written here and above do not overlap.

Not sure why this worked before though. Padding data_swa seems saner than leaving the values uninitialized.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both should be padded. The padding is necessary so that GPU kernels (such as the Metal Flash-Attention) not perform extra checks for out-of-bounds access when working on chunks of data

@github-actions github-actions bot added the python python script changes label Jul 24, 2024
@FanShupei FanShupei marked this pull request as ready for review July 24, 2024 07:10
@ggerganov ggerganov merged commit 8a4bad5 into ggerganov:master Jul 25, 2024
55 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
* use sliding window for phi3

* fix typo, "data_swa" -> "data"

* [conver_hf_to_gguf.py] add phi3 sliding window
@Arlodotexe
Copy link

Arlodotexe commented Jul 30, 2024

This PR broke loading of all existing Phi models, even those published by Microsoft. What gives?

From LM Studio:

"llama.cpp error: 'error loading model hyperparameters: key not found in model: phi3.attention.sliding_window'"

@Arlodotexe
Copy link

Arlodotexe commented Jul 31, 2024

From the Ollama maintainers:

there's additional data required in the GGUF to make Phi3 fully work correctly. Latest llama.cpp adds it and requires it. It's just an update of the model file.

Looks like this PR was a fix to an issue where Phi was outputting garbage after 2k tokens, and any model that fails to load would be broken by 2k tokens if we allowed it to load anyway.

There's a number of fine-tuned Phi variants out there, I guess they all need to be separately updated before they can be used again? That's no small task, and kinda fragments the model ecosystem. Anything in place to ease the process or notify owners of fine-tuned variants? Not even the official ones seem to have been updated yet.

@ThantZinWinnnnn
Copy link

This PR broke loading of all existing Phi models, even those published by Microsoft. What gives?

From LM Studio:

"llama.cpp error: 'error loading model hyperparameters: key not found in model: phi3.attention.sliding_window'"

me too . i can't load phi-3-mini-4k-instruct-gguf

@cllamastii
Copy link

For me is the same.
llama_model_load: error loading model: error loading model hyperparameters: key not found in model: phi3.attention.sliding_window

@vladfaust
Copy link

@flatsiedatsie
Copy link

Try this one: https://huggingface.co/vladfaust/Phi-3-mini-4k-instruct-Q4_K_M-GGUF.

I spent a lot of effort finding Phi 3 Mini versions that are specialized in specific languages. All of them are now not working. A cursory check shows that there are no updates for these models, and those updates probably won't come.

Is there perhaps a parameter that can force the models to load anyway?

@ngxson ngxson mentioned this pull request Aug 8, 2024
4 tasks
@flatsiedatsie
Copy link

Whoa! You actually created an upstream fix!

.. I'm so impressed!

And it works too!

Thank you so much for this. Gobsmacked.

@njsyw1997 njsyw1997 deleted the phi3-swa branch August 16, 2024 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants