-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Gemma2 adapter weights lm_head
skipped on gguf conversion
#9065
Comments
lm_head
skipped on gguf conversionlm_head
skipped on gguf conversion
I have ~zero context here but I will note that the Gemma family of models uses a reversible embedding, so the lm_head layer is tied to be identical to the embedding layer. |
Gemma 2 uses weight tying so lm_head weights are same as token embedding weights. |
@qnixsynapse @josharian Thanks yes I suspect this might be the reason why these weights are skipped, but I still feel the conversion should not succeed, as the adapter will not work correctly. |
Yes that's correct, the adapter should not have The simple solution for now is to add a check in llama.cpp/convert_lora_to_gguf.py Lines 365 to 369 in 2339a0b
Add the check: def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
dest = super().modify_tensors(data_torch, name, bid)
if name == "lm_head.weight" and len(dest) == 0:
raise ValueError(f"lm_head is present in adapter, but is ignored in base model") What do you think? @compilade |
I think it's still applied in the compute graph (where the token embeddings tensor is duplicated for the output), so it should probably not be ignored from LoRA adapters. Lines 12020 to 12021 in 2339a0b
Although this doesn't affect Not sure how to bypass the check. Maybe something like "don't call Or yes, maybe an error could be appropriate. |
Yeah, you're right. The Line 7039 in 2339a0b
Line 4107 in 2339a0b
So I think at least we're good on cpp side, as it can handle lora tensors separately even if output and tok_embd are the same.
I think that should not be a problem (not 100% sure). I suppose that PEFT see output and tok_embd as 2 different tensors. Maybe need to check this later on @ltoniazzi
The problem was that calling Option 1: Override the default behavior of We can do this by changing adding llama.cpp/convert_hf_to_gguf.py Lines 2696 to 2700 in 2339a0b
Option 2: Or, in my last comment, I suggest to just throw an error if
|
@compilade I think currently As you mentioned, one could not skip the But I think the tricky part might be merging the adapter ( So it looks to me that either:
are the cleaner options. |
Had a quick look and it looks like for Gemma2 (and probably all models with It indeed happens that merging the Also at inference, if the adapter is in unmerged state ( |
Hmm interesting. The problem with llama.cpp is that we currently don't support lora for
We could, for example, detect if the model is missing Probably we should go with the simple way for now: don't allow user to convert such adapter in the first place. Then, we will fix it when more users use lora adapters with gemma. |
Agree, I can have a go at it later this week |
What happened?
The
lm_head
layer for a Gemma2 LoRA adapter is not converted byconvert_lora_to_gguf.py
, and therefore not applied at inference (ruining performance of the adapter).How to reproduce:
Expand
pytorch
/peft
includinglm_head
in thetarget_modules
param:lm_head
layer is skipped by this line inconvert_hf_to_gguf.py
(and no error is raised):llama-cli
to check that indeed no lora layer is applied in the respective line in llama.cpp:./llama-cli -m base/model/path/Base-F32.gguf \ --lora lora/model/path/Lora-F32-LoRA.gguf \ -p "Hello Gemma2" -n 50
Expected behaviour
I think this is a bug because a user might have trained an adapter that is applied to the the
lm_head
layer, so skipping it on conversion will destroy the adapter's performance. I think the code should either:Cannot convert Gemma2 adapter with lm_head layer
or
lm_head
layer shares the weights with theembed
layer in Gemma2, probably leading to having to create a new tensor for thelm_head
to merge the adapter to).Comments
convert_lora_to_gguf.py
was introduced in PR Refactor lora adapter support #8332, so maybe the @ngxson knows if skipping thelm_head
is the desired outcome of if it is actually a bug. Otherwise I'm happy to try figure out why this happens.lm_head
lora layer correctly.Name and Version
version: 3524 (bc0f887)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
What operating system are you seeing the problem on?
MacOS, but it should be a platform-independent problem.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: