-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with saving standalone gemma-2b-it after fine-tuning with LoRA on TPU v3-8 #29659
Comments
I modified the code a little bit to make some sanity checks. def train():
gemma2it = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it") # sanity check model
tokenizer = AutoTokenizer.from_pretrained("NousResearch/gemma-2b-it-tokenizer")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", torch_dtype=torch.bfloat16)
dataset = load_dataset("pawkanarek/poke_test", split="train")
lora_config = LoraConfig(r=8, target_modules=["k_proj", "v_proj"], task_type="CAUSAL_LM")
fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"], "xla": True, "xla_fsdp_v2": True, "xla_fsdp_grad_ckpt": True}
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer = tokenizer,
args=TrainingArguments(
per_device_train_batch_size=64,
num_train_epochs=4,
output_dir="output/trained_model",
optim="adafactor",
dataloader_drop_last = True, # Required for SPMD.
fsdp="full_shard",
fsdp_config=fsdp_config,
),
peft_config=lora_config,
max_seq_length=2048,
)
# 1
trainer.train()
print("comparing gemma2it with trainer.model")
compare_weights(gemma2it, trainer.model) # different GemmaForCausalLM:2506172416 params vs SpmdFullyShardedDataParallel:3031123968 params
# 2
merged_model = trainer.model.merge_and_unload()
print("comparing gemma2it with merged_model")
compare_weights(gemma2it, merged_model) # different GemmaForCausalLM:2506172416 params vs GemmaForCausalLM:3030460416 params
# 3
print("saving merged_model")
merged_model.to("cpu")
merged_model.save_pretrained("output/merged_model")
compare_weights(gemma2it, merged_model) # different GemmaForCausalLM:2506172416 params vs GemmaForCausalLM:3030460416 params
# 4
print("comparing loaded merged_model from disk with in-memory merged_model")
loaded_merged_model = AutoModelForCausalLM.from_pretrained("output/merged_model")
compare_weights(merged_model, loaded_merged_model) # different GemmaForCausalLM:3030460416 params vs GemmaForCausalLM:2506172416 params
# 5
print("comparing gemma2it with loaded merged_model from disk")
compare_weights(gemma2it, loaded_merged_model) # models GemmaForCausalLM and GemmaForCausalLM are the same I added some sanity checks with base, untouched
Looks like there is something fishy with my code when saving / loading model from the disk... I'll update if i notice what's wrong. I will check why my weights are saved to something called |
Hi @PawKanarek Please reference #29388 , by the way do you have testing the LoRA fine tune performance on TPU XLA? I have some explore for some LoRA but it has no any effective for the base model and the generate message just very same as base model. |
Hi @zorrofox, and thanks for insight! Looks like my transformers fork didn't included change from that PR. |
I used the |
I think that i fixed it, but i won't recommend this fix to anyone, so I'm not even thinking about making PR. It's a patch rather than fix, but i think it works - To check if it really works I will train gemma-2-it until it overfit on training dataset and then i will take a look on interference output. To apply my patch you would have to add new parameter to save_pretrained formatting_weights_func = None, Also add this code before sharding https://github.com/huggingface/transformers/blob/03847ef45189d328a51f428b0a61a6b891e69f88/src/transformers/modeling_utils.py#L2429C1-L2437C111 # apply formatting to the weights before saving
if formatting_weights_func is not None:
for old_key in list(state_dict.keys()):
new_key = formatting_weights_func(old_key)
logger.debug(f"changed {old_key=} to {new_key=}")
state_dict[new_key] = state_dict.pop(old_key) With this changes I can finally spot a difference between a trained model loaded from disk and a base model that was trained on, and the warning also is gone
def compare_weights(model1, model2):
name1, name2 = model1.__class__.__name__, model2.__class__.__name__
params1, params2 = model1.parameters(), model2.parameters()
sum1, sum2 = sum(p.numel() for p in params1), sum(p.numel() for p in params2)
if (sum1 != sum2):
print(f"!!! different in {name1}:{sum1} params vs {name2}:{sum2} params")
for (n1, p1), (n2, p2) in zip(model1.named_parameters(), model2.named_parameters()):
if n1 != n2:
print(f"!!! Parameter names differ: {n1} != {n2}")
return False
if not torch.equal(p1.data, p2.data):
print(f"!!! Parameter values differ: {n1}, {p1.data}, {p2.data}")
return False
def formmating_func(old_key):
return old_key.replace('._orig_module', '')
def train():
# the same training config as before
trainer.train()
trainer_model = trainer.model.to('cpu')
merged_model = trainer_model.merge_and_unload()
merged_model.save_pretrained("output/merged_model", formatting_weights_func = formmating_func)
loaded_merged_model = AutoModelForCausalLM.from_pretrained("output/merged_model")
gemma2it = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
print("!!! comparing gemma2it with loaded merged_model from disk")
compare_weights(gemma2it, loaded_merged_model) # !!! FINALLY !!! Parameter values differ: model.layers.0.self_attn.k_proj.weight, tensor([[-3.2043e-04, 8.1177e-03, 3.0365e-03, ..., -5.3101e-03, I'm not closing this issue, because I didn't fixed it, and true issue is still hidden somewhere. That's only workaround |
@PawKanarek Thanks a lot for your advice, I also have the same issue as you. I think you have the root causes that why the trained model not changed. |
@PawKanarek just to isolate the error, what happens if you run the same code on a GPU instead of TPU? |
@PawKanarek can you also provide the training logs please and run with |
@PawKanarek also after training can you try saving with |
@PawKanarek also with your patch did it work? |
@PawKanarek one last thing that I would like to see is: does the generation differs when using this: model = |
@shub-kris thanks,
I don't have GPU capable of training Gemma-2b-it model. I have only my local macbook with mps backend and Google TPU clouds (thanks to https://sites.research.google/trc/about/)
I will try to give you logs tomorrow. Today the machine is busy with training :)
I tried it many times, no success.
Yes. It works.
Sadly, I don't have nvidia GPU. |
@PawKanarek thanks for your answers. I am having a look and will post here once I get to the root of the issue |
I tried the following script on a GPU import torch
import peft
import trl
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
print(f"{torch.__version__=}")
print(f"{peft.__version__=}")
print(f"{trl.__version__=}")
def check_model_weights_equality(model1, model2):
params1, params2 = model1.parameters(), model2.parameters()
sum1 = sum(p.numel() for p in params1)
sum2 = sum(p.numel() for p in params2)
if (sum1 != sum2):
print(f"Number of parameters are different in {model1.__class__}:{sum1} and {model2.__class__}:{sum2} are different")
return False
for p1, p2 in zip(params1, params2):
if not torch.equal(p1, p2):
print(f"weights of {model1.__class__} and {model2.__class__} are different")
return False
print(f"models {model1.__class__} and {model2.__class__} are the same")
return True
def count_parameters(model):
return sum(p.numel() for p in model.parameters())
def train():
model_id = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
dataset = load_dataset("pawkanarek/poke_test", split="train")
lora_config = LoraConfig(r=16, target_modules=["k_proj", "v_proj"], task_type="CAUSAL_LM", lora_alpha=16, lora_dropout=0.05,)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer = tokenizer,
args=TrainingArguments(
per_device_train_batch_size=2,
max_steps=40, # small epochs for brevity, but the same is also with larger epochs
output_dir="output/trained_model",
optim="adafactor",
logging_steps=1,
learning_rate=3e-4,
),
peft_config=lora_config,
max_seq_length=512,
)
trainer.train()
trainer.save_model()
merged_model = trainer.model.merge_and_unload() # merge LORA with base model
merged_model.to("cpu")
print(type(merged_model), count_parameters(merged_model))
merged_model.save_pretrained("adapters_merged")
### VERIFICATION, ENSURE THAT MODEL WAS TRAINED
trained_model = AutoModelForCausalLM.from_pretrained("adapters_merged")
print(type(trained_model), count_parameters(trained_model))
original_model = AutoModelForCausalLM.from_pretrained(model_id)
print(type(original_model), count_parameters(original_model))
check_model_weights_equality(trained_model, original_model)
if __name__ == "__main__":
train() And here was the output:
so, the issue has nothing to do with TPU for sure However, one thing I would like to verify is if your way of checking if the model weights are equal or not. So, will get back to you on that. |
Logs:
|
@moficodes I think you did misunderstand my intentions. I want to save a standalone model, not just the LoRA adapter. You saved only the LoRA adapter (with Please Take a look at this updated script. I changed a comparing function to be more descriptive, and I added more logging as @shub-kris asked. # Make sure to run the script with the following envs:
# PJRT_DEVICE=TPU XLA_USE_SPMD=1
import torch
import torch_xla
import peft
import trl
import torch_xla.core.xla_model as xm
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft.peft_model import PeftModel
from trl import SFTTrainer
from transformers import logging, IntervalStrategy
device = xm.xla_device() # Set up TPU device.
def models_equal(model1, model2):
name1, name2 = model1.__class__.__name__, model2.__class__.__name__
params1, params2 = model1.parameters(), model2.parameters()
sum1, sum2 = sum(p.numel() for p in params1), sum(p.numel() for p in params2)
if (sum1 != sum2):
print(f"!!! numer of params are different in {name1}:{sum1} params vs {name2}:{sum2} params")
for (n1, p1), (n2, p2) in zip(model1.named_parameters(), model2.named_parameters()):
if n1 != n2:
print(f"!!! Parameter names differ: {n1} != {n2}")
return False
if not torch.equal(p1.data, p2.data):
print(f"!!! Parameter values differ: {n1}, {p1.data}, {p2.data}")
return False
print(f"!!! models {name1} and {name2} are the same")
return True
def train():
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
dataset = load_dataset("pawkanarek/poke_test", split="train")
lora_config = LoraConfig(r=8, target_modules=["k_proj", "v_proj"], task_type="CAUSAL_LM")
fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"], "xla": True, "xla_fsdp_v2": True, "xla_fsdp_grad_ckpt": True}
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer = tokenizer,
args=TrainingArguments(
logging_steps=1,
save_strategy=IntervalStrategy.EPOCH,
per_device_train_batch_size=64,
num_train_epochs=1,
output_dir="output/trained_model",
optim="adafactor",
dataloader_drop_last = True, # Required for SPMD.
fsdp="full_shard",
fsdp_config=fsdp_config,
),
peft_config=lora_config,
max_seq_length=2048,
)
trainer.train()
trainer.save_model()
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", return_dict=True, torch_dtype=torch.bfloat16)
new_model = PeftModel.from_pretrained(base_model, "output/trained_model")
new_model = new_model.merge_and_unload()
new_model.save_pretrained("output/new_model")
new_model_from_disk = AutoModelForCausalLM.from_pretrained("output/new_model", torch_dtype=torch.bfloat16)
base_model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", torch_dtype=torch.bfloat16)
print(f"are equal after load from disk? {models_equal(base_model, new_model_from_disk)}") # they equal after loading from disk
print(1)
if __name__ == "__main__":
logging.set_verbosity(logging.DEBUG)
train()
As you can see at the end i again see information that the base model and loaded model from disk are the same
I'm open to investigate further. |
Hi @PawKanarek I tried a new script which is very similar to your script, and I tried inference before and after training the models and the results are different, which verifies that the model was trained and also saved perfectly. Script# PJRT_DEVICE=TPU XLA_USE_SPMD=1
import torch
import torch_xla
import peft
import trl
import torch_xla.core.xla_model as xm
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
print(f"{torch.__version__=}")
print(f"{torch_xla.__version__=}")
print(f"{peft.__version__=}")
print(f"{trl.__version__=}")
device = xm.xla_device() # Set up TPU device.
def inference(model, tokenizer):
text = "Quote: Imagination is more"
device = "cpu"
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20) #generate only supported on GPU and CPU
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
def train():
model_id = "google/gemma-2b"
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# tokenizer.pad_token = tokenizer.eos_token
#Load and process dataset
raw_dataset = load_dataset("Abirate/english_quotes", split="train")
lora_config = LoraConfig(r=8, target_modules="all-linear", task_type="CAUSAL_LM", lora_alpha=16, lora_dropout=0.05,)
fsdp_config = {"fsdp_transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"], "xla": True, "xla_fsdp_v2": True, "xla_fsdp_grad_ckpt": True}
trainer = SFTTrainer(
model=model,
# train_dataset=format_dataset,
train_dataset=raw_dataset,
tokenizer = tokenizer,
args=TrainingArguments(
per_device_train_batch_size=32,
num_train_epochs=10,
output_dir="output",
optim="adafactor",
logging_steps=1,
learning_rate=3e-4,
save_strategy="no",
dataloader_drop_last = True, # Required for SPMD.
fsdp="full_shard",
fsdp_config=fsdp_config,
),
peft_config=lora_config,
max_seq_length=1024,
packing=True,
dataset_text_field="quote",
)
trainer.train()
trainer.save_model()
merged_model = trainer.model.merge_and_unload() # merge LORA with base model
merged_model.to("cpu")
merged_model.save_pretrained("adapters_merged")
### VERIFICATION, ENSURE THAT MODEL WAS TRAINED
trained_model = AutoModelForCausalLM.from_pretrained("adapters_merged")
original_model = AutoModelForCausalLM.from_pretrained(model_id)
print("Inference with base model: \n\n")
inference(original_model, tokenizer)
print("Inference with trained model: \n\n")
inference(trained_model, tokenizer)
if __name__ == "__main__":
train()
logs
Inference Results
@amyeroberts we can close this issue #29659 and also the issue #29608 |
I also tried without With finetuned model I got this result:
|
Thank you @shub-kris ! I will run this script on my local machine and then I will share the results. tokenizer.pad_token = tokenizer.eos_token ? |
It configures the tokenizer's padding token to be the same as its end-of-sequence (EOS) token. But you don't need it for this use-case as the tokenizr already has pad_token defined here |
I got the inference result from TPU is not like you logs:
|
@zorrofox it's like me: #29659 (comment) |
But the inference result is very diffrent. |
@zorrofox nothing is different. Please go through the comment once again, and if it's different what is different? Are you referring to this comment: #29659 (comment) then here i tried without FSDP. |
I think that my original method for comparing weights was broken. When I accessed the parameters with the params1 = model1.parameters()
print(len(list(params1))) # prints 164
print(len(list(params1))) # prints 0 I tried your code @shub-kris and I have exactly the same result from merged model:
That looks kinda broken, and i still experience this warning when loading merged model
But maybe this should be addressed in another issue. Thanks once more for investigating and debugging. |
Is this happening when you're loading a saved model? |
@amyeroberts No, I copied that warning message from comment of @zorrofox #29659 (comment), but I remember that i also experienced this warning. To be 100% certain, I once again launched code from this comment of @shub-kris #29659 (comment) and thats is my output
As you can see I also experience that kind of warning when loading merged model. |
@PawKanarek I am now able to replicate the error/warning you get, earlier I didn't get. When I try debugging, I encountered this error when running with Can you please re-run the script, with these commented
and reduce the batch size according to your TPU and post the results here again . cc @amyeroberts |
@shub-kris with commented-out FSDP and reduced output (click arrow to expand)
|
@PawKanarek thank you for the confirmation. I need to now look into what's going wrong when we use cc @amyeroberts |
@alanwaketan can you please take a look into it. |
I think the issue is for the FSDP wrapped model, we need to unwrap the model before saving it. I have given instructions to @shub-kris for fixing the unwrap logic in HF. If things don't work out in HF, I will provide a utility in torch-xla to unwrap the model. |
@PawKanarek @zorrofox can you now try with the PR: #29780 For me everything works perfectly now. |
@shub-kris This time the merged model loading warning is disappeared but the inference result is not very good. train outputtorch.__version__='2.3.0.dev20240312+cu121' torch_xla.__version__='2.3.0+git97acc14' peft.__version__='0.9.0' trl.__version__='0.7.11' Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.82it/s] WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1711089816.717824 13174 pjrt_api.cc:100] GetPjrtApi was found for tpu at /home/admin_greghuang_altostrat_com/.local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so I0000 00:00:1711089816.717896 13174 pjrt_api.cc:79] PJRT_Api is set for device type tpu I0000 00:00:1711089816.717907 13174 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40. /home/admin_greghuang_altostrat_com/.local/lib/python3.10/site-packages/torch_xla/core/xla_model.py:104: UserWarning: `devkind` argument is deprecated and will be removed in a future release. warnings.warn("`devkind` argument is deprecated and will be removed in a " /home/admin_greghuang_altostrat_com/.local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:294: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code. warnings.warn( /home/admin_greghuang_altostrat_com/.local/lib/python3.10/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True) warnings.warn( 0%| | 0/30 [00:00 warnings.warn("For backward hooks to be called," /home/admin_greghuang_altostrat_com/.local/lib/python3.10/site-packages/torch/autograd/graph.py:744: UserWarning: aten::reshape: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass 3%|███▌ | 1/30 [01:31<44:15, 91.58s/it]{'loss': 4.273, 'grad_norm': 3.734375, 'learning_rate': 0.00029, 'epoch': 0.33} {'loss': 4.1188, 'grad_norm': 4.09375, 'learning_rate': 0.00028, 'epoch': 0.67} {'loss': 3.7833, 'grad_norm': 5.21875, 'learning_rate': 0.00027, 'epoch': 1.0} {'loss': 3.3963, 'grad_norm': 4.375, 'learning_rate': 0.00026, 'epoch': 1.33} {'loss': 3.2293, 'grad_norm': 2.75, 'learning_rate': 0.00025, 'epoch': 1.67} {'loss': 2.9096, 'grad_norm': 1.9296875, 'learning_rate': 0.00023999999999999998, 'epoch': 2.0} {'loss': 2.7546, 'grad_norm': 1.953125, 'learning_rate': 0.00023, 'epoch': 2.33} {'loss': 2.7717, 'grad_norm': 2.453125, 'learning_rate': 0.00021999999999999995, 'epoch': 2.67} {'loss': 2.6428, 'grad_norm': 2.921875, 'learning_rate': 0.00020999999999999998, 'epoch': 3.0} {'loss': 2.6672, 'grad_norm': 1.6640625, 'learning_rate': 0.00019999999999999998, 'epoch': 3.33} {'loss': 2.5239, 'grad_norm': 4.4375, 'learning_rate': 0.00018999999999999998, 'epoch': 3.67} {'loss': 2.4252, 'grad_norm': 1.578125, 'learning_rate': 0.00017999999999999998, 'epoch': 4.0} {'loss': 2.455, 'grad_norm': 2.265625, 'learning_rate': 0.00016999999999999999, 'epoch': 4.33} {'loss': 2.5093, 'grad_norm': 1.4921875, 'learning_rate': 0.00015999999999999999, 'epoch': 4.67} {'loss': 2.2936, 'grad_norm': 2.0, 'learning_rate': 0.00015, 'epoch': 5.0} {'loss': 2.3667, 'grad_norm': 2.1875, 'learning_rate': 0.00014, 'epoch': 5.33} {'loss': 2.3081, 'grad_norm': 2.21875, 'learning_rate': 0.00013, 'epoch': 5.67} {'loss': 2.2664, 'grad_norm': 3.59375, 'learning_rate': 0.00011999999999999999, 'epoch': 6.0} {'loss': 2.2403, 'grad_norm': 6.5625, 'learning_rate': 0.00010999999999999998, 'epoch': 6.33} {'loss': 2.269, 'grad_norm': 5.28125, 'learning_rate': 9.999999999999999e-05, 'epoch': 6.67} {'loss': 2.2112, 'grad_norm': 1.15625, 'learning_rate': 8.999999999999999e-05, 'epoch': 7.0} {'loss': 2.2353, 'grad_norm': 1.2265625, 'learning_rate': 7.999999999999999e-05, 'epoch': 7.33} {'loss': 2.15, 'grad_norm': 1.3984375, 'learning_rate': 7e-05, 'epoch': 7.67} {'loss': 2.2592, 'grad_norm': 1.359375, 'learning_rate': 5.9999999999999995e-05, 'epoch': 8.0} {'loss': 2.185, 'grad_norm': 0.8359375, 'learning_rate': 4.9999999999999996e-05, 'epoch': 8.33} {'loss': 2.1976, 'grad_norm': 0.75390625, 'learning_rate': 3.9999999999999996e-05, 'epoch': 8.67} {'loss': 2.1421, 'grad_norm': 0.8203125, 'learning_rate': 2.9999999999999997e-05, 'epoch': 9.0} {'loss': 2.2024, 'grad_norm': 0.81640625, 'learning_rate': 1.9999999999999998e-05, 'epoch': 9.33} {'loss': 2.0441, 'grad_norm': 0.90625, 'learning_rate': 9.999999999999999e-06, 'epoch': 9.67} {'loss': 2.1902, 'grad_norm': 0.70703125, 'learning_rate': 0.0, 'epoch': 10.0} {'train_runtime': 574.9741, 'train_samples_per_second': 1.67, 'train_steps_per_second': 0.052, 'train_loss': 2.600708842277527, 'epoch': 10.0} 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [09:34<00:00, 19.17s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.70it/s] Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.74it/s] Inference with base model:Quote: Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. - Albert Einstein I am Quote: Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. Imagination is the only way to |
@zorrofox try training longer as your losses are still high. I don't remember the exact hyperparameters I tried but I was able to get decent results. Thanks for confirming that the issue is resolved regarding saving and reloading the weights. |
That's great news @shub-kris! Thank you for the quick fix and hard work! I will post update when I'm done with my current trainings (because my workaorund still works and i don't want to break my pipeline). Could you provide the minimal pseudo-code with correct pattern for unloading and merging LoRA adapter as standalone model? trainer = SFTTrainer(...)
trainer.train()
merged_model = trainer.model.merge_and_unload() # merge LORA with base model
merged_model.to("cpu")
merged_model.save_pretrained("adapters_merged") Is this OK? Or do i need also make |
@PawKanarek in the training script: I will recommend to do the training and save the model.
Merging can be done in a separate script to avoid any kind of TPU or FSDP wrapper issues. I follow as mentioned here: https://huggingface.co/docs/trl/en/use_model#use-adapters-peft
|
Fix from and saving with given pattern works flawlessly. Thank you @shub-kris 👨💻 |
System Info
torch.version='2.3.0.dev20240307'
torch_xla.version='2.3.0+git46e2230'
peft.version='0.9.0'
trl.version='0.7.12.dev0'
Python 3.10.13
Who can help?
@ArthurZucker , @younesbelkada, @muellerzr, @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hello, I have a problem with training the
gemma-2b-it
model on Google TPU v3-8. My goal is to train it with the peft lora adapter, and then save it as a standalone model.For merging base model with lora adapter I was following the guide: https://huggingface.co/docs/trl/main/en/use_model
Training code is based on this blog post: https://huggingface.co/blog/gemma-peft
The problem is that the training takes a while (for 300k rows in a data loader it might take even 8 hours) but after training the model seems… untrained. The interference output looks almost identical to the output of the base model.
Furthermore, when I check for the weights of the trained and original models then they appear to be identical.
I also consistently encounter the following error message, while loading saved model:
Below is the minimal working code that trains and saves the model.
And this is the output
I'm stuck so, I'm asking for help. I tried many combinations of the
PeftModel.merge_and_unload()
,saving_pretrained()
, andtrainer.save_model()
and nothing seems to work. Every idea to push this issue forward will be appreciated. Thanks.Expected behavior
Training trains the model.
The text was updated successfully, but these errors were encountered: