model.save_pretrained fails with error when using Pytorch XLA #29608

moficodes · 2024-03-12T06:12:12Z

System Info

Transformers == 4.38.2
Platform == TPU V4 on GKE
Python == 3.10

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I ran some tests on a GKE Cluster with TPU V4 with 4 nodes.

https://gist.github.com/moficodes/1492228c80a3c08747a973b519cc7cda

This run fails with

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 13, in storage_ptr
return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "//fsdp.py", line 112, in
model.save_pretrained(new_model_id)
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2448, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten
shared_pointers = _find_shared_tensors(tensors)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 17, in storage_ptr
return tensor.storage().data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 956, in data_ptr
return self._data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 960, in _data_ptr
return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

Expected behavior

Save the model and push to hugging face.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-03-12T10:39:46Z

cc @muellerzr

shub-kris · 2024-03-13T08:21:34Z

Hi @moficodes I looked into your script, and right now model.save_pretrained doesn't support saving models that are already on a TPU, so you need to first move them to CPU.

You can use either of these options:

# This saves adapter weights
trainer.save_model(new_model_id)

OR

# This doesn't save adapter weights 
model.to('cpu')
model.save_pretrained(new_model_id)

moficodes · 2024-03-13T16:15:20Z

@shub-kris
does the model.to('cpu' or trainer.save_mode(new_model_id) require your pr to be merged?

shub-kris · 2024-03-13T16:18:57Z

@moficodes in order to save adapter-weights by using trainer.save_model(new_model_id) requires my PR to be merged. But since you are building on top of my branch in your Dockerfile it should work.

moficodes · 2024-03-13T16:24:08Z

Awesome. After the save, I would like to then push the model back out to HF if possible.

Does the rest of the code look ok for that?

moficodes · 2024-03-14T10:24:31Z

@shub-kris
trainer.save_mode(new_model_id) Worked. But it failed to push to hf.

Traceback (most recent call last):
  File "//fsdp.py", line 128, in <module>
    model.push_to_hub(new_model_id, check_pr=True)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2499, in push_to_hub
    return super().push_to_hub(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 878, in push_to_hub
    model_card = create_and_tag_model_card(
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 1133, in create_and_tag_model_card
    model_card = ModelCard.load(repo_id, token=token, ignore_metadata_errors=ignore_metadata_errors)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/repocard.py", line 187, in load
    with card_path.open(mode="r", newline="", encoding="utf-8") as f:
  File "/usr/local/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
IsADirectoryError: [Errno 21] Is a directory: 'moficodes/gemma-7b-sql-context'

Any thing jumps out to you?

moficodes · 2024-03-14T13:27:57Z

had to name the new_model_id as gemma-7b-sql-context instead of moficodes/gemma-7b-sql-context.

This worked.

shub-kris · 2024-03-14T14:58:43Z

@moficodes Nice ;)

zorrofox · 2024-03-16T05:34:21Z

Someone can reopen this issue? I don’t think this issue really fixed. #29659

shub-kris · 2024-03-18T11:46:08Z

@moficodes did you encounter this issue: #29659

moficodes · 2024-03-18T12:09:49Z

Hmm. I did not compare my final fine tuned model to confirm. I am now reading through the other issue and it might not have worked exactly how we wanted it to.

shub-kris · 2024-03-18T14:06:54Z

@moficodes did you run any inference or something to verify if outputs changed?

amyeroberts · 2024-03-19T12:33:57Z

#29659 (comment)

shub-kris mentioned this issue Mar 13, 2024

[PEFT] Fix save_pretrained to make sure adapters weights are also saved on TPU #29388

Merged

5 tasks

moficodes closed this as completed Mar 14, 2024

amyeroberts reopened this Mar 18, 2024

shub-kris mentioned this issue Mar 19, 2024

Problems with saving standalone gemma-2b-it after fine-tuning with LoRA on TPU v3-8 #29659

Closed

4 tasks

amyeroberts closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.save_pretrained fails with error when using Pytorch XLA #29608

model.save_pretrained fails with error when using Pytorch XLA #29608

moficodes commented Mar 12, 2024 •

edited

Loading

amyeroberts commented Mar 12, 2024

shub-kris commented Mar 13, 2024 •

edited

Loading

moficodes commented Mar 13, 2024

shub-kris commented Mar 13, 2024 •

edited

Loading

moficodes commented Mar 13, 2024

moficodes commented Mar 14, 2024

moficodes commented Mar 14, 2024

shub-kris commented Mar 14, 2024

zorrofox commented Mar 16, 2024

shub-kris commented Mar 18, 2024

moficodes commented Mar 18, 2024

shub-kris commented Mar 18, 2024

amyeroberts commented Mar 19, 2024

model.save_pretrained fails with error when using Pytorch XLA #29608

model.save_pretrained fails with error when using Pytorch XLA #29608

Comments

moficodes commented Mar 12, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Mar 12, 2024

shub-kris commented Mar 13, 2024 • edited Loading

moficodes commented Mar 13, 2024

shub-kris commented Mar 13, 2024 • edited Loading

moficodes commented Mar 13, 2024

moficodes commented Mar 14, 2024

moficodes commented Mar 14, 2024

shub-kris commented Mar 14, 2024

zorrofox commented Mar 16, 2024

shub-kris commented Mar 18, 2024

moficodes commented Mar 18, 2024

shub-kris commented Mar 18, 2024

amyeroberts commented Mar 19, 2024

moficodes commented Mar 12, 2024 •

edited

Loading

shub-kris commented Mar 13, 2024 •

edited

Loading

shub-kris commented Mar 13, 2024 •

edited

Loading