Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model.save_pretrained fails with error when using Pytorch XLA #29608

Closed
2 of 4 tasks
moficodes opened this issue Mar 12, 2024 · 13 comments
Closed
2 of 4 tasks

model.save_pretrained fails with error when using Pytorch XLA #29608

moficodes opened this issue Mar 12, 2024 · 13 comments

Comments

@moficodes
Copy link

moficodes commented Mar 12, 2024

System Info

Transformers == 4.38.2
Platform == TPU V4 on GKE
Python == 3.10

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I ran some tests on a GKE Cluster with TPU V4 with 4 nodes.

https://gist.github.com/moficodes/1492228c80a3c08747a973b519cc7cda

This run fails with

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 13, in storage_ptr
return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "//fsdp.py", line 112, in
model.save_pretrained(new_model_id)
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2448, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten
shared_pointers = _find_shared_tensors(tensors)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 17, in storage_ptr
return tensor.storage().data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 956, in data_ptr
return self._data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 960, in _data_ptr
return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.

Expected behavior

Save the model and push to hugging face.

@amyeroberts
Copy link
Collaborator

cc @muellerzr

@shub-kris
Copy link
Contributor

shub-kris commented Mar 13, 2024

Hi @moficodes I looked into your script, and right now model.save_pretrained doesn't support saving models that are already on a TPU, so you need to first move them to CPU.

You can use either of these options:

# This saves adapter weights
trainer.save_model(new_model_id)

OR

# This doesn't save adapter weights 
model.to('cpu')
model.save_pretrained(new_model_id)

@moficodes
Copy link
Author

@shub-kris
does the model.to('cpu' or trainer.save_mode(new_model_id) require your pr to be merged?

@shub-kris
Copy link
Contributor

shub-kris commented Mar 13, 2024

@moficodes in order to save adapter-weights by using trainer.save_model(new_model_id) requires my PR to be merged. But since you are building on top of my branch in your Dockerfile it should work.

@moficodes
Copy link
Author

Awesome. After the save, I would like to then push the model back out to HF if possible.

Does the rest of the code look ok for that?

@moficodes
Copy link
Author

@shub-kris
trainer.save_mode(new_model_id) Worked. But it failed to push to hf.

Traceback (most recent call last):
  File "//fsdp.py", line 128, in <module>
    model.push_to_hub(new_model_id, check_pr=True)
  File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2499, in push_to_hub
    return super().push_to_hub(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 878, in push_to_hub
    model_card = create_and_tag_model_card(
  File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 1133, in create_and_tag_model_card
    model_card = ModelCard.load(repo_id, token=token, ignore_metadata_errors=ignore_metadata_errors)
  File "/usr/local/lib/python3.10/site-packages/huggingface_hub/repocard.py", line 187, in load
    with card_path.open(mode="r", newline="", encoding="utf-8") as f:
  File "/usr/local/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
IsADirectoryError: [Errno 21] Is a directory: 'moficodes/gemma-7b-sql-context'

Any thing jumps out to you?

@moficodes
Copy link
Author

had to name the new_model_id as gemma-7b-sql-context instead of moficodes/gemma-7b-sql-context.

This worked.

@shub-kris
Copy link
Contributor

@moficodes Nice ;)

@zorrofox
Copy link

Someone can reopen this issue? I don’t think this issue really fixed. #29659

@amyeroberts amyeroberts reopened this Mar 18, 2024
@shub-kris
Copy link
Contributor

@moficodes did you encounter this issue: #29659

@moficodes
Copy link
Author

Hmm. I did not compare my final fine tuned model to confirm. I am now reading through the other issue and it might not have worked exactly how we wanted it to.

@shub-kris
Copy link
Contributor

@moficodes did you run any inference or something to verify if outputs changed?

@amyeroberts
Copy link
Collaborator

#29659 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants