-
Notifications
You must be signed in to change notification settings - Fork 26.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model.save_pretrained fails with error when using Pytorch XLA #29608
Comments
cc @muellerzr |
Hi @moficodes I looked into your script, and right now You can use either of these options:
OR
|
@shub-kris |
@moficodes in order to save adapter-weights by using |
Awesome. After the save, I would like to then push the model back out to HF if possible. Does the rest of the code look ok for that? |
@shub-kris Traceback (most recent call last):
File "//fsdp.py", line 128, in <module>
model.push_to_hub(new_model_id, check_pr=True)
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2499, in push_to_hub
return super().push_to_hub(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 878, in push_to_hub
model_card = create_and_tag_model_card(
File "/usr/local/lib/python3.10/site-packages/transformers/utils/hub.py", line 1133, in create_and_tag_model_card
model_card = ModelCard.load(repo_id, token=token, ignore_metadata_errors=ignore_metadata_errors)
File "/usr/local/lib/python3.10/site-packages/huggingface_hub/repocard.py", line 187, in load
with card_path.open(mode="r", newline="", encoding="utf-8") as f:
File "/usr/local/lib/python3.10/pathlib.py", line 1119, in open
return self._accessor.open(self, mode, buffering, encoding, errors,
IsADirectoryError: [Errno 21] Is a directory: 'moficodes/gemma-7b-sql-context' Any thing jumps out to you? |
had to name the new_model_id as This worked. |
@moficodes Nice ;) |
Someone can reopen this issue? I don’t think this issue really fixed. #29659 |
@moficodes did you encounter this issue: #29659 |
Hmm. I did not compare my final fine tuned model to confirm. I am now reading through the other issue and it might not have worked exactly how we wanted it to. |
@moficodes did you run any inference or something to verify if outputs changed? |
System Info
Transformers == 4.38.2
Platform == TPU V4 on GKE
Python == 3.10
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I ran some tests on a GKE Cluster with TPU V4 with 4 nodes.
https://gist.github.com/moficodes/1492228c80a3c08747a973b519cc7cda
This run fails with
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 13, in storage_ptr
return tensor.untyped_storage().data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "//fsdp.py", line 112, in
model.save_pretrained(new_model_id)
File "/usr/local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2448, in save_pretrained
safe_save_file(shard, os.path.join(save_directory, shard_file), metadata={"format": "pt"})
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 281, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 470, in _flatten
shared_pointers = _find_shared_tensors(tensors)
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 72, in _find_shared_tensors
if v.device != torch.device("meta") and storage_ptr(v) != 0 and storage_size(v) != 0:
File "/usr/local/lib/python3.10/site-packages/safetensors/torch.py", line 17, in storage_ptr
return tensor.storage().data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 956, in data_ptr
return self._data_ptr()
File "/usr/local/lib/python3.10/site-packages/torch/storage.py", line 960, in _data_ptr
return self._untyped_storage.data_ptr()
RuntimeError: Attempted to access the data pointer on an invalid python storage.
Expected behavior
Save the model and push to hugging face.
The text was updated successfully, but these errors were encountered: