You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that I couldn't train on more than 1 trainium instance with optimum Neuron. However, if I comment out the code related to the neuroncache, then it seems to work. I commented out https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L132-L147and set the path to the cache_dir from the get method, and then commented outhttps://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L200C9-L208C36and training would work on multiple nodes
Who can help?
No response
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
try and run mlm.py from examples on more than 1 trainium node and get failures
Expected behavior
Should do MLM training
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
try and run mlm.py from examples on more than 1 trainium node and get failures
Expected behavior
Should do MLM training
The text was updated successfully, but these errors were encountered: