More than 1 Trainium Instance #485

mathephysicist · 2024-02-15T16:45:28Z

System Info

I found that I couldn't train on more than 1 trainium instance with optimum Neuron. However, if I comment out the code related to the neuroncache, then it seems to work. 

I commented out 
https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L132-L147

and set the path to the cache_dir from the get method, and then commented out

https://github.com/huggingface/optimum-neuron/blob/ee0c1f4104ee817daf84107776d9a2d7b92499dd/optimum/neuron/trainers.py#L200C9-L208C36

and training would work on multiple nodes

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

try and run mlm.py from examples on more than 1 trainium node and get failures

Expected behavior

Should do MLM training

The text was updated successfully, but these errors were encountered:

michaelbenayoun · 2024-02-16T10:31:44Z

Hi, it is under development in #440 and should be fixed soon.

mathephysicist · 2024-02-21T21:52:24Z

Thanks for the update @michaelbenayoun, is there an ETA for this feature or any way I can support it to ship it faster?

philschmid · 2024-03-27T08:26:41Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

github-actions · 2024-10-23T08:05:11Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions · 2024-10-28T08:05:22Z

This issue was closed because it has been stalled for 5 days with no activity.

mathephysicist added the bug Something isn't working label Feb 15, 2024

github-actions bot added the Stale label Oct 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More than 1 Trainium Instance #485

More than 1 Trainium Instance #485

mathephysicist commented Feb 15, 2024 •

edited

Loading

michaelbenayoun commented Feb 16, 2024

mathephysicist commented Feb 21, 2024

philschmid commented Mar 27, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 28, 2024

More than 1 Trainium Instance #485

More than 1 Trainium Instance #485

Comments

mathephysicist commented Feb 15, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

michaelbenayoun commented Feb 16, 2024

mathephysicist commented Feb 21, 2024

philschmid commented Mar 27, 2024

github-actions bot commented Oct 23, 2024

github-actions bot commented Oct 28, 2024

mathephysicist commented Feb 15, 2024 •

edited

Loading