-
Notifications
You must be signed in to change notification settings - Fork 822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-Tuning Crashes for no reason when Eight GPU cards are used. #816
Comments
Thanks for your interest and recognition in LMFlow! Some of our collaborators have met a similar issue. We were using CUDA 12.0 and pytorch for cuda 12.1, and similar problems occurred. It was resolved by using pytorch corresponding to an older CUDA (like 11.8). We suspect this problem is caused by the mismatch of the latest pytorch version and CUDA version. You may try to adjust the versions of pytorch to see if the problem occurs again. Hope this information can be helpful 😄 |
Thanks for your reply ! Another issue is the run_all_benchmark.sh; specifically, when I run the script, it just gives error saying:
I get into the code, I suspect that some parts of the code has not yet finished implementation, and then leads to the error ? I run the code by Thanks again for your help ! |
@2003pro I am wondering if you can take a look at this? |
I suggest to switch
Also, if there any further issues, you may check if transformers' version is compatible. My environment's version is 4.33.3. |
Dear Developers at LMFlow:
I have been using LMFlow for a long time and the experience is great !
But recently, after cloning the latest LMFlow and use it to Fine-Tune my model, I encountered some expected issue.
Specifically, when I use all 8 of my NVIDIA-A100 cards, the fine-tuning program crashes without indicating any error. However, when I use only 6 / 7 cards, things goes well.
Beliw is the output of the program:
I am pretty sure I use the correct way to specify the GPU cards to use by setting DeepSpeed Arguments:
deepspeed_args="--master_port=11012 --include localhost:0,1,2,3,4,5,6,7"
As I have never excountered this problem with the elder version, after several times of experiment with errors, I come here to seek for helps.
I am not sure if it is the problem on the LMFlow side / on my side.
Thanks for your help ~
The text was updated successfully, but these errors were encountered: