-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune of deepseek-coder fails #262
Comments
Thanks for reporting! Sergey @JegernOUTT can you please look if we can fix this quickly? |
I see this error in the logs
Probably we have to increase shm limits in our docker container |
FWIW: I ran into the |
@matthusby yeah, 16384m might be too much |
@ryancu7 @matthusby I've tried different shm sizes, looks like 256m was enough for me. |
Checked smallcloud/refact_self_hosting:nightly |
Awesome! I am in the middle of some tuning now, but I will check with 256m when finished. I will update this thread if I run into any problems. |
When I try to train deepseek-coder/5.7b/mqa-base, I get:
DataLoader worker (pid(s) 10222) exited unexpectedly
I have tried this several times always with the same result: filtering completes eventually but then finetuning fails before the first iteration. Finetune settings are on the default. No models were being served at the time.
Another person on the Discord channel experiences the same problem.
I have previously successfully tuned Refact/1.6B with basically the same source files.
I am using the current docker image (with the 'latest' tag) on Ubuntu 2204 with an NVIDIA GeForce RTX 3090 with 24 GB VRAM.
Log files: refact_logs.zip
(I redacted 3 filenames in the attached logs and deleted some repetitive lines.)
The log also contains same errors like this a few seconds before the bus error; not sure if it is relevant:
Token indices sequence length is longer than the specified maximum sequence length for this model (28538 > 16384). Running this sequence through the model will result in indexing errors
The text was updated successfully, but these errors were encountered: