Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune of deepseek-coder fails #262

Closed
ryancu7 opened this issue Jan 4, 2024 · 7 comments · Fixed by #279
Closed

Finetune of deepseek-coder fails #262

ryancu7 opened this issue Jan 4, 2024 · 7 comments · Fixed by #279
Assignees

Comments

@ryancu7
Copy link

ryancu7 commented Jan 4, 2024

When I try to train deepseek-coder/5.7b/mqa-base, I get: DataLoader worker (pid(s) 10222) exited unexpectedly
I have tried this several times always with the same result: filtering completes eventually but then finetuning fails before the first iteration. Finetune settings are on the default. No models were being served at the time.

Another person on the Discord channel experiences the same problem.
I have previously successfully tuned Refact/1.6B with basically the same source files.
I am using the current docker image (with the 'latest' tag) on Ubuntu 2204 with an NVIDIA GeForce RTX 3090 with 24 GB VRAM.

Log files: refact_logs.zip
(I redacted 3 filenames in the attached logs and deleted some repetitive lines.)
The log also contains same errors like this a few seconds before the bus error; not sure if it is relevant:
Token indices sequence length is longer than the specified maximum sequence length for this model (28538 > 16384). Running this sequence through the model will result in indexing errors

@olegklimov
Copy link
Contributor

Thanks for reporting! Sergey @JegernOUTT can you please look if we can fix this quickly?

@JegernOUTT
Copy link
Member

I see this error in the logs

Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit

Probably we have to increase shm limits in our docker container
@olegklimov
I'll try to reproduce it though, to check possible fixes

@matthusby
Copy link

FWIW: I ran into the DataLoader worker (pid(s) xxx) exited unexpectedly last night too. This was my first time trying to finetune on a larger number of files (about 8k files - mix of ruby, and TS). I was able to resolve it by adding --shm-size=16384m to my docker run command. I did not do any testing to see what value would resolve the Dataloader issue - so 16G might be way more then is needed in my case.

@JegernOUTT
Copy link
Member

@matthusby yeah, 16384m might be too much
We're figuring out the correct smallest amount and then will add it to instructions
Thank you for testing!

@JegernOUTT JegernOUTT linked a pull request Jan 22, 2024 that will close this issue
@JegernOUTT
Copy link
Member

@ryancu7 @matthusby I've tried different shm sizes, looks like 256m was enough for me.
We are going to add that value to the docker run instructions
If you have a time, you can check if it's enough for your systems

@hazratisulton
Copy link

Checked smallcloud/refact_self_hosting:nightly
sha256:caf0d0b8cbe153b9e6e5ceef5b974b222c44c56c1103f936ca1fc081ccd753f0 - OK.

@matthusby
Copy link

Awesome! I am in the middle of some tuning now, but I will check with 256m when finished. I will update this thread if I run into any problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants