Finetune of deepseek-coder fails #262

ryancu7 · 2024-01-04T06:16:22Z

When I try to train deepseek-coder/5.7b/mqa-base, I get: DataLoader worker (pid(s) 10222) exited unexpectedly
I have tried this several times always with the same result: filtering completes eventually but then finetuning fails before the first iteration. Finetune settings are on the default. No models were being served at the time.

Another person on the Discord channel experiences the same problem.
I have previously successfully tuned Refact/1.6B with basically the same source files.
I am using the current docker image (with the 'latest' tag) on Ubuntu 2204 with an NVIDIA GeForce RTX 3090 with 24 GB VRAM.

Log files: refact_logs.zip
(I redacted 3 filenames in the attached logs and deleted some repetitive lines.)
The log also contains same errors like this a few seconds before the bus error; not sure if it is relevant:
Token indices sequence length is longer than the specified maximum sequence length for this model (28538 > 16384). Running this sequence through the model will result in indexing errors

The text was updated successfully, but these errors were encountered:

olegklimov · 2024-01-11T12:15:53Z

Thanks for reporting! Sergey @JegernOUTT can you please look if we can fix this quickly?

JegernOUTT · 2024-01-15T05:24:43Z

I see this error in the logs

Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit

Probably we have to increase shm limits in our docker container
@olegklimov
I'll try to reproduce it though, to check possible fixes

matthusby · 2024-01-18T13:52:42Z

FWIW: I ran into the DataLoader worker (pid(s) xxx) exited unexpectedly last night too. This was my first time trying to finetune on a larger number of files (about 8k files - mix of ruby, and TS). I was able to resolve it by adding --shm-size=16384m to my docker run command. I did not do any testing to see what value would resolve the Dataloader issue - so 16G might be way more then is needed in my case.

JegernOUTT · 2024-01-19T09:39:52Z

@matthusby yeah, 16384m might be too much
We're figuring out the correct smallest amount and then will add it to instructions
Thank you for testing!

JegernOUTT · 2024-01-22T08:36:53Z

@ryancu7 @matthusby I've tried different shm sizes, looks like 256m was enough for me.
We are going to add that value to the docker run instructions
If you have a time, you can check if it's enough for your systems

hazratisulton · 2024-01-23T01:25:17Z

Checked smallcloud/refact_self_hosting:nightly
sha256:caf0d0b8cbe153b9e6e5ceef5b974b222c44c56c1103f936ca1fc081ccd753f0 - OK.

matthusby · 2024-01-23T01:40:22Z

Awesome! I am in the middle of some tuning now, but I will check with 256m when finished. I will update this thread if I run into any problems.

olegklimov assigned JegernOUTT Jan 11, 2024

JegernOUTT linked a pull request Jan 22, 2024 that will close this issue

Ds multiprocessing fixes #279

Merged

hazratisulton closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune of deepseek-coder fails #262

Finetune of deepseek-coder fails #262

ryancu7 commented Jan 4, 2024 •

edited

Loading

olegklimov commented Jan 11, 2024

JegernOUTT commented Jan 15, 2024

matthusby commented Jan 18, 2024

JegernOUTT commented Jan 19, 2024

JegernOUTT commented Jan 22, 2024

hazratisulton commented Jan 23, 2024

matthusby commented Jan 23, 2024

Finetune of deepseek-coder fails #262

Finetune of deepseek-coder fails #262

Comments

ryancu7 commented Jan 4, 2024 • edited Loading

olegklimov commented Jan 11, 2024

JegernOUTT commented Jan 15, 2024

matthusby commented Jan 18, 2024

JegernOUTT commented Jan 19, 2024

JegernOUTT commented Jan 22, 2024

hazratisulton commented Jan 23, 2024

matthusby commented Jan 23, 2024

ryancu7 commented Jan 4, 2024 •

edited

Loading