How to pretrain on a single machine (no using SLURM) #14

geonoon · 2024-06-13T13:23:39Z

Thank you for this amazing project.

I tried to perform pretraining on a single machine, with a Nvidia A100 GPU, or just with a CPU, but it could not work through.

It seems the script file main_pretrain.py needs to be modified somehow.

Could you offer help in detail on this matter?

Thanks in advance.

DotWang · 2024-06-18T10:00:33Z

@geonoon In fact, we have considered two cases for distributed pretraining: SLURM and server, but I'm not sure whether the main_pretrain.py of MTP can be implemented on the server, maybe you can refer to this, to revise the codes related to the distributed pretraining.

Here is a command example:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=1 --master_port=10001 --master_addr = [server ip] main_pretrain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to pretrain on a single machine (no using SLURM) #14

How to pretrain on a single machine (no using SLURM) #14

geonoon commented Jun 13, 2024 •

edited

Loading

DotWang commented Jun 18, 2024 •

edited

Loading

How to pretrain on a single machine (no using SLURM) #14

How to pretrain on a single machine (no using SLURM) #14

Comments

geonoon commented Jun 13, 2024 • edited Loading

DotWang commented Jun 18, 2024 • edited Loading

geonoon commented Jun 13, 2024 •

edited

Loading

DotWang commented Jun 18, 2024 •

edited

Loading