Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to pretrain on a single machine (no using SLURM) #14

Open
geonoon opened this issue Jun 13, 2024 · 1 comment
Open

How to pretrain on a single machine (no using SLURM) #14

geonoon opened this issue Jun 13, 2024 · 1 comment

Comments

@geonoon
Copy link

geonoon commented Jun 13, 2024

Thank you for this amazing project.

I tried to perform pretraining on a single machine, with a Nvidia A100 GPU, or just with a CPU, but it could not work through.

It seems the script file main_pretrain.py needs to be modified somehow.

Could you offer help in detail on this matter?

Thanks in advance.

@DotWang
Copy link
Collaborator

DotWang commented Jun 18, 2024

@geonoon In fact, we have considered two cases for distributed pretraining: SLURM and server, but I'm not sure whether the main_pretrain.py of MTP can be implemented on the server, maybe you can refer to this, to revise the codes related to the distributed pretraining.

Here is a command example:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
    --nnodes=1 --master_port=10001 --master_addr = [server ip] main_pretrain.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants