Is it possible to collect state dict in cpu? #4

JiaquanYe · 2021-08-12T09:38:09Z

When I finish one epoch in trianing, the main_worker function will call ts.collect_state_dict(model, state_dict).
But because the limit of GPU resource, it will raise Out of Memory in my machine, when call ts.collect_state_dict(model, state_dict).
I found that will gather the state_dict in GPU, is it anyway to gather in CPU?

kaiyuyue · 2021-08-14T03:15:51Z

It is impossible to perform gather operation in cpu because the operation is based on NCCL backend. But there is a way to avoid gathering in GPU on-the-fly, that is to save state_dict of each shard locally and then write a post process script to hub them together. For example, if using 16 GPUs within 16 ranks, save 16 checkpoints during training, like model_state_rank_001.pth, model_state_rank_002.pth, … and model_state_rank_016.pth. After finishing training, write a post process script to gather these 16 checkpoints into one. Pay attention to keep right order for each shard state and run the inference test to check result.

JiaquanYe · 2021-08-14T03:19:17Z

It is impossible to perform gather operation in cpu because the operation is based on NCCL backend. But there is a way to avoid gathering in GPU on-the-fly, that is to save state_dict of each shard locally and then write a post process script to hub them together. For example, if using 16 GPUs within 16 ranks, save 16 checkpoints during training, like model_state_rank_001.pth, model_state_rank_002.pth, … and model_state_rank_016.pth. After finishing training, write a post process script to gather these 16 checkpoints into one. Pay attention to keep right order for each shard state and run the inference test to check result.

It is an excellent solution! Thanks.

kaiyuyue added the Good Issue Good reference for newcomers label Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to collect state dict in cpu? #4

Is it possible to collect state dict in cpu? #4

JiaquanYe commented Aug 12, 2021

kaiyuyue commented Aug 14, 2021

JiaquanYe commented Aug 14, 2021

Is it possible to collect state dict in cpu? #4

Is it possible to collect state dict in cpu? #4

Comments

JiaquanYe commented Aug 12, 2021

kaiyuyue commented Aug 14, 2021

JiaquanYe commented Aug 14, 2021