How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

fantasysee · 2022-04-20T15:27:03Z

Description

Hi, @lengstrom . Thanks for your wonderful work!

My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, which is not large enough to load ImageNet into the memory.

I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (Scenario: Large scale datasets and Scenario: Multi-GPU training (1 model, multiple GPUs)

Right now, I can run a ResNet18 on a single card by using os_cache=False. However, if I use in_memory=0 and distributed = 1 to run the provided train_imagenet.py code as follows, some errors are reported, which are listed at the bottom. Would you please tell me how to solve this issue?

Command

python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \
    ... \
    --data.in_memory=0 \
    --training.distributed=1

Message

Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.

=> Logging in ...

Not enough memory; try setting quasi-random ordering
(OrderOption.QUASI_RANDOM) in the dataloader constructor's order argument.

Full error below:
0%| | 0/1251 [00:01<?, ?it/s]
Exception ignored in: <function EpochIterator.del at 0x7f528d4f04c0>
Traceback (most recent call last):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in del
self.close()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close
self.memory_context.exit(None, None, None)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in exit
self.executor.exit(*args)
AttributeError: 'ProcessCacheContext' object has no attribute 'executor'
Traceback (most recent call last):
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in
ImageNetTrainer.launch_from_args()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args
ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper
cls.exec(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec
trainer.train()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train
train_loss = self.train_loop(epoch)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop
for ix, (images, target) in enumerate(iterator):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in iter
return EpochIterator(self, selected_order)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in init
raise e
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in init
self.memory_context.enter()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in enter
self.memory = np.zeros((self.schedule.num_slots, self.page_size),
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8

The text was updated successfully, but these errors were encountered:

lengstrom · 2022-06-29T03:51:21Z

You should try using OrderOption.RANDOM, QUASI_RANDOM isn't implemented for distributed training yet. Let me know if that fixes it for you and if not we can try something else!

fantasysee · 2022-06-30T03:16:03Z

Thank you very much!

It works! Multi-GPU training is enabled when I set dist.world_size=4, data.in_memory=0, and training.distributed=1 and use OrderOption.RANDOM.

However, I found that it is unexpectedly slow when distributed training is enabled for training a ResNet18 with ImageNet (about 31.73s/it, batch size=512). The training estimated time is almost equal to the Single-GPU training strategy.

Would you please tell me how I could speed up the training process?

netw0rkf10w · 2022-11-13T00:08:41Z

Same issue!
libffcv/ffcv#268

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

fantasysee commented Apr 20, 2022

lengstrom commented Jun 29, 2022 •

edited

Loading

fantasysee commented Jun 30, 2022

netw0rkf10w commented Nov 13, 2022

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

Comments

fantasysee commented Apr 20, 2022

Description

Command

Message

lengstrom commented Jun 29, 2022 • edited Loading

fantasysee commented Jun 30, 2022

netw0rkf10w commented Nov 13, 2022

lengstrom commented Jun 29, 2022 •

edited

Loading