Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enable Multi-GPU training (1 model, multiple GPUs) under the server with limited memory? #10

Open
fantasysee opened this issue Apr 20, 2022 · 3 comments

Comments

@fantasysee
Copy link

Description

Hi, @lengstrom . Thanks for your wonderful work!

My goal is to run a ResNet18 under ImageNet on my server using a multi-GPU training strategy to speed up the training process. The server has 4 RTX 2080 Ti GPUs with a 46G memory, which is not large enough to load ImageNet into the memory.

I have read the instructions on https://docs.ffcv.io/parameter_tuning.html (Scenario: Large scale datasets and Scenario: Multi-GPU training (1 model, multiple GPUs)

Right now, I can run a ResNet18 on a single card by using os_cache=False. However, if I use in_memory=0 and distributed = 1 to run the provided train_imagenet.py code as follows, some errors are reported, which are listed at the bottom. Would you please tell me how to solve this issue?


Command

python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \
    ... \
    --data.in_memory=0 \
    --training.distributed=1

Message

Warning: no ordering seed was specified with distributed=True. Setting seed to 0 to match PyTorch distributed sampler.

=> Logging in ...

Not enough memory; try setting quasi-random ordering
(OrderOption.QUASI_RANDOM) in the dataloader constructor's order argument.

Full error below:
0%| | 0/1251 [00:01<?, ?it/s]
Exception ignored in: <function EpochIterator.del at 0x7f528d4f04c0>
Traceback (most recent call last):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 161, in del
self.close()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 158, in close
self.memory_context.exit(None, None, None)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 59, in exit
self.executor.exit(*args)
AttributeError: 'ProcessCacheContext' object has no attribute 'executor'
Traceback (most recent call last):
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 510, in
ImageNetTrainer.launch_from_args()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 461, in launch_from_args
ch.multiprocessing.spawn(cls._exec_wrapper, nprocs=world_size, join=True)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 468, in _exec_wrapper
cls.exec(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 478, in exec
trainer.train()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 300, in train
train_loss = self.train_loop(epoch)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 63, in result
return func(*args, **kwargs)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/fastargs/decorators.py", line 35, in call
return self.func(*args, **filled_args)
File "/mnt/sdb2/fangchao/Workspace/proj_base/ffcv/examples/imagenet-example/train_imagenet.py", line 361, in train_loop
for ix, (images, target) in enumerate(iterator):
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/loader.py", line 214, in iter
return EpochIterator(self, selected_order)
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 43, in init
raise e
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 37, in init
self.memory_context.enter()
File "/home/fangchao/miniconda3/envs/ffcv_11.3/lib/python3.9/site-packages/ffcv/memory_managers/process_cache/context.py", line 32, in enter
self.memory = np.zeros((self.schedule.num_slots, self.page_size),
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 229. GiB for an array with shape (29251, 8388608) and data type uint8

@lengstrom
Copy link
Contributor

lengstrom commented Jun 29, 2022

You should try using OrderOption.RANDOM, QUASI_RANDOM isn't implemented for distributed training yet. Let me know if that fixes it for you and if not we can try something else!

@fantasysee
Copy link
Author

Thank you very much!

It works! Multi-GPU training is enabled when I set dist.world_size=4, data.in_memory=0, and training.distributed=1 and use OrderOption.RANDOM.

However, I found that it is unexpectedly slow when distributed training is enabled for training a ResNet18 with ImageNet (about 31.73s/it, batch size=512). The training estimated time is almost equal to the Single-GPU training strategy.

Would you please tell me how I could speed up the training process?

@netw0rkf10w
Copy link

Same issue!
libffcv/ffcv#268

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants