-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP (multi GPU) Iterable Dataset is not working as expected ? #15734
Comments
Yes this is expected. Lightning can't know how to shard the data/iterator you provide. You need to make sure your iterator returns half of the data on GPU 0 and the other half on GPU 1. You can do this for example by changing your for loop to something like this (typos expected): for item in imdb_tokenized['train'][rank::num_gpus]:
... This shards your data. The rank can be accessed for example through Another way would be to use the DistribuedSampler inside your iterable dataset. |
Makes sense. One more doubt. In my Assume, my If my Is there any concept of In keras steps per epoch concept is there. Here, is there any way to |
No. num_workers has nothing to do with the sampling of the data.
Read more about workers here: https://pytorch.org/docs/stable/data.html#multi-process-data-loading
Trainer(limit_train_batches=1000, max_epochs=10) |
Thanks. But, if you could execute the above code by If Sharing results. |
Any update on this please ? |
Any update |
@awaelchli, after trying to use PyTorch's DistributedSampler with an IterableDataset in my application, I observed that DistributedSampler raised an error saying it requires each input dataset to have a len() property. Does this match your understanding, given the context of this discussion? If so, how might we use the DistributedSampler to circumvent the original concern in this issue? |
@awaelchli, I second this issue. I am also having difficulties figuring out the simplest way to enable multi-GPU and multi-dataloader worker support for IterableDatasets when using PyTorch Lightning. All the examples I have worked through so far do not seem to work when considering both of the following cases: (1) Would it be possible to put together a simple PyTorch Lightning example of how one can structure their IterableDataset and PyTorch Lightning DataModule to support the two use cases above? |
Yes, these observations are all expected. This is not special behavior in Lightning, it's just how the IterableDataset and DataLoader are working in PyTorch. In short: When using an iterable dataset, you need to take care of the sampler inside your dataset yourself, and shard/partition the data yourself across workers and devices. Yes, I can put together an example, but it has to wait a few days until new year. |
I really don’t know why Pytorch is so preferred, despite such a complicated
and clumsy distribution strategies .
In Tensorflow its a cakewalk.
…On Thu, 29 Dec 2022 at 8:12 PM, Adrian Wälchli ***@***.***> wrote:
Yes, these observations are all expected. This is not special behavior in
Lightning, it's just how the IterableDataset and DataLoader are working in
PyTorch. In short: When using an iterable dataset, you need to take care of
the sampler inside your dataset yourself, and shard/partition the data
yourself across workers and devices.
—
Reply to this email directly, view it on GitHub
<#15734 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6KHRFHTBEQSAXZGPQ3DWPWPLVANCNFSM6AAAAAASFEIFJM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@s4sarath let's stay on topic. @s4sarath @amorehead Here is a notebook that explains the difference between the map dataset and iterable dataset with several examples, using dataloader workers and also shows how it behaves across multiple processes. At the bottom, I also show the example with Lightning. I hope this helps your understanding. |
Thanks man. Will have a look
…On Tue, 3 Jan 2023 at 2:49 PM, Adrian Wälchli ***@***.***> wrote:
@s4sarath <https://github.com/s4sarath> let's stay on topic.
Here is a notebook
<https://colab.research.google.com/drive/1OFLZnX9y5QUFNONuvFsxOizq4M-tFvk-?usp=sharing>
that explains the difference between the map dataset and iterable dataset
with several examples, using dataloader workers and also shows how it
behaves across multiple processes. At the bottom, I also show the example
with Lightning. I hope this helps your understanding.
—
Reply to this email directly, view it on GitHub
<#15734 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACRE6KAHQZKPOVGB2YFRAE3WQPVI5ANCNFSM6AAAAAASFEIFJM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @awaelchli for the detailed notebook, its really helpful! One question I have is, does this only work with PyTorch's inbuilt parallel strategies, or will it work for other strategies like DeepSpeed? Do we need to call a pytorch lightning get_rank/worker_info function which abstracts away the underlying strategy, or does calling the torch function always guarantee we get the correct information regardless of strategy? |
@gorold It should work with deepspeed yes, but probably not with the TPU strategy. I haven't mentioned it in the notebook, but PyTorch is developing torchdata which will address these issues completely, as it is heavily focusing on performant iterable-style data loading together with DataLoader2. It would eliminate essentially all of the boilerplate code I show in that notebook. |
Thanks a lot! |
Thanks so much for sharing! @awaelchli #15734 (comment) |
With a pure PyTorch Iterable dataset, I don't know how to do that cleanly. |
This notebook seems to work in it's most basic form for me but for some reason when I implement this strategy with a batch of tensors (rather than a batch of integers), the distributed sampler doesn't return the tensors in their original form. Instead, I get tensors with totally shuffled numbers from my dataloader. Any idea why this would be the case? |
@keenjo By default, the distributed sampler shuffles the data. It has an argument |
I don't understand how this code from that Colab notebook actually works:
Where is the data actually coming from in this example? |
When I add the two lines to get world size and process rank to my |
For distributed training, each process will call the dataset independently: iterator = iter(dataset) If we don't put any distribute sampling inside of our dataset, each process would get the same samples: This would render data-parallel completely useless. Instead, if we add the distributed sampler, then we make each process will return different data: process 0: next(iterator), next(iterator), next(iterator) -> [0, 2, 4] You can study the output of the notebook cell to see the same thing. |
What I'm failing to understand is how in practice to pass the rank and world_size to the dataset when that is being created by my DataModule, before the Trainer is created. It seems that for this to work the Trainer is supposed to pass the rank somehow to the dataset. I can't figure out from your example notebook how to do this. When I try to access the rank and/or world_size in my Dataset before the Trainer is created, it either freezes during runtime or says I need to use |
@EvanZ I was also confused about this at first, but then figured it out. The Trainer does not need any information about the data to be instantiated. So I would recommend instantiating the Trainer first, then you can pass the trainer.world_size and trainer.global_rank to your data module without any issues. Hope this helps! |
One question that I guess seems obvious to you guys but not to me, do I have to explicitly call |
Hmm indeed that is helpful (in theory haha). Currently my training script looks like:
You're saying if I just flip the Trainer before the DataModule then I will be able to access the rank and the world size inside the Dataset? |
Yes, that's exactly it! |
Ok...maybe that's the missing detail I needed. I'll work on it some more! |
I should also mention that I used a different strategy to solve this problem in the end using itertools.islice to avoid repeating data and my Iterable Dataset ended up looking like this: class CustomDataset(IterableDataset):
def __init__(self, tokenizer, filepath, rank, world_size, stage='train'):
super().__init__()
self.tokenizer = tokenizer
self.filepath = filepath
self.stage = stage
self.rank = rank
self.world_size = world_size
def __iter__(self):
assert self.stage in ['train', 'val', 'test']
worker_info = torch.utils.data.get_worker_info()
if worker_info is not None:
num_workers = worker_info.num_workers
worker_id = torch.utils.data.get_worker_info().id
world_size = self.world_size
rank = self.rank
if self.stage == 'train':
train_iter_source = open(f'{self.filepath}/train.source')
train_iter_target = open(f'{self.filepath}/train.target')
train_set = zip(train_iter_source, train_iter_target)
mapped_itr = map(self.no_newlines, train_set)
tok_itr = map(self.tokenize_inputs, mapped_itr)
elif self.stage == 'val':
val_iter_source = open(f'{self.filepath}/val.source')
val_iter_target = open(f'{self.filepath}/val.target')
val_set = zip(val_iter_source, val_iter_target)
mapped_itr = map(self.no_newlines, val_set)
tok_itr = map(self.tokenize_inputs, mapped_itr)
elif self.stage == 'test':
test_iter_source = open(f'{self.filepath}/test_both.source')
test_iter_target = open(f'{self.filepath}/test_both.target')
test_set = zip(test_iter_source, test_iter_target)
mapped_itr = map(self.no_newlines, test_set)
tok_itr = map(self.tokenize_inputs, mapped_itr)
if worker_info is not None:
if rank == 0:
tok_itr = itertools.islice(tok_itr, worker_id, None, (num_workers * world_size))
else:
tok_itr = itertools.islice(tok_itr, worker_id + (num_workers * rank), None, (num_workers * world_size))
return tok_itr
def no_newlines(self, lines):
'''
Function to take new lines out of inputs
'''
lines = list(lines)
for idx, line in enumerate(lines):
lines[idx] = line.strip('\n')
return lines
def tokenize_inputs(self, lines):
'''
Function to tokenize a batch of lines that are read
'''
lines_tok = self.tokenizer.batch_encode_plus(lines,
return_special_tokens_mask=False,
add_special_tokens=False)['input_ids']
return lines_tok |
Hmm that's interesting and a different organization than I use. I define
|
@keenjo Where do you pass in the rank and world size from? I assume you have a custom |
If I try to use
Even though I have instantiated the Trainer before the DataModule...Do I need to call |
Basically anywhere in my script I try to call |
I have a main training script which is where I get the initial # Instantiate the trainer
trainer = Trainer(accelerator='gpu',
devices=n_gpu,
precision=16,
val_check_interval=args.val_check_interval,
strategy='deepspeed_stage_2',
# logger=logger,
max_epochs=args.max_train_epochs,
callbacks=[checkpoint_callback,
#grad_accumulation,
lr_monitor,
early_stop_callback,
cb.TQDMProgressBar(),
pred_writer])
rank = trainer.global_rank
world_size = trainer.world_size
# Instantiate the data collator and data module
collate_fn = DataCollatorForSeq2SeqWithMaskingAndPadding(tokenizer=tok, max_length=args.max_length, padding=True)
dataset = LitRDF2TextDataModule(tokenizer=tok, train_batch_size=args.train_batch_size,
eval_batch_size=args.eval_batch_size, collate_fn=collate_fn,
data_path=args.data_path, rank=rank, world_size=world_size, buffer_size=args.buffer_size) Then within my # Pytorch Lightning Data Module
class LitRDF2TextDataModule(pl.LightningDataModule):
"""
Class to prepare data and split it into Dataloaders
- Look into Pytorch Lightning 1.9.4 Documentation for more information:
- https://lightning.ai/docs/stable/
"""
def __init__(self, tokenizer, train_batch_size, eval_batch_size, data_path, collate_fn,
rank, world_size, buffer_size):
super().__init__()
self.tokenizer = tokenizer # tokenizer to use on the dataset
self.data_path = data_path # path to dataset
self.train_batch_size = train_batch_size # train batch size
self.eval_batch_size = eval_batch_size # eval batch size
self.collate_fn = collate_fn
self.rank = rank
self.world_size = world_size
self.buffer_size = buffer_size
def setup(self, stage):
'''
Method to prepare data to be passed to dataloaders
- Specifically this creates a pytorch Dataset object for each split of the data
and prepares the data for masking
'''
print(f'Preparing {stage} data...')
if stage == 'fit' or stage == 'validation':
self.dataset_train = CustomDataset(tokenizer=self.tokenizer,
filepath=self.data_path,
rank=self.rank,
world_size=self.world_size,
stage='train')
...
... |
I'm confused. I thought the rank refers to each GPU. How does that work if you are calling it in the main script? How does it ever change? Are you sure your model is utilizing all the GPUs? |
I've ensured that my model is using all of the GPUs by printing out the ranks during various processes. Plus, before I solved my problem I had the issue of my data being duplicated across all of the GPUs. I'm not exactly sure what could be causing your error, but just in case there is incompatibility between the lightning versions we are working with, I'm using |
I am using 1.9.4 as well. It boggles my mind. The rank is 0 in the main script. You are passing that to your dataset. How can it ever change? I just don't get the logic. |
I finally got everything to "work" in the sense that I can see batches of data being sent to each GPU and the results appear to be similar to what I was getting before using just 1 GPU...only problem is I am not actually seeing any speedup. It seems all this effort was for not. :/ |
I want to ask if I want to read a file line by line using this class, and I initialize one instance using method
Where can I put the data path In other words, how to replace the |
I’m also having difficulty understanding how the rank can change when it’s initially set to 0 in the main script and then passed to the dataset.
@EvanZ did you ever find an answer to your question? I’m facing a similar issue and would really appreciate some insight. In the script @awaelchli provided, this seems to work in the non-Lightning version with:
However, I’m struggling to understand how to achieve the same behavior in the Lightning version, where mp.start_processes and init_process don’t seem to be utilized. |
I'm also quite confused by this. I hoped PyTorch Lightning would provide more clear guidance + examples (though the notebook is an excellent start!). Would using See how it is used in the PyTorch documentation: https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset |
A couple quick questions about the class DataParallelIterableDataset(IterableDataset):
def __len__(self):
# Caveat: When using DistributedSampler, we need to know the number of samples in our dataset!
# Hence, we need to implement `__len__`.
return NUM_SAMPLES
def __iter__(self):
worker_info = torch.utils.data.get_worker_info()
num_workers = worker_info.num_workers if worker_info is not None else 1
worker_id = worker_info.id if worker_info is not None else 0
world_size = torch.distributed.get_world_size()
process_rank = torch.distributed.get_rank()
sampler = torch.utils.data.DistributedSampler(
self,
num_replicas=num_workers * world_size,
rank=process_rank * num_workers + worker_id,
shuffle=False
)
for i in iter(sampler):
yield I
|
There are 2 things that finally made this working for me:
Implement instead the sharding logic directly into a custom
|
Bug description
Hi,
I am currently testing with IterableDataset and DDP.
Total Examples -
10000
Batch_size -
32
NUM_GPUS -
2
.While using IterableDataset , ideally with 2 GPUS, we are supposed to run
157 steps (10000 / 32 batch / 2 gpus)
in one epoch. But, instead of that, it is running for314 steps (10000 / 32 batch)
.This issue is only with IterableDataset. When I am using normal Dataset (map dataset) from torch things are good and fine. Is there any reason for this particular behaviour ?
How to reproduce the bug
Error messages and logs
Environment
More info
No response
The text was updated successfully, but these errors were encountered: