-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we use BackgroundGenerator when we've had DataLoader? #5
Comments
To the best of my knowledge, the DataLoader in Pytorch is creating a set of worker threads which all prefetch new data at once when all workers are empty. So if for example, you create 8 worker threads:
Using the prefetch generator we make sure that each of those workers always has at least 1 additional data item loaded. You can see this behavior if you create a very shallow network. I have here two colab notebooks (based on the CIFAR10 example from the official tutorial): Here with data loader and 2 workers: https://colab.research.google.com/drive/10wJIfCw5moPc-Yx9rSqWFEXkNceAOPpc Here with the additional prefetch_generator:
This is why keeping track of computing vs data loading time (aka compute efficiency) is important. In this simple example, we even save lots of training time. If anyone knows how to fix this behavior in the PyTorch data loader let me know :) |
Thank you for your wonderful example! from torch.utils.data import DataLoader
from prefetch_generator import BackgroundGenerator
class DataLoaderX(DataLoader):
def __iter__(self):
return BackgroundGenerator(super().__iter__()) |
I had a problem using BackgroundGenerator with PyTorch Distributed Data Parallel. |
Technically no, it creates worker processes
pytorch does not do this
This is a flawed benchmark that doesn't actually show the importance of prefetching -- it runs the fastest without any prefetching: when setting |
A quick update on this one. PyTorch 1.7 introduced a configurable prefetching parameter for the I didn't do any benchmarking yet. But I can imagine that the integrated prefetching makes this |
Add section "How Lightly Works" to "Getting Started"
I got exactly the same problem. But thurning off BackgroundGenerator in DDP would make the data sample phase much slower. Is there any better solutions for this? |
I really enjoy this guide! However, I am not sure what the advantage of
prefetch_generator
is. It seems that DataLoader in pytorch has already supported prefetching.Thank you!
The text was updated successfully, but these errors were encountered: