Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress bar missing with litdata.StreamingDataset and wrong number of steps in an epoch #112

Closed
yhl48 opened this issue Apr 25, 2024 · 4 comments · Fixed by #122
Closed
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@yhl48
Copy link
Contributor

yhl48 commented Apr 25, 2024

🐛 Bug

There are two separate issues here.

When training a model using litdata.StreamingDataset, the tqdm progress bar shows {steps}/? and the estimated time is missing.

Moreover, the total number of steps in an epoch seems to be independent of the number of GPUs. Instead of having total_steps = num_samples / (num_gpus * batch_size), the log returns total_steps = num_samples / batch_size

Expected behavior

The progress bar should show the estimated time and the fraction of steps that have been completed.

total_steps = num_samples / (num_gpus * batch_size)

cc @tchaton

@yhl48 yhl48 added bug Something isn't working help wanted Extra attention is needed labels Apr 25, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@yhl48
Copy link
Contributor Author

yhl48 commented Apr 25, 2024

I might be stating the obvious here but the issue seems to originate from this line where self.trainer.num_training_batches == inf

@yhl48
Copy link
Contributor Author

yhl48 commented Apr 25, 2024

related issue Lightning-AI/pytorch-lightning#15734

@yhl48 yhl48 changed the title Progress bar missing with litdata.StreamingDataset Progress bar missing with litdata.StreamingDataset and wrong number of steps in an epoch Apr 25, 2024
@yhl48
Copy link
Contributor Author

yhl48 commented May 6, 2024

With regard to the the data distribution across gpus, I believe this line

self.distributed_env = _DistributedEnv.detect()

should be called again in __iter__ since the ddp process is initialised in Trainer, which is triggered after StreamingDataset is initialised, rendering this line

if torch.distributed.is_available() and torch.distributed.is_initialized():

to always be False

@tchaton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant