Adding aggregated logs for training run #411

MayankChaturvedi · 2024-09-04T19:25:07Z

What this does

This change adds four aggregated profiling metrics in training logs- average policy update time, max policy update time, average data loading time, and max data loading time.
The comments of issue describe the thought process behind the metrics.

Examples:

	Example
New log	smpl:2K ep:3 epch:0.06 loss:3.706 grdn:94.749 lr:1.0e-05 pu_mx_av:1.472\|1.159 dl_mx_av:0.022\|0.010
Old log	step:200 smpl:2K ep:3 epch:0.06 Loss:2.806 grdn: 14.267 lr: 10e-05 updt_s:1.278 data_s:0.010

How it was tested

Ran the training script

python lerobot/scripts/train.py     policy=act     env=aloha     env.task=AlohaInsertion-v0     dataset_repo_id=lerobot/aloha_sim_insertion_human device=cpu

How to checkout & try? (for the reviewer)

Reviewers can run the above command to validate the output

alexander-soare

Looking mostly good to me! Please take a review at my comments, and once done I'll try it out on my local machine.

lerobot/scripts/train.py

alexander-soare · 2024-09-06T09:32:17Z

Thanks @MayankChaturvedi

I tried this out, and I think maybe the pipe symbol is quite jarring on the eyes trying to scan the line quickly. Maybe a , looks better?

# Current version.
INFO 2024-09-06 10:28:34 ts/train.py:197 smpl:3K ep:26 epch:0.13 loss:0.391 grdn:8.145 lr:1.0e-05 updt_max|avg:89|84 data_max|avg:106|28

# Suggestion.
# Here I also add "ms" to make it clear it's milliseconds. To make space for that, I removed the "ep" entry.
INFO 2024-09-06 10:28:34 ts/train.py:197 smpl:3K epch:0.13 loss:0.391 grdn:8.145 lr:1.0e-05 updt_ms_max,avg:89,84 data_ms_max,avg:106,28

Let's also get @Cadene's input

Cadene · 2024-09-16T11:54:55Z

Hello @MayankChaturvedi ,
Thanks a lot of your contribution on this PR.
We will come back to it when time allows.
Thanks for your understanding.
Best

alexander-soare requested changes Sep 5, 2024

View reviewed changes

lerobot/scripts/train.py Outdated Show resolved Hide resolved

lerobot/scripts/train.py Outdated Show resolved Hide resolved

MayankChaturvedi added 4 commits September 5, 2024 21:36

Make profiling during training more informative

aa2c620

Removing the "step" count log from training logger

eee71be

Adding documentation for the new logging

d459ab6

Logging milliseconds, and improving documentation

781bd7f

MayankChaturvedi force-pushed the aggregated-profiling branch from c2308f1 to 781bd7f Compare September 5, 2024 21:42

Adding logging example in the train policy example file

7eaa58a

MayankChaturvedi force-pushed the aggregated-profiling branch from 86bf01e to 7eaa58a Compare September 5, 2024 21:49

alexander-soare self-assigned this Sep 6, 2024

Merge branch 'main' into aggregated-profiling

602a73d

Merge branch 'main' into aggregated-profiling

a7fb0f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding aggregated logs for training run #411

Adding aggregated logs for training run #411

MayankChaturvedi commented Sep 4, 2024

alexander-soare left a comment

alexander-soare commented Sep 6, 2024 •

edited

Loading

Cadene commented Sep 16, 2024

Adding aggregated logs for training run #411

Are you sure you want to change the base?

Adding aggregated logs for training run #411

Conversation

MayankChaturvedi commented Sep 4, 2024

What this does

How it was tested

How to checkout & try? (for the reviewer)

alexander-soare left a comment

Choose a reason for hiding this comment

alexander-soare commented Sep 6, 2024 • edited Loading

Cadene commented Sep 16, 2024

alexander-soare commented Sep 6, 2024 •

edited

Loading