Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding aggregated logs for training run #411

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

MayankChaturvedi
Copy link

What this does

This change adds four aggregated profiling metrics in training logs- average policy update time, max policy update time, average data loading time, and max data loading time.
The comments of issue describe the thought process behind the metrics.

Examples:

Example
New log smpl:2K ep:3 epch:0.06 loss:3.706 grdn:94.749 lr:1.0e-05 pu_mx_av:1.472|1.159 dl_mx_av:0.022|0.010
Old log step:200 smpl:2K ep:3 epch:0.06 Loss:2.806 grdn: 14.267 lr: 10e-05 updt_s:1.278 data_s:0.010

How it was tested

Ran the training script

python lerobot/scripts/train.py     policy=act     env=aloha     env.task=AlohaInsertion-v0     dataset_repo_id=lerobot/aloha_sim_insertion_human device=cpu

How to checkout & try? (for the reviewer)

Reviewers can run the above command to validate the output

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking mostly good to me! Please take a review at my comments, and once done I'll try it out on my local machine.

lerobot/scripts/train.py Outdated Show resolved Hide resolved
lerobot/scripts/train.py Outdated Show resolved Hide resolved
@alexander-soare
Copy link
Collaborator

alexander-soare commented Sep 6, 2024

Thanks @MayankChaturvedi

I tried this out, and I think maybe the pipe symbol is quite jarring on the eyes trying to scan the line quickly. Maybe a , looks better?

# Current version.
INFO 2024-09-06 10:28:34 ts/train.py:197 smpl:3K ep:26 epch:0.13 loss:0.391 grdn:8.145 lr:1.0e-05 updt_max|avg:89|84 data_max|avg:106|28

# Suggestion.
# Here I also add "ms" to make it clear it's milliseconds. To make space for that, I removed the "ep" entry.
INFO 2024-09-06 10:28:34 ts/train.py:197 smpl:3K epch:0.13 loss:0.391 grdn:8.145 lr:1.0e-05 updt_ms_max,avg:89,84 data_ms_max,avg:106,28

Let's also get @Cadene's input

@Cadene
Copy link
Collaborator

Cadene commented Sep 16, 2024

Hello @MayankChaturvedi ,
Thanks a lot of your contribution on this PR.
We will come back to it when time allows.
Thanks for your understanding.
Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants