Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve push_dataset_to_hub API + Add unit tests #231

Merged
merged 21 commits into from
Jun 13, 2024
Merged

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented May 30, 2024

What this does

  • Improve push_dataset_to_hub.py API
    • Instead of --data-dir data and implicit location for raw directory, uses explicit --raw-dir data/pusht_raw with full path to directory
    • instead of --save-to-disk 1 and implicit location for output directory, uses explicit --local-dir data/pusht with full path to directory
    • Instead of --community-id lerobot and --dataset-id pusht, uses simpler --repo-id lerobot/pusht
    • Instead of --debug 1 to run on first episode, uses more configurable --episodes 0
    • Instead of --save-tests-to-disk 1 and implicit location for dataset test directory, uses explicit --tests-data-dir tests/data/pusht
    • Instead of implicit --dry-run 1, uses explicit --push-to-hub 0 (by default to 1)
    • Instead of confusing --revision, hardcode CODEBASE_VERSION
    • Adds --force-override 1 (by default 0), to not delete local directory by mistake
  • When no --local-dir is provided, uses hardcoded /tmp/{REPO_ID} to cache images if needed
  • Adds generic tests for push_dataset_to_hub.py
  • Simplifies raw datasets download
    • Pushed all raw datasets to hub in cadene page
    • Instead of using one raw download per raw format, uses download_hub for all
  • Fix a bug in xarm raw datasets that could generate push datasets instead of lift (see below)
Screenshot 2024-06-12 at 10 53 19 - BREAKING CHANGE: sort the hdf5 files for aloha datasets, so episode 0 is not the same Screenshot 2024-06-12 at 11 38 07

How to try the code

# Before
python lerobot/scripts/push_dataset_to_hub.py \
--data-dir data \
--raw-format pusht_zarr \
--dataset-id pusht \
--repo-id lerobot/pusht \
--dry-run 1

# After
python lerobot/scripts/push_dataset_to_hub.py \
--raw-dir data/pusht_raw \
--raw-format pusht_zarr \
--repo-id lerobot/pusht \
--push-to-hub 0 \
--local-dir data/pusht
# Generic tests
pytest -sx tests/test_push_dataset_to_hub.py

# Backward compatibility tests
DATA_DIR=tests/data python -m pytest --run-skipped tests/test_push_dataset_to_hub.py::test_push_dataset_to_hub_pusht_backward_compatibility

How it was tested

CI

$ DATA_DIR=tests/data python -m pytest tests/test_push_dataset_to_hub.py::test_push_dataset_to_hub_pusht_zarr_format_backward_compatibility
=============================================================================== test session starts ===============================================================================
platform linux -- Python 3.10.12, pytest-8.2.1, pluggy-1.5.0
rootdir: /admin/home/remi_cadene/code/lerobot
configfile: pyproject.toml
plugins: hydra-core-1.3.2, cov-5.0.0
collected 1 item

Testing with DEVICE='cuda'

tests/test_push_dataset_to_hub.py .                                                                                                                                         [100%]

This change is Reviewable

@thomwolf thomwolf requested a review from Cadene May 30, 2024 21:25
Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I prefer raw_dir and out_dir.
We will need to propagate the change to the README and other parts of the code.

Let's do it once we converged on the API for push_dataset_to_hub.
For instance, I am still wondering if we should rename it convert_to_lerobot_dataset.py with an option --push-to-hub.
With the current state of this PR, we currently have push_dataset_to_hub.py with an option --out-dir and --dry-run 1 to save locally without pushing to hub.

lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
@thomwolf
Copy link
Member Author

Updated it to fix the readme and doc. ok to merge @Cadene so I can push a next fix?

@thomwolf
Copy link
Member Author

Quick updated behavior: default "save-tests-to-disk" to 0

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR :)
We should definitly improve push_dataset_to_hub.py.

I just want to highlight that we dont have proper unit tests for push_dataset_to_hub.py for now. We prioritized other features.

If we start iterating heavily on push_dataset_to_hub.py, we might need to implement unit tests in a future PR to ensure we dont break things.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
@Cadene Cadene changed the title update script names Rename push_dataset_to_hub arguments to --raw-dir and --out-dir May 31, 2024
@Cadene Cadene changed the title Rename push_dataset_to_hub arguments to --raw-dir and --out-dir Small renaming of push_dataset_to_hub arguments May 31, 2024
thomwolf and others added 6 commits June 3, 2024 11:36
Co-authored-by: Remi <re.cadene@gmail.com>
Co-authored-by: Remi <re.cadene@gmail.com>
Co-authored-by: Remi <re.cadene@gmail.com>
Co-authored-by: Remi <re.cadene@gmail.com>
Co-authored-by: Remi <re.cadene@gmail.com>
Co-authored-by: Remi <re.cadene@gmail.com>
@thomwolf
Copy link
Member Author

thomwolf commented Jun 3, 2024

all good to merge

@Cadene Cadene changed the title Small renaming of push_dataset_to_hub arguments Improve push_dataset_to_hub API + Add unit tests Jun 12, 2024
@Cadene Cadene requested a review from aliberts June 12, 2024 13:59
Copy link
Member Author

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments mainly about the interface.
I didn't look into each _format.py script, tell me if you want another pair of eyes on those.

README.md Outdated Show resolved Hide resolved
@@ -18,58 +18,39 @@
or store it locally. LeRobot dataset format is lightweight, fast to load from, and does not require any
installation of neural net specific packages like pytorch, tensorflow, jax.

Example:
Example of how to download raw datasets, convert them into LeRobotDataset format, and push them to the hub:
```
python lerobot/scripts/push_dataset_to_hub.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to add a comment above each of these commands explaining in plain English what these do, kind of like what we have in here or here and maybe use some of the options to illustrate more cases (like --push-to-hub 0 or --local-dir)

lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Outdated Show resolved Hide resolved
lerobot/scripts/push_dataset_to_hub.py Show resolved Hide resolved
tests/test_push_dataset_to_hub.py Outdated Show resolved Hide resolved
@aliberts aliberts added ✨ Enhancement New feature or request 🗃️ Dataset Something dataset-related labels Jun 13, 2024
Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Cadene & @thomwolf for this!

@aliberts aliberts merged commit 125bd93 into main Jun 13, 2024
6 checks passed
@aliberts aliberts deleted the update_push_to_hub branch June 13, 2024 13:18
@aliberts aliberts linked an issue Jun 13, 2024 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🗃️ Dataset Something dataset-related ✨ Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make naming scheme for local data folders less strict
3 participants