Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Port of Octo Model #209

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

kashyapakshay
Copy link
Contributor

@kashyapakshay kashyapakshay commented May 24, 2024

What this does

Minimal port of OctoModel. List of follow-up tasks:

  • Eval on PushT, tune hyperparams for best performance.
  • Compare results with OG Octo on PushT.
  • Add relevant datasets (like Open X-Embodiment).
  • Add new sim tasks similar to real tasks the authors have tested on.

How to checkout & try? (for the reviewer)

Provide a simple way for the reviewer to try out your changes.

Examples:

python lerobot/scripts/train.py policy=octo env=pusht env.task=PushT-v0 dataset_repo_id=lerobot/pusht

This change is Reviewable

@alexander-soare alexander-soare self-requested a review May 24, 2024 11:10
@alexander-soare alexander-soare added the 🧠 Policies Something policies-related label May 24, 2024
@kashyapakshay kashyapakshay marked this pull request as ready for review May 25, 2024 08:01
@kashyapakshay kashyapakshay changed the title [wip] octo model poc Minimal Port of Octo Model May 25, 2024
Returns:
eps_pred: torch.Tensor, shape [batch_size, pred_horizon * action_dim]
"""
mean_readouts_embed = readout_embeds.mean(dim=(1, 2))
Copy link
Contributor Author

@kashyapakshay kashyapakshay May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm unsure if there are any pitfalls to doing this, if anyone has opinions. the original Octo uses the averages over readout embeddings of each obs timestep (and generates multiple trajectories). another option here maybe to just use the last readout embedding?


class OctoModel(nn.Module):
def __init__(self, config: OctoConfig):
"""An overview of this minimal Octo model implementation:
Copy link
Contributor Author

@kashyapakshay kashyapakshay May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request to reviewers: review this documentation for accuracy and quality

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this Akshay. I've got quite a few comments here, mostly related to the documentation and my understanding of it. I'm mostly confused about the structure of observations/actions. To avoid repeating myself, I've stopped reviewing in order to give you a chance to address what's already there.

Reviewed 7 of 11 files at r1, all commit messages.
Reviewable status: Some CI tests are either incomplete or have failed!, 7 of 11 files reviewed, 29 unresolved discussions (waiting on @kashyapakshay)


poetry.lock line 1655 at r2 (raw file):

    {file = "lxml-5.2.2-cp36-cp36m-musllinux_1_2_x86_64.whl", hash = "sha256:2304d3c93f2258ccf2cf7a6ba8c761d76ef84948d87bf9664e14d203da2cd264"},
    {file = "lxml-5.2.2-cp36-cp36m-win32.whl", hash = "sha256:02437fb7308386867c8b7b0e5bc4cd4b04548b1c5d089ffb8e7b31009b961dc3"},
    {file = "lxml-5.2.2-cp36-cp36m-win_amd64.whl", hash = "sha256:edcfa83e03370032a489430215c1e7783128808fd3e2e0a3225deee278585196"},

Reminder to self. Make sure this file is unchanged if we don't add transformers as a dependency.


pyproject.toml line 61 at r2 (raw file):

moviepy = ">=1.0.3"
rerun-sdk = ">=0.15.1"
transformers = "^4.41.1"

As I mentioned in another comment, I think we should port the scheduler instead of adding this heavy dependency (I'll reach out to you about this offline as I can chip in with this addition if you like).


lerobot/common/policies/octo/components.py line 1 at r2 (raw file):

from typing import Iterable

As I've noted before, sorry I haven't given you a guide on contributing policies. One thing we're trying to do is follow the single file policy in transformers. Here's a blog post explaining it if you're interested: https://huggingface.co/blog/transformers-design-philosophy

For this PR, all that means is moving all this into modeling_octo.py and prepending all class names with Octo.


lerobot/common/policies/octo/configuration_octo.py line 3 at r2 (raw file):

#!/usr/bin/env python

# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,

Suggestion:

Robotic AI & Learning Lab Berkeley

lerobot/common/policies/octo/modeling_octo.py line 90 at r2 (raw file):

        self._queues = None

        self.octo = OctoModel(config)

nit: DP uses self.diffusion but I want to change this to self.model to be be consistent with ACT and TD-MPC. I know it's nitty but please humour me 🙏 while I figure it out.


lerobot/common/policies/octo/modeling_octo.py line 114 at r2 (raw file):

        """Select a single action given environment observations.

        This method handles caching a history of observations and an action trajectory generated by the

Is this an accurate explanation of what's going on here? Below it looks like there's one readout per observation step.


lerobot/common/policies/octo/modeling_octo.py line 180 at r2 (raw file):

            "observation tokens") and readout tokens as follows:
            - Normalized states and flattened feature maps from all observation steps are projected using FFNs into
                observation tokens and concattenated.

When you say "concatenated" (nit: spelling), do you mean along the sequence dimension to form the input tokens? Next point uses "appended".


lerobot/common/policies/octo/modeling_octo.py line 182 at r2 (raw file):

                observation tokens and concattenated.
            - Learned "readout tokens" are appended next to tokens of each observation step.
            - A block-wise causal mask (see make_mask for an example viz) that is used to prevent

nit: Could this be clearer (added "token-wise", removed "that" to keep consistent with the grammatical structure of other points, use correct fn name, don't abbreviate visualization)

Suggestion:

            - A combination of token-wise and block-wise causal masking (see make_blockwise_causal_mask for an example diagram) is used to prevent

lerobot/common/policies/octo/modeling_octo.py line 183 at r2 (raw file):

            - Learned "readout tokens" are appended next to tokens of each observation step.
            - A block-wise causal mask (see make_mask for an example viz) that is used to prevent
                a) observation and readout tokens from attending to any future tokens,

Might this be clearer?

Suggestion:

                a) observation and prior readout tokens from attending to any future readout tokens,

lerobot/common/policies/octo/modeling_octo.py line 184 at r2 (raw file):

            - A block-wise causal mask (see make_mask for an example viz) that is used to prevent
                a) observation and readout tokens from attending to any future tokens,
                b) observation tokens from attending to any readout tokens, and

Is this not already covered in (a)?


lerobot/common/policies/octo/modeling_octo.py line 185 at r2 (raw file):

                a) observation and readout tokens from attending to any future tokens,
                b) observation tokens from attending to any readout tokens, and
                c) readout tokens from attending to each other.

Do we not want readout tokens attending to relatively prior readout tokens? And if I understand it right, is this not already covered in (a)


lerobot/common/policies/octo/modeling_octo.py line 186 at r2 (raw file):

                b) observation tokens from attending to any readout tokens, and
                c) readout tokens from attending to each other.
        2) DiffusionActionHead, which predicts the noise to remove from a noisy trajectory, conditioned on the mean of

note to self: I need to come back to this line once I get more context from the code


lerobot/common/policies/octo/modeling_octo.py line 189 at r2 (raw file):

            embeddings of the readout tokens from 1) and a projection of the denoising iteration K.

        Say we use n_obs=2 with a num_dof=7 state (joint angles, gripper, etc.) and a 96x96 wrist image.

nit: Since num_dof is not actually a variable name in the code can we just use English?

Suggestion:

        Say we use n_obs=2 with a 7 DoF state (joint angles, gripper, etc.) and a 96x96 wrist image.

lerobot/common/policies/octo/modeling_octo.py line 190 at r2 (raw file):

        Say we use n_obs=2 with a num_dof=7 state (joint angles, gripper, etc.) and a 96x96 wrist image.
        With token_dim=384, we have:

nit: If we are using variable names can we please make sure they match actual variables?

Suggestion:

        With`embed_dim`=384, we have:

lerobot/common/policies/octo/modeling_octo.py line 192 at r2 (raw file):

        With token_dim=384, we have:
        - Feature maps of shape (2, 256, 6, 6) from resnet18 that are flattened to patches (2, 256, 36), projected
          and rearranged to (2, 36, 384).

This threw me off for a moment. How about just projected to (2, 384, 42)? The rearrangement part is a detail that doesn't add to the understanding of how it works, and if anyone cares, they'll know how to track it in the code. Ditto below.


lerobot/common/policies/octo/modeling_octo.py line 194 at r2 (raw file):

          and rearranged to (2, 36, 384).
        - State observations of shape (2, 7) are projected and rearranged to (2, 1, 384).
        - Readout tokens of shape (2, 1, 384) that are learnt.

I'm a little confused about why there is a sequence of 2 readout tokens and I'll have to come back to this after reading the code.


lerobot/common/policies/octo/modeling_octo.py line 212 at r2 (raw file):

              <noisy_sample>, <K_proj> --> (Action Diffusion Head) --> <noise_pred>

        Note that this implementation does not (yet) include certain features from the original Octo implementation:

Thanks!


lerobot/common/policies/octo/modeling_octo.py line 229 at r2 (raw file):

        # TODO: generalize to multiple proprioceptive observations in the future.
        num_state_obs = 1
        n_obs_tokens = (feat_map_shape[1] * feat_map_shape[2]) + num_state_obs

To me, this looks like you are also only using one image observation step. If you were using two, you'd want 2 * (feat_map_shape[1] * feat_map_shape[2]) + num_state_obs.


lerobot/common/policies/octo/modeling_octo.py line 265 at r2 (raw file):

            self.num_inference_steps = config.num_inference_steps

    # ========= inference  ============

nit: (i know you probably copied this from DP - I still have a little bit of cleaning to do there) let's drop this comment please. Not everything underneath it is related to inference.


lerobot/common/policies/octo/modeling_octo.py line 386 at r2 (raw file):

class OctoRgbEncoder(nn.Module):

It would be helpful if you can confirm for this, and any other code, whether it was a straight copy or you modified something. Saves some review time. 🙏


lerobot/common/policies/octo/modeling_octo.py line 489 at r2 (raw file):

    """Make the block-wise causal mask for the OctoTransformer.
    The layout is expected to be a dictionary with the following keys:
    - n_obs_tokens: number of observation tokens in input sequence.

nit: pass these as params instead of a dict for more readable code


lerobot/common/policies/octo/modeling_octo.py line 506 at r2 (raw file):

    ----------------------------
              ^           ^
            readout    readout

Thanks, this is really helpful! Some nit suggestions: explicit about what the axes mean, tick and cross for perhaps added clarity

Suggestion:

    -----------------------------------------
    Token index  | 0 | 1 | 2 | 3 | 4 | 5 |
    -------------|---|---|---|---|---|---|
    0 attends to |||||||
    1 attends to |||||||
    2 attends to ||||||| < readout
    3 attends to |||||||
    4 attends to |||||||
    5 attends to ||||||| < readout
    -----------------------------------------
                       ^           ^
                       readout    readout

lerobot/common/policies/octo/modeling_octo.py line 527 at r2 (raw file):

class OctoTransformer(nn.Module):

My apologies. I need to make a guide for contributing policies. But one thing that we're going for us trying to pass configs around rather than lots of args. There are several advantages:

  • avoid repetition in function signature and self.param = param
  • avoid the possibility of renaming. If we just pass the config its really easy to trace everything back to the documentation in the config class
  • avoid having to repeat documentation!

lerobot/configs/policy/octo.yaml line 5 at r2 (raw file):

# Defaults for training for the PushT dataset as per https://github.com/real-stanford/diffusion_policy.
# Note: We do not track EMA model weights as we discovered it does not improve the results. See
#       https://github.com/huggingface/lerobot/pull/134 for more details.

I think these 3 lines should be removed.

Code quote:

# Defaults for training for the PushT dataset as per https://github.com/real-stanford/diffusion_policy.
# Note: We do not track EMA model weights as we discovered it does not improve the results. See
#       https://github.com/huggingface/lerobot/pull/134 for more details.

lerobot/configs/policy/octo.yaml line 22 at r2 (raw file):

  action:
    min: [12.0, 25.0]
    max: [511.0, 511.0]

Can this be removed?

Code quote:

  # TODO(rcadene, alexander-soare): we override state and action stats to use the same as the pretrained model
  # from the original codebase, but we should remove these and train our own pretrained model
  observation.state:
    min: [13.456424, 32.938293]
    max: [496.14618, 510.9579]
  action:
    min: [12.0, 25.0]
    max: [511.0, 511.0]

lerobot/scripts/train.py line 44 at r2 (raw file):

def make_optimizer_and_scheduler(cfg, policy):
    print(policy.name)

Perhaps you meant to remove this?


lerobot/scripts/train.py line 90 at r2 (raw file):

            weight_decay=cfg.training.adamw_weight_decay,
        )
        # we use the transformers library because diffusers doesn't have the

Could we just port it? We're trying to avoid adding transformers as a dependency (and any other heavy dependencies for that matter.


tests/test_policies.py line 285 at r2 (raw file):

@pytest.mark.parametrize(

Can we also add a backwards compatibility test here? (FYI we find this test to be a bit janky but so far it's proven to be better than not having it at all)

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: Some CI tests are either incomplete or have failed!, 7 of 11 files reviewed, 30 unresolved discussions (waiting on @kashyapakshay)


lerobot/common/policies/octo/modeling_octo.py line 645 at r1 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

i'm unsure if there are any pitfalls to doing this, if anyone has opinions. the original Octo uses the averages over readout embeddings of each obs timestep (and generates multiple trajectories). another option here maybe to just use the last readout embedding?

If the original is doing this too, then I think it's fine. You can add a comment explaining the potential for an experiment/enhancement.

My 2 cents and gut feeling is that using the last readout embedding should probably suffice.


lerobot/common/policies/octo/modeling_octo.py line 195 at r2 (raw file):

        - State observations of shape (2, 7) are projected and rearranged to (2, 1, 384).
        - Readout tokens of shape (2, 1, 384) that are learnt.
        These are appended to get the input sequence of shape (2, 38, 384) --> (76, 384).

Are these numbers right? Reading above I would think that the observation sequence has shape (2, 37, 384) so far, and when you interleave the readout tokens into that structure you get (2, 38, 384)... Oh I think I just realized what's happening. I'll leave my thought process here though. I think I got confused because too much as happening in one step with the rearrange.


lerobot/common/policies/octo/modeling_octo.py line 494 at r2 (raw file):

    Example:
    layout = {"n_obs_tokens": 4, "n_obs": 2, "n_readouts": 1}

I'm a bit confused about these params. n_obs_tokens is the total number, while n_readouts is per observation. Should we try to keep things consistent and be more explicit with naming (like n_obs_tokens_per_obs_step, n_readouts_per_obs_step, n_obs_steps.


lerobot/common/policies/octo/modeling_octo.py line 575 at r2 (raw file):

        blockwise_causal_mask = None
        if use_blockwise_causal_mask:

So would we ever want to not use this?


lerobot/common/policies/octo/modeling_octo.py line 632 at r2 (raw file):

        super().__init__()

        self.ff = FourierFeatures(time_dim)

nit: we try to avoid abbreviations where possible (I thought this was feed-forward at first glance of the forward method). ff -> fourier_feature_embedder or something like that

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: Some CI tests are either incomplete or have failed!, 7 of 11 files reviewed, 30 unresolved discussions (waiting on @kashyapakshay)


lerobot/common/policies/octo/modeling_octo.py line 180 at r2 (raw file):

            "observation tokens") and readout tokens as follows:
            - Normalized states and flattened feature maps from all observation steps are projected using FFNs into
                observation tokens and concattenated.

Suggestion:

concatenated

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this!

Just a reminder: we should add copyright headers on all the new files, with the appropriate mentions (I'll add a test for this in the CI at some point). I think this should be fine:

#!/usr/bin/env python

# Copyright 2024 Robotic AI, Learning Lab Berkeley,
# and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

There should also be an addition in the ## Acknowledgment section of the readme.

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 7 files at r4, all commit messages.
Reviewable status: Some CI tests are either incomplete or have failed!, 8 of 11 files reviewed, 17 unresolved discussions (waiting on @kashyapakshay)

Copy link
Contributor Author

@kashyapakshay kashyapakshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: Some CI tests are either incomplete or have failed!, 8 of 11 files reviewed, 12 unresolved discussions (waiting on @alexander-soare)


lerobot/common/policies/octo/configuration_octo.py line 3 at r2 (raw file):

#!/usr/bin/env python

# Copyright 2024 Columbia Artificial Intelligence, Robotics Lab,

Done.


lerobot/common/policies/octo/modeling_octo.py line 180 at r2 (raw file):

            "observation tokens") and readout tokens as follows:
            - Normalized states and flattened feature maps from all observation steps are projected using FFNs into
                observation tokens and concattenated.

Done.


lerobot/common/policies/octo/modeling_octo.py line 182 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

nit: Could this be clearer (added "token-wise", removed "that" to keep consistent with the grammatical structure of other points, use correct fn name, don't abbreviate visualization)

removed usage of the phrase "blockwise", feel like its unnecessary with the explanation


lerobot/common/policies/octo/modeling_octo.py line 185 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Do we not want readout tokens attending to relatively prior readout tokens? And if I understand it right, is this not already covered in (a)

a) mentions no token will be attending to any future tokens, but this is clarifying readout tokens can't attend to prior readout tokens of other timesteps either. it's not explicitly mentioned in the paper, they just say readout tokens are passive observers, but I see they're doing that masking in the code

Screenshot 2024-05-27 at 12.09.10 PM.png


lerobot/common/policies/octo/modeling_octo.py line 192 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

This threw me off for a moment. How about just projected to (2, 384, 42)? The rearrangement part is a detail that doesn't add to the understanding of how it works, and if anyone cares, they'll know how to track it in the code. Ditto below.

Done.


lerobot/common/policies/octo/modeling_octo.py line 194 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

I'm a little confused about why there is a sequence of 2 readout tokens and I'll have to come back to this after reading the code.

it's one readout token per obs timestep. tried to improve this whole documentation.


lerobot/common/policies/octo/modeling_octo.py line 195 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Are these numbers right? Reading above I would think that the observation sequence has shape (2, 37, 384) so far, and when you interleave the readout tokens into that structure you get (2, 38, 384)... Oh I think I just realized what's happening. I'll leave my thought process here though. I think I got confused because too much as happening in one step with the rearrange.

Rewrote the whole documentation, please let me know if it's still confusing


lerobot/common/policies/octo/modeling_octo.py line 229 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

To me, this looks like you are also only using one image observation step. If you were using two, you'd want 2 * (feat_map_shape[1] * feat_map_shape[2]) + num_state_obs.

Clarified the variable name here, it's supposed to be the number of observation tokens in each obs timestep. That multiplication happens elsewhere when needed.


lerobot/common/policies/octo/modeling_octo.py line 386 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

It would be helpful if you can confirm for this, and any other code, whether it was a straight copy or you modified something. Saves some review time. 🙏

Done.


lerobot/common/policies/octo/modeling_octo.py line 494 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

I'm a bit confused about these params. n_obs_tokens is the total number, while n_readouts is per observation. Should we try to keep things consistent and be more explicit with naming (like n_obs_tokens_per_obs_step, n_readouts_per_obs_step, n_obs_steps.

Done.


lerobot/common/policies/octo/modeling_octo.py line 527 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

My apologies. I need to make a guide for contributing policies. But one thing that we're going for us trying to pass configs around rather than lots of args. There are several advantages:

  • avoid repetition in function signature and self.param = param
  • avoid the possibility of renaming. If we just pass the config its really easy to trace everything back to the documentation in the config class
  • avoid having to repeat documentation!

Done.


lerobot/common/policies/octo/modeling_octo.py line 575 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

So would we ever want to not use this?

I'm not sure. This was configurable in the original implementation so left it as is. But as I think about it, that was likely for their experimentation so I can remove this toggle and simplify.


lerobot/configs/policy/octo.yaml line 5 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

I think these 3 lines should be removed.

Done.


lerobot/scripts/train.py line 44 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Perhaps you meant to remove this?

Done.


lerobot/common/policies/octo/components.py line 1 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

As I've noted before, sorry I haven't given you a guide on contributing policies. One thing we're trying to do is follow the single file policy in transformers. Here's a blog post explaining it if you're interested: https://huggingface.co/blog/transformers-design-philosophy

For this PR, all that means is moving all this into modeling_octo.py and prepending all class names with Octo.

Done.

@kashyapakshay
Copy link
Contributor Author

lerobot/common/policies/octo/modeling_octo.py line 185 at r2 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

a) mentions no token will be attending to any future tokens, but this is clarifying readout tokens can't attend to prior readout tokens of other timesteps either. it's not explicitly mentioned in the paper, they just say readout tokens are passive observers, but I see they're doing that masking in the code

Screenshot 2024-05-27 at 12.09.10 PM.png

  • of other observation steps

Copy link
Contributor Author

@kashyapakshay kashyapakshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this copyright header. In the Acknowledgement, we just acknowledge the first authors for open sourcing their code, yeah?

Reviewable status: Some CI tests are either incomplete or have failed!, 7 of 11 files reviewed, 12 unresolved discussions (waiting on @alexander-soare)


lerobot/configs/policy/octo.yaml line 22 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Can this be removed?

removed these. since these stats are computed by make_dataset, I think that's ok? we dont have anything from original Octo codebase to override with


tests/test_policies.py line 285 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Can we also add a backwards compatibility test here? (FYI we find this test to be a bit janky but so far it's proven to be better than not having it at all)

Done. Just added to and ran save_policy_to_safetensors.py, not sure if there's anything more I need to do?

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r5, 1 of 2 files at r6, 7 of 7 files at r7, all commit messages.
Reviewable status: Some CI tests are either incomplete or have failed!, all files reviewed, 6 unresolved discussions (waiting on @kashyapakshay)


lerobot/common/policies/octo/configuration_octo.py line 64 at r7 (raw file):

        n_layers: Number of transformer encoder layers.
        n_heads: Number of heads in the transformer.
        d_ffn: Dimension of the feedforward network in the transformer.

nit: please use more verbose naming for clarity. Taken from ACT you could try dim_feedforward.


lerobot/common/policies/octo/configuration_octo.py line 143 at r7 (raw file):

    # Inference
    num_inference_steps: int | None = None

This defaults to 20 in octo.yaml but None here. Can you please make them match?


lerobot/common/policies/octo/configuration_octo.py line 161 at r7 (raw file):

            )
        image_key = next(iter(image_keys))
        if (

Can you please incorporate this PR (here and wherever else crop_shape=None needs to be handled)? #219


lerobot/common/policies/octo/modeling_octo.py line 195 at r2 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

Rewrote the whole documentation, please let me know if it's still confusing

Thanks a lot! Admittedly, I should have done a deeper reading of the paper in advance. Doing so, plus your rewrite makes this much clearer now.


lerobot/common/policies/octo/modeling_octo.py line 191 at r7 (raw file):

            and a 96x96 wrist image (which would result in 256x6x6 feature maps using resnet18):

        `n_obs_tokens_per_step` = 6*6 + 1 + 1 = 38

At least semantically, I think it makes more sense to think of there being 37 observation tokens per step and 1 readout token. Making for a total of 38 tokens per step. I think I've confirmed this by reading further into the code below. Right?


lerobot/common/policies/octo/modeling_octo.py line 477 at r7 (raw file):

    Example:
    --------------------------------------
    Obs Timestep | 0 | 0 | 0 | 1 | 1 | 1 |

Nice!


lerobot/common/policies/octo/modeling_octo.py line 512 at r7 (raw file):

    """Transformer Encoder for Octo, as described above.

    Args:

nit: Sorry, I think these should go under the init. Afaik the class docstring is where you'd document attributes and methods. Same every where else.


lerobot/configs/policy/octo.yaml line 22 at r2 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

removed these. since these stats are computed by make_dataset, I think that's ok? we dont have anything from original Octo codebase to override with

Perfect


tests/test_policies.py line 285 at r2 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

Done. Just added to and ran save_policy_to_safetensors.py, not sure if there's anything more I need to do?

That's it :)

Copy link
Contributor Author

@kashyapakshay kashyapakshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: Some CI tests are either incomplete or have failed!, 12 of 16 files reviewed, 4 unresolved discussions (waiting on @alexander-soare)


lerobot/common/policies/octo/configuration_octo.py line 143 at r7 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

This defaults to 20 in octo.yaml but None here. Can you please make them match?

Done.


lerobot/common/policies/octo/configuration_octo.py line 161 at r7 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Can you please incorporate this PR (here and wherever else crop_shape=None needs to be handled)? #219

Done.


lerobot/common/policies/octo/modeling_octo.py line 114 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Is this an accurate explanation of what's going on here? Below it looks like there's one readout per observation step.

this should be right, we are in fact only generating one trajectory (using all the readout tokens, explanation in the OctoModel docstring).

[note to self: triple check this logic makes sense for Octo]


lerobot/common/policies/octo/modeling_octo.py line 184 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Is this not already covered in (a)?

updated to say they cant attend to any future or prior readout tokens


lerobot/common/policies/octo/modeling_octo.py line 191 at r7 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

At least semantically, I think it makes more sense to think of there being 37 observation tokens per step and 1 readout token. Making for a total of 38 tokens per step. I think I've confirmed this by reading further into the code below. Right?

you're right, corrected this line and added explicit input seq len calculation


lerobot/scripts/train.py line 90 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Could we just port it? We're trying to avoid adding transformers as a dependency (and any other heavy dependencies for that matter.

yes, but was thinking that can be a separate PR? maybe we can merge that first since it can be common to all models, in train.py


poetry.lock line 1655 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Reminder to self. Make sure this file is unchanged if we don't add transformers as a dependency.

reverted this change

This was referenced May 30, 2024
Copy link
Contributor Author

@kashyapakshay kashyapakshay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened a new PR for porting schedulers: #225

Reviewable status: Some CI tests are either incomplete or have failed!, 8 of 16 files reviewed, 4 unresolved discussions (waiting on @alexander-soare)

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r8, 4 of 4 files at r9, all commit messages.
Reviewable status: Some CI tests are either incomplete or have failed!, all files reviewed, 2 unresolved discussions (waiting on @kashyapakshay)


lerobot/common/policies/octo/configuration_octo.py line 161 at r7 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

Done.

I'm not seeing the change I was alluding to. Sorry if I wasn't clear. It's not just a matter of merging main but actually looking at the diffs in the PR and applying them to Octo.


lerobot/common/policies/octo/modeling_octo.py line 114 at r2 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

this should be right, we are in fact only generating one trajectory (using all the readout tokens, explanation in the OctoModel docstring).

[note to self: triple check this logic makes sense for Octo]

Yes, sorry this was clear to me in the last revision but I forgot to close this thread.


lerobot/common/policies/octo/modeling_octo.py line 187 at r9 (raw file):

            c) readout tokens from attending to prior readout tokens.
    2) OctoDiffusionActionHead, which predicts the noise to remove from a noisy trajectory, conditioned on the mean
        of all readout embeddings and a projection of the denoising iteration K.

nit: and this is super nitty but might as well since we have another blocking comment. "projection" threw me off here, but I think I understand what you mean. You seem to be referring to the update rule used on the NN's output (which differs depending on whether we are predicting epsilon, or the action directly). Also, I think the way this is written assumes that the users has selected epsilon as the prediction_type. How about:

Suggestion:

    2) OctoDiffusionActionHead, which acts as the diffusion model (it predicts either the
       noise or the denoised action sequence directly depending on `prediction_type`).
       It is conditioned on the prior denoising step's output, and the mean of all readout
       embeddings.

lerobot/scripts/train.py line 90 at r2 (raw file):

Previously, kashyapakshay (Akshay Kashyap) wrote…

yes, but was thinking that can be a separate PR? maybe we can merge that first since it can be common to all models, in train.py

Excellent!

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: Some CI tests are either incomplete or have failed!, all files reviewed, 2 unresolved discussions (waiting on @kashyapakshay)


lerobot/common/policies/octo/modeling_octo.py line 187 at r9 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

nit: and this is super nitty but might as well since we have another blocking comment. "projection" threw me off here, but I think I understand what you mean. You seem to be referring to the update rule used on the NN's output (which differs depending on whether we are predicting epsilon, or the action directly). Also, I think the way this is written assumes that the users has selected epsilon as the prediction_type. How about:

Sorry, "which acts as the diffusion model's denoising network"

Copy link
Collaborator

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: Some CI tests are either incomplete or have failed!, all files reviewed, 3 unresolved discussions (waiting on @kashyapakshay)


lerobot/common/policies/octo/modeling_octo.py line 645 at r1 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

If the original is doing this too, then I think it's fine. You can add a comment explaining the potential for an experiment/enhancement.

My 2 cents and gut feeling is that using the last readout embedding should probably suffice.

Based in our discussion on Discord, switching this to blocking till we have clarity on how this is done in Octo exactly. We want to know if the mean pooling is done over the n_readouts_per_step dimension or the n_obs_steps dimension or both. Ultimately, we want to know how we reduce those two dimensions to get to a single output.

@Cadene Cadene self-requested a review May 30, 2024 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧠 Policies Something policies-related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants