Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal(ir): state based implementation #972

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kemingy
Copy link
Member

@kemingy kemingy commented Oct 7, 2022

Signed-off-by: Keming <kemingyang@tensorchord.ai>
@muniu-bot
Copy link

muniu-bot bot commented Oct 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kemingy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@muniu-bot muniu-bot bot added the approved label Oct 7, 2022
@VoVAllen
Copy link
Member

VoVAllen commented Oct 7, 2022

How to merge different different stages like llb.Merge?

Also I think bramble's syntax can be an option

@kemingy
Copy link
Member Author

kemingy commented Oct 7, 2022

How to merge different different stages like llb.Merge?

Will add Merge, Diff, File later.

Also I think bramble's syntax can be an option

Will take a look.

@VoVAllen
Copy link
Member

VoVAllen commented Oct 9, 2022

I think we should have an idea about what the llb graph will look like after more dependency is set by the user (such as gcc and pypi packages), that can also utilize caches as much as possible.

Simple things should be simple, complex things should be possible.

@gaocegege
Copy link
Member

I think bramble's example there looks complex.

https://github.com/maxmcd/bramble/blob/eea4aee51e6ad881166412d61190012fb0d97c56/internal/project/testdata/project/default.bramble

It declares the input arguments explicitly. Personally, I prefer the func chain.

@kemingy
Copy link
Member Author

kemingy commented Oct 10, 2022

I think we should have an idea about what the llb graph will look like after more dependency is set by the user (such as gcc and pypi packages), that can also utilize caches as much as possible.

Simple things should be simple, complex things should be possible.

Agree. What we have now should be simple. Others like parallelism should be possible.

Method chaining should be enough for sequence commands. Each function should return a state. (ExecState should be hidden by introducing more parameters)

Signed-off-by: Keming <kemingyang@tensorchord.ai>
@kemingy kemingy marked this pull request as ready for review October 10, 2022 10:17
@kemingy kemingy requested a review from VoVAllen as a code owner October 10, 2022 10:17
@gaocegege
Copy link
Member

Some questions:

  • Is it possible to auto-merge the two chains? Merge/diff should be advanced statements.
  • How to integrate it with the envdlib?

Signed-off-by: Keming <kemingyang@tensorchord.ai>
@kemingy
Copy link
Member Author

kemingy commented Oct 11, 2022

* Is it possible to auto-merge the two chains? Merge/diff should be advanced statements.

I think it's hard. (correct me if I'm wrong)

Starlark doesn't support operator overloading like conda_state + vscode_state. But I think we can introduce a new method like conda_state.merge([vscode_state]) if it's helpful.

Besides, we need explicitly use root.state() to get a copy if another branch is not built from scratch. Otherwise, we don't know when to diverge.

* How to integrate it with the envdlib?

Some ideas:

  1. envdlib can provide functions built from scratch or from a user-provided state.
    • from scratch: root.merge([envdlib.compile_rust_serving()])
    • from a state: root = envdlib.tensorboard(conda_state) or root.apply(envdlib.tensorboard, host_port=9000) so users can continue chaining
  2. We should provide Source.envd_python() which is equivalent to base(os="ubuntu20.04", language="python3").

@gaocegege
Copy link
Member

Then could the new language syntax be compatible with the existing design?

Or is it a total breaking change?

BTW, could you please provide the example for python-basic with the new design?

Signed-off-by: Keming <kemingyang@tensorchord.ai>
Signed-off-by: Keming <kemingyang@tensorchord.ai>
@kemingy
Copy link
Member Author

kemingy commented Oct 12, 2022

Then could the new language syntax be compatible with the existing design?

Or is it a total breaking change?

I think it will be a breaking change.

BTW, could you please provide the example for python-basic with the new design?

Already been added to the proposal. PTAL.

@gaocegege
Copy link
Member

@VoVAllen WDYT

I have no opinion on it, let's start researching if starlark supports it.

@VoVAllen
Copy link
Member

VoVAllen commented Oct 12, 2022

I'm a bit concerned about the current proposal. The current design is detail-oriented, which is more complicated than original design.
Also current design looks similar to llb, we can also consider expose llb-like primitives directly. Explicit dependency declaration is an advanced function, llb primitives would be easier for us to maintain and ensures that "complex thing is possible"

Some personal thoughts:
Explicitly define two/three stages. base stages, envd-managed stages(install.python_packages etc.), user-managed stages(run(XXX))

The difference between them is:

  • base stages can be overwritten by custom images, and managed by envd if not specified
  • envd-managed stages will parallelize and use cache as much as possible to accelerate the process, thus no dependency can be set here.
  • user-managed stages can be fully customized, with explicit dependency.

Other ideas:

All functions provided by envd can add a new argument, such as called state.

In user stages user can do:

state = stage('user')
state1 = install.apt_packages(["g++"], state=state)
new_state = install.python_packages(["package_needs_g++"], state=state1)

and to define it as a custom function:

def install_inhouse_package():
  state = stage('user')
  state1 = install.apt_packages(["g++"], state=state)
  new_state = install.python_packages(["package_needs_g++"], state=state1)
  # envd_output is an builtin variable, add means merge state with the final output
  envd_output.add(new_state)
  return new_state

To use

def build():
   install.python_packages(['torch'])
   install_inhouse_packages()

WDYT

@kemingy
Copy link
Member Author

kemingy commented Oct 12, 2022

I'm a bit concerned about the current proposal. The current design is detail-oriented, which is more complicated than original design. Also current design looks similar to llb, we can also consider expose llb-like primitives directly. Explicit dependency declaration is an advanced function, llb primitives would be easier for us to maintain and ensures that "complex thing is possible"

Some personal thoughts: Explicitly define two/three stages. base stages, envd-managed stages(install.python_packages etc.), user-managed stages(run(XXX))

The difference between them is:

* base stages can be overwritten by custom images, and managed by envd if not specified

* envd-managed stages will parallelize and use cache as much as possible to accelerate the process, thus no dependency can be set here.

* user-managed stages can be fully customized, with explicit dependency.

This is similar to the current implementation and this proposal. We do have different stages, it's just not explicit.

We can provide the install.conda_python() function. So users who start with the custom images can use it to install the python environment.

Other ideas:

All functions provided by envd can add a new argument, such as called state.

In user stages user can do:

state = stage('user')
state1 = install.apt_packages(["g++"], state=state)
new_state = install.python_packages(["package_needs_g++"], state=state1)

and to define it as a custom function:

def install_inhouse_package():
  state = stage('user')
  state1 = install.apt_packages(["g++"], state=state)
  new_state = install.python_packages(["package_needs_g++"], state=state1)
  # envd_output is an builtin variable, add means merge state with the final output
  envd_output.add(new_state)
  return new_state

To use

def build():
   install.python_packages(['torch'])
   install_inhouse_packages()

WDYT

Defining the dependencies with an extra state argument is acceptable but not very user-friendly.

The LLB-like syntax is only complex when you need to use diff and merge. Otherwise, the method chaining should be a simple solution.

Signed-off-by: Keming <kemingyang@tensorchord.ai>
@kemingy
Copy link
Member Author

kemingy commented Oct 13, 2022

One more thing, this is incompatible with config.envd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants