-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design document of RemoteExecutor. #7720
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, very good starting point! I put some comments as my suggestions, would love to know how you think about them.
We propose some details to implement `RemoteExecutor`: | ||
|
||
- Each `Partitioned IR(intermediate representation)` has unique `IRID`. | ||
- Each `Partitioned IR` with its relationship with others and its resource need are stored into etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "others"? Maybe replace others with what others really are here.
|
||
- Each `Partitioned IR(intermediate representation)` has unique `IRID`. | ||
- Each `Partitioned IR` with its relationship with others and its resource need are stored into etcd. | ||
- Each `PaddlePaddle Runtime` runs `Partitioned IR` got from etcd by `IRID`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes our runtime depend on etcd. I think the dependency direction is wrong. If we really want to use etcd as a communication method, it should be our runtime define some communication interface, and implement an etcd adapter class that implements the interface. So that the framework does not depend on etcd, instead, only the adapter depends on etcd. This is called Dependency inversion principle. (in my opinion, a major benefit of using an OOP programming language).
Anway, is etcd necessary, why communicate with etcd is better than communicate with rpc? As described in the PR, it's for "So the executed Partitioned IR can communicate with each other by IRID even if some of them restart.". However, wouldn't fault tolerance not related to communication method? We can still communicate using rpc, when we want to support fault tolerant, the runtime can store state in disk/etcd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anway, is etcd necessary, why communicate with etcd is better than communicate with rpc
The partition IR communicate to other IR with RPC.When paddle RunTime starts, it gets IR from etcd(storage) and cached it to local.The local cache should be updated when send/get meets error(the remote may not exists).
|
||
## Architect graph | ||
<div style="align: center"> | ||
<img src="src/remote_executor2.png" width="700" align=center/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense for the transpiler running in the cloud, because the transpiler should know the resource (e.g., how many trainers running) information (e.g., with autoscaling, the number of trainers running changes).
|
||
## Architect graph | ||
<div style="align: center"> | ||
<img src="src/remote_executor2.png" width="700" align=center/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the controller should tell how remote executor how many trainers currently running, not the other way around. In this way, we can run remote executor on bare medals, where there is no controller.
|
||
## Architect graph | ||
<div style="align: center"> | ||
<img src="src/remote_executor2.png" width="700" align=center/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: draw a graph that specify how this system runs on bare medal cloud (e.g., no k8s controller, no etcd), and another graph specifying how we integrate k8s cloud with it. In the current graph, remote executor can not run without etcd or controller.
|
||
## Architect graph | ||
<div style="align: center"> | ||
<img src="src/remote_executor2.png" width="700" align=center/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one question:
Where can RemoteExecutor save state(IR)? On distribution system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to me that there is one global gatekeeper (RemoteExecutor) to orchestrate all the roles and traffic in and out of the cluster? I think this kind of remote executor should be one per job?
- Foreground Job: when the client exits the jobs will be killed. | ||
- It's convenient for the users to debug their program. | ||
- It needs a `HeartBeat` to `RemoteExecutor` to report that client is living.Otherwise, the `RemoteExecutor` will kill the job. | ||
- Background Job: client's death doesn't affect the job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just need background job, all jobs should be killed explicitly by the paddle cloud management Python API.
There is no fundamental difference between the `Trainer` and the `Parameter server`, they are all `PaddleRunTime`.They do the different tasks just because they execute different `ProgramDesc`. | ||
Although we reserve `Trainer` and `Pserver` concepts in the pseudo codes below, it's just for users to distinguish among different `ProgramDesc`s.They are just names for `ProgramDesc`s. | ||
|
||
## Peudo code of users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should aim for the user only have to change the executor to a remote executor in Python code, and nothing else. This peudo code requires the user to change too many lines.
I know currently we need the user to change many lines, but I think we should aim for only changing few lines.
``` | ||
|
||
|
||
## Storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are mixing fault tolerance with communication here (as mentioned in #7720 (comment)). In my opinion we should separate them, because they two are very different, and have different libraries / tools that fit best.
Another point is if we store all the IR communication in etcd, who will delete IRs that are no longer necessary? I my opinion, we use the best library for communication, don't save the message. And the best library for fault tolerance, and override the necessary states.
No description provided.