-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add roadmap. #1317
Add roadmap. #1317
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
# SkyPilot Roadmap | ||
|
||
## v0.3 | ||
|
||
### Managed Spot | ||
- Minimize the cost of the controller | ||
- Support spot controller on existing cluster (or local cluster) | ||
- Reducing the fixed cost of the controller (e.g., allow setting controller VM type) | ||
- Increasing parallelism (number of concurrent jobs) | ||
- Pushing the scale (e.g., support a high number of pending/concurrent jobs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This reads similar to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1. IIUC, this should refer to scaling up the controller itself. Probably, we can change it to |
||
- Framework-specific guides to add checkpointing/reloading using SkyPilot Storage | ||
|
||
### Smarter Optimizer | ||
- Fine-grained optimizer: pick by cheapest zone order | ||
- Better consider data egress time/cost | ||
- Consider buckets/Storage objects in file_mounts | ||
- Optimizing the data placement for SkyPilot Storage local uploads | ||
- Use the optimizer to decide the bucket location | ||
|
||
### Programmatic API | ||
- Refactor/extend the current API to *make it easy to programmatically use SkyPilot* | ||
- Expose core classes in docs | ||
|
||
### More clouds | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: |
||
- Refactoring of interfaces to ease adding new clouds | ||
- IBM Cloud | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add lambda labs/runpod/jarvis labs? |
||
|
||
### On-prem | ||
- Design for switching between cloud and on-prem | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to add an item before this: |
||
- Explore/design of "local mode" to run SkyPilot tasks locally | ||
|
||
### Faster launching speed | ||
- Consider a more minimal image | ||
- Azure speed investigation | ||
|
||
### k8s support | ||
- Ray-on-k8s backend | ||
- To figure out: Launch a new k8s cluster? Launch SkyPilot Tasks to an existing k8s cluster? | ||
|
||
### Cost: Optimization, Tracking, and Reporting | ||
- Track and show costs related to a job/cluster | ||
- For managed spot jobs, track and show %savings vs. on-demand | ||
- Optimizer: take into account disk costs | ||
|
||
### Serverless | ||
- Design and prototype of a "serverless jobs" submission API and CLI | ||
- Initial use case: hundreds of hyperparameter tuning trials | ||
|
||
### Backend | ||
- Support heterogeneous node types in a cluster (e.g., in RL, CPU actor(s) and GPU learner(s) in the same cluster) | ||
- Support CPUs as resource requirements | ||
- General robustness/UX improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to call this
v1.0.0-dev0
orv1.0.0
to align with the current master version. Reason: I believe thev0.3
will not include all the features listed here and we should not make a promise for the features we are not going to have inv0.3
. This doc is a longer-term plan thanv0.3
.