-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add roadmap. #1317
Add roadmap. #1317
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# SkyPilot Roadmap | ||
|
||
This doc lists general directions of interest to facilitate community contributions. | ||
|
||
Note that | ||
- This list is not meant to be comprehensive (i.e., new work items of interest may pop up) | ||
- Even though listed under a specific version, not all items need to be completed before we ship that version (i.e., some items can go into future versions) | ||
|
||
## v0.3 | ||
|
||
### Managed Spot | ||
- Minimize the cost of the controller | ||
- Support running spot controller on an existing/local cluster | ||
- Reducing the fixed cost of the controller (e.g., allow setting controller VM type) | ||
- Supporting a higher number of pending/concurrent jobs | ||
- Framework-specific guides to add checkpointing/reloading using SkyPilot Storage | ||
|
||
### Smarter Optimizer | ||
- Fine-grained optimizer: pick by cheapest zone order | ||
- Better consider data egress time/cost | ||
- Consider buckets/Storage objects in file_mounts | ||
- Optimizing the data placement for SkyPilot Storage local uploads | ||
- Use the optimizer to decide the bucket location | ||
|
||
### Programmatic API | ||
- Refactor/extend the current API to *make it easy to programmatically use SkyPilot* | ||
- Expose core classes in docs | ||
|
||
### Support more clouds | ||
- Refactoring of interfaces to ease adding new clouds | ||
- IBM Cloud | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add lambda labs/runpod/jarvis labs? |
||
- Explore support for low-cost clouds (e.g., lambda labs/runpod/jarvis labs) | ||
|
||
### On-prem | ||
- Robustify the on-prem feature | ||
- Design for switching between cloud and on-prem | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to add an item before this: |
||
- Explore/design of "local mode" to run SkyPilot tasks locally | ||
|
||
### Faster launching speed | ||
- Consider a more minimal image | ||
- Azure speed investigation | ||
|
||
### k8s support | ||
- Ray-on-k8s backend | ||
- To figure out: Launch a new k8s cluster? Launch SkyPilot Tasks to an existing k8s cluster? | ||
|
||
### Cost: Optimization, Tracking, and Reporting | ||
- Track and show costs related to a job/cluster | ||
- For managed spot jobs, track and show %savings vs. on-demand | ||
- Optimizer: take into account disk costs | ||
|
||
### Serverless | ||
- Design and prototype of a "serverless jobs" submission API and CLI | ||
- Initial use case: hundreds of hyperparameter tuning trials | ||
|
||
### Backend | ||
- Support heterogeneous node types in a cluster (e.g., in RL, CPU actor(s) and GPU learner(s) in the same cluster) | ||
- Support CPUs as resource requirements | ||
- General robustness/UX improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to call this
v1.0.0-dev0
orv1.0.0
to align with the current master version. Reason: I believe thev0.3
will not include all the features listed here and we should not make a promise for the features we are not going to have inv0.3
. This doc is a longer-term plan thanv0.3
.