-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[discussion] Capacity planning #708
Comments
I think it is an important feature while may not be implemented by the operator. Maybe on top of it or in the pipeline. |
I think it could be a little bit a chicken and egg problem in some cases. I.e. see #628 (comment). |
The question is from AI engs. They said they are not sure about how many resources they should assign to the container. I think it should be the |
I have personally faced this issue of OOM and the cluster not having enough CPU, etc. Ideally, the solution would be to have We can even have a |
Thanks for bringing up the issue. The overall issue has much higher scope than tf-operator, but we can probably start from minimal and see what we can do in tf-operator first. I believe this is what borg does for resource estimation, where you gradually increase allocated resources and restart killed apps. For us, a big prerequisite is that we need ml apps to have checkpoint support unless it's explicitly a dry_run.
There are a lot of moving parts, like whether it's a dedicated cluster for kubeflow vs. shared cluster, whether to use priority/preemption, whether to leverage QoS support in Kubernetes, etc. Solving the issue would take a long time, but for tf-operator, I would go for |
Does vertical pod autoscaling give us what we need? Can't we use vertical pod autoscaling to automatically bump up the pod requests if it OOMs? |
It could solve the problem partially. Initial resoucres + VPA equals BTW, VPA will restart the pod if it thinks the pod needs to be auto scale, which is not expected from user's view. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
When applying Kubeflow in our company, AI engineers meet some problems about the capacity. I am glad to discuss it with the community to find a better way to avoid it.
They complain that they always encounter OOM problem, since they may not know how to limit the resource used in the job. They want to let us tell them how to set the resource limit.
Then I investigate how to achieve it, there are some ways I know:
I'd appreciate it if you could give me some advice.
/cc @ddysher @jlewi @ddutta @YujiOshima @cheyang @jian-he @bhack @ankushagarwal
The text was updated successfully, but these errors were encountered: