Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] Capacity planning #708

Closed
gaocegege opened this issue Jul 1, 2018 · 8 comments
Closed

[discussion] Capacity planning #708

gaocegege opened this issue Jul 1, 2018 · 8 comments

Comments

@gaocegege
Copy link
Member

When applying Kubeflow in our company, AI engineers meet some problems about the capacity. I am glad to discuss it with the community to find a better way to avoid it.

They complain that they always encounter OOM problem, since they may not know how to limit the resource used in the job. They want to let us tell them how to set the resource limit.

Then I investigate how to achieve it, there are some ways I know:

  • Compute the resource limit statically
    • If we could get the model, and we know which layers will be placed on CPU, we could calculate how much resource the container needs.
    • Pros: Easy to understand
    • Cons: Hard to implement, it requires some information about the graph and device placement.
  • Profiling
    • Profiling the job before actually running.
    • Pros: Easy to implement
    • Cons: Needs pipeline support (Run the profiling first and generate the resource limit)
  • Closed-loop (feedback) control
    • We should have some historical records about the similar jobs. We could refer to the jobs and adjust the resource limit of the new job.
    • Pros: No overhead
    • Cons: Maybe not very accurate.
  • Closed-loop (feedback) control with profiling or static computing
    • We apply profiling or static computing for the first time, then if there is an old job, then we use closed-loop (feedback) control to predict the resource that the new job will need.
  • Katib
    • We could treat the resource limit as a hyperparameter, then we train it using katib.
    • Pros: Reuse Katib
    • Cons: High overhead

I'd appreciate it if you could give me some advice.

/cc @ddysher @jlewi @ddutta @YujiOshima @cheyang @jian-he @bhack @ankushagarwal

@gaocegege
Copy link
Member Author

I think it is an important feature while may not be implemented by the operator. Maybe on top of it or in the pipeline.

@bhack
Copy link

bhack commented Jul 1, 2018

I think it could be a little bit a chicken and egg problem in some cases. I.e. see #628 (comment).
So are you just talking about minimum required resources?

@gaocegege
Copy link
Member Author

gaocegege commented Jul 1, 2018

The question is from AI engs. They said they are not sure about how many resources they should assign to the container. I think it should be the minimum required resources.

@amsaha
Copy link

amsaha commented Jul 1, 2018

I have personally faced this issue of OOM and the cluster not having enough CPU, etc. Ideally, the solution would be to have elasticity built into our installation scripts so that the scripts can start with some calculated minimum and in case of failure, starts increasing the resource(s) that gave an error.

We can even have a dry_run option for installation to figure out what the minumum required resources are. What I am not clear about is whether it is easy to figure out from the error messages on kubernetes which resources have run out.

@ddysher
Copy link
Member

ddysher commented Jul 2, 2018

Thanks for bringing up the issue. The overall issue has much higher scope than tf-operator, but we can probably start from minimal and see what we can do in tf-operator first.

I believe this is what borg does for resource estimation, where you gradually increase allocated resources and restart killed apps. For us, a big prerequisite is that we need ml apps to have checkpoint support unless it's explicitly a dry_run.

Ideally, the solution would be to have elasticity built into our installation scripts so that the scripts can start with some calculated minimum and in case of failure, starts increasing the resource(s) that gave an error.

There are a lot of moving parts, like whether it's a dedicated cluster for kubeflow vs. shared cluster, whether to use priority/preemption, whether to leverage QoS support in Kubernetes, etc. Solving the issue would take a long time, but for tf-operator, I would go for Closed-loop (feedback) control with profiling or static computing first as an opt-in feature. Once the feature is enabled, tf-operator will do resource estimation and create pods accordingly; admins are responsible to make sure monitoring is available, etc. This is sub-optimal as we do not take cluster-wide information into account, but do sound like a viable solution.

@jlewi
Copy link
Contributor

jlewi commented Jul 2, 2018

Does vertical pod autoscaling give us what we need? Can't we use vertical pod autoscaling to automatically bump up the pod requests if it OOMs?

@gaocegege
Copy link
Member Author

@jlewi

It could solve the problem partially. Initial resoucres + VPA equals Closed-loop (feedback) control above.

BTW, VPA will restart the pod if it thinks the pod needs to be auto scale, which is not expected from user's view.

@stale
Copy link

stale bot commented Apr 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants