[discussion] Capacity planning #708

gaocegege · 2018-07-01T09:54:44Z

When applying Kubeflow in our company, AI engineers meet some problems about the capacity. I am glad to discuss it with the community to find a better way to avoid it.

They complain that they always encounter OOM problem, since they may not know how to limit the resource used in the job. They want to let us tell them how to set the resource limit.

Then I investigate how to achieve it, there are some ways I know:

Compute the resource limit statically
- If we could get the model, and we know which layers will be placed on CPU, we could calculate how much resource the container needs.
- Pros: Easy to understand
- Cons: Hard to implement, it requires some information about the graph and device placement.
Profiling
- Profiling the job before actually running.
- Pros: Easy to implement
- Cons: Needs pipeline support (Run the profiling first and generate the resource limit)
Closed-loop (feedback) control
- We should have some historical records about the similar jobs. We could refer to the jobs and adjust the resource limit of the new job.
- Pros: No overhead
- Cons: Maybe not very accurate.
Closed-loop (feedback) control with profiling or static computing
- We apply profiling or static computing for the first time, then if there is an old job, then we use closed-loop (feedback) control to predict the resource that the new job will need.
Katib
- We could treat the resource limit as a hyperparameter, then we train it using katib.
- Pros: Reuse Katib
- Cons: High overhead

I'd appreciate it if you could give me some advice.

/cc @ddysher @jlewi @ddutta @YujiOshima @cheyang @jian-he @bhack @ankushagarwal

gaocegege · 2018-07-01T09:56:07Z

I think it is an important feature while may not be implemented by the operator. Maybe on top of it or in the pipeline.

bhack · 2018-07-01T10:37:01Z

I think it could be a little bit a chicken and egg problem in some cases. I.e. see #628 (comment).
So are you just talking about minimum required resources?

gaocegege · 2018-07-01T11:39:19Z

The question is from AI engs. They said they are not sure about how many resources they should assign to the container. I think it should be the minimum required resources.

amsaha · 2018-07-01T17:16:59Z

I have personally faced this issue of OOM and the cluster not having enough CPU, etc. Ideally, the solution would be to have elasticity built into our installation scripts so that the scripts can start with some calculated minimum and in case of failure, starts increasing the resource(s) that gave an error.

We can even have a dry_run option for installation to figure out what the minumum required resources are. What I am not clear about is whether it is easy to figure out from the error messages on kubernetes which resources have run out.

ddysher · 2018-07-02T02:39:53Z

Thanks for bringing up the issue. The overall issue has much higher scope than tf-operator, but we can probably start from minimal and see what we can do in tf-operator first.

I believe this is what borg does for resource estimation, where you gradually increase allocated resources and restart killed apps. For us, a big prerequisite is that we need ml apps to have checkpoint support unless it's explicitly a dry_run.

Ideally, the solution would be to have elasticity built into our installation scripts so that the scripts can start with some calculated minimum and in case of failure, starts increasing the resource(s) that gave an error.

There are a lot of moving parts, like whether it's a dedicated cluster for kubeflow vs. shared cluster, whether to use priority/preemption, whether to leverage QoS support in Kubernetes, etc. Solving the issue would take a long time, but for tf-operator, I would go for Closed-loop (feedback) control with profiling or static computing first as an opt-in feature. Once the feature is enabled, tf-operator will do resource estimation and create pods accordingly; admins are responsible to make sure monitoring is available, etc. This is sub-optimal as we do not take cluster-wide information into account, but do sound like a viable solution.

jlewi · 2018-07-02T22:39:35Z

Does vertical pod autoscaling give us what we need? Can't we use vertical pod autoscaling to automatically bump up the pod requests if it OOMs?

gaocegege · 2018-07-03T02:57:05Z

@jlewi

It could solve the problem partially. Initial resoucres + VPA equals Closed-loop (feedback) control above.

BTW, VPA will restart the pod if it thinks the pod needs to be auto scale, which is not expected from user's view.

stale · 2020-04-20T07:56:55Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaocegege added help wanted priority/p3 difficulty/hard api/v1alpha2 community/discussion area/usage labels Jul 1, 2018

gaocegege added the area/0.3.0 label Jul 2, 2018

gaocegege removed the area/0.3.0 label Sep 3, 2018

jlewi added kind/discussion and removed community/discussion labels Aug 28, 2019

stale bot added the lifecycle/stale label Apr 20, 2020

stale bot closed this as completed Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] Capacity planning #708

[discussion] Capacity planning #708

gaocegege commented Jul 1, 2018

gaocegege commented Jul 1, 2018

bhack commented Jul 1, 2018

gaocegege commented Jul 1, 2018 •

edited

Loading

amsaha commented Jul 1, 2018

ddysher commented Jul 2, 2018

jlewi commented Jul 2, 2018

gaocegege commented Jul 3, 2018

stale bot commented Apr 20, 2020

[discussion] Capacity planning #708

[discussion] Capacity planning #708

Comments

gaocegege commented Jul 1, 2018

gaocegege commented Jul 1, 2018

bhack commented Jul 1, 2018

gaocegege commented Jul 1, 2018 • edited Loading

amsaha commented Jul 1, 2018

ddysher commented Jul 2, 2018

jlewi commented Jul 2, 2018

gaocegege commented Jul 3, 2018

stale bot commented Apr 20, 2020

gaocegege commented Jul 1, 2018 •

edited

Loading