Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify CPU Request for the SDK Server Sidecar #390

Merged

Conversation

markmandel
Copy link
Member

This provides the mechanism (and defaults) for being able to set both the CPU request, and CPU limits for the SDK Server GameServer sidecar.

I've only set the Request level, as it seems that the major issue is not CPU usage, but actually how the scheduler allots space for the sidecar (by default 100/0.1 vCPU is alloted to each container.

I've set the default request level to be 5m/0.005 vCPU -- while this is above what load tests have shown, I wanted to be conservative. Also, the controls exist to tweak this value yourself via the Helm chart.

I've not set a CPU limit, as I found when setting a low (<= 20m) CPU limit on the sidecar it mostly stopped working. But if people want to experiment with this, it is also configurable via the Helm chart.

Closes #344

@markmandel markmandel added kind/feature New features for Agones area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Oct 18, 2018
@markmandel markmandel added this to the 0.6.0 milestone Oct 18, 2018
@markmandel
Copy link
Member Author

/cc @KamiShepard

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 345e8292-9184-47ea-92eb-707caf13f672

The following development artifacts have been built, and will exist for the next 30 days:

(experimental) To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/390/head:pr_390 && git checkout pr_390
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.6.0-0b46fb4

@victor-prodan
Copy link
Contributor

If it stopa working when you set the limit to 20ms it means that at runtime it has cpu usage spikes that top that value.

As the scheduling is done by requested usage and not by limit, my concern is that with bin packing the servers will be packed tighter than they should and these sidecar cpu spikes will impact the performance/functionality of game servers

@markmandel
Copy link
Member Author

markmandel commented Oct 18, 2018

Looking at the graphs, it never gets anywhere close to 20m/0.02 (the max is ~ 0.003) 🤷‍♂️ but maybe it's being sampled?

Looking at the limits description:

The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

Let's see if I can get this math right.

  • 20m = 0.02
  • 0.02 * 100 = 2
  • So, the container gets 2ms every 100ms, which is only 20ms per 1s.
  • Maybe this is actually too little to get things done? (But you likely have a better idea of lower level cpu cycles than I do)

For requests:

The spec.containers[].resources.requests.cpu is converted to its core value, which is potentially fractional, and multiplied by 1024. The greater of this number or 2 is used as the value of the --cpu-shares flag in the docker run command.

Looking at cpu-shares, this is a CPU limiter, but:

The proportion will only apply when CPU-intensive processes are running. When tasks in one container are idle, other containers can use the left-over CPU time. The actual amount of CPU time will vary depending on the number of containers running on the system.

That being said - this is also why I allowed both to configured, in case this is all totally wrong, it can be adjusted as necessary (and in fact, I'd definitely like this to be configured with real, proper data and evidence to be sure)

20m mostly worked -- 30m was more consistently good -- at least on the CPUs I was running in my cluster.

WDYT?

@victor-prodan
Copy link
Contributor

The sidecar is normally Idle, it only does stuff at discrete events. This is why looking at the average cpu usage is misleading.

I dont think we should put a default limit though, as if we're wrong (not all cloud cpus are equal) it will break it.

But I think that the default value for any parameter of a service should err on the safe side. Its better to have a robust service that can be optimized.

Therefore I suggest to go with a default closer to the observed consistent limit, 20-30 ms.

@markmandel
Copy link
Member Author

That sounds quite reasonable to me. Agreed that we should err on the side of robustness.

Also, in the long term, if we find our numbers are too big, I don't think people will complain about shrinking them down. Pushing them up may cause some concerns.

So to confirm:

  1. We all agree, no default hard imit
  2. We set a request level of 30m.

Question: I'm thinking about adding a "Advanced" doc on CPU (and memory?) limits and requests - both for this, and for general gameservers. Should this be part of this PR?

@KamiShepard - any thoughts on the above?

This provides the mechanism (and defaults) for being able to set
both the CPU request, and CPU limits for the SDK Server `GameServer` sidecar.

I've only set the Request level, as it seems that the major issue is not
CPU usage, but actually how the scheduler allots space for the sidecar
(by default 100/0.1 vCPU is alloted to each container.

After discussion, the CPU request has been set to 30m, but is also
configurable via the helm chart.

I've not set a CPU limit, as I found when setting a low (<= 20m) CPU limit
on the sidecar it mostly stopped working. But if people want to experiment with
this, it is also configurable via the Helm chart.

Closes googleforgames#344
@markmandel
Copy link
Member Author

Not hearing any objections - so updated to 30m on the cpu request, and also wrote an advanced doc on cpu and memory limiting.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: f00eddfe-0ab6-496f-ab81-28c5c3bfa461

The following development artifacts have been built, and will exist for the next 30 days:

(experimental) To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/390/head:pr_390 && git checkout pr_390
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.6.0-c6f714a

Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@markmandel markmandel merged commit c30c70c into googleforgames:master Oct 22, 2018
@markmandel markmandel deleted the feature/sidecar-limit-cpu branch October 22, 2018 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/feature New features for Agones
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants