Move to using CRD Subresources for all Agones CRDs #329

markmandel · 2018-08-27T20:07:57Z

Design

We would want to generate the client code and write the kubectl configuration for Subresources.

I'm not 100% positive if this is enabled by default / available on cloud providers in Kubernetes 1.10. We may have to wait until Kubernetes 1.11 is available across the board.

That being said, this looks to break down to two major things:

Status Subresource

This doesn't give us any extra functionality, but does give us more control RBAC control over what can be changed by what part of the system. For example #150 will only be possible once this is in effect.

There are also some potential minor performance optimisations - but I think this will be minor. This is more of a security++ operation.

Scale on Fleets

This may be the easier one to start with - but the endpoint being that it would enable kubectl scale Fleet simple-udp --replicas=10 which would be super nice, and make custom autoscalers much easier.

As an implementation detail, this may also drive us to implement to scale on the backing GameServerSet as well.

Research

The text was updated successfully, but these errors were encountered:

victor-prodan · 2018-09-02T08:54:14Z

Subresource implementation example found here: https://github.com/kubernetes/sample-controller

markmandel · 2018-12-18T16:33:25Z

We can do this now that #447 is merged!

aLekSer · 2019-01-28T13:56:46Z

Started adding new Subresources into helm templates and it is working for Fleet for instance:

>kubectl scale --replicas=15 fleet/simple-udp
fleet.stable.agones.dev/simple-udp scaled

I followed the way which is documented under Subresources section here: https://v1-11.docs.kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#advanced-topics

aLekSer · 2019-01-31T14:54:44Z

Regarding adding Scale subresource: it makes Fleet able to be configured with Horizontal Pod Autoscaler. And we can use metrics which was previously exported to stackdriver to control the scaling. This is just one of scale subresource gains.
https://cloud.google.com/blog/products/gcp/beyond-cpu-horizontal-pod-autoscaling-comes-to-google-kubernetes-engine

Allows systems like HorizontalPodAutoscaler and PodDisruptionBudget interact with your new resource

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#api-server-aggregation

markmandel · 2019-04-12T20:37:11Z

Was just looking at the autoscaler - should that also have a state subresource?

aLekSer · 2019-04-15T07:46:53Z

@markmandel Let me check that.
Do we also need to have scale subresource for Buffer policyType maxReplicas parameter only?
https://github.com/GoogleCloudPlatform/agones/blob/b19408187de9e643698b115f93e255d0279ae85a/install/helm/agones/templates/crds/fleetautoscaler.yaml#L66

markmandel · 2019-04-17T20:09:36Z

Oops, missed your comment. I don't think so. I don't think scale makes sense here. WDYT?

aLekSer · 2019-04-18T11:21:04Z

Yes, that would make things more complicated. So I will prepare adding a Status to Fleetautoscaler.

markmandel · 2019-07-25T06:42:33Z

Created a draft of the GameServer/status implementation here #947

I'm waiting on result of e2e tests, but I'm leaning towards having GameServer be a special case, where we don't implement a status subresource - essentially so that we can change the entirety of a GameServer in one go.

Allocation is something that moves away from the K8s paradigm, and it only affects GameServers, so in this case it may be that we want to keep it as a special case because of that.

There are several reasons for this:

(low concern) An extra step and API operation to set the default status values for a gameserver. Not the end of the world, but nice to have less traffic when scaling up and down gameservers.
(high concern) when we allocate, we have to apply several operations, with no way to make them transactional. If the second one fails (master goes down, etc), we have a partially applied data to an, allocated game server. This is also an extra k8s api hit per allocation.
- we could maybe have a "cleanup" queue that marks gameservers as unhealthy if the allocation fails. But this seems like a lot of effort for... what gain?

~~Waiting for the e2e tests to finish - I am slightly worries that the k8s resource locking mechanism we rely on to ensure allocations are safe from fleet changes will be nullified. From here~~

~~The .metadata.generation value is incremented for all changes, except for changes to .metadata or .status.~~

~~But will wait on e2e tests to complete before drawing conclusions. Just wanted to throw this out there as an additional concern.~~

markmandel · 2019-07-25T06:53:14Z

In local e2e testing - both TestGameServerAllocationDuringGameServerDeletion and TestScaleFleetUpAndDownWithGameServerAllocation seem to be passing 😮 so I may well be wrong about the last item.

But how do people feel about the rest of the above?

markmandel · 2019-07-25T08:26:47Z

It all passed. I'm actually quite surprised.

markmandel · 2019-07-25T22:13:28Z

/cc @aLekSer @roberthbailey - wdyt? Would love some second opinions about this. Please also let me know if what I wrote was not clear.

roberthbailey · 2019-07-25T23:23:11Z

I'll take a look later today.

roberthbailey · 2019-07-27T08:59:06Z

I'm not an expert on this code path but I have a basic question regarding your second concern:

In all of the other code paths you updated, it was a simple change from Update --> UpdateStatus, but for allocation it's more complicated. I assume there is a good reason that allocation isn't possible by only changing fields in the status and that it requires updates to metadata?

Assuming that there is a reasonable answer to the above, I tend to agree with your conclusion that this is adding more complexity and trouble than it is worth. If we did need to do two updates, they need to be done in way that is automatically recoverable by the controller such that it can drive the second update by seeing the first one (which led me to the above question, since that would mean putting enough info in status to make the change, at which point why not have everything in status).

Based on your code changes, I think it's a breaking api change to introduce the status sub-resource later, so it's sort of now-or-never to add it in. What do we gain by having it as a subresource? How much are we breaking with conventions to not make it a subresource, especially when all of the other types do have a status subresource?

markmandel · 2019-07-27T10:56:34Z

Excellent questions, let me tackle each one:

In all of the other code paths you updated, it was a simple change from Update --> UpdateStatus, but for allocation it's more complicated. I assume there is a good reason that allocation isn't possible by only changing fields in the status and that it requires updates to metadata?

So there are two things at play here. In all other cases, we have the workqueue + sync method. So we have a self healing system that eventually ends up in the declared state. Basically what K8s was built for. For Allocation, we're essentially doing an imperative command - which is not quite what K8s was built for.

So in previous cases, even if we did add an extra step with the new SubResource, it's fine because even if something goes wrong, we will eventually self heal - so we always end up in a happy place.

But in Allocations case, i feel it's ideal to do it all in one go -- we don't have a way to self heal if something goes wrong (although, we could build one in theory -- but it seems like a lot of work).

The update metadata step is because you call metadata changes along with the Allocation - this is useful for passing information to the GameServer. Classic example is to pass what map the gameserver needs to loads before the players can play, or how many players the gameserver can expect. We talk about it a bit here: https://agones.dev/site/docs/reference/gameserverallocation/

Assuming that there is a reasonable answer to the above, I tend to agree with your conclusion that this is adding more complexity and trouble than it is worth. If we did need to do two updates, they need to be done in way that is automatically recoverable by the controller such that it can drive the second update by seeing the first one (which led me to the above question, since that would mean putting enough info in status to make the change, at which point why not have everything in status).

Yeah, I was originally against moving it to the status, since Labels and Annotations seemed like a pre-built thing that K8s already had for storing arbitrary data about the GameServer. (I have some memories of us talking about this back in the original design stage -- I can probably dig up the tickets if need be). It also has some nice crossover in that labels are naturally searchable in the k8s api already, which can be useful.

Based on your code changes, I think it's a breaking api change to introduce the status sub-resource later, so it's sort of now-or-never to add it in. What do we gain by having it as a subresource? How much are we breaking with conventions to not make it a subresource, especially when all of the other types do have a status subresource?

These are good questions. The only thing I think is a big win, is being able to give the RBAC rule of gameserver/status. We are probably dropping a small performance drop if we add it because of the extra api hits for this particular resource. (the other don't get their metadata or spec updated).

How much are we breaking conventions? Honestly, I can't imagine anyone would really care - we can document it in the generated code in the client-go that there is no status endpoint (and we shouldn't generate the UpdateStatus function in the go code either - it's currently there even though it doesn't function). It's not like we are removing the actual functionality of updating the Status. I think in my mind - we're doing something a little different from the norm with the GameServer resource, so I don't think its going to be a detriment in any way to anyone greatly -- but it gives us the ability to edit as many aspects of the GameServer in a single transaction -- and I think that will be hugely valuable going forward.

Hopefully that answers your questions. Please let me know if you have more.

markmandel · 2019-07-27T10:59:49Z

Actually to add a note - assuming we go with keeping the code as is - aside from removing the UpdateStatus function from GameServer we should write comprehensive documentation that explains the decision, and why GameServer is the special case it is - so that any user attempting to interact with a GameServer directly is not confused.

I figure we can add it to the gameserver.go header, and it will filter through to all the generated documentation.

markmandel · 2019-07-29T06:15:59Z

Created draft PR #959 so whichever way we decide to go - we have a PR we can merge when ready.

The more I think about it, I can't actually see any real net positives for adding status as a subresource on GameServer anyway -- so I don't think we should add it. But definitely still want to hear other's opinions.

roberthbailey · 2019-07-30T07:38:22Z

You've convinced me that we should go with #959 and remove UpdateStatus.

markmandel · 2019-07-30T07:53:15Z

I think that is the best way forward. 👍 I'll get the PR ready.

markmandel added kind/feature New features for Agones kind/design Proposal discussing new features / fixes and how they should be implemented area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Aug 27, 2018

victor-prodan mentioned this issue Sep 1, 2018

Horizontal Fleet Autoscaling #334

Closed

markmandel mentioned this issue Sep 6, 2018

FleetAutoScaler v0 #340

Merged

markmandel mentioned this issue Dec 17, 2018

Update to Kubernetes 1.11 #447

Merged

This was referenced Jan 31, 2019

Add Scale Subresource into Fleet and Gameserverset CRDs #539

Merged

Use of HorizontalPodAutoscaler and PodDisruptionBudget for a Fleet #553

Closed

markmandel added this to the 0.10.0 milestone Apr 3, 2019

aLekSer mentioned this issue Apr 19, 2019

Add status subresource for FleetAutoscaler #730

Merged

markmandel modified the milestones: 0.10.0, 0.11.0 May 7, 2019

markmandel modified the milestones: 0.11.0, 0.12.0 Jun 18, 2019

markmandel self-assigned this Jul 25, 2019

markmandel mentioned this issue Jul 29, 2019

Remove GameServer UpdateStatus() since its not being used. #959

Merged

markmandel mentioned this issue Jul 30, 2019

Gameserver status subresource #947

Closed

markmandel closed this as completed in #959 Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to using CRD Subresources for all Agones CRDs #329

Move to using CRD Subresources for all Agones CRDs #329

markmandel commented Aug 27, 2018

victor-prodan commented Sep 2, 2018

markmandel commented Dec 18, 2018

aLekSer commented Jan 28, 2019 •

edited

Loading

aLekSer commented Jan 31, 2019 •

edited

Loading

markmandel commented Apr 12, 2019

aLekSer commented Apr 15, 2019 •

edited

Loading

markmandel commented Apr 17, 2019

aLekSer commented Apr 18, 2019

markmandel commented Jul 25, 2019 •

edited

Loading

markmandel commented Jul 25, 2019

markmandel commented Jul 25, 2019

markmandel commented Jul 25, 2019 •

edited

Loading

roberthbailey commented Jul 25, 2019

roberthbailey commented Jul 27, 2019

markmandel commented Jul 27, 2019

markmandel commented Jul 27, 2019

markmandel commented Jul 29, 2019

roberthbailey commented Jul 30, 2019

markmandel commented Jul 30, 2019

Move to using CRD Subresources for all Agones CRDs #329

Move to using CRD Subresources for all Agones CRDs #329

Comments

markmandel commented Aug 27, 2018

Design

Status Subresource

Scale on Fleets

Research

victor-prodan commented Sep 2, 2018

markmandel commented Dec 18, 2018

aLekSer commented Jan 28, 2019 • edited Loading

aLekSer commented Jan 31, 2019 • edited Loading

markmandel commented Apr 12, 2019

aLekSer commented Apr 15, 2019 • edited Loading

markmandel commented Apr 17, 2019

aLekSer commented Apr 18, 2019

markmandel commented Jul 25, 2019 • edited Loading

markmandel commented Jul 25, 2019

markmandel commented Jul 25, 2019

markmandel commented Jul 25, 2019 • edited Loading

roberthbailey commented Jul 25, 2019

roberthbailey commented Jul 27, 2019

markmandel commented Jul 27, 2019

markmandel commented Jul 27, 2019

markmandel commented Jul 29, 2019

roberthbailey commented Jul 30, 2019

markmandel commented Jul 30, 2019

aLekSer commented Jan 28, 2019 •

edited

Loading

aLekSer commented Jan 31, 2019 •

edited

Loading

aLekSer commented Apr 15, 2019 •

edited

Loading

markmandel commented Jul 25, 2019 •

edited

Loading

markmandel commented Jul 25, 2019 •

edited

Loading