Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to using CRD Subresources for all Agones CRDs #329

Closed
markmandel opened this issue Aug 27, 2018 · 19 comments · Fixed by #959
Closed

Move to using CRD Subresources for all Agones CRDs #329

markmandel opened this issue Aug 27, 2018 · 19 comments · Fixed by #959
Assignees
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Milestone

Comments

@markmandel
Copy link
Member

Design

We would want to generate the client code and write the kubectl configuration for Subresources.

I'm not 100% positive if this is enabled by default / available on cloud providers in Kubernetes 1.10. We may have to wait until Kubernetes 1.11 is available across the board.

That being said, this looks to break down to two major things:

Status Subresource

This doesn't give us any extra functionality, but does give us more control RBAC control over what can be changed by what part of the system. For example #150 will only be possible once this is in effect.

There are also some potential minor performance optimisations - but I think this will be minor. This is more of a security++ operation.

Scale on Fleets

This may be the easier one to start with - but the endpoint being that it would enable kubectl scale Fleet simple-udp --replicas=10 which would be super nice, and make custom autoscalers much easier.

As an implementation detail, this may also drive us to implement to scale on the backing GameServerSet as well.

Research

@markmandel markmandel added kind/feature New features for Agones kind/design Proposal discussing new features / fixes and how they should be implemented area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Aug 27, 2018
@victor-prodan
Copy link
Contributor

Subresource implementation example found here: https://github.com/kubernetes/sample-controller

@markmandel
Copy link
Member Author

We can do this now that #447 is merged!

@aLekSer
Copy link
Collaborator

aLekSer commented Jan 28, 2019

Started adding new Subresources into helm templates and it is working for Fleet for instance:

>kubectl scale --replicas=15 fleet/simple-udp
fleet.stable.agones.dev/simple-udp scaled

I followed the way which is documented under Subresources section here: https://v1-11.docs.kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#advanced-topics

@aLekSer
Copy link
Collaborator

aLekSer commented Jan 31, 2019

Regarding adding Scale subresource: it makes Fleet able to be configured with Horizontal Pod Autoscaler. And we can use metrics which was previously exported to stackdriver to control the scaling. This is just one of scale subresource gains.
https://cloud.google.com/blog/products/gcp/beyond-cpu-horizontal-pod-autoscaling-comes-to-google-kubernetes-engine

Allows systems like HorizontalPodAutoscaler and PodDisruptionBudget interact with your new resource

https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/#api-server-aggregation

@markmandel
Copy link
Member Author

Was just looking at the autoscaler - should that also have a state subresource?

@aLekSer
Copy link
Collaborator

aLekSer commented Apr 15, 2019

@markmandel Let me check that.
Do we also need to have scale subresource for Buffer policyType maxReplicas parameter only?
https://github.com/GoogleCloudPlatform/agones/blob/b19408187de9e643698b115f93e255d0279ae85a/install/helm/agones/templates/crds/fleetautoscaler.yaml#L66

@markmandel
Copy link
Member Author

Oops, missed your comment. I don't think so. I don't think scale makes sense here. WDYT?

@aLekSer
Copy link
Collaborator

aLekSer commented Apr 18, 2019

Yes, that would make things more complicated. So I will prepare adding a Status to Fleetautoscaler.

@markmandel markmandel modified the milestones: 0.10.0, 0.11.0 May 7, 2019
@markmandel markmandel modified the milestones: 0.11.0, 0.12.0 Jun 18, 2019
@markmandel markmandel self-assigned this Jul 25, 2019
@markmandel
Copy link
Member Author

markmandel commented Jul 25, 2019

Created a draft of the GameServer/status implementation here #947

I'm waiting on result of e2e tests, but I'm leaning towards having GameServer be a special case, where we don't implement a status subresource - essentially so that we can change the entirety of a GameServer in one go.

Allocation is something that moves away from the K8s paradigm, and it only affects GameServers, so in this case it may be that we want to keep it as a special case because of that.

There are several reasons for this:

  1. (low concern) An extra step and API operation to set the default status values for a gameserver. Not the end of the world, but nice to have less traffic when scaling up and down gameservers.
  2. (high concern) when we allocate, we have to apply several operations, with no way to make them transactional. If the second one fails (master goes down, etc), we have a partially applied data to an, allocated game server. This is also an extra k8s api hit per allocation.
    • we could maybe have a "cleanup" queue that marks gameservers as unhealthy if the allocation fails. But this seems like a lot of effort for... what gain?

Waiting for the e2e tests to finish - I am slightly worries that the k8s resource locking mechanism we rely on to ensure allocations are safe from fleet changes will be nullified. From here

The .metadata.generation value is incremented for all changes, except for changes to .metadata or .status.

But will wait on e2e tests to complete before drawing conclusions. Just wanted to throw this out there as an additional concern.

@markmandel
Copy link
Member Author

In local e2e testing - both TestGameServerAllocationDuringGameServerDeletion and TestScaleFleetUpAndDownWithGameServerAllocation seem to be passing 😮 so I may well be wrong about the last item.

But how do people feel about the rest of the above?

@markmandel
Copy link
Member Author

It all passed. I'm actually quite surprised.

@markmandel
Copy link
Member Author

markmandel commented Jul 25, 2019

/cc @aLekSer @roberthbailey - wdyt? Would love some second opinions about this. Please also let me know if what I wrote was not clear.

@roberthbailey
Copy link
Member

I'll take a look later today.

@roberthbailey
Copy link
Member

I'm not an expert on this code path but I have a basic question regarding your second concern:

In all of the other code paths you updated, it was a simple change from Update --> UpdateStatus, but for allocation it's more complicated. I assume there is a good reason that allocation isn't possible by only changing fields in the status and that it requires updates to metadata?

Assuming that there is a reasonable answer to the above, I tend to agree with your conclusion that this is adding more complexity and trouble than it is worth. If we did need to do two updates, they need to be done in way that is automatically recoverable by the controller such that it can drive the second update by seeing the first one (which led me to the above question, since that would mean putting enough info in status to make the change, at which point why not have everything in status).

Based on your code changes, I think it's a breaking api change to introduce the status sub-resource later, so it's sort of now-or-never to add it in. What do we gain by having it as a subresource? How much are we breaking with conventions to not make it a subresource, especially when all of the other types do have a status subresource?

@markmandel
Copy link
Member Author

Excellent questions, let me tackle each one:

In all of the other code paths you updated, it was a simple change from Update --> UpdateStatus, but for allocation it's more complicated. I assume there is a good reason that allocation isn't possible by only changing fields in the status and that it requires updates to metadata?

So there are two things at play here. In all other cases, we have the workqueue + sync method. So we have a self healing system that eventually ends up in the declared state. Basically what K8s was built for. For Allocation, we're essentially doing an imperative command - which is not quite what K8s was built for.

So in previous cases, even if we did add an extra step with the new SubResource, it's fine because even if something goes wrong, we will eventually self heal - so we always end up in a happy place.

But in Allocations case, i feel it's ideal to do it all in one go -- we don't have a way to self heal if something goes wrong (although, we could build one in theory -- but it seems like a lot of work).

The update metadata step is because you call metadata changes along with the Allocation - this is useful for passing information to the GameServer. Classic example is to pass what map the gameserver needs to loads before the players can play, or how many players the gameserver can expect. We talk about it a bit here: https://agones.dev/site/docs/reference/gameserverallocation/

Assuming that there is a reasonable answer to the above, I tend to agree with your conclusion that this is adding more complexity and trouble than it is worth. If we did need to do two updates, they need to be done in way that is automatically recoverable by the controller such that it can drive the second update by seeing the first one (which led me to the above question, since that would mean putting enough info in status to make the change, at which point why not have everything in status).

Yeah, I was originally against moving it to the status, since Labels and Annotations seemed like a pre-built thing that K8s already had for storing arbitrary data about the GameServer. (I have some memories of us talking about this back in the original design stage -- I can probably dig up the tickets if need be). It also has some nice crossover in that labels are naturally searchable in the k8s api already, which can be useful.

Based on your code changes, I think it's a breaking api change to introduce the status sub-resource later, so it's sort of now-or-never to add it in. What do we gain by having it as a subresource? How much are we breaking with conventions to not make it a subresource, especially when all of the other types do have a status subresource?

These are good questions. The only thing I think is a big win, is being able to give the RBAC rule of gameserver/status. We are probably dropping a small performance drop if we add it because of the extra api hits for this particular resource. (the other don't get their metadata or spec updated).

How much are we breaking conventions? Honestly, I can't imagine anyone would really care - we can document it in the generated code in the client-go that there is no status endpoint (and we shouldn't generate the UpdateStatus function in the go code either - it's currently there even though it doesn't function). It's not like we are removing the actual functionality of updating the Status. I think in my mind - we're doing something a little different from the norm with the GameServer resource, so I don't think its going to be a detriment in any way to anyone greatly -- but it gives us the ability to edit as many aspects of the GameServer in a single transaction -- and I think that will be hugely valuable going forward.

Hopefully that answers your questions. Please let me know if you have more.

@markmandel
Copy link
Member Author

Actually to add a note - assuming we go with keeping the code as is - aside from removing the UpdateStatus function from GameServer we should write comprehensive documentation that explains the decision, and why GameServer is the special case it is - so that any user attempting to interact with a GameServer directly is not confused.

I figure we can add it to the gameserver.go header, and it will filter through to all the generated documentation.

@markmandel
Copy link
Member Author

Created draft PR #959 so whichever way we decide to go - we have a PR we can merge when ready.

The more I think about it, I can't actually see any real net positives for adding status as a subresource on GameServer anyway -- so I don't think we should add it. But definitely still want to hear other's opinions.

@roberthbailey
Copy link
Member

You've convinced me that we should go with #959 and remove UpdateStatus.

@markmandel
Copy link
Member Author

I think that is the best way forward. 👍 I'll get the PR ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/design Proposal discussing new features / fixes and how they should be implemented kind/feature New features for Agones
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants