Proposal: Multi-Cluster Allocation Policies #597

jkowalski · 2019-02-15T16:37:32Z

Background

When operating multiple Agones clusters to support world-wide game launch it is often necessary to perform multi-cluster allocations (allocations from a set of clusters instead of just one) based on defined policies.

Examples of common policies include:

burst to cloud - first allocate from a fleet in a cluster located on premises but when it's out of capacity, start allocating from a fleet hosted in the cloud.
provider preference - prefer certain cloud providers over others based on factors such as cost or quality of service, with the ability to fall back to others when capacity is not available
load spreading - distribute allocations across clusters from different providers to limit the blast radius of an outage of a single cluster (e.g. network fiber cut)

Proposal

This document proposes changing how GameServerAllocation API works by adding support for forwarding allocation requests to other clusters based on policies that can be applied on a per-request basis.

We will add a new CRD called GameServerAllocationPolicy that controls how multi-cluster allocations will be performed. The policy will contain a list of clusters to allocate from with corresponding priorities and weights. The credentials for accessing those clusters will be stored in secrets.

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerAllocationPolicy
metadata:
  name: my-policy
  namespace: default
spec:
  clusters:
  - name: on-prem
    priority: 1
    server: "https://some-endpoint"
    credentials: "some-secret-name"
  - name: another-cloud
    priority: 2
    weight: 100
    server: "https://some-other-endpoint"
    credentials: "some-other-secret-name"
  - name: this-cluster
    priority: 2
    weight: 300
    server: "local"

The name of the policy can be specified when creating GameServerAllocation.

apiVersion: "stable.agones.dev/v1alpha1"
kind: GameServerAllocation
metadata:
  generateName: my-allocation-
spec:
  policy: my-policy
  required:
    matchLabels:
      game: my-game
  preferred:
    - matchLabels:
        stable.agones.dev/fleet: green-fleet
  scheduling: Packed
  metadata:
    ...

If policy is not specified, the allocation will be attempted from the local cluster as it is done today.

When policy is present on a the GameServerAllocation request, the API handler would become a router that calls the specified clusters in their priority order, and for clusters with equal priority it would randomly pick a clusters, with probability of choosing a cluster proportional to its weight. If a cluster is out of capacity, the handler would try other clusters until the allocation succeeds or the list of clusters to try is exhausted.

In the example above, when the allocation request comes in, we would always try allocating from on-prem cluster first, because it's a cluster with highest priority.

If allocation from on-prem fails, we proceed to next highest priority which includes two possible clusters: this-cluster (with 75% probability) or other-cloud (with 25% probability).

Deployment Topologies

In a multi-cluster scenario, several allocation topologies are possible, based on a decision where to put the GameServerAllocationPolicy objects:

Single Cluster Responsible For Routing

In this mode, a single cluster is selected to server multi-cluster allocation APIs and Match Maker is pointed at its allocation endpoint. The cluster has GameServerAllocationPolicy that points at all other clusters. This has the benefit of simplicity, but has a single point of failure, which is the chosen cluster.

Pros:

simple configuration of policies (one place)
simple configuration of secrets (one place)

Cons:

single point of failure
secrets hosted in the cluster that's running game servers

Dedicated Cluster Responsible For Routing

Another option, similar to the single cluster is to create a dedicated cluster that's only responsible for allocations, but does not host game servers or fleets. This cluster will have only routing policies and secrets to talk to other clusters.

Pros:

simplified management of policies (one cluster only)
simplified management of secrets (one cluster only)
no secrets are stored in clusters where game servers run

Cons:

still single point of failure
additional Agones cluster to manage

All Clusters Responsible For Routing

In this mode, all clusters will have policies and secrets that allow them to route allocation requests to all other clusters when necessary. A global load balancer (could be a VIP or DNS-based) will randomly pick a cluster to allocate from, which will perform an extra "hop" to the cluster based on policy.

Pros:

resilient to cluster outages
no additional clusters to manage

Cons:

complex configuration of policies
complex configuration of secrets

Other Topologies

Other, more complex topologies are possible, including hierarchical ones where routing-only clusters form a hierarchy, routing-only-clusters behind load-balancers, etc.

The text was updated successfully, but these errors were encountered:

markmandel · 2019-02-18T06:27:04Z

This looks pretty good to me, I'm not seeing any glaring red flags, and looks to solve the HA problem as well. - only question I have - what if I don't want to have a probability of going to another cluster - say I want to go allocate against "on-prem" until it fills up, and then move to "cloud-1", do I set the weight to 1? 0? All the same weights? (does weight have a default?)

@Kuqd I know you've been working with multiple clusters a lot -- what are your thoughts?

ilkercelikyilmaz · 2019-02-18T19:18:55Z

only question I have - what if I don't want to have a probability of going to another cluster - say I want to go allocate against "on-prem" until it fills up, and then move to "cloud-1", do I set the weight to 1? 0? All the same weights? (does weight have a default?)

This is controlled by the Priority. If there is only one entry with Priority=1 then there is no probability and the allocation will happen only on this cluster (until it is out of capacity). If there are multiple entries with same priority, then it will use weight of each entry to distribute allocation between clusters.
If there is no weight is set, we can assign a default weight.

markmandel · 2019-02-18T23:37:36Z

Actually, I see this is also covered in @jkowalski 's example. on-prem has no weight, so it is given top priorty, but because No.2 and No.3 have weights, then it is randomly distributed based on weight. 👍

That makes sense to me.

cyriltovena · 2019-02-19T00:53:55Z

I think it would be nice to describe what kind of credentials we are supporting, my guess is service account token.

One con of 3 is security but that’s the price for no single point of failure. Would a regional cluster with option 2 be good enough to remove the single point of failure?

Would be nice to be able to target a cluster from the GSA, so matchmaking can make some ping requests and select the right cluster, WDYT?

Basically being able to say I would prefer us-east but if it’s full follow the policy. Seems already possible if you create a bunch of policy so that’s good.

markmandel · 2019-02-19T01:13:26Z

So I've assumed the credentials are Kubernetes credentials (bearer token)? So essentially a service account + rbac permissions (although you are right - we should be explicit about this).

Since you can create your own topology with this design - the tradeoffs are up to the user. If you feel that 3 GSA router clusters is enough HA, then that's fine. But if you want more than that, you can add more. In fact -- you can adjust on the fly.

Originally I had thought a director/agent model would be better -- but looking at this, I think this is better because:

Under the agent model, you still have to have some kind of auth (certs probably), and they all still need to be managed. Also then the bots need service accounts anyway, so you're doing double the work.
We basically reuse existing tech - k8s creds, vpcs, etc to lock everything down if people want to (especially b/w on prem and cloud) - so it's less work for us - and less security risk.

We will want to have some pretty explicit docs on how to create and manage these tokens through - or point to some documentation that does this (I haven't seen anything yet on my travels).

Would be nice to be able to target a cluster from the GSA, so matchmaking can make some ping requests and select the right cluster, WDYT?

Basically being able to say I would prefer us-east but if it’s full follow the policy. Seems already possible if you create a bunch of policy so that’s good.

I think this is covered by the policies. I think the idea here is that the game server ops team can decide on the policy set that is in place - and the team working on matchmaking / game logic isn't able to accidentally override that when attempting to get a game server. They can only choose from a pre-approved set.

markmandel · 2019-02-19T01:18:33Z

We should also point people at: https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/

To ensure they know to encrypt secrets at rest. Do we need to do more security wise here?

pooneh-m · 2019-03-06T17:32:43Z

What do you think of changing GameServerAllocationPolicy to support time, while policies are ordered chronologically? For example for

t1<---policy1----->t2<----policy2----...

The CRD for GameServerAllocationPolicy will be something like:

 spec:
  policies:
  - name: policy1
    start time: t1
    clusters:
    - name: on-prem
    ...
    - name: another-cloud
    ...
  - name: policy2
    start time: t2
    clusters:
    - name ...

ilkercelikyilmaz · 2019-03-06T17:58:05Z

So Agones will select the policies the ones that have the start time closest to the current time?

markmandel · 2019-03-06T18:12:22Z

@pooneh-m - I'm wondering if the motivation for this is to be able to automatically declare the start and end times of a policy, so that a user doesn't have to do this manually? (Then it becomes a question of how to handle crossover periods). Is that correct?

pooneh-m · 2019-03-06T20:54:58Z

@ilkercelikyilmaz Yes. Agones will pick the policy that its current time is passed the start time and before the next start time.
@markmandel yes, I proposed a simplified version of that.

pooneh-m · 2019-04-11T16:51:46Z

What do you think of naming allocation policy CR?

GameServerAllocationPolicy
MultiClusterAllocationPolicy
FleetAllocationPolicy
GameServerMultiClusterAllocationPolicy

markmandel · 2019-04-12T20:11:27Z

I think I'm leaning more towards:
MultiClusterAllocationPolicy or GameServerMultiClusterAllocationPolicy (even though it's a huge name, it is descriptive).

WDYT?

markmandel · 2019-04-12T20:12:41Z

Another potentially fun question - should the AllocationPolicy be under a multicluster group of some kind? (Rather than stable?)

(And maybe stable => core?)

pooneh-m · 2019-04-13T00:04:40Z

I am more leaning towards keeping GameServer prefix for Agones CRs because there is less risk of having the same CRD name for two CRs in the same cluster. I think either GameServerAllocationPolicy or GameServerMultiClusterAllocationPolicy is fine.

GameServerMultiClusterAllocationPolicy has two votes.

pooneh-m · 2019-04-13T00:05:57Z

Another potentially fun question - should the AllocationPolicy be under a multicluster group of some kind? (Rather than stable?)

(And maybe stable => core?)

Lets discuss this in issue #703 that you opened.

markmandel · 2019-04-13T00:07:42Z

Yeah I agree - we should let grouping dictate naming - and 100% agreed on GameServer as a prefix for the reasons described above!

pooneh-m · 2019-04-15T18:58:46Z

Based on the group naming suggestion in #703, I am choosing GameServerAllocationPolicy, since the full name <plural>.<group> has multicluster in it
-> gameserverallocationpolicies.multicluster.agones.dev

pooneh-m · 2019-04-17T22:58:39Z

I'll be adding a new field to GameServerAllocation to extend it for multicluster allocation.

MultiClusterPolicySelector metav1.LabelSelector

By default the multicluster policy will not be effective for allocation. If MultiClusterPolicySelector is specified, multicluster policy is enforced per request.

There are two benefits to it:

If cluster 1 forwards allocation requests to cluster 2 per multicluster policy definition, the multicluster policy on cluster 2 will be disabled to avoid rerouting.
We can reuse the existing gameserverallocation API for multicluster allocation, instead of introducing a new API.

We can also make enabling and disabling the multicluster policy by introducing an explicit flag, but I don't think it is necessary.

MultiClusterPolicy {
   Enable bool
   PolicySelector metav1.LabelSelector
}

markmandel · 2019-04-18T13:37:16Z

Just so I'm 100% clear, PolicySelector would then match all the Policies on the cluster and apply all of them? Is there any control over the order, or is it essentially random?

pooneh-m · 2019-04-18T17:16:40Z

A list of multicluster policies are selected using PolicySelector per incoming allocation request. Policies are then ordered based on their priority and weight and the first policy in the ordered list is selected. Based on the policy, the allocation request is either handled locally or redirected to another cluster.

markmandel · 2019-04-18T17:25:03Z

Oh neat - so it's more of a merge operation really - all the spec.clusters are merged into a single, sorted list based on weight. Nice! SGTM!

pooneh-m · 2019-04-23T20:58:24Z

About the cluster to cluster connectivity, apparently, service accounts are not forever. One way to solve the connectivity is to introduce allocation as a service that can call other cluster's allocation services directly instead of through API servers using pre-installed certificates. WDYT?

markmandel · 2019-04-23T21:55:54Z

Not sure I understand the above tbh. Sounds like a re-architecting of how the kubernetes API is authenticated (if I read it correctly)? How does that impact connectivity? If kubectl can be used from outside a cluster, we should be able to do the same thing, no? (it all uses client-go, after all)

pooneh-m · 2019-04-26T00:37:52Z

Yes, we need a slight re-architecture to handle authentication for cluster to cluster allocation requests.

For match making service calling to an allocation service from a different cluster or for multi-cluster allocation scenario, we cannot store a service account token and assume it lasts forever. We cannot also assume customers can enable a plugins for authenticating with an identity provider e.g. OIDC on their cluster or accept a client or TLS cert.

The solution is to (1) introduce a reverse proxy with external IP on the cluster that performs the authentication of allocation requests and then forward the requests to API server. For better performance (2) we should move the allocation service logic (controller) to the proxy and call it allocation service. Then (3) remove API server extension for allocation, which is a breaking change and should be done before 1.0 release.

The solution will be similar to this sample.

For talking to GKE, kubectl is using user account authenticated with google identity provider, instead of service account and it has expiry on the token.

markmandel · 2019-04-26T00:43:37Z

I have a strange emotional attachment to GameServerAllocations 🤷‍♂️ I feel like it's so nice to be able to do allocations with kubectl on the command line for testing and development etc. I'd like to have some more user input on that aspect.

I still like the idea of keeping them around, also because it's a nice in-cluster and/or developer experience -- but I totally get the reasoning of potentially removing them. Maybe I'm being overly sentimental? (I can admit to that).

We had a short discussion earlier about completing (1) above, and then seeing how our performance goes? Is that our first step, and then maybe I can live in hope that they may stay around? 😄

But yes - I 100% agree we need to make a decision before 1.0, as it affects the API surface, and we need to lock that down before 1.0.

markmandel · 2019-05-08T19:02:42Z

@pooneh-m - just running a smoke test on the latest RC, noticed in the logs:

{"message":"agones.dev/agones/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:agones-system:agones-controller\" cannot list secrets at the cluster scope","severity":"error","time":"2019-05-08T18:59:53.210002337Z"}
{"message":"agones.dev/agones/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:agones-system:agones-controller\" cannot list secrets at the cluster scope","severity":"error","time":"2019-05-08T18:59:54.212493604Z"}
{"message":"agones.dev/agones/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Secret: secrets is forbidden: User \"system:serviceaccount:agones-system:agones-controller\" cannot list secrets at the cluster scope","severity":"error","time":"2019-05-08T18:59:55.214727668Z"}

Looks like we need to add a RBAC permission 😢

Doesn't affect functionality at the moment, but would be good to get a fix in while we're in feature freeze I think.

pooneh-m · 2019-05-08T19:44:50Z

Thanks! I am on it.

pooneh-m · 2019-05-22T19:10:43Z

Here are the remaining work items for the allocator service:

gRPC API.
Load testing and moving allocation APIServer extention to allocator service if helps with performance.
Convert allocation errors to meaningful http status (currently they are all 500 http status).
Add retry logic for transient errors when calling cross cluster allocator endpoints.
Cache allocation client.
Add E2E tests for multi-cluster scenarios.
Remove ClusterName from allocation policies. Allocation endpoint suffice.
Add Healthz for the service.
Revisit whether we need to expose namespace to the caller of the allocator service.
Add samples that send request for allocation to gameserver--allocator
Add metrics for multicluster allocation failures
Add documentations

markmandel · 2019-08-01T05:11:51Z

@pooneh-m just wanted to gently nudge this - see where we up to date on this?

We should probably add "documentation" to the above list as well 😃

This isn't on the 1.0 roadmap, but I was just curious.

pooneh-m · 2019-08-02T05:22:04Z

Yes, I am going to tackle the list before going to v1.0. I added documentation to the list.

markmandel · 2019-08-02T05:47:12Z

Nice! Very cool!

Davidnovarro · 2019-08-19T14:17:42Z

@pooneh-m Hi! Any update on this?

pooneh-m · 2019-08-20T07:29:58Z

Hi @Davidnovarro, I just started working on this again. Hopefully before v1.0 there will be plenty of updates. :) I'm planning to do a refactoring to move allocation handler to its own stand alone library that both allocator service and the API server extension reference to help with the scale. Then I will introduce the gRPC API, add more testing for cross cluster calls and add documentations.

markmandel · 2020-12-10T22:16:05Z

@pooneh-m is this closeable now?

jkowalski added kind/design Proposal discussing new features / fixes and how they should be implemented area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Feb 15, 2019

jkowalski mentioned this issue Feb 25, 2019

[Breaking Change] Move GameServerAllocation to an API Extension Server. #600

Closed

pooneh-m mentioned this issue Apr 9, 2019

Add allocation policy CRD and schema definition. #698

Merged

pooneh-m mentioned this issue Apr 20, 2019

Implementing cluster selector from multi-cluster allocation policies. #733

Merged

pooneh-m mentioned this issue May 1, 2019

Rename APIServerEndpoint to AllocationEndpoint for multi-cluster allocation #755

Merged

This was referenced May 8, 2019

Add secret list and watch permissions to RBAC rules #762

Merged

Adding an allocator service that acts as a reverse proxy. #768

Merged

ilkercelikyilmaz mentioned this issue May 28, 2019

Request to become an Approver on Agones #796

Closed

pooneh-m mentioned this issue Aug 27, 2019

Refactor gameserverallocations to its components #1015

Merged

This was referenced Sep 5, 2019

Adding make file to generate allocation go from proto #1041

Merged

Define the proto definition for the allocator service #1025

Merged

pooneh-m mentioned this issue Sep 27, 2019

Fixes, more e2e tests and logging for multi-cluster allocation #1077

Merged

pooneh-m mentioned this issue Feb 3, 2020

Changing the allocator API to gRPC #1314

Merged

roberthbailey mentioned this issue May 21, 2020

[Docs] Multi-cluster Allocation #1582

Closed

pooneh-m closed this as completed Dec 10, 2020

markmandel added this to the 1.11.0 milestone Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Multi-Cluster Allocation Policies #597

Proposal: Multi-Cluster Allocation Policies #597

jkowalski commented Feb 15, 2019

markmandel commented Feb 18, 2019

ilkercelikyilmaz commented Feb 18, 2019

markmandel commented Feb 18, 2019 •

edited

Loading

cyriltovena commented Feb 19, 2019 •

edited

Loading

markmandel commented Feb 19, 2019

markmandel commented Feb 19, 2019

pooneh-m commented Mar 6, 2019

ilkercelikyilmaz commented Mar 6, 2019

markmandel commented Mar 6, 2019

pooneh-m commented Mar 6, 2019

pooneh-m commented Apr 11, 2019

markmandel commented Apr 12, 2019

markmandel commented Apr 12, 2019

pooneh-m commented Apr 13, 2019

pooneh-m commented Apr 13, 2019

markmandel commented Apr 13, 2019

pooneh-m commented Apr 15, 2019 •

edited

Loading

pooneh-m commented Apr 17, 2019 •

edited

Loading

markmandel commented Apr 18, 2019

pooneh-m commented Apr 18, 2019

markmandel commented Apr 18, 2019

pooneh-m commented Apr 23, 2019

markmandel commented Apr 23, 2019

pooneh-m commented Apr 26, 2019 •

edited

Loading

markmandel commented Apr 26, 2019

markmandel commented May 8, 2019

pooneh-m commented May 8, 2019

pooneh-m commented May 22, 2019 •

edited

Loading

markmandel commented Aug 1, 2019

pooneh-m commented Aug 2, 2019 •

edited

Loading

markmandel commented Aug 2, 2019

Davidnovarro commented Aug 19, 2019

pooneh-m commented Aug 20, 2019

markmandel commented Dec 10, 2020

Proposal: Multi-Cluster Allocation Policies #597

Proposal: Multi-Cluster Allocation Policies #597

Comments

jkowalski commented Feb 15, 2019

Background

Proposal

Deployment Topologies

markmandel commented Feb 18, 2019

ilkercelikyilmaz commented Feb 18, 2019

markmandel commented Feb 18, 2019 • edited Loading

cyriltovena commented Feb 19, 2019 • edited Loading

markmandel commented Feb 19, 2019

markmandel commented Feb 19, 2019

pooneh-m commented Mar 6, 2019

ilkercelikyilmaz commented Mar 6, 2019

markmandel commented Mar 6, 2019

pooneh-m commented Mar 6, 2019

pooneh-m commented Apr 11, 2019

markmandel commented Apr 12, 2019

markmandel commented Apr 12, 2019

pooneh-m commented Apr 13, 2019

pooneh-m commented Apr 13, 2019

markmandel commented Apr 13, 2019

pooneh-m commented Apr 15, 2019 • edited Loading

pooneh-m commented Apr 17, 2019 • edited Loading

markmandel commented Apr 18, 2019

pooneh-m commented Apr 18, 2019

markmandel commented Apr 18, 2019

pooneh-m commented Apr 23, 2019

markmandel commented Apr 23, 2019

pooneh-m commented Apr 26, 2019 • edited Loading

markmandel commented Apr 26, 2019

markmandel commented May 8, 2019

pooneh-m commented May 8, 2019

pooneh-m commented May 22, 2019 • edited Loading

markmandel commented Aug 1, 2019

pooneh-m commented Aug 2, 2019 • edited Loading

markmandel commented Aug 2, 2019

Davidnovarro commented Aug 19, 2019

pooneh-m commented Aug 20, 2019

markmandel commented Dec 10, 2020

markmandel commented Feb 18, 2019 •

edited

Loading

cyriltovena commented Feb 19, 2019 •

edited

Loading

pooneh-m commented Apr 15, 2019 •

edited

Loading

pooneh-m commented Apr 17, 2019 •

edited

Loading

pooneh-m commented Apr 26, 2019 •

edited

Loading

pooneh-m commented May 22, 2019 •

edited

Loading

pooneh-m commented Aug 2, 2019 •

edited

Loading