-
Notifications
You must be signed in to change notification settings - Fork 813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add webhook functionality into FleetAutoscaler #460
Conversation
Build Succeeded 👏 Build Id: bf3d2c2f-7e7f-4cb1-b3bc-957e1cacadef The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
|
||
### 2. Deploy a Webhook service for autoscaling | ||
|
||
We need to create a pod which will handle HTTP requests with json payload `FleetAutoscaleReview` and return back it with `FleetAutoscaleResponse` populated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add links of those two payload types to the endpoint specifications ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added links
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice feature, I think this will give a lot of flexibility, good job !
I've got couple of questions and issues to fix, but the overall PR is great, Thanks !
|
||
When the endpoint is called and a target Replica count is calculated and scale flag. It returns the JSON encoded FleetAutoscaleReview. | ||
|
||
To learn how to deploy the fleet to GKE, please see the tutorial [Create a Fleet (Go)](../../docs/create_fleet.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like more details of that autoscaler implementation with some example on how it behave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the autoscaling flow example with some calculations into this file.
// AutoscalerService is the description of the webhook autoscaler service | ||
// used to form url which is accessible inside the cluster | ||
type AutoscalerService struct { | ||
Name string `json:"name"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - we should embed the existing types, in this case WebhookClientConfig
2723e9f
to
9c8e512
Compare
Thanks for the comments. |
Build Succeeded 👏 Build Id: de15ba7e-4748-4e00-83ce-accd0d0d725b The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
9c8e512
to
4ed6397
Compare
Build Succeeded 👏 Build Id: bb7f5b33-7c5a-4ea9-b22d-35fbf3102a8c The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
4ed6397
to
22ce700
Compare
examples/autoscaler-webhook/main.go
Outdated
"io/ioutil" | ||
"net/http" | ||
|
||
v1alpha1 "agones.dev/agones/pkg/apis/stable/v1alpha1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1alpha1 "agones.dev/agones/pkg/apis/stable/v1alpha1" | |
"agones.dev/agones/pkg/apis/stable/v1alpha1" |
Build Succeeded 👏 Build Id: 6b215867-cab4-4387-b6df-3a4b98f08982 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
22ce700
to
906ca4a
Compare
Build Succeeded 👏 Build Id: 425376df-55d3-4aa3-8702-20584f50e234 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
I have added the e2e test. I need Mark to push newest change with Scale bool to agones-images/autoscaler-webhook:0.1. So e2e test would be fixed |
906ca4a
to
dea76e9
Compare
Build Failed 😱 Build Id: b23cbf69-5db6-4374-b9e3-b6cbb312951d Build Logs
|
gcr.io/agones-images/autoscaler-webhook:0.1 image updated. Rerunning the test! |
Build Succeeded 👏 Build Id: 2b63e3c8-3f46-4032-90d4-f8704b598d18 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving some more notes - apart from some documentation things, and some smaller issues - this is looking really great.
In order to track the list of gameservers which run in your fleet you can run this command in a separate terminal tab: | ||
|
||
``` | ||
watch "kubectl get gs -o=custom-columns=NAME:.metadata.name,STATUS:.status.state,IP:.status.address,PORT:.status.ports[0].port -n default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're now on 1.11 and have additionalPrintercolumns, this can now be:
watch "kubectl get gs -o=custom-columns=NAME:.metadata.name,STATUS:.status.state,IP:.status.address,PORT:.status.ports[0].port -n default" | |
watch "kubectl get gs -n default" |
### 2. Deploy a Webhook service for autoscaling | ||
|
||
We need to create a pod which will handle HTTP requests with json payload [`FleetAutoscaleReview`](./fleetautoscaler_spec.md#webhook-endpoint-specification) and return back it with [`FleetAutoscaleResponse`](./fleetautoscaler_spec.md#webhook-endpoint-specification) populated. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should link to the source code here?
Something like:
You can see the source for this webhook [here](../examples/autoscaler-webhook)
And maybe some kind of description about what this auto scaling webhook does?
Also something like:
The values returned from the `FleetAutoscaleResponse` tell the Autoscaler what target size the backing Fleet should be scaled up or down to
Just to be extra explicit. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds much better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more description to this step.
Normal AutoScalingFleet 35s fleetautoscaler-controller Scaling fleet simple-udp from 2 to 4 | ||
``` | ||
|
||
You can see that the fleet size has increased in particular case doubled to 4 gameservers, the autoscaler having compensated for the two allocated instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see that the fleet size has increased in particular case doubled to 4 gameservers, the autoscaler having compensated for the two allocated instances. | |
You can see that the fleet size has increased in particular case doubled to 4 gameservers (based on our custom logic in our webhook), the autoscaler having compensated for the two allocated instances. |
|
||
### 7. Check down scaling using Webhook Autoscaler policy | ||
|
||
If the fraction of allocated replicas in whole Replicas count would be less that threshold (0.3) than fleet would scale down by scaleFactor in our example 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fraction of allocated replicas in whole Replicas count would be less that threshold (0.3) than fleet would scale down by scaleFactor in our example 2. | |
Based on our custom webhook deployed earlier, if the fraction of allocated replicas in whole Replicas count would be less that threshold (0.3) than fleet would scale down by scaleFactor in our example 2. |
|
||
And get gameservers command output: | ||
``` | ||
kubectl get gs -o=custom-columns=NAME:.metadata.name,STATUS:.status.state,IP:.status.address,PORT:.status.ports[0].port -n default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubectl get gs -o=custom-columns=NAME:.metadata.name,STATUS:.status.state,IP:.status.address,PORT:.status.ports[0].port -n default | |
kubectl get gs -n default |
} | ||
defer res.Body.Close() // nolint: errcheck | ||
if res.StatusCode != http.StatusOK { | ||
return f.Status.Replicas, false, fmt.Errorf("Bad Status code from the server") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than fmt.Errorf - use the [errors](https://godoc.org/github.com/pkg/errors)
package to create / wrap errors - it gives us stack tracing and some more extra nice things. We use it all over the place, but here is one example: https://github.com/GoogleCloudPlatform/agones/blob/master/cmd/controller/main.go#L244
(Not sure if you are doing this everywhere, but we should change it in other places too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, switching to this package
} | ||
|
||
return f.Status.Replicas, false, fmt.Errorf("Wrong Policy Type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as below, switch to errors pkg - also error messages usually start with a lower case letter. I forget why ;)
url = fmt.Sprintf("http://%s.%s.svc:8000/%s", w.Service.Name, w.Service.Namespace, servicePath) | ||
} | ||
if url == "" { | ||
return f.Status.Replicas, false, fmt.Errorf("URL was not provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here again.
return f.Status.Replicas, false, fmt.Errorf("URL was not provided") | |
return f.Status.Replicas, false, errors.New("URL was not provided") |
I've done this review in reverse order 😕 sorry
dc8c087
to
b8c928b
Compare
Build Failed 😱 Build Id: cc69c1ed-9aaa-4f74-9873-b13fe5b654ac Build Logs
|
Build Succeeded 👏 Build Id: 4e1bc244-43ae-4512-b15f-2b71432db096 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
eb8d73a
to
388b60e
Compare
Build Failed 😱 Build Id: 00688f48-80c7-4d63-850c-a28da594cc04 Build Logs
|
388b60e
to
d2c86f1
Compare
Build Succeeded 👏 Build Id: df343635-aec8-4957-a9e0-16443c3aafd7 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
d2c86f1
to
352d519
Compare
Build Failed 😱 Build Id: 22ffdc88-88b0-4eaf-99ab-f55b17f3a791 Build Logs
|
Looked like a flaky http endpoint issue. So kicking off a rebuild. Only saw 2 small things left on the todo list, I resolved all other conversations in there that were good to go. Once those are done and in, I'm happy to merge this in. We may just slip it in to 0.7.0! 🎆 |
Build Succeeded 👏 Build Id: 2b245b3d-ef40-4fb9-a0b3-85aadf4c2d91 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
3c2b19c
to
1d1e8fa
Compare
Build Succeeded 👏 Build Id: 0afddda3-fef3-4dfa-8e6f-d5db07f8eabf The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
Build Failed 😱 Build Id: 8e49fdac-d254-4bd4-b147-8a3ee9649fca Build Logs
|
1d1e8fa
to
9e9a364
Compare
Build Succeeded 👏 Build Id: ba648615-927c-4ec3-a11d-5c81558601cc The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I found the flaky issue (and then found some other small things as I was going through), and also found some logging enhancements on the webhook that should make introspection easier.
Here are the changes as a patch too, if that is easier.
test/e2e/fleetautoscaler_test.go
Outdated
defer framework.KubeClient.CoreV1().Pods(defaultNs).Delete(pod.ObjectMeta.Name, nil) // nolint:errcheck | ||
} else { | ||
// if we could not create the autoscaler, their is no point going further | ||
logrus.Error("Failed creating autoscaler, aborting TestAutoscalerBasicFunctions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logrus.Error
and return - this could be assert.FailNow()
:
logrus.Error("Failed creating autoscaler, aborting TestAutoscalerBasicFunctions") | |
assert.FailNow(t, "Failed creating autoscaler, aborting TestAutoscalerBasicFunctions") |
(And remove the return
statement)
Same on below as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mark, I have applied the patch. That PullAlways parameter should help.
examples/autoscaler-webhook/main.go
Outdated
Request: faReq.Request, | ||
Response: &faResp, | ||
} | ||
logger.Infof("FleetAutoscaleReview Request: %+v ; Response: %+v", *review.Request, *review.Response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.Infof("FleetAutoscaleReview Request: %+v ; Response: %+v", *review.Request, *review.Response) | |
logger.WithField("review", review).Info("FleetAutoscaleReview") |
Then we'll get a proper structured log, and can see what the results are.
test/e2e/fleetautoscaler_test.go
Outdated
Spec: corev1.PodSpec{ | ||
Containers: []corev1.Container{{Name: "webhook", | ||
Image: "gcr.io/agones-images/autoscaler-webhook:0.1", | ||
ImagePullPolicy: corev1.PullIfNotPresent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImagePullPolicy: corev1.PullIfNotPresent, | |
ImagePullPolicy: corev1.PullAlways, |
I think this is the flaky issue! There is likely older versions of the webhook on some of the nodes in the cluster, and if this gets scheduled on them, the newer version isn't pulled down.
9e9a364
to
8411467
Compare
Build Failed 😱 Build Id: 06ab90da-f0bb-4712-85c5-0f725a0c99ad Build Logs
|
Add webhook policy type into FleetAutoscaler declaration. Provided an example of a webhook pod which receives FleetAutoscaleReview with a Fleet status and based on that information calculates target replica set. This process is performed on FleetAustoscaler Sync every 30 seconds. Extends Buffer policy functionality with Webhook policy which is proposed in googleforgames#334 comments.
8411467
to
0051024
Compare
Build Succeeded 👏 Build Id: c6aef50c-b474-4dbb-a0d8-568bafd01fa9 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
Just pushed the most recent version of the webhook image up to gcr. Will rerun the tests, and if all passes, I think this is good to go! |
Build Succeeded 👏 Build Id: f59057fa-4f70-4402-924e-a0dc3577e177 The following development artifacts have been built, and will exist for the next 30 days:
To install this version:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! This is good to go! 👍🙌👌
This has all been covered. And I got approval from Cyril.
Add webhook policy type into FleetAutoscaler declaration.
Provided an example of a webhook pod which receives FleetAutoscaleReview
with a Fleet status and based on that information calculates target
replica set. This process is performed on FleetAustoscaler Sync every 30
seconds. Extends Buffer policy functionality with Webhook policy which is proposed by Mark in #334 comment.