Skip to content

Commit

Permalink
agones-{extensions,allocator}: Pause after cancelling context (#3843)
Browse files Browse the repository at this point in the history
* Revert "agones-{extensions,allocator}: Be more defensive about draining"

This reverts commit 68b04ee.

* agones-{extensions,allocator}: Pause after cancelling context

After #3839 went in, we noticed the flakes in
TestAllocatorAfterDeleteReplica disappear, but
TestGameServerCreationRightAfterDeletingOneExtensionsPod remained. I
looked more closely at this and I developed a theory: My guess is that
`kube-apiserver` connections to webhooks are actually "sticky" using
http(s) keepalives. If the TCP connection never closes, we'd see the
behavior we do, which is the `EOF` from kube-apiserver going to write
on a dead socket.

I looked at how to close sockets on the server side to "finish" the
drain, but realized that we actually do, by cancelling the context.
The thing is that we cancel the context and immediately exit, but it
leaves no time for anything to shut down.

I tested this and it seems to drive away the flake, without adding in
the extra delay.

Reverts #3839
  • Loading branch information
zmerlynn committed May 30, 2024
1 parent 138551e commit 80dd2ea
Show file tree
Hide file tree
Showing 7 changed files with 14 additions and 12 deletions.
3 changes: 2 additions & 1 deletion cmd/allocator/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,8 @@ func main() {
podReady = false
time.Sleep(conf.ReadinessShutdownDuration)
cancelCtx()
logger.Infof("Readiness shutdown duration has passed, exiting pod")
logger.Infof("Readiness shutdown duration has passed, context cancelled")
time.Sleep(1 * time.Second) // allow a brief time for cleanup, but force exit if main doesn't
os.Exit(0)
})

Expand Down
3 changes: 2 additions & 1 deletion cmd/extensions/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,8 @@ func main() {
podReady = false
time.Sleep(ctlConf.ReadinessShutdownDuration)
cancelCtx()
logger.Infof("Readiness shutdown duration has passed, exiting pod")
logger.Infof("Readiness shutdown duration has passed, context cancelled")
time.Sleep(1 * time.Second) // allow a brief time for cleanup, but force exit if main doesn't
os.Exit(0)
})

Expand Down
4 changes: 2 additions & 2 deletions install/helm/agones/templates/extensions-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ spec:
priorityClassName: {{ .Values.agones.priorityClassName }}
{{- end }}
serviceAccountName: {{ .Values.agones.serviceaccount.controller.name }}
terminationGracePeriodSeconds: {{ mul .Values.agones.extensions.readiness.periodSeconds .Values.agones.extensions.readiness.failureThreshold 5 }}
terminationGracePeriodSeconds: {{ mul .Values.agones.extensions.readiness.periodSeconds .Values.agones.extensions.readiness.failureThreshold 3 }}
containers:
- name: agones-extensions
image: "{{ .Values.agones.image.registry }}/{{ .Values.agones.image.extensions.name}}:{{ default .Values.agones.image.tag .Values.agones.image.extensions.tag }}"
Expand Down Expand Up @@ -137,7 +137,7 @@ spec:
- name: CONTAINER_NAME
value: "agones-extensions"
- name: READINESS_SHUTDOWN_DURATION
value: {{ mul .Values.agones.extensions.readiness.periodSeconds .Values.agones.extensions.readiness.failureThreshold 4 }}s
value: {{ mul .Values.agones.extensions.readiness.periodSeconds .Values.agones.extensions.readiness.failureThreshold 2 }}s
ports:
- name: webhooks
containerPort: 8081
Expand Down
4 changes: 2 additions & 2 deletions install/helm/agones/templates/service/allocation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ spec:
{{ toYaml .Values.agones.allocator.tolerations | indent 8 }}
{{- end }}
serviceAccountName: {{ $.Values.agones.serviceaccount.allocator.name }}
terminationGracePeriodSeconds: {{ mul .Values.agones.allocator.readiness.periodSeconds .Values.agones.allocator.readiness.failureThreshold 5 }}
terminationGracePeriodSeconds: {{ mul .Values.agones.allocator.readiness.periodSeconds .Values.agones.allocator.readiness.failureThreshold 3 }}
{{- if eq .Values.agones.allocator.disableTLS false }}
volumes:
- name: tls
Expand Down Expand Up @@ -292,7 +292,7 @@ spec:
- name: ALLOCATION_BATCH_WAIT_TIME
value: {{ .Values.agones.allocator.allocationBatchWaitTime | quote }}
- name: READINESS_SHUTDOWN_DURATION
value: {{ mul .Values.agones.allocator.readiness.periodSeconds .Values.agones.extensions.readiness.failureThreshold 4 }}s
value: {{ mul .Values.agones.allocator.readiness.periodSeconds .Values.agones.extensions.readiness.failureThreshold 2 }}s
ports:
{{- if .Values.agones.allocator.service.http.enabled }}
- name: {{ .Values.agones.allocator.service.http.portName }}
Expand Down
8 changes: 4 additions & 4 deletions install/yaml/install.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17216,7 +17216,7 @@ spec:
value: "true"
priorityClassName: agones-system
serviceAccountName: agones-controller
terminationGracePeriodSeconds: 45
terminationGracePeriodSeconds: 27
containers:
- name: agones-extensions
image: "us-docker.pkg.dev/agones-images/release/agones-extensions:1.41.0-dev"
Expand Down Expand Up @@ -17259,7 +17259,7 @@ spec:
- name: CONTAINER_NAME
value: "agones-extensions"
- name: READINESS_SHUTDOWN_DURATION
value: 36s
value: 18s
ports:
- name: webhooks
containerPort: 8081
Expand Down Expand Up @@ -17420,7 +17420,7 @@ spec:
operator: Equal
value: "true"
serviceAccountName: agones-allocator
terminationGracePeriodSeconds: 45
terminationGracePeriodSeconds: 27
volumes:
- name: tls
secret:
Expand Down Expand Up @@ -17491,7 +17491,7 @@ spec:
- name: ALLOCATION_BATCH_WAIT_TIME
value: "500ms"
- name: READINESS_SHUTDOWN_DURATION
value: 36s
value: 18s
ports:
- name: https
containerPort: 8443
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/allocator/pod_termination_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ import (

const (
retryInterval = 5 * time.Second
retryTimeout = 60 * time.Second
retryTimeout = 45 * time.Second
)

func TestAllocatorAfterDeleteReplica(t *testing.T) {
Expand Down
2 changes: 1 addition & 1 deletion test/e2e/extensions/high_availability_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ func TestGameServerCreationRightAfterDeletingOneExtensionsPod(t *testing.T) {
logger.Infof("Removing one of the Extensions Pods: %v", list.Items[1].ObjectMeta.Name)
deleteAgonesExtensionsPod(ctx, t, false)

endTime := time.Now().Add(60 * time.Second)
endTime := time.Now().Add(30 * time.Second)
for time.Now().Before(endTime) {
gs := framework.DefaultGameServer(defaultNs)
logger.Infof("Creating game-server %s...", gs.Name)
Expand Down

0 comments on commit 80dd2ea

Please sign in to comment.