Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Pod deletion during unavailable controller #1279

Merged
merged 2 commits into from
Feb 7, 2020

Conversation

markmandel
Copy link
Collaborator

@markmandel markmandel commented Jan 16, 2020

If a Pod gets deleted, especially during GameServer Ready or Allocated state, and the controller is either crashed, missing or unable to access master, when the controller comes back up, the GameServer is left in a zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible, depending on user interaction with Kubernetes, so we should cover the scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to ensure they have backing Pods, such that if this happens the GameServer is marked as Unhealthy, and a Fleet can eventually return to a healed, stable state, and not require manual intervention.

Closes #1170
Closes #398 (especially combined with fix for #1245)

@markmandel markmandel added kind/bug These are bugs. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Jan 16, 2020
@markmandel
Copy link
Collaborator Author

/cc @KamiMay just letting you know this is in the queue.

Would like to merge it after the stable release, to give it some time to bake.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: cb21efef-1805-474b-b047-a851eb46de71

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Collaborator Author

markmandel commented Jan 16, 2020

Pasting here, to keep track. Looks like I introduced a panic.

"Reconciling GameServerSet"
Observed a panic: "invalid memory address or nil pointer dereference" (runtime : invalid memory address or nil pointer dereference)
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:679
/usr/local/go/src/runtime/panic.go:199
/usr/local/go/src/runtime/signal_unix.go:394
/go/src/agones.dev/agones/pkg/gameservers/missing.go:155
/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:153
/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:129
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/agones.dev/agones/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/go/src/agones.dev/agones/pkg/util/workerqueue/workerqueue.go:187
/usr/local/go/src/runtime/asm_amd64.s:1357
panic: runtime : invalid memory address or nil pointer dereference [recovered]
panic: runtime : invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x12492f4

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 988c3e1b-faa4-40e0-b0f4-84c9f047a92f

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 33be285f-6095-4308-ac18-0184536fecb8

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 40d1a33f-fb9b-46d8-9a85-c376f5174350

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1279/head:pr_1279 && git checkout pr_1279
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.3.0-46afcac

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 76db447b-5b82-4711-92d1-52b9b7438da0

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1279/head:pr_1279 && git checkout pr_1279
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.3.0-d7ef191

@roberthbailey roberthbailey added the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Jan 17, 2020
@markmandel markmandel removed the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Jan 22, 2020
@markmandel markmandel marked this pull request as ready for review January 22, 2020 01:00
@markmandel
Copy link
Collaborator Author

Oooh! I think I know how to e2e test this. Please still free to review, but will add an e2e test shortly.

@markmandel
Copy link
Collaborator Author

Added e2e test suite for controller crashes.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 75552c42-6e57-44d5-ac05-29cb11ae41f1

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: b80cb67c-5b59-4987-b758-7d8f715e105a

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: e8ea7b45-f8a5-49cf-bd7c-ba56cd16cdcc

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1279/head:pr_1279 && git checkout pr_1279
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-1e81aca

@markmandel
Copy link
Collaborator Author

@aLekSer

@markmandel , great implementation. I was thinking of doing this logic on a start of Agones controller, but this approach way better, more general and more state of the art!
Just was thinking of a Dev Gameserver, I think they might go to unhealthy, is there any E2E test for this case?

That's a great question!

There is a check in here for dev gameservers:
https://github.com/googleforgames/agones/pull/1279/files#diff-d0f163ec75bb0e3e6d1b8db74fda705bR87

And a Unit test as well:
https://github.com/googleforgames/agones/pull/1279/files#diff-7285517b020e1edd73726337f6ba8b8fR258-R262

I didn't write a e2e, as I figure the e2e dev gameserver test would suffice (it would at least fail sometimes). The only way I could think to e2e test it is to wait 30 seconds - but do we want that slow an e2e test?

WDYT?

@roberthbailey rebased!

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: d098bbf3-2190-4785-8f8d-aae8795ccbef

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 6213e9fb-cc3e-4b61-b803-b2b9aaca5578

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1279/head:pr_1279 && git checkout pr_1279
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-7eda264

@aLekSer
Copy link
Collaborator

aLekSer commented Jan 30, 2020

That's great that Dev Gameservers are now also supported.
I think that we don't need a special E2E test case or make it 30s as long as existent one is able to fail from time to time.

@markmandel markmandel requested review from roberthbailey and removed request for EricFortin and dzlier-gcp February 5, 2020 14:56
pkg/gameservers/gameservers.go Show resolved Hide resolved
pkg/gameservers/missing.go Outdated Show resolved Hide resolved
pkg/gameservers/missing.go Outdated Show resolved Hide resolved
pkg/gameservers/missing.go Outdated Show resolved Hide resolved
pkg/gameservers/missing.go Outdated Show resolved Hide resolved
pkg/gameservers/missing.go Outdated Show resolved Hide resolved
pkg/gameservers/missing.go Outdated Show resolved Hide resolved
pkg/gameservers/missing.go Show resolved Hide resolved
If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 1c60176f-02b8-421d-8218-451a3e566cd6

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 95345276-e7ab-4433-b60e-91408651485a

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 499b7b58-ce15-4793-8efa-39b8d45c6dfe

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: aae1b4f0-cfec-44b2-900a-7942646551b6

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1279/head:pr_1279 && git checkout pr_1279
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-fda22a9

"k8s.io/client-go/tools/cache"
)

func TestIsBeforePodCreated(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment was lost in my last review, but this test seems like it belongs in pkg/gameservers/gameservers_test.go instead of here, since it's testing code in pkg/gameservers/gameservers.go

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh! yeah. Moving!

pkg/gameservers/gameservers.go Show resolved Hide resolved
Copy link
Collaborator

@aLekSer aLekSer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, nothing to add to Robert's comments.

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aLekSer, markmandel, roberthbailey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [markmandel,roberthbailey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: e953efe1-a8f8-4f23-8f66-d0878c68b5a9

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1279/head:pr_1279 && git checkout pr_1279
  • helm install ./install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.4.0-118c7af

@markmandel markmandel merged commit bb05ab0 into googleforgames:master Feb 7, 2020
@markmandel markmandel deleted the bug/pod-missing branch February 7, 2020 21:06
@markmandel markmandel added this to the 1.4.0 milestone Feb 7, 2020
ilkercelikyilmaz pushed a commit to ilkercelikyilmaz/agones that referenced this pull request Oct 23, 2020
* Fix for Pod deletion during unavailable controller

If a Pod gets deleted, especially during GameServer Ready or Allocated
state, and the controller is either crashed, missing or unable to access
master, when the controller comes back up, the GameServer is left in a
zombie state in which it could be Allocated, but there is no Pod process
to back it.

Ideally, scenarios like this shouldn't happen, but it is possible,
depending on user interaction with Kubernetes, so we should cover the
scenario, as it requires manual intervention to fix otherwise.

This PR implements a controller that periodically checks GameServers to
ensure they have backing Pods, such that if this happens the GameServer
is marked as Unhealthy, and a Fleet can eventually return to a healed,
stable state, and not require manual intervention.

Closes googleforgames#1170
Closes googleforgames#398 (especially combined with fix for googleforgames#1245)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs. lgtm size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agones controller shut down Moving cluster to a new node pool doesn't recreate all fleets
5 participants