Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GameServer container restart before Ready, move to Unhealthy state After (v2) #1099

Merged

Conversation

markmandel
Copy link
Member

This brings our implementation inline with what our health checking documentation states that we do.

This is done by implementing extra checks in the HealthController to determine if it's appropriate to move to Unhealthy rather than allow a restart to occur.

Replaced PR #1069

Closes #956

@markmandel markmandel added the kind/bug These are bugs. label Oct 4, 2019
@markmandel markmandel changed the title GameServer container restart before Ready, move to Unhealthy state After GameServer container restart before Ready, move to Unhealthy state After (v2) Oct 4, 2019
@markmandel markmandel added the area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc label Oct 4, 2019
@agones-bot
Copy link
Collaborator

Build Failed 😱

Build Id: 5b35e2d0-dd16-4745-8bec-ff6a360671c5

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@markmandel
Copy link
Member Author

/hold

e2e test found a bug in my implementation. Need to fix!

@roberthbailey
Copy link
Member

I'm waiting for @markmandel to remove the hold before I review.

@markmandel markmandel added the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Oct 22, 2019
@markmandel
Copy link
Member Author

/hold cancel

While I think this should stay in feature freeze, as I would like to go through a development cycle for extra confidence, it is ready for review again now.

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 2b82f742-c6c7-431b-9123-024ed8b01fb6

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1099/head:pr_1099 && git checkout pr_1099
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.1.0-2aab79d

@markmandel markmandel removed the feature-freeze-do-not-merge Only eligible to be merged once we are out of feature freeze (next full release) label Oct 29, 2019
@markmandel
Copy link
Member Author

OMG! We can finally get this reviewed an in a release! 💃 I feel like I've been working on this bug forever 😄

@roberthbailey
Copy link
Member

Now that the release is out this is the next review up on my list.

@@ -91,6 +91,10 @@ const (
// DevAddressAnnotation is an annotation to indicate that a GameServer hosted outside of Agones.
// A locally hosted GameServer is not managed by Agones it is just simply registered.
DevAddressAnnotation = "agones.dev/dev-address"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(aside) Is there a reason that this one is prefixed with "agones.dev" instead of being agones.GroupName + "/dev-address"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, we should fix this.

logger.WithError(err).WithField("gs", newGs.ObjectMeta.Name).WithField("state", newGs.Status.State).Info("no ready, trying again")
return false, nil
})
if !assert.NoError(t, err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elsewhere I just see assert.NoError() but this one wraps that and calls FailNow. Is that to short circuit running the remainder of the test? If so, would it be simpler to just write

if err != nil {
    assert.FailNow(t, ...)
}

I find the statements that check the boolean result of the assert to be hard to read, and they are basically just checking if err is or isn't nil and calling another assert, which can be done more simply.

// also gives the process a chance to come back up
_, err = conn.Write([]byte("CRASH"))
logger.WithError(err).Info("no crash, trying again")
return false, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return false, err instead of nil? Or will returning an error short circuit the polling function? It seems weird that the error writing the udp packet is logged but otherwise ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are indeed correct - returning the error will short circuit the poll. It may return an error as the receiver may not be up and running yet, so it's useful for debugging purposes, but otherwise it is expected.

logger.WithError(err).Warn("could not send CRASH udp packet")
return false, nil
}
return true, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is successfully sending the packet sufficient to break the loop? Shouldn't we verify that the pod actually exited (e.g. the game server received the packet)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, now we have a helper method.

// now crash, should be unhealthy, since it's after being Ready
logger.Info("crashing again, should be unhealthy")
// retry on crash, as with the restarts, sometimes Go takes a moment to send this through.
err = wait.Poll(time.Second, 3*time.Minute, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three of these udp sending loops are very similar in structure (write a packet, check for state, try again). Have you considered moving them into a helper function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Done!

}

hc.recorder.Event(gs, corev1.EventTypeWarning, string(gsCopy.Status.State), "Issue with Gameserver pod")
if !skip {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be inverted.

if skip {
  // The game server hasn't been marked ready, so it doesn't transition to unhealthy. 
  return nil
}

To leave the other code with one less level of indentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah excellent suggestion!


return nil
}

// skipUnhealthy checks the GameServer is Ready or after Ready, and what the container crash state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is really hard to read. I tried taking a stab at rewriting it but didn't have any particularly great ways to phrase it. If you can think of a way to say it more clearly that'd be awesome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a stab. PTAL when I push my changes up. I think it's better though.

v.setup(gs, pod)

m.KubeClient.AddReactor("list", "pods", func(action k8stesting.Action) (bool, runtime.Object, error) {
return true, &corev1.PodList{Items: []corev1.Pod{*pod}}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this doesn't cover the branches where pod doesn't exist or k8s returns an error to put the request back into the queue.

Name: gs.Spec.Container,
State: corev1.ContainerState{Terminated: &corev1.ContainerStateTerminated{}},
}}
}, expected: true},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put expected on a new line

},
"before ready, with no terminated container": {
setup: func(gs *agonesv1.GameServer, pod *corev1.Pod) {
gs.Status.State = agonesv1.GameServerStateScheduled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the before ready states are all the same, given the other test coverage it may not matter but you could vary which state is set to one of the other 4 states that are before ready.

This brings our implementation inline with what our health checking
documentation states that we do.

This is done by implementing extra checks in the HealthController to
determine if it's appropriate to move to Unhealthy rather than allow
a restart to occur.

Replaced PR googleforgames#1069

Closes googleforgames#956
@markmandel
Copy link
Member Author

@roberthbailey this should be good to review again 👍

@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 8344dedc-9848-4200-8c4e-c35d71ca9555

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/1099/head:pr_1099 && git checkout pr_1099
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=1.2.0-ee1c3dd

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: markmandel, roberthbailey

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [markmandel,roberthbailey]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roberthbailey roberthbailey merged commit 3549367 into googleforgames:master Nov 2, 2019
@markmandel markmandel deleted the bug/restart-before-ready-v2 branch November 2, 2019 17:45
@markmandel
Copy link
Member Author

dancing!

@markmandel markmandel added this to the 1.2.0 milestone Nov 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs. lgtm size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Game server container crash before Ready, should restart, not move to Unhealthy
4 participants