Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s: more robust stack error detection on deploy #948

Merged
merged 2 commits into from
Jun 4, 2018

Conversation

simonferquel
Copy link
Contributor

@simonferquel simonferquel commented Mar 15, 2018

This supersedes #846

This takes advantage of k8s informers (which are much more robust than using only a watch but somewhat more complex as well).
We look both at stack reconciliation errors, and POD scheduling. The logic is now:
When at least one POD of each service is Ready, we consider the stack to be ready.

On a terminal, service status reporting is now interactive (instead of being forward only), and is correctly interrupted in case of a reconciliation failure.

@codecov-io
Copy link

codecov-io commented Mar 15, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@0089f17). Click here to learn what that means.
The diff coverage is 51.53%.

@@            Coverage Diff            @@
##             master     #948   +/-   ##
=========================================
  Coverage          ?   54.09%           
=========================================
  Files             ?      262           
  Lines             ?    16740           
  Branches          ?        0           
=========================================
  Hits              ?     9056           
  Misses            ?     7079           
  Partials          ?      605

)

func metaStateFromStatus(status serviceStatus) metaServiceState {
if status.podsReady > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch{
case status.podsReady > 0:
    return metaServiceStateReady
case status.podsPending > 0:
    return metaServiceStatePenging
default:
    return metaServiceStateFailed
}

}

func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty line

color := aec.DefaultF
switch state {
case metaServiceStateFailed:
color = aec.RedF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want some colors in the CLI. @vdemeester ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serves as way to identify status changes more clearly (manual testing that just modifying the text was not really clear.
Note that on non-terminal, output is sequencial and without coloring (and don't update quite as much, only when the metastate of a service changes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in a separate PR; so far the CLI does not use any colouring, so this should be looked at in a wider scope, and doesn't seem critical

return stop
func (w *deployWatcher) Watch(stack *apiv1beta1.Stack, serviceNames []string, statusUpdates chan serviceStatus) error {
errC := make(chan error, 1)
defer close(errC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that closing channel is mandatory only if the receiver is watching for the "close event". Otherwise it is perfectly valid to let it opened until its collection.

errC <- e
})
defer func() {
runtimeutil.ErrorHandlers = runtimeutil.ErrorHandlers[:len(runtimeutil.ErrorHandlers)-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if another part of the code adds an ErrorHandlers after this one? It will be removed without notice...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the main issue with this model, and this is not mutex protected. I changed the code a bit to at least avoid keeping an error handler that writes on a closed channel, but still, it is not the best piece of software coming from K8s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok...

func (sw *stackWatcher) OnUpdate(oldObj, newObj interface{}) {
sw.OnAdd(newObj)
}
func (sw *stackWatcher) OnDelete(obj interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty line missing

}
}
func (sw *stackWatcher) OnUpdate(oldObj, newObj interface{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty line missing

}
func (sw *stackWatcher) OnAdd(obj interface{}) {
stack, ok := obj.(*apiv1beta1.Stack)
if !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch{
case !ok:
    sw.resultChan <- errors.Errorf("stack %s has not the correct type", sw.stackName)
case stack.Status.Phase == apiv1beta1.StackFailure:
    sw.resultChan <- errors.Errorf("stack %s failed with status %s", sw.stackName, stack.Status.Phase)
}

pw.services[serviceName] = status
}
func (pw *podWatcher) allReady() bool {
for _, status := range pw.services {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a data race condition here while iterating on pw.services?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, informers guarantee that OnAdd/OnUpdate/OnDelete are called sequencially for a given handler. So there is no concurrency there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!


return true
func (pw *podWatcher) OnDelete(obj interface{}) {
p, ok := obj.(*apiv1.Pod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this code is duplicated 3 times (OnAdd, OnUpdate, OnDelete).

Copy link
Contributor

@silvin-lubecki silvin-lubecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonferquel can you squash the commits and split it into 2 (one for vendoring update and one for the actual change).

I think it would have make the review simpler if the display changes were split from the actual error detection pieces but now it's done 😛

@simonferquel
Copy link
Contributor Author

@vdemeester I aggree that splitting display changes would have been easier for reviewing, but the problem I found is that the nature of the service updates we presented to the user previously did not make sense in a K8s world. So it was difficult to map new status updates (with number of PODs per service in a running/failed state) with the old display code (and its weird container restarts thing).
Btw, working on this PR made me realize it would be nice if we would introduce Readyness probes on Compose/Kubernetes (but that is a completely out of scope concern)

@simonferquel
Copy link
Contributor Author

Squashing done

@vdemeester
Copy link
Collaborator

vdemeester commented Mar 21, 2018

@simonferquel lint failure — the e2e is getting fixed in #955 (you're gonna need to rebase against master)

These files were changed:

 M vendor/k8s.io/client-go/tools/cache/store.go

@simonferquel
Copy link
Contributor Author

@vdemeester just fixed that. Still the strange perm flag on this file that breaks vendoring from a windows machine

@simonferquel
Copy link
Contributor Author

@vdemeester could you update your review ?

@thaJeztah
Copy link
Member

ping @simonferquel this needs a rebase 😢

// Add adds a value to the cache. Returns true if an eviction occurred.
=======
// Add adds a value to the cache. Returns true if an eviction occured.
>>>>>>> Vendoring for stack status watch + tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a merge conflict in here

vendor.conf Outdated
@@ -34,6 +34,7 @@ github.com/go-openapi/spec 6aced65f8501fe1217321abf0749d354824ba2ff
github.com/go-openapi/swag 1d0bd113de87027671077d3c71eb3ac5d7dbba72
github.com/gregjones/httpcache c1f8028e62adb3d518b823a2f8e6a95c38bdd3aa
github.com/grpc-ecosystem/grpc-gateway 1a03ca3bad1e1ebadaedd3abb76bc58d4ac8143b
github.com/hashicorp/golang-lru a0d98a5f288019575c6d1f4bb1573fef2d1fcdc4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is included twice in vendor.conf (there's a vendor at the bottom of this file as well)

color := aec.DefaultF
switch state {
case metaServiceStateFailed:
color = aec.RedF
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be in a separate PR; so far the CLI does not use any colouring, so this should be looked at in a wider scope, and doesn't seem critical

@@ -47,7 +47,7 @@ func (c *LRU) Purge() {
c.evictList.Init()
}

// Add adds a value to the cache. Returns true if an eviction occurred.
// Add adds a value to the cache. Returns true if an eviction occured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering; if this is the only change, if we should bump at all

@@ -47,11 +47,7 @@ func (c *LRU) Purge() {
c.evictList.Init()
}

<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to the vendor commit?

@@ -178,24 +174,3 @@ func newStatusDisplay(o *command.OutStream) statusDisplay {
}
return &interactiveStatusDisplay{o: o}
}

// createFileBasedConfigMaps creates a Kubernetes ConfigMap for each Compose global file-based config.
func createFileBasedConfigMaps(stackName string, globalConfigs map[string]composetypes.ConfigObjConfig, configMaps corev1.ConfigMapInterface) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, please squash this with the other commit

@simonferquel
Copy link
Contributor Author

Removed coloring, and resquashed everything into vendor+code PR style

func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) {
b := aec.EmptyBuilder
for ix := 0; ix < len(d.statuses); ix++ {
b = b.Up(1).EraseLine(aec.EraseModes.All)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice we add a new dependency for this; looks like there may be duplication with the existing https://github.com/Nvveen/Gotty package that we use, for example here; https://github.com/docker/cli/blob/master/vendor/github.com/docker/docker/pkg/jsonmessage/jsonmessage.go#L165-L221

ping @ijc perhaps you could have a look?

(perhaps it's ok to have both, just want someone to double-check 🤗)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original issue which lead (eventually) to Gotty use was #28111 (fixed in #28238, Gotty in followup #28304) solving a problem with Up(0) (and Down(0) etc) being undefined, but that isn't used here and this aec library looks to do the right thing but is just using plain ANSI codes rather than terminfo, so may theoretically not be as portable, although in practice by sticking to ANSI it's probably going to work most reasonable places.

Gotty is not maintained (my PRs made as part of the above still aren't merged) whereas aec has been more recently (and has no issue or prs open). Maybe it'd be worth moving the jsonmessage stuff over to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking @ijc - let me open an issue in moby/moby to discuss replacing Gotty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi, I chose this lib because of its simplicity, and because it is already used in buildkit (which has IMO a very modern and good looking UX)

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits and questions


func displayInteractiveServiceStatus(status serviceStatus, o io.Writer) {
state := metaStateFromStatus(status)
fmt.Fprintf(o, "%s: %s\t\t[pods details: %v ready, %v pending, %v failed]", status.name, state,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • perhaps this should be [pods status .. (also, should it be plural (pods status) or singular (pod status))?
  • can you use %d for the integers?
  • would it be useful to show the total number of pods as well? (1/5 ready, 2/5 pending, 2/5 failed)


handlers := runtimeutil.ErrorHandlers

// informers errors are reported using global error handlers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/informers/informer/`

stack, ok := obj.(*apiv1beta1.Stack)
switch {
case !ok:
sw.resultChan <- errors.Errorf("stack %s has not the correct type", sw.stackName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stack %s has incorrect type

or invalid type ?

or

incorrect type for stack %s 

Wondering; should we (can we?) show the actual type? (e.g. if it's a apiv1beta2.Stack)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, this error should really not happen (we are just covering for potential bugs in the way we initialize the informer). I agree we should use a consistent wording, but adding extra-work for adding more info in the message seem a bit overkill

func (pw *podWatcher) handlePod(obj interface{}) {
pod, ok := obj.(*apiv1.Pod)
if !ok {
pw.resultChan <- errors.New("unexpected type for a Pod")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for a Pod" seems a bit vague; can we print the Pod's name (or something?)

unexpected type for pod %s

Also, for consistency with stacks above; we should pick the same wording (unexpected type, invalid type, or incorrect type) - pick one (not sure which one's best)

"byservice": func(obj interface{}) ([]string, error) {
pod, ok := obj.(*apiv1.Pod)
if !ok {
return nil, errors.New("pod has an unexpected type")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use same wording here as above (or vice-versa)

return cache.NewSharedInformer(
&cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
options.LabelSelector = "com.docker.stack.namespace=" + stackName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have defined the name for this label as a constant in two locations;

const (
// LabelNamespace is the label used to track stack resources
LabelNamespace = "com.docker.stack.namespace"
)

and

const (
// ForServiceName is the label for the service name.
ForServiceName = "com.docker.service.name"
// ForStackName is the label for the stack name.
ForStackName = "com.docker.stack.namespace"
// ForServiceID is the label for the service id.
ForServiceID = "com.docker.service.id"
)

Can you use a constant here? (Also, we should discuss unifying those constants, but not for this PR)

},

WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
options.LabelSelector = "com.docker.stack.namespace=" + stackName
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here (constant)

}
func (p *testPodListWatch) Watch(opts metav1.ListOptions) (watch.Interface, error) {
return p.fake.
InvokesWatch(k8stesting.NewWatchAction(podsResource, p.ns, opts))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wondering why you split these (and other locations) over two lines, instead of

return s.fake.InvokesWatch(k8stesting.NewWatchAction(stacksResource, s.ns, opts))

(don't think we do that in most places, but may be my personal preference, so just a nit)

if d.states[status.name] != state {
d.states[status.name] = state
fmt.Fprintf(d.o, "%s: %s", status.name, state)
fmt.Fprintln(d.o)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a \n to the previous fmt.Fprintf() instead?

fmt.Fprintf(d.o, "%s: %s\n", status.name, state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for consistency with the platform EOL conventions (\n on *nix, \r\n on windows). But sure, I can simplify that.

Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM, but can you squash the last two commits?

Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
@simonferquel simonferquel force-pushed the k8s-watch-stack-status branch 2 times, most recently from 935613a to f38510b Compare May 25, 2018 12:57
@simonferquel
Copy link
Contributor Author

@thaJeztah Squashing done! :)

func displayInteractiveServiceStatus(status serviceStatus, o io.Writer) {
state := metaStateFromStatus(status)
totalFailed := status.podsFailed + status.podsSucceeded + status.podsUnknown
fmt.Fprintf(o, "%[1]s: %[2]s\t\t[pod status: %[3]d/%[6]d ready, %[4]d/%[6]d pending, %[5]d/%[6]d failed]\n", status.name, state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this:

backend: Failed         [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
cpp: Failed             [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
dockerfile: Failed              [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
golang: Pending         [pod status: 0/1 ready, 1/1 pending, 0/1 failed]
python: Failed          [pod status: 0/0 ready, 0/0 pending, 0/0 failed]

Maybe it's missing a tab between the service name and the state?

fmt.Fprintf(o, "%[1]s:\t%[2]s

Or maybe it's more complicated than that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, looks like we need (the equivalent of) a tabwriter for that; guess the problem is that dockerfile: Failed would just reach a tab-stop, therefore the tab that is being written bumps it to the next tab-stop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's what I had in mind...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would involve;

Instead of clearing + overwriting each line separately; buffer all lines first, then clear all lines and write them (using a tabwriter)

Copy link
Collaborator

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐸

@vdemeester vdemeester merged commit eaa9149 into docker:master Jun 4, 2018
@GordonTheTurtle GordonTheTurtle added this to the 18.06.0 milestone Jun 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants