K8s: more robust stack error detection on deploy #948

simonferquel · 2018-03-15T15:16:10Z

This supersedes #846

This takes advantage of k8s informers (which are much more robust than using only a watch but somewhat more complex as well).
We look both at stack reconciliation errors, and POD scheduling. The logic is now:
When at least one POD of each service is Ready, we consider the stack to be ready.

On a terminal, service status reporting is now interactive (instead of being forward only), and is correctly interrupted in case of a reconciliation failure.

codecov-io · 2018-03-15T15:20:57Z

Codecov Report

❗ No coverage uploaded for pull request base (master@0089f17). Click here to learn what that means.
The diff coverage is 51.53%.

@@            Coverage Diff            @@
##             master     #948   +/-   ##
=========================================
  Coverage          ?   54.09%           
=========================================
  Files             ?      262           
  Lines             ?    16740           
  Branches          ?        0           
=========================================
  Hits              ?     9056           
  Misses            ?     7079           
  Partials          ?      605

silvin-lubecki · 2018-03-16T10:23:35Z

cli/command/stack/kubernetes/deploy.go

+)
+
+func metaStateFromStatus(status serviceStatus) metaServiceState {
+	if status.podsReady > 0 {


switch{ case status.podsReady > 0: return metaServiceStateReady case status.podsPending > 0: return metaServiceStatePenging default: return metaServiceStateFailed }

silvin-lubecki · 2018-03-16T10:24:59Z

cli/command/stack/kubernetes/deploy.go

+}
+
+func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) {
+


nit: empty line

silvin-lubecki · 2018-03-16T10:26:52Z

cli/command/stack/kubernetes/deploy.go

+	color := aec.DefaultF
+	switch state {
+	case metaServiceStateFailed:
+		color = aec.RedF


I don't know if we want some colors in the CLI. @vdemeester ?

This serves as way to identify status changes more clearly (manual testing that just modifying the text was not really clear.
Note that on non-terminal, output is sequencial and without coloring (and don't update quite as much, only when the metastate of a service changes)

I think this should be in a separate PR; so far the CLI does not use any colouring, so this should be looked at in a wider scope, and doesn't seem critical

silvin-lubecki · 2018-03-16T13:48:36Z

cli/command/stack/kubernetes/watcher.go

-	return stop
+func (w *deployWatcher) Watch(stack *apiv1beta1.Stack, serviceNames []string, statusUpdates chan serviceStatus) error {
+	errC := make(chan error, 1)
+	defer close(errC)


I think that closing channel is mandatory only if the receiver is watching for the "close event". Otherwise it is perfectly valid to let it opened until its collection.

silvin-lubecki · 2018-03-16T13:50:32Z

cli/command/stack/kubernetes/watcher.go

+		errC <- e
+	})
+	defer func() {
+		runtimeutil.ErrorHandlers = runtimeutil.ErrorHandlers[:len(runtimeutil.ErrorHandlers)-1]


What happens if another part of the code adds an ErrorHandlers after this one? It will be removed without notice...

That is the main issue with this model, and this is not mutex protected. I changed the code a bit to at least avoid keeping an error handler that writes on a closed channel, but still, it is not the best piece of software coming from K8s

silvin-lubecki · 2018-03-16T14:08:51Z

cli/command/stack/kubernetes/watcher.go

+func (sw *stackWatcher) OnUpdate(oldObj, newObj interface{}) {
+	sw.OnAdd(newObj)
+}
+func (sw *stackWatcher) OnDelete(obj interface{}) {


nit: empty line missing

silvin-lubecki · 2018-03-16T14:08:55Z

cli/command/stack/kubernetes/watcher.go

 	}
 }
+func (sw *stackWatcher) OnUpdate(oldObj, newObj interface{}) {


nit: empty line missing

silvin-lubecki · 2018-03-16T14:12:35Z

cli/command/stack/kubernetes/watcher.go

-		}
+func (sw *stackWatcher) OnAdd(obj interface{}) {
+	stack, ok := obj.(*apiv1beta1.Stack)
+	if !ok {


switch{ case !ok: sw.resultChan <- errors.Errorf("stack %s has not the correct type", sw.stackName) case stack.Status.Phase == apiv1beta1.StackFailure: sw.resultChan <- errors.Errorf("stack %s failed with status %s", sw.stackName, stack.Status.Phase) }

silvin-lubecki · 2018-03-16T14:29:06Z

cli/command/stack/kubernetes/watcher.go

+	pw.services[serviceName] = status
+}
+func (pw *podWatcher) allReady() bool {
+	for _, status := range pw.services {


Can we have a data race condition here while iterating on pw.services?

No, informers guarantee that OnAdd/OnUpdate/OnDelete are called sequencially for a given handler. So there is no concurrency there

silvin-lubecki · 2018-03-16T14:30:31Z

cli/command/stack/kubernetes/watcher.go


-	return true
+func (pw *podWatcher) OnDelete(obj interface{}) {
+	p, ok := obj.(*apiv1.Pod)


It seems that this code is duplicated 3 times (OnAdd, OnUpdate, OnDelete).

silvin-lubecki

LGTM

vdemeester

@simonferquel can you squash the commits and split it into 2 (one for vendoring update and one for the actual change).

I think it would have make the review simpler if the display changes were split from the actual error detection pieces but now it's done 😛

simonferquel · 2018-03-21T10:22:16Z

@vdemeester I aggree that splitting display changes would have been easier for reviewing, but the problem I found is that the nature of the service updates we presented to the user previously did not make sense in a K8s world. So it was difficult to map new status updates (with number of PODs per service in a running/failed state) with the old display code (and its weird container restarts thing).
Btw, working on this PR made me realize it would be nice if we would introduce Readyness probes on Compose/Kubernetes (but that is a completely out of scope concern)

simonferquel · 2018-03-21T13:08:04Z

Squashing done

vdemeester · 2018-03-21T13:31:19Z

@simonferquel lint failure — the e2e is getting fixed in #955 (you're gonna need to rebase against master)

These files were changed:

 M vendor/k8s.io/client-go/tools/cache/store.go

simonferquel · 2018-03-21T13:32:29Z

@vdemeester just fixed that. Still the strange perm flag on this file that breaks vendoring from a windows machine

simonferquel · 2018-03-27T15:55:17Z

@vdemeester could you update your review ?

thaJeztah · 2018-04-30T12:02:29Z

ping @simonferquel this needs a rebase 😢

thaJeztah · 2018-05-09T23:22:11Z

vendor/github.com/hashicorp/golang-lru/simplelru/lru.go

 // Add adds a value to the cache.  Returns true if an eviction occurred.
+=======
+// Add adds a value to the cache.  Returns true if an eviction occured.
+>>>>>>> Vendoring for stack status watch + tests


There's a merge conflict in here

thaJeztah · 2018-05-09T23:23:48Z

vendor.conf

@@ -34,6 +34,7 @@ github.com/go-openapi/spec 6aced65f8501fe1217321abf0749d354824ba2ff
 github.com/go-openapi/swag 1d0bd113de87027671077d3c71eb3ac5d7dbba72
 github.com/gregjones/httpcache c1f8028e62adb3d518b823a2f8e6a95c38bdd3aa
 github.com/grpc-ecosystem/grpc-gateway 1a03ca3bad1e1ebadaedd3abb76bc58d4ac8143b
+github.com/hashicorp/golang-lru a0d98a5f288019575c6d1f4bb1573fef2d1fcdc4


This is included twice in vendor.conf (there's a vendor at the bottom of this file as well)

thaJeztah · 2018-05-10T00:01:45Z

cli/command/stack/kubernetes/deploy.go

+	color := aec.DefaultF
+	switch state {
+	case metaServiceStateFailed:
+		color = aec.RedF


I think this should be in a separate PR; so far the CLI does not use any colouring, so this should be looked at in a wider scope, and doesn't seem critical

thaJeztah · 2018-05-10T00:06:53Z

vendor/github.com/hashicorp/golang-lru/simplelru/lru.go

@@ -47,7 +47,7 @@ func (c *LRU) Purge() {
 	c.evictList.Init()
 }

-// Add adds a value to the cache.  Returns true if an eviction occurred.
+// Add adds a value to the cache.  Returns true if an eviction occured.


Wondering; if this is the only change, if we should bump at all

thaJeztah · 2018-05-10T00:08:35Z

vendor/github.com/hashicorp/golang-lru/simplelru/lru.go

@@ -47,11 +47,7 @@ func (c *LRU) Purge() {
 	c.evictList.Init()
 }

-<<<<<<< HEAD


Can you move this to the vendor commit?

thaJeztah · 2018-05-10T00:11:10Z

cli/command/stack/kubernetes/deploy.go

@@ -178,24 +174,3 @@ func newStatusDisplay(o *command.OutStream) statusDisplay {
 	}
 	return &interactiveStatusDisplay{o: o}
 }
-
-// createFileBasedConfigMaps creates a Kubernetes ConfigMap for each Compose global file-based config.
-func createFileBasedConfigMaps(stackName string, globalConfigs map[string]composetypes.ConfigObjConfig, configMaps corev1.ConfigMapInterface) error {


Same here, please squash this with the other commit

simonferquel · 2018-05-14T13:45:50Z

Removed coloring, and resquashed everything into vendor+code PR style

thaJeztah · 2018-05-16T18:29:43Z

cli/command/stack/kubernetes/deploy.go

+func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) {
+	b := aec.EmptyBuilder
+	for ix := 0; ix < len(d.statuses); ix++ {
+		b = b.Up(1).EraseLine(aec.EraseModes.All)


I notice we add a new dependency for this; looks like there may be duplication with the existing https://github.com/Nvveen/Gotty package that we use, for example here; https://github.com/docker/cli/blob/master/vendor/github.com/docker/docker/pkg/jsonmessage/jsonmessage.go#L165-L221

ping @ijc perhaps you could have a look?

(perhaps it's ok to have both, just want someone to double-check 🤗)

The original issue which lead (eventually) to Gotty use was #28111 (fixed in #28238, Gotty in followup #28304) solving a problem with Up(0) (and Down(0) etc) being undefined, but that isn't used here and this aec library looks to do the right thing but is just using plain ANSI codes rather than terminfo, so may theoretically not be as portable, although in practice by sticking to ANSI it's probably going to work most reasonable places.

Gotty is not maintained (my PRs made as part of the above still aren't merged) whereas aec has been more recently (and has no issue or prs open). Maybe it'd be worth moving the jsonmessage stuff over to that.

Thanks for looking @ijc - let me open an issue in moby/moby to discuss replacing Gotty

Fyi, I chose this lib because of its simplicity, and because it is already used in buildkit (which has IMO a very modern and good looking UX)

thaJeztah

some nits and questions

thaJeztah · 2018-05-17T10:10:55Z

cli/command/stack/kubernetes/deploy.go

+
+func displayInteractiveServiceStatus(status serviceStatus, o io.Writer) {
+	state := metaStateFromStatus(status)
+	fmt.Fprintf(o, "%s: %s\t\t[pods details: %v ready, %v pending, %v failed]", status.name, state,


perhaps this should be [pods status .. (also, should it be plural (pods status) or singular (pod status))?

can you use %d for the integers?

would it be useful to show the total number of pods as well? (1/5 ready, 2/5 pending, 2/5 failed)

thaJeztah · 2018-05-17T10:11:43Z

cli/command/stack/kubernetes/watcher.go

+
+	handlers := runtimeutil.ErrorHandlers
+
+	// informers errors are reported using global error handlers


nit: s/informers/informer/`

thaJeztah · 2018-05-17T10:16:44Z

cli/command/stack/kubernetes/watcher.go

+	stack, ok := obj.(*apiv1beta1.Stack)
+	switch {
+	case !ok:
+		sw.resultChan <- errors.Errorf("stack %s has not the correct type", sw.stackName)


stack %s has incorrect type

or invalid type ?

or

incorrect type for stack %s

Wondering; should we (can we?) show the actual type? (e.g. if it's a apiv1beta2.Stack)?

Honestly, this error should really not happen (we are just covering for potential bugs in the way we initialize the informer). I agree we should use a consistent wording, but adding extra-work for adding more info in the message seem a bit overkill

thaJeztah · 2018-05-17T10:21:29Z

cli/command/stack/kubernetes/watcher.go

+func (pw *podWatcher) handlePod(obj interface{}) {
+	pod, ok := obj.(*apiv1.Pod)
+	if !ok {
+		pw.resultChan <- errors.New("unexpected type for a Pod")


"for a Pod" seems a bit vague; can we print the Pod's name (or something?)

unexpected type for pod %s

Also, for consistency with stacks above; we should pick the same wording (unexpected type, invalid type, or incorrect type) - pick one (not sure which one's best)

thaJeztah · 2018-05-17T10:23:02Z

cli/command/stack/kubernetes/watcher.go

+		"byservice": func(obj interface{}) ([]string, error) {
+			pod, ok := obj.(*apiv1.Pod)
+			if !ok {
+				return nil, errors.New("pod has an unexpected type")


Use same wording here as above (or vice-versa)

thaJeztah · 2018-05-17T10:28:04Z

cli/command/stack/kubernetes/watcher.go

+	return cache.NewSharedInformer(
+		&cache.ListWatch{
+			ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
+				options.LabelSelector = "com.docker.stack.namespace=" + stackName


Looks like we have defined the name for this label as a constant in two locations;

cli/cli/compose/convert/compose.go

Lines 13 to 16 in 236a847

const (

// LabelNamespace is the label used to track stack resources

LabelNamespace = "com.docker.stack.namespace"

)

and

cli/kubernetes/labels/labels.go

Lines 8 to 15 in 7ad3036

const (

// ForServiceName is the label for the service name.

ForServiceName = "com.docker.service.name"

// ForStackName is the label for the stack name.

ForStackName = "com.docker.stack.namespace"

// ForServiceID is the label for the service id.

ForServiceID = "com.docker.service.id"

)

Can you use a constant here? (Also, we should discuss unifying those constants, but not for this PR)

thaJeztah · 2018-05-17T10:28:18Z

cli/command/stack/kubernetes/watcher.go

+			},
+
+			WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
+				options.LabelSelector = "com.docker.stack.namespace=" + stackName


Same here (constant)

thaJeztah · 2018-05-17T10:34:15Z

cli/command/stack/kubernetes/watcher_test.go

+}
+func (p *testPodListWatch) Watch(opts metav1.ListOptions) (watch.Interface, error) {
+	return p.fake.
+		InvokesWatch(k8stesting.NewWatchAction(podsResource, p.ns, opts))


nit: wondering why you split these (and other locations) over two lines, instead of

return s.fake.InvokesWatch(k8stesting.NewWatchAction(stacksResource, s.ns, opts))

(don't think we do that in most places, but may be my personal preference, so just a nit)

thaJeztah · 2018-05-17T10:44:31Z

cli/command/stack/kubernetes/deploy.go

+	if d.states[status.name] != state {
+		d.states[status.name] = state
+		fmt.Fprintf(d.o, "%s: %s", status.name, state)
+		fmt.Fprintln(d.o)


Can you add a \n to the previous fmt.Fprintf() instead?

fmt.Fprintf(d.o, "%s: %s\n", status.name, state)

This is for consistency with the platform EOL conventions (\n on *nix, \r\n on windows). But sure, I can simplify that.

Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>

thaJeztah

changes LGTM, but can you squash the last two commits?

Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>

simonferquel · 2018-05-25T12:57:23Z

@thaJeztah Squashing done! :)

silvin-lubecki · 2018-05-25T13:43:04Z

cli/command/stack/kubernetes/deploy.go

+func displayInteractiveServiceStatus(status serviceStatus, o io.Writer) {
+	state := metaStateFromStatus(status)
+	totalFailed := status.podsFailed + status.podsSucceeded + status.podsUnknown
+	fmt.Fprintf(o, "%[1]s: %[2]s\t\t[pod status: %[3]d/%[6]d ready, %[4]d/%[6]d pending, %[5]d/%[6]d failed]\n", status.name, state,


I got this:

backend: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed] cpp: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed] dockerfile: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed] golang: Pending [pod status: 0/1 ready, 1/1 pending, 0/1 failed] python: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed]

Maybe it's missing a tab between the service name and the state?

fmt.Fprintf(o, "%[1]s:\t%[2]s

Or maybe it's more complicated than that?

Hm, looks like we need (the equivalent of) a tabwriter for that; guess the problem is that dockerfile: Failed would just reach a tab-stop, therefore the tab that is being written bumps it to the next tab-stop.

Ok that's what I had in mind...

I think that would involve;

Instead of clearing + overwriting each line separately; buffer all lines first, then clear all lines and write them (using a tabwriter)

vdemeester

LGTM 🐸

simonferquel requested review from dnephin and vdemeester as code owners March 15, 2018 15:16

GordonTheTurtle added the status/0-triage label Mar 15, 2018

simonferquel force-pushed the k8s-watch-stack-status branch from f9db2cc to 55db783 Compare March 15, 2018 15:20

vdemeester added status/2-code-review and removed status/0-triage labels Mar 15, 2018

simonferquel force-pushed the k8s-watch-stack-status branch from 55db783 to b608cc5 Compare March 16, 2018 08:53

GordonTheTurtle added the dco/no label Mar 16, 2018

simonferquel force-pushed the k8s-watch-stack-status branch from 000bca4 to a992e87 Compare March 16, 2018 13:22

GordonTheTurtle removed the dco/no label Mar 16, 2018

docker deleted a comment from GordonTheTurtle Mar 16, 2018

silvin-lubecki reviewed Mar 19, 2018

View reviewed changes

silvin-lubecki approved these changes Mar 19, 2018

View reviewed changes

simonferquel force-pushed the k8s-watch-stack-status branch from 0b11e8f to 119d682 Compare March 20, 2018 08:52

vdemeester reviewed Mar 21, 2018

View reviewed changes

simonferquel force-pushed the k8s-watch-stack-status branch from 119d682 to be0fd97 Compare March 21, 2018 13:07

simonferquel force-pushed the k8s-watch-stack-status branch from be0fd97 to 23f4ce6 Compare March 21, 2018 13:30

simonferquel force-pushed the k8s-watch-stack-status branch from 23f4ce6 to 5f4fd3c Compare March 21, 2018 13:35

thaJeztah added area/stack area/kubernetes labels Apr 30, 2018

simonferquel force-pushed the k8s-watch-stack-status branch from 5f4fd3c to bf94013 Compare April 30, 2018 15:17

simonferquel force-pushed the k8s-watch-stack-status branch from 5452829 to 832574a Compare May 9, 2018 20:42

thaJeztah requested changes May 10, 2018

View reviewed changes

simonferquel force-pushed the k8s-watch-stack-status branch from 832574a to ec3c96f Compare May 14, 2018 13:45

thaJeztah reviewed May 16, 2018

View reviewed changes

thaJeztah requested changes May 17, 2018

View reviewed changes

Vendoring for stack watch status and testing

8cb2e44

Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>

simonferquel force-pushed the k8s-watch-stack-status branch from aaa57fb to fff839c Compare May 24, 2018 14:46

thaJeztah approved these changes May 24, 2018

View reviewed changes

Better stack status check

f38510b

Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>

simonferquel force-pushed the k8s-watch-stack-status branch 2 times, most recently from 935613a to f38510b Compare May 25, 2018 12:57

silvin-lubecki reviewed May 25, 2018

View reviewed changes

vdemeester approved these changes Jun 4, 2018

View reviewed changes

vdemeester merged commit eaa9149 into docker:master Jun 4, 2018

GordonTheTurtle added this to the 18.06.0 milestone Jun 4, 2018

		}

		func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) {


		handlers := runtimeutil.ErrorHandlers

		// informers errors are reported using global error handlers

	const (
	// LabelNamespace is the label used to track stack resources
	LabelNamespace = "com.docker.stack.namespace"
	)

	const (
	// ForServiceName is the label for the service name.
	ForServiceName = "com.docker.service.name"
	// ForStackName is the label for the stack name.
	ForStackName = "com.docker.stack.namespace"
	// ForServiceID is the label for the service id.
	ForServiceID = "com.docker.service.id"
	)

K8s: more robust stack error detection on deploy #948

K8s: more robust stack error detection on deploy #948

Conversation

simonferquel commented Mar 15, 2018 • edited Loading

codecov-io commented Mar 15, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silvin-lubecki left a comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

simonferquel commented Mar 21, 2018

simonferquel commented Mar 21, 2018

vdemeester commented Mar 21, 2018 • edited Loading

simonferquel commented Mar 21, 2018

simonferquel commented Mar 27, 2018

thaJeztah commented Apr 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonferquel commented May 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thaJeztah left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thaJeztah left a comment

Choose a reason for hiding this comment

simonferquel commented May 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdemeester left a comment

Choose a reason for hiding this comment

simonferquel commented Mar 15, 2018 •

edited

Loading

codecov-io commented Mar 15, 2018 •

edited

Loading

vdemeester commented Mar 21, 2018 •

edited

Loading