-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s: more robust stack error detection on deploy #948
Conversation
f9db2cc
to
55db783
Compare
Codecov Report
@@ Coverage Diff @@
## master #948 +/- ##
=========================================
Coverage ? 54.09%
=========================================
Files ? 262
Lines ? 16740
Branches ? 0
=========================================
Hits ? 9056
Misses ? 7079
Partials ? 605 |
55db783
to
b608cc5
Compare
000bca4
to
a992e87
Compare
) | ||
|
||
func metaStateFromStatus(status serviceStatus) metaServiceState { | ||
if status.podsReady > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch{
case status.podsReady > 0:
return metaServiceStateReady
case status.podsPending > 0:
return metaServiceStatePenging
default:
return metaServiceStateFailed
}
} | ||
|
||
func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: empty line
color := aec.DefaultF | ||
switch state { | ||
case metaServiceStateFailed: | ||
color = aec.RedF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we want some colors in the CLI. @vdemeester ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This serves as way to identify status changes more clearly (manual testing that just modifying the text was not really clear.
Note that on non-terminal, output is sequencial and without coloring (and don't update quite as much, only when the metastate of a service changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be in a separate PR; so far the CLI does not use any colouring, so this should be looked at in a wider scope, and doesn't seem critical
return stop | ||
func (w *deployWatcher) Watch(stack *apiv1beta1.Stack, serviceNames []string, statusUpdates chan serviceStatus) error { | ||
errC := make(chan error, 1) | ||
defer close(errC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that closing channel is mandatory only if the receiver is watching for the "close event". Otherwise it is perfectly valid to let it opened until its collection.
errC <- e | ||
}) | ||
defer func() { | ||
runtimeutil.ErrorHandlers = runtimeutil.ErrorHandlers[:len(runtimeutil.ErrorHandlers)-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if another part of the code adds an ErrorHandlers after this one? It will be removed without notice...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the main issue with this model, and this is not mutex protected. I changed the code a bit to at least avoid keeping an error handler that writes on a closed channel, but still, it is not the best piece of software coming from K8s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok...
func (sw *stackWatcher) OnUpdate(oldObj, newObj interface{}) { | ||
sw.OnAdd(newObj) | ||
} | ||
func (sw *stackWatcher) OnDelete(obj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: empty line missing
} | ||
} | ||
func (sw *stackWatcher) OnUpdate(oldObj, newObj interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: empty line missing
} | ||
func (sw *stackWatcher) OnAdd(obj interface{}) { | ||
stack, ok := obj.(*apiv1beta1.Stack) | ||
if !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch{
case !ok:
sw.resultChan <- errors.Errorf("stack %s has not the correct type", sw.stackName)
case stack.Status.Phase == apiv1beta1.StackFailure:
sw.resultChan <- errors.Errorf("stack %s failed with status %s", sw.stackName, stack.Status.Phase)
}
pw.services[serviceName] = status | ||
} | ||
func (pw *podWatcher) allReady() bool { | ||
for _, status := range pw.services { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a data race condition here while iterating on pw.services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, informers guarantee that OnAdd/OnUpdate/OnDelete are called sequencially for a given handler. So there is no concurrency there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
|
||
return true | ||
func (pw *podWatcher) OnDelete(obj interface{}) { | ||
p, ok := obj.(*apiv1.Pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this code is duplicated 3 times (OnAdd, OnUpdate, OnDelete).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
0b11e8f
to
119d682
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonferquel can you squash the commits and split it into 2 (one for vendoring update and one for the actual change).
I think it would have make the review simpler if the display changes were split from the actual error detection pieces but now it's done 😛
@vdemeester I aggree that splitting display changes would have been easier for reviewing, but the problem I found is that the nature of the service updates we presented to the user previously did not make sense in a K8s world. So it was difficult to map new status updates (with number of PODs per service in a running/failed state) with the old display code (and its weird container restarts thing). |
119d682
to
be0fd97
Compare
Squashing done |
be0fd97
to
23f4ce6
Compare
@simonferquel lint failure — the e2e is getting fixed in #955 (you're gonna need to rebase against master)
|
@vdemeester just fixed that. Still the strange perm flag on this file that breaks vendoring from a windows machine |
23f4ce6
to
5f4fd3c
Compare
@vdemeester could you update your review ? |
ping @simonferquel this needs a rebase 😢 |
5f4fd3c
to
bf94013
Compare
5452829
to
832574a
Compare
// Add adds a value to the cache. Returns true if an eviction occurred. | ||
======= | ||
// Add adds a value to the cache. Returns true if an eviction occured. | ||
>>>>>>> Vendoring for stack status watch + tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a merge conflict in here
vendor.conf
Outdated
@@ -34,6 +34,7 @@ github.com/go-openapi/spec 6aced65f8501fe1217321abf0749d354824ba2ff | |||
github.com/go-openapi/swag 1d0bd113de87027671077d3c71eb3ac5d7dbba72 | |||
github.com/gregjones/httpcache c1f8028e62adb3d518b823a2f8e6a95c38bdd3aa | |||
github.com/grpc-ecosystem/grpc-gateway 1a03ca3bad1e1ebadaedd3abb76bc58d4ac8143b | |||
github.com/hashicorp/golang-lru a0d98a5f288019575c6d1f4bb1573fef2d1fcdc4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is included twice in vendor.conf (there's a vendor at the bottom of this file as well)
color := aec.DefaultF | ||
switch state { | ||
case metaServiceStateFailed: | ||
color = aec.RedF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be in a separate PR; so far the CLI does not use any colouring, so this should be looked at in a wider scope, and doesn't seem critical
@@ -47,7 +47,7 @@ func (c *LRU) Purge() { | |||
c.evictList.Init() | |||
} | |||
|
|||
// Add adds a value to the cache. Returns true if an eviction occurred. | |||
// Add adds a value to the cache. Returns true if an eviction occured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering; if this is the only change, if we should bump at all
@@ -47,11 +47,7 @@ func (c *LRU) Purge() { | |||
c.evictList.Init() | |||
} | |||
|
|||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this to the vendor commit?
@@ -178,24 +174,3 @@ func newStatusDisplay(o *command.OutStream) statusDisplay { | |||
} | |||
return &interactiveStatusDisplay{o: o} | |||
} | |||
|
|||
// createFileBasedConfigMaps creates a Kubernetes ConfigMap for each Compose global file-based config. | |||
func createFileBasedConfigMaps(stackName string, globalConfigs map[string]composetypes.ConfigObjConfig, configMaps corev1.ConfigMapInterface) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, please squash this with the other commit
832574a
to
ec3c96f
Compare
Removed coloring, and resquashed everything into vendor+code PR style |
func (d *interactiveStatusDisplay) OnStatus(status serviceStatus) { | ||
b := aec.EmptyBuilder | ||
for ix := 0; ix < len(d.statuses); ix++ { | ||
b = b.Up(1).EraseLine(aec.EraseModes.All) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice we add a new dependency for this; looks like there may be duplication with the existing https://github.com/Nvveen/Gotty package that we use, for example here; https://github.com/docker/cli/blob/master/vendor/github.com/docker/docker/pkg/jsonmessage/jsonmessage.go#L165-L221
ping @ijc perhaps you could have a look?
(perhaps it's ok to have both, just want someone to double-check 🤗)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original issue which lead (eventually) to Gotty
use was #28111 (fixed in #28238, Gotty in followup #28304) solving a problem with Up(0)
(and Down(0)
etc) being undefined, but that isn't used here and this aec
library looks to do the right thing but is just using plain ANSI codes rather than terminfo, so may theoretically not be as portable, although in practice by sticking to ANSI it's probably going to work most reasonable places.
Gotty
is not maintained (my PRs made as part of the above still aren't merged) whereas aec
has been more recently (and has no issue or prs open). Maybe it'd be worth moving the jsonmessage
stuff over to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking @ijc - let me open an issue in moby/moby to discuss replacing Gotty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi, I chose this lib because of its simplicity, and because it is already used in buildkit (which has IMO a very modern and good looking UX)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits and questions
|
||
func displayInteractiveServiceStatus(status serviceStatus, o io.Writer) { | ||
state := metaStateFromStatus(status) | ||
fmt.Fprintf(o, "%s: %s\t\t[pods details: %v ready, %v pending, %v failed]", status.name, state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- perhaps this should be
[pods status ..
(also, should it be plural (pods status
) or singular (pod status
))? - can you use
%d
for the integers? - would it be useful to show the total number of pods as well? (
1/5 ready, 2/5 pending, 2/5 failed
)
|
||
handlers := runtimeutil.ErrorHandlers | ||
|
||
// informers errors are reported using global error handlers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/informers/informer/`
stack, ok := obj.(*apiv1beta1.Stack) | ||
switch { | ||
case !ok: | ||
sw.resultChan <- errors.Errorf("stack %s has not the correct type", sw.stackName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stack %s has incorrect type
or invalid type
?
or
incorrect type for stack %s
Wondering; should we (can we?) show the actual type? (e.g. if it's a apiv1beta2.Stack
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, this error should really not happen (we are just covering for potential bugs in the way we initialize the informer). I agree we should use a consistent wording, but adding extra-work for adding more info in the message seem a bit overkill
func (pw *podWatcher) handlePod(obj interface{}) { | ||
pod, ok := obj.(*apiv1.Pod) | ||
if !ok { | ||
pw.resultChan <- errors.New("unexpected type for a Pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for a Pod" seems a bit vague; can we print the Pod's name (or something?)
unexpected type for pod %s
Also, for consistency with stacks above; we should pick the same wording (unexpected type
, invalid type
, or incorrect type
) - pick one (not sure which one's best)
"byservice": func(obj interface{}) ([]string, error) { | ||
pod, ok := obj.(*apiv1.Pod) | ||
if !ok { | ||
return nil, errors.New("pod has an unexpected type") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use same wording here as above (or vice-versa)
return cache.NewSharedInformer( | ||
&cache.ListWatch{ | ||
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) { | ||
options.LabelSelector = "com.docker.stack.namespace=" + stackName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have defined the name for this label as a constant in two locations;
cli/cli/compose/convert/compose.go
Lines 13 to 16 in 236a847
const ( | |
// LabelNamespace is the label used to track stack resources | |
LabelNamespace = "com.docker.stack.namespace" | |
) |
and
cli/kubernetes/labels/labels.go
Lines 8 to 15 in 7ad3036
const ( | |
// ForServiceName is the label for the service name. | |
ForServiceName = "com.docker.service.name" | |
// ForStackName is the label for the stack name. | |
ForStackName = "com.docker.stack.namespace" | |
// ForServiceID is the label for the service id. | |
ForServiceID = "com.docker.service.id" | |
) |
Can you use a constant here? (Also, we should discuss unifying those constants, but not for this PR)
}, | ||
|
||
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { | ||
options.LabelSelector = "com.docker.stack.namespace=" + stackName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here (constant)
} | ||
func (p *testPodListWatch) Watch(opts metav1.ListOptions) (watch.Interface, error) { | ||
return p.fake. | ||
InvokesWatch(k8stesting.NewWatchAction(podsResource, p.ns, opts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wondering why you split these (and other locations) over two lines, instead of
return s.fake.InvokesWatch(k8stesting.NewWatchAction(stacksResource, s.ns, opts))
(don't think we do that in most places, but may be my personal preference, so just a nit)
if d.states[status.name] != state { | ||
d.states[status.name] = state | ||
fmt.Fprintf(d.o, "%s: %s", status.name, state) | ||
fmt.Fprintln(d.o) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a \n
to the previous fmt.Fprintf()
instead?
fmt.Fprintf(d.o, "%s: %s\n", status.name, state)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for consistency with the platform EOL conventions (\n
on *nix, \r\n
on windows). But sure, I can simplify that.
Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
aaa57fb
to
fff839c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes LGTM, but can you squash the last two commits?
Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
935613a
to
f38510b
Compare
@thaJeztah Squashing done! :) |
func displayInteractiveServiceStatus(status serviceStatus, o io.Writer) { | ||
state := metaStateFromStatus(status) | ||
totalFailed := status.podsFailed + status.podsSucceeded + status.podsUnknown | ||
fmt.Fprintf(o, "%[1]s: %[2]s\t\t[pod status: %[3]d/%[6]d ready, %[4]d/%[6]d pending, %[5]d/%[6]d failed]\n", status.name, state, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got this:
backend: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
cpp: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
dockerfile: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
golang: Pending [pod status: 0/1 ready, 1/1 pending, 0/1 failed]
python: Failed [pod status: 0/0 ready, 0/0 pending, 0/0 failed]
Maybe it's missing a tab between the service name and the state?
fmt.Fprintf(o, "%[1]s:\t%[2]s
Or maybe it's more complicated than that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, looks like we need (the equivalent of) a tabwriter for that; guess the problem is that dockerfile: Failed
would just reach a tab-stop, therefore the tab that is being written bumps it to the next tab-stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that's what I had in mind...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would involve;
Instead of clearing + overwriting each line separately; buffer all lines first, then clear all lines and write them (using a tabwriter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐸
This supersedes #846
This takes advantage of k8s informers (which are much more robust than using only a watch but somewhat more complex as well).
We look both at stack reconciliation errors, and POD scheduling. The logic is now:
When at least one POD of each service is Ready, we consider the stack to be ready.
On a terminal, service status reporting is now interactive (instead of being forward only), and is correctly interrupted in case of a reconciliation failure.