Rework deploy output #98

KnVerey · 2017-05-25T00:35:04Z

This PR is all about UX, so examining the output in addition to the code is very important. I've posted a screenshot of my favourite scenario below, but please:

Tophat this PR by running the integration tests with PRINT_LOGS=1
Try to think of important success/failure cases that may not be covered by the integration tests

I'd love it if everyone on @Shopify/cloudplatform could take a moment to weigh in on the output. I'm no UX designer, and as you all know, this experience is a key part of our app onboarding. 😄

Context

Currently, the KUBESTATUS logging makes up the large majority of the output for most deploys. Originally, I was envisioning this becoming a visualization, and having that be what people watched. I now believe that even if we do implement a visualization, it should not be at the expense of deploy log clarity.

Goals

These are ambitious, and I'm not claiming to have achieved them here. But this is what I'm striving for:

All output should be readable by a somewhat motivated human.
Users should be able to read the new summary section to find out what happened during the deploy. This is especially important for debugging information on deploy failures, but also includes actions taken during successful deploys.
Users should only see a backtrace when the failure is our fault, not theirs (i.e. the gem actually 💥 )

Not done here

Special messaging for each resource type. I've done the most obvious ones, but we'll need to continue to improve this in the future.
Functional changes to make deploys friendlier (e.g. fail faster instead of time out)

Screenshot

Todos:

Address Policial failures
Run the executables
Run a local deploy of a test app with TPRs to a Shopify cluster

KnVerey · 2017-05-30T23:55:19Z

lib/kubernetes-deploy/kubernetes_resource/cloudsql.rb

-      @status = if cloudsql_proxy_deployment_exists? && mysql_service_exists?
+      @deployment_exists = cloudsql_proxy_deployment_exists?
+      @service_exists = mysql_service_exists?
+      @status = if @deployment_exists && @service_exists


These changes aren't strictly related to this PR. I just noticed that these resources were effectively re-syncing every time we called deploy_succeeded?

KnVerey · 2017-05-31T00:07:59Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -30,12 +31,24 @@ def sync
            )
            pod.deploy_started = @deploy_started
            pod.interpret_json_data(pod_json)
+
+            if !@representative_pod && pod_probably_new?(pod_json)


This makes me 😢 / 😳 , but it lets us provide the huge benefit of pod logs/events on deployment failures without waiting for much bigger (200+ LOC) changes required to do this properly. That code is already mostly written here so ideally the hack will be in the wild for a very short time. I think this is worth doing to get the output changes in front of users sooner. Risks:

If there is massive clock skew between the cluster and the deployer, no representative pod will be chosen, and we'll show the "go look in Splunk" message anyway.

If someone manages to deploy twice in 30 seconds and the second deploy fails, we may select the wrong pod.

If a previous deploy is failing so hard that the pods are being re-created repeatedly (not just restarted, which would be fine) and the new deploy fails too, we may select the wrong pod.

ibawt · 2017-05-31T13:17:11Z

looks great to me. I imagine I'll have more opinions when I debug some failures with the new output.

n1koo · 2017-05-31T19:43:28Z

Looks really nice. Kinda bikesheddish is that theres still quite a bit of output (understandable since its useful) but its hard for eyes to jump to what matters, eg the app logs. Should we emphasise them somehow (eg. colouring etc?)

KnVerey · 2017-05-31T21:24:25Z

Should we emphasise them somehow (eg. colouring etc?)

I'm reluctant to colour the logs themselves, since they might already have colouring that might be more useful (e.g. rails migration logs do). A couple ideas to reduce output / emphasize logs, with their tradeoffs:

Blacklist certain types of events that probably aren't interesting, e.g. Created, Pulled, Starting. Could the partial story / difference vs Splunk cause confusion though? I thought maybe we could exclude "Normal"-level events as a whole, but a ton of errors (e.g. Backoff, NodeNotReady, Killing) actually have that level, sadly.
Put logs before events in the output. The logs could theoretically be massive though... perhaps I should be limiting to last 50 lines in general?
Suppress the logs/events line entirely instead of printing "none found, check splunk". The idea was to discourage people from assuming that Splunk will not be helpful just because the deploy failed to get the logs for some reason.

WDYT?

Edouard-chin

Pretty nice work, I didn't went through all files yet as there is a lot of changes, will continue the second batch later/tomorrow and will 🎩 as well.

Edouard-chin · 2017-05-31T21:21:30Z