Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework deploy output #98

Merged
merged 1 commit into from
Jun 1, 2017
Merged

Rework deploy output #98

merged 1 commit into from
Jun 1, 2017

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented May 25, 2017

This PR is all about UX, so examining the output in addition to the code is very important. I've posted a screenshot of my favourite scenario below, but please:

  • Tophat this PR by running the integration tests with PRINT_LOGS=1
  • Try to think of important success/failure cases that may not be covered by the integration tests

I'd love it if everyone on @Shopify/cloudplatform could take a moment to weigh in on the output. I'm no UX designer, and as you all know, this experience is a key part of our app onboarding. 😄

Context

Currently, the KUBESTATUS logging makes up the large majority of the output for most deploys. Originally, I was envisioning this becoming a visualization, and having that be what people watched. I now believe that even if we do implement a visualization, it should not be at the expense of deploy log clarity.

Goals

These are ambitious, and I'm not claiming to have achieved them here. But this is what I'm striving for:

  • All output should be readable by a somewhat motivated human.
  • Users should be able to read the new summary section to find out what happened during the deploy. This is especially important for debugging information on deploy failures, but also includes actions taken during successful deploys.
  • Users should only see a backtrace when the failure is our fault, not theirs (i.e. the gem actually 💥 )

Not done here

  • Special messaging for each resource type. I've done the most obvious ones, but we'll need to continue to improve this in the future.
  • Functional changes to make deploys friendlier (e.g. fail faster instead of time out)

Screenshot

image

Todos:

  • Address Policial failures
  • Run the executables
  • Run a local deploy of a test app with TPRs to a Shopify cluster

@status = if cloudsql_proxy_deployment_exists? && mysql_service_exists?
@deployment_exists = cloudsql_proxy_deployment_exists?
@service_exists = mysql_service_exists?
@status = if @deployment_exists && @service_exists
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes aren't strictly related to this PR. I just noticed that these resources were effectively re-syncing every time we called deploy_succeeded?

@@ -30,12 +31,24 @@ def sync
)
pod.deploy_started = @deploy_started
pod.interpret_json_data(pod_json)

if !@representative_pod && pod_probably_new?(pod_json)
Copy link
Contributor Author

@KnVerey KnVerey May 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me 😢 / 😳 , but it lets us provide the huge benefit of pod logs/events on deployment failures without waiting for much bigger (200+ LOC) changes required to do this properly. That code is already mostly written here so ideally the hack will be in the wild for a very short time. I think this is worth doing to get the output changes in front of users sooner. Risks:

  • If there is massive clock skew between the cluster and the deployer, no representative pod will be chosen, and we'll show the "go look in Splunk" message anyway.
  • If someone manages to deploy twice in 30 seconds and the second deploy fails, we may select the wrong pod.
  • If a previous deploy is failing so hard that the pods are being re-created repeatedly (not just restarted, which would be fine) and the new deploy fails too, we may select the wrong pod.

@KnVerey KnVerey changed the title [WIP] Rework deploy output Rework deploy output May 31, 2017
@KnVerey KnVerey requested a review from kirs May 31, 2017 01:40
@ibawt
Copy link
Contributor

ibawt commented May 31, 2017

looks great to me. I imagine I'll have more opinions when I debug some failures with the new output.

@n1koo
Copy link
Contributor

n1koo commented May 31, 2017

Looks really nice. Kinda bikesheddish is that theres still quite a bit of output (understandable since its useful) but its hard for eyes to jump to what matters, eg the app logs. Should we emphasise them somehow (eg. colouring etc?)

@KnVerey KnVerey requested a review from Edouard-chin May 31, 2017 19:53
@KnVerey
Copy link
Contributor Author

KnVerey commented May 31, 2017

Should we emphasise them somehow (eg. colouring etc?)

I'm reluctant to colour the logs themselves, since they might already have colouring that might be more useful (e.g. rails migration logs do). A couple ideas to reduce output / emphasize logs, with their tradeoffs:

  • Blacklist certain types of events that probably aren't interesting, e.g. Created, Pulled, Starting. Could the partial story / difference vs Splunk cause confusion though? I thought maybe we could exclude "Normal"-level events as a whole, but a ton of errors (e.g. Backoff, NodeNotReady, Killing) actually have that level, sadly.
  • Put logs before events in the output. The logs could theoretically be massive though... perhaps I should be limiting to last 50 lines in general?
  • Suppress the logs/events line entirely instead of printing "none found, check splunk". The idea was to discourage people from assuming that Splunk will not be helpful just because the deploy failed to get the logs for some reason.

WDYT?

Copy link
Contributor

@Edouard-chin Edouard-chin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty nice work, I didn't went through all files yet as there is a lot of changes, will continue the second batch later/tomorrow and will 🎩 as well.

def heading(text, secondary_msg='', secondary_msg_color=:blue)
padding = (100.0 - (text.length + secondary_msg.length)) / 2
blank_line
part1 = ColorizedString.new("#{'-' * padding.floor}#{text}").blue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we replace https://github.com/Shopify/kubernetes-deploy/blob/5021384388f37792e272e605f9fa7b71df4ec8cb/lib/kubernetes-deploy.rb#L8
with require colorize

We can just do "#{'-' * padding.floor}#{text}".blue and secondary_msg.colorize(secondary_msg_color)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have it this way because I personally prefer the separate class over patching String with all those extra color methods, and I don't find this ugly. I'll change it if the consensus view is to the contrary though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop you right, that's a legit reason not to polute the String namespace. 👌

sent.capitalize
end

# Saves a sentence fragment to be displayed in the first sentence of the summary section
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cool that you added documentation, I'm wondering if we should use the standard rdoc syntax so that your doc would be parsable

# The resulting summary will begin with "Created 3 secrets and failed to deploy 2 resources"
#
# @example
#   @logger.summary.add_action("created 3 secrets")
#   @logger.summary.add_cation("failed to deploy 2 resources")
def add_action(sentence_fragment)
  @actions_taken << sentence_fragment
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a good reference for that syntax? I'm happy to use it, but I'm not familiar with it. @kirs made a similar suggestion previously but just linked to a specific example, so I still don't know what to refer to in general. I don't see the @example you're using in https://rdoc.github.io/rdoc/, for instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops my bad, the snippet above is not rdoc is yard. We started using yard very recently on shopify core, so maybe we can us it here as well? Yard as a neat example feature and I believe it's easier to read when going in the source code directly http://www.rubydoc.info/gems/yard/file/docs/Tags.md#List_of_Available_Tags

def initialize
@actions_taken = []
@paragraphs = []
@failure_reason = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, like the actions_taken accessor above, it's an artifact from a previous version that I missed cleaning up. Thanks!

when 1 then actions_taken.first
when 2 then actions_taken.join(" and ")
else
actions_taken[0..-2].join(", ") + " and " + actions_taken[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have active_support on this project so you can use actions_taken.to_sentence it will do the same for case 1, 2 and else.

def actions_sentence
  sent = case actions_taken.length
    when 0 then '...'
    else
      actions_taken.to_sentence
  end
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've only selectively included pieces of ActiveSupport we need. I didn't think it was worth pulling in conversions for this, but I can do it.

end

class DeferredSummary
attr_reader :paragraphs, :actions_taken
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see actions_taken being used outside of this class, should we remove the getter?

@@ -10,6 +11,16 @@ class KubernetesResource

TIMEOUT = 5.minutes

DEBUG_RESOURCE_NOT_FOUND_MESSAGE = "None found. Please check your usual logging service (e.g. Splunk)."
UNUSUAL_FAILURE_MESSAGE = <<-MSG.strip_heredoc.strip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSG.strip_heredoc should be enough

file_name, file_content = find_bad_file_from_kubectl_output(err)
if file_name
debug_msg = <<-HELPFUL_MESSAGE.strip_heredoc
This usually means your template named '#{file_name}' is invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks but it's easier to read if the message is indented 2 spaces in the heredoc block

<<-MSG
  hello
MSG

@dwradcliffe
Copy link
Member

This is a great improvement! Let's try to get this shipped this week - even if we want to make more changes in a week or two.

@Edouard-chin
Copy link
Contributor

Edouard-chin commented Jun 1, 2017

As David say, it's a great improvement and I think we can work on top this. I ran the test with multiple scenarios, it looks great although my terminal doesn't render the color properly

Edit: I'm dumb I was running the test on the wrong branch 🤦‍♂️

Copy link
Contributor

@wfarr wfarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Contributor

@ibawt ibawt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@KnVerey
Copy link
Contributor Author

KnVerey commented Jun 1, 2017

I did a tophat of the deploy exe against a test app in a production cluster and fixed a little bug where the pod id was displayed twice in log lines. No problems found with the other two exes. Re: CI, test_long_running_deployment seems to be flakey--I've seen it fail twice recently but haven't been able to reproduce. I'm going to re-run CI and merge if it passes.

Potential future improvements:

  • Consider turning off the timestamps setting on get logs, as Rails logs (and probably most?) already have them.
  • Somehow reduce the number of events dumped (maybe blacklist certain types)
  • Since the Watcher isn't started right when the resource is actually deployed a lot of things will claim to deploy in 0.0s if they were quite (but not actually that) fast, which sounds like blatant lies. Consider passing in the real start time.
  • I just rediscovered that kubectl logs can take a deployment name as of 1.6 (k8s PR). Integrate that strategy into the in-progress smarter log retrieval fix (need to test with multi-container deployments).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants