Skip to content
This repository has been archived by the owner on Jan 9, 2023. It is now read-only.

tarmak cluster logs #575

Merged
merged 16 commits into from
Feb 6, 2019
Merged

Conversation

JoshVanL
Copy link
Contributor

@JoshVanL JoshVanL commented Oct 12, 2018

What this PR does / why we need it:
This adds a new cluster sub-command logs. The following arguments will be taken as instance pools. This sub-command will go and fetch the logs from every service in each instance of the instance pool for every instance pool given. The logs will then be bundled together into a tar ball for convenient file sharing.

fixes #574

Adds `tarmak cluster logs` for convenient fetching of logs of any instance or target group

/assign
TODO:

  • fix cmd hang
  • more efficiency

@jetstack-bot jetstack-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 12, 2018
@JoshVanL
Copy link
Contributor Author

/cc @MattiasGees @alljames @simonswine
Any ideas on how to keep the ssh connection open to improve the speed of fetching logs?
Thoughts on current implementation?

Copy link
Member

@MattiasGees MattiasGees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like this idea that you try to fetch all services that way we never miss anything. But I feel this let us transfer a lot of logs that we don't need. A predefined list will lower the amount of logs, but it also adds the burden that we have to maintain that list if things change.

A few things I like to see changed.

  1. Make sure the extension of the log files end with .log

  2. Now we fetch all the logs over one separate ssh connection. I propose to do the following steps. Disadvantage will be that we need to use local space on the instances.

    1. Select services from which we get the logs
    2. Save all logs to local disk in one temp folder
    3. Get all logs from local disk by downloading that folder
  3. Add extra arguments to get logs from/between certain dates. journalctl has support for --since and --until. I think this can be valuable. I think by default we should only get the last 24 hours of logs. Otherwise they can become big.

Also I get an error when I try to get logs from the worker instance group which has multiple instances.

tarmak cluster logs --path ~/Downloads/logs.tar.gz worker

INFO[0001] fetching service logs from instance 'worker'  app=tarmak
ssh: Could not resolve hostname worker: nodename nor servname provided, or not known
ERRO[0001] failed to gather unit service list from instance 'worker', skipping...  app=tarmak
INFO[0001] fetching service logs from instance 'worker'  app=tarmak
ssh: Could not resolve hostname worker: nodename nor servname provided, or not known
ERRO[0001] failed to gather unit service list from instance 'worker', skipping...  app=tarmak
INFO[0001] fetching service logs from instance 'worker'  app=tarmak
ssh: Could not resolve hostname worker: nodename nor servname provided, or not known
ERRO[0001] failed to gather unit service list from instance 'worker', skipping...  app=tarmak
INFO[0001] logs bundle written to '/Users/mattias/Downloads/logs.tar.gz'  app=tarmak
INFO[0001] Tarmak performed all tasks successfully.


// clusterLogsCmd handles `tarmak clusters logs`
var clusterLogsCmd = &cobra.Command{
Use: "logs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to logs [INSTANCE POOL] to say we need to define an instance pool.

t.Log().Fatal("expecting at least a one instance pool name")
}

t.CancellationContext().WaitOrCancel(t.NewCmdTerraform(args).Logs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why NewCmdTerraform? I don't feel this belongs here as the logs have nothing to do with Terraform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree however neither is SSH(). I like using this struct for consistency and using the context etc. Perhaps we should rename it to CmdTarmak however this doesn't make much sense being in terraform.go
Thoughts?

@JoshVanL
Copy link
Contributor Author

  • Sane defaults
    • Retrieve logs for 24h
    • Control plan + Etcds + Bastion + Vault default targets
  • Look into the rotation time of journald
  • Allow specifc instance or instances (similar name like in cluster ssh instance)
  • Use a single journalctl -o json stream per instance
  • Filter then per service / kernel / target on the tarmak host

@jetstack-bot jetstack-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 18, 2018
@JoshVanL
Copy link
Contributor Author

/unassign
/assign @simonswine @MattiasGees

Logs are now json streamed and piped straight to file to be bundled together.
All hosts in a group are gathered concurrently.
Sane defaults for since + until.

@JoshVanL JoshVanL changed the title WIP: tarmak cluster logs tarmak cluster logs Oct 18, 2018
@jetstack-bot jetstack-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 18, 2018
Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that feature @JoshVanL

Some minor comments, but in general great work!

/assign @JoshVanL
/unassign

until string
targets []string

mu sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make clearer what that lock is protecting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in its name also a comment

wg sync.WaitGroup
hosts []interfaces.Host
tmpDir string
tmpFiles map[string]*os.File
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should make sure that this has it's own lock protecting read/writes to the map

pkg/tarmak/logs/logs.go Show resolved Hide resolved
}

entry := new(SystemdEntry)
if err := json.Unmarshal(r, entry); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use json.NewDecoder and a decode loop instead of a scanner + individual decodes

l.mu.Lock()
err = fmt.Errorf("failed to unmarshal entry [%s]: %s", r, err)
result = multierror.Append(result, err)
l.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the are with the lock separate from other code (new lines before and after). also keep it minimal (i.e.) without the Errorf

"path",
utils.DefaultLogsPathPlaceholder,
"target tar ball path",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we default to the current working directory, config files for tarmak should not be polluted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I also think current working directory is a better solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

),
Short: "Gather logs from a list of instances or target groups",
PreRunE: func(cmd *cobra.Command, args []string) error {
if len(args) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some better validation of targets would be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done more validation further down. I don't want to call ListHosts() here again because it's so expensive


switch group {
case "control-plane":
l.targets = []string{"master", "etcd"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we define vault as being part of the control-plane?

@jetstack-bot jetstack-bot assigned JoshVanL and unassigned simonswine Oct 23, 2018
@jetstack-bot jetstack-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2018
}

type Logs interface {
Gather(group string, flags tarmakv1alpha1.ClusterLogsFlags) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or Aggregate ?

@JoshVanL JoshVanL force-pushed the 574-tarmak-cluster-logs branch 2 times, most recently from f38a295 to 14c3df8 Compare October 23, 2018 16:45
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
@JoshVanL
Copy link
Contributor Author

JoshVanL commented Feb 1, 2019

/test verify quick
/test verify docs

@JoshVanL
Copy link
Contributor Author

JoshVanL commented Feb 1, 2019

/test verify quick

1 similar comment
@JoshVanL
Copy link
Contributor Author

JoshVanL commented Feb 1, 2019

/test verify quick

@simonswine
Copy link
Contributor

I think almost there, i find this behaviour odd:

./_output/tarmak cluster logs --path test.tar.gz master vault                                                                                                                                                                                     1 ↵
INFO[0000] fetching logs from target 'master'            app=tarmak
INFO[0001] fetching journald logs from instance 'master-3'  app=tarmak
INFO[0001] fetching journald logs from instance 'master-1'  app=tarmak
INFO[0001] fetching journald logs from instance 'master-2'  app=tarmak
INFO[0001] new connection to bastion host successful     app=tarmak
INFO[0004] logs bundle written to '/home/simon/.golang/packages/src/github.com/jetstack/tarmak/test.tar.gz'  app=tarmak
INFO[0004] fetching logs from target 'vault'             app=tarmak
INFO[0004] fetching journald logs from instance 'vault-3'  app=tarmak
INFO[0004] fetching journald logs from instance 'vault-2'  app=tarmak
INFO[0004] fetching journald logs from instance 'vault-1'  app=tarmak
INFO[0005] logs bundle written to '/home/simon/.golang/packages/src/github.com/jetstack/tarmak/test.tar.gz'  app=tarmak
INFO[0005] tarmak performed all tasks successfully

They should not overwrite each other, i guess

control-plane and master etcd vault should do the exact same

@simonswine
Copy link
Contributor

/assign @JoshVanL

Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
@JoshVanL
Copy link
Contributor Author

JoshVanL commented Feb 4, 2019

/unassign

@simonswine
Copy link
Contributor

Thanks @JoshVanL, looking good now

/approve
/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2019
@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoshVanL, MattiasGees, simonswine

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JoshVanL,simonswine]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot merged commit 463b70c into jetstack:master Feb 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. kind/documentation Categorizes issue or PR as related to documentation. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tarmak export logs command
5 participants