-
Notifications
You must be signed in to change notification settings - Fork 49
tarmak cluster logs #575
tarmak cluster logs #575
Conversation
/cc @MattiasGees @alljames @simonswine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like this idea that you try to fetch all services that way we never miss anything. But I feel this let us transfer a lot of logs that we don't need. A predefined list will lower the amount of logs, but it also adds the burden that we have to maintain that list if things change.
A few things I like to see changed.
-
Make sure the extension of the log files end with
.log
-
Now we fetch all the logs over one separate ssh connection. I propose to do the following steps. Disadvantage will be that we need to use local space on the instances.
- Select services from which we get the logs
- Save all logs to local disk in one temp folder
- Get all logs from local disk by downloading that folder
-
Add extra arguments to get logs from/between certain dates.
journalctl
has support for--since
and--until
. I think this can be valuable. I think by default we should only get the last 24 hours of logs. Otherwise they can become big.
Also I get an error when I try to get logs from the worker instance group which has multiple instances.
tarmak cluster logs --path ~/Downloads/logs.tar.gz worker
INFO[0001] fetching service logs from instance 'worker' app=tarmak
ssh: Could not resolve hostname worker: nodename nor servname provided, or not known
ERRO[0001] failed to gather unit service list from instance 'worker', skipping... app=tarmak
INFO[0001] fetching service logs from instance 'worker' app=tarmak
ssh: Could not resolve hostname worker: nodename nor servname provided, or not known
ERRO[0001] failed to gather unit service list from instance 'worker', skipping... app=tarmak
INFO[0001] fetching service logs from instance 'worker' app=tarmak
ssh: Could not resolve hostname worker: nodename nor servname provided, or not known
ERRO[0001] failed to gather unit service list from instance 'worker', skipping... app=tarmak
INFO[0001] logs bundle written to '/Users/mattias/Downloads/logs.tar.gz' app=tarmak
INFO[0001] Tarmak performed all tasks successfully.
cmd/tarmak/cmd/cluster_logs.go
Outdated
|
||
// clusterLogsCmd handles `tarmak clusters logs` | ||
var clusterLogsCmd = &cobra.Command{ | ||
Use: "logs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to logs [INSTANCE POOL]
to say we need to define an instance pool.
cmd/tarmak/cmd/cluster_logs.go
Outdated
t.Log().Fatal("expecting at least a one instance pool name") | ||
} | ||
|
||
t.CancellationContext().WaitOrCancel(t.NewCmdTerraform(args).Logs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why NewCmdTerraform
? I don't feel this belongs here as the logs have nothing to do with Terraform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree however neither is SSH(). I like using this struct for consistency and using the context etc. Perhaps we should rename it to CmdTarmak however this doesn't make much sense being in terraform.go
Thoughts?
|
af51d76
to
76fde99
Compare
/unassign Logs are now json streamed and piped straight to file to be bundled together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/tarmak/logs/logs.go
Outdated
until string | ||
targets []string | ||
|
||
mu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make clearer what that lock is protecting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in its name also a comment
pkg/tarmak/logs/logs.go
Outdated
wg sync.WaitGroup | ||
hosts []interfaces.Host | ||
tmpDir string | ||
tmpFiles map[string]*os.File |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should make sure that this has it's own lock protecting read/writes to the map
pkg/tarmak/logs/logs.go
Outdated
} | ||
|
||
entry := new(SystemdEntry) | ||
if err := json.Unmarshal(r, entry); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use json.NewDecoder and a decode loop instead of a scanner + individual decodes
pkg/tarmak/logs/logs.go
Outdated
l.mu.Lock() | ||
err = fmt.Errorf("failed to unmarshal entry [%s]: %s", r, err) | ||
result = multierror.Append(result, err) | ||
l.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the are with the lock separate from other code (new lines before and after). also keep it minimal (i.e.) without the Errorf
"path", | ||
utils.DefaultLogsPathPlaceholder, | ||
"target tar ball path", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we default to the current working directory, config files for tarmak should not be polluted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I also think current working directory is a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
), | ||
Short: "Gather logs from a list of instances or target groups", | ||
PreRunE: func(cmd *cobra.Command, args []string) error { | ||
if len(args) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some better validation of targets would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done more validation further down. I don't want to call ListHosts() here again because it's so expensive
pkg/tarmak/logs/logs.go
Outdated
|
||
switch group { | ||
case "control-plane": | ||
l.targets = []string{"master", "etcd"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we define vault as being part of the control-plane?
pkg/tarmak/interfaces/interfaces.go
Outdated
} | ||
|
||
type Logs interface { | ||
Gather(group string, flags tarmakv1alpha1.ClusterLogsFlags) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or Aggregate
?
f38a295
to
14c3df8
Compare
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
/test verify quick |
/test verify quick |
1 similar comment
/test verify quick |
I think almost there, i find this behaviour odd:
They should not overwrite each other, i guess
|
/assign @JoshVanL |
Signed-off-by: JoshVanL <vleeuwenjoshua@gmail.com>
0b9a084
to
9b792e7
Compare
/unassign |
Thanks @JoshVanL, looking good now /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoshVanL, MattiasGees, simonswine The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
This adds a new cluster sub-command
logs
. The following arguments will be taken as instance pools. This sub-command will go and fetch the logs from every service in each instance of the instance pool for every instance pool given. The logs will then be bundled together into a tar ball for convenient file sharing.fixes #574
/assign
TODO: