Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging around agent checkins. #1477
Improve logging around agent checkins. #1477
Changes from 3 commits
13370ea
4083192
e953964
7f27af6
957edb4
eb50994
5a6be15
8172358
29e3837
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be
Checkin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should only update the local reporter to the degraded state after multiple repeated failures, instead of just the first one?
In the fleet UI the agent is marked as degraded after multiple missed checkins, not just one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, we can move it down to where it checks the fail counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just asking, but did you check how the
retry_after
duration will be printed, if it'll be human readable or the milliseconds?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be human readable according to https://pkg.go.dev/time#Duration.String since
NextWait()
returns a time.Duration.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I apparently need to manually call the
.String()
message if I want it to be human readable, otherwise the units are nanoseconds.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added the
_ns
suffix to specify the units. The ns durations are easier to graph and work with in Kibana so I stuck with those. Nanoseconds are the units for event.duration if we ever wanted to make our logs ECS compliant.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can put a nolint directive for this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, I went further and just turned off this linter in #1478
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that I particularly like manually timing the request like this and returning it all the way up the call stack, but this was the easiest way to do it. In an ideal world we could use request tracing to obtain this information without having to modify our network client.
Perhaps a more sustainable way to do this would be to implement a way for the request tracer to output traced requests for failure requests to our logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We found ourselves trying to figure out checkin request durations by looking at the difference in timestamp between the agent and cloud proxy on a few occasions, so this is worth doing at least for this one request for convenience.