-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename to management.Manager, add UpdateStatus to Manager interface. #19114
Conversation
Pinging @elastic/ingest-management (Team:Ingest Management) |
Pinging @elastic/integrations-services (Team:Services) |
💚 Build SucceededExpand to view the summary
Build stats
Test stats 🧪
Steps errorsExpand to view the steps failures
|
func (*nilManager) Stop() {} | ||
func (*nilManager) CheckRawConfig(cfg *common.Config) error { return nil } | ||
func (n *nilManager) UpdateStatus(status Status, msg string) { | ||
n.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this here if this is nil manager - noop manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do it only for logging the status changes. I think it is good to show status updates in the logger as well, so it's clear what is happening in the beat, even if its not controlled by Elastic Agent.
func (cm *Manager) OnConfig(s string) { | ||
cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration") | ||
cm.UpdateStatus(management.Configuring, "Updating configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we switch back after config is applied or do we let beat to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does switch back to Running at the end of OnConfig
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you;re right it was hidden so i overlooked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this right OnConfig sets the status to 'Healthy' at the end. Are we sure this is the correct state to report? Do we 'forget' the internal state here?
How about removing the 'Configuring' status? The less states (and protocol behavior) we have, the less we can get it wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally do not like the idea of setting the status in the OnConfig
handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status
.
At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).
I do not think we should removing Configuring
I think it is a key state for watching the observability of the Beat and propogating that information to Fleet.
It is good to know that going Healthy -> Configuring -> Failed
is because everything was good, then updating the configuration caused it to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally do not like the idea of setting the status in the OnConfig handler, I feel like that should really only be done specifically by the beat itself. So if OnConfig passes it to the reloader then the reloader should really set the Status.
At the moment I keep it this way so it works, and once status is being reported correctly by the beat, this should be removed (probably at the same time).
+1
Is this required change/cleanup documented in an issue?
@@ -285,3 +303,25 @@ func (cm *Manager) toConfigBlocks(cfg common.MapStr) (api.ConfigBlocks, error) { | |||
|
|||
return res, nil | |||
} | |||
|
|||
func statusToProtoStatus(status management.Status) proto.StateObserved_Status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add unit test to test switching statuses back and forth with some noop client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I don't believe client from elastic-agent-client
is exposed as an interface. To allow me to completely replace it with a noop client.
if cm.status != status || cm.msg != msg { | ||
cm.status = status | ||
cm.msg = msg | ||
cm.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should test this with concurrent request, what i fear may occur is that wit many concurrent changes of status
we will end up with different internal state
vs reported state
do you see some pros/cons of doing reporting using a channel or having full flow lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also try to 'forget' outdated status updates if we get new ones while we are waiting. This could be implemented using sync.Cond
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only reason to track the state at this level is so it can be logged when it is changed. We could remove the internal tracking and locking in the Fleet Manager if we just log every time UpdateStatus
is called. Then the locking inside of client
will handle updating the correct state to the manager.
Or we could place the call to client
inside of the lock to ensure that its always the same between the 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think client options is also ok. i think logging inconsistent states will cause us more trouble then help us during some sdh triaging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with moving the logging and client into the lock.
libbeat/management/management.go
Outdated
CheckRawConfig(cfg *common.Config) error | ||
|
||
// UpdateStatus called when the status of the beat has changed. | ||
UpdateStatus(status Status, msg string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving the UpdateStatus method into it's own interface? I'd like to push only 'status' reporting to the inputs/outputs and provide a set of helper functions that wrap a 'StatusReporter', merging status changes from multiple subsystems. But in the end we can also introduce the minimal interface in another package when we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can move it to its own interface, would it also still be in the management.Manager
interface?
type Manager interface {
StatusReporter
...
}
cm.logger.Infof("Status change to %s: %s", status, msg) | ||
return | ||
} | ||
cm.lock.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why unlock the call to client.Status
? If we have a race between unlock and the Status being send by two go-routines, how can you tell which one is the newer status update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was commenting about on that as well, I can move the client.Status
call inside of the lock. I think that would be the best.
…or statusToProtoStatus.
@michalpristas @urso This is ready for another review. |
func (cm *Manager) OnConfig(s string) { | ||
cm.client.Status(proto.StateObserved_CONFIGURING, "Updating configuration") | ||
cm.UpdateStatus(management.Configuring, "Updating configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you;re right it was hidden so i overlooked
cm.status = status | ||
cm.msg = msg | ||
cm.lock.Unlock() | ||
cm.client.Status(statusToProtoStatus(status), msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we update status on the client which will result in some action either during checkin or in watcher after some timeout.
if somebody reports failure i think (correct me if my memory failed) we talked about re-starting application immediately, can you make sure this flow is valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flow is valid from the side of the Elastic Agent, as reporting Failed will cause the application to be killed and restarted. This is fully tested in covered in the Elastic Agent unit tests, both in the case that the application returns Failed over the protocol and if the application just crashes.
return | ||
} | ||
|
||
blocks, err := cm.toConfigBlocks(configMap) | ||
if err != nil { | ||
err = errors.Wrap(err, "could not apply the configuration") | ||
cm.logger.Error(err) | ||
cm.client.Status(proto.StateObserved_FAILED, err.Error()) | ||
cm.UpdateStatus(management.Failed, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all these errors I wonder what 'Failed' means. Is the beat really 'failed' or are we (the overall application with agent) running in a degraded mode because not all configurations could have been applied? Even if we fail here the Beat continues to publish events for existing inputs, right?
Are we mixing the status of the 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated in the list of statuses there is Failed
and Degraded
.
// Degraded is status describing application is degraded.
Degraded
// Failed is status describing application is failed. This status should
// only be used in the case the beat should stop running as the failure
// cannot be recovered.
Failed
In Degarded state beat can keep sending events and agent will not stop the process. Failed
is failed, I have completely failed, restart me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've seen these status types. Is this method run on startup only, or also if the beat is already active? In the later case the Beat will continue with the old state if there was an error decoding the configuration. Do we want to enforce a 'restart' if the config update operation failed? Will Failed
trigger a restart by the Agent?
Are we mixing the status of the update 'operation' with the beat health here? If the operation fails, are we required to restart the Beat?
I'm more or less wondering about the overal semantics of OnConfig
and the single errors we can get.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will run both on startup and while running.
The beat doesn't need to worry about handling the restart on Failed status, the Agent will do that work for it, restart it and bring it back up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good for first iteration and exact form will crystallize once we have at least some healthcheck implemented
…lastic#19114) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus. (cherry picked from commit 05c9065)
…19114) (#19172) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus. (cherry picked from commit 05c9065)
…ngs-archive * upstream/master: Fix minor spelling error in Jenkinsfile (elastic#19153) [CI] fail if not possible to install python3 (elastic#19164) [Elastic Agent] Improved mage demo experience (elastic#18445) [yum] Clear cached data and add retry loop (elastic#19182) fix lint job by updating NOTICE (elastic#19161) Fix tags for coredns/envoyproxy (elastic#19134) Disable host.* fields by default for CrowdStrike module (elastic#19132) Allow host.* fields to be disabled in Zeek module (elastic#19113) Rename to management.Manager, add UpdateStatus to Manager interface. (elastic#19114) Edit Elastic Agent docs (elastic#19146) [JJBB] create job definition for the golang-crossbuild project (elastic#19162) Fix incorrect usage of hints builder when exposed port is a substring of the hint (elastic#19052) Add basic cloudfoundry integration tests (elastic#19018)
…lastic#19114) * Rename management.ConfigManager to management.Manager, add UpdateStatus to Manager interface. * Update docstring for Failed status. * Add to developer changelog. * Add StatusReporter interface, wrap client.Status in lock. Add tests for statusToProtoStatus.
What does this PR do?
Renames
management.ConfigManager
tomanagement.Manager
being that it now does more than just manage configurations. AddsUpdateStatus
tomanagement.Manager
to allow a beat to update the status.Using
FleetManager
the status will be propagated over theelastic-agent-client
to the controlling agent.Why is it important?
So status can be reported both to the controlling Agent and up to Fleet for observability of the beats that Agent is running.
Reporting a
Failed
status to Agent will result in the Agent stopping and restarting the beat.Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
How to test this PR locally
Related issues
HealthCheck
function in every beats. #17737Use cases
Screenshots
Logs