-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Component Status Reporting #8169
Conversation
Codecov ReportAttention:
... and 2 files with indirect coverage changes 📢 Thoughts on this report? Let us know!. |
beb7ade
to
cec36a7
Compare
component/component.go
Outdated
// GlobalID uniquely identifies a component | ||
type GlobalID struct { | ||
ID ID | ||
Kind Kind | ||
PipelineID ID // Not empty only if the Kind is Processor | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple nuances to the way we instance components within the service, which I think likely need to be reconciled in this PR.
Receivers and exporters are instanced per data type. For example, users will often configure a single otlpexporter
for exporting multiple types of data, but the underlying implementation creates separate instances. which would share the same GlobalID.
This is probably the correct model to present externally, because it aligns with the user's understanding, but we will have to define how to reconcile with the underlying implementation. Minimally, instances of the "same" component must be able to safely publish updates concurrently.
This raises a usability complication though, because we may expect status thrashing or burying of problems. e.g. the traces instance of a component reports a problem, but the metrics instance immediately clobbers the status by reporting that it is healthy.
I think we probably need to maintain a separate status for each instance of the "same" component, and aggregate them whenever an event is published. e.g. report "lowest" health state of any instance that shares the GlobalID, regardless of which instance sent the most recent event.
To further complicate things, we have the sharedcomponent
package, which allows component authors to opt in to a unified model. In this case, the service still reasons about separate instances, but there is in fact only one instance after all. For example, the otlpreceiver
uses this mechanism so that the shared instance can present a single port. I'm not certain how best to handle this, but perhaps the sharedcomponent
package needs to manage status reporting for the components that use it.
Connectors share the same concerns as receivers and exporters, except instancing is per ordered pair of data types. sharedcomponent
may or may not be used here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯 this sounds somewhat complicated, but thanks for the explanation.
I'm wondering if we should assume that each instance of a component should report status individually, and if so, perhaps we can rename GlobalID
to InstanceID
, where the ID
would have enough metadata to identify the underlying component. We could add some additional fields to the struct if needed.
I'm also wondering if this is accidentally already happening with the GlobalID
as it's currently implemented. We are passing around pointers to GlobalID
instances, via the host wrapper. Is the code that generates the GlobalID
s in graph generating a GlobalID
per component instance? If so, we would have a GlobalID
instance per component, each with a distinct pointer, but possibly the same struct values.
I'd also be interested in any other reasonable ways to approach this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if this is accidentally already happening with the GlobalID as it's currently implemented. We are passing around pointers to GlobalID instances, via the host wrapper. Is the code that generates the GlobalIDs in graph generating a GlobalID per component instance? If so, we would have a GlobalID instance per component, each with a distinct pointer, but possibly the same struct values.
To clarify, are you talking about graph.nodeID
as the current implementation of GlobalID
, and service.Host
's graph.Graph
as the mechanism by which we're passing around pointers to the IDs?
Assuming so, I don't believe there are uniqueness problems there because the nodeID
's are hashes that are generated based on the the considerations I mentioned above. Processor IDs are hashed with a specific pipeline ID. Receiver IDs and Exporter IDs are hashed with a data type, and Connector IDs are hashed with an ordered pair of data types. nodeID
is also not accessible outside of the graph package, and I don't think there's any way to indirectly reference or modify them.
I'm wondering if we should assume that each instance of a component should report status individually, and if so, perhaps we can rename GlobalID to InstanceID, where the ID would have enough metadata to identify the underlying component. We could add some additional fields to the struct if needed.
I think it would be very hard for consumers of the events to make sense of this. The instancing is really an internal concern of the collector and not how we expect users to reason about the components.
I'd also be interested in any other reasonable ways to approach this.
The only way I can think to do this is by modeling component and instance separately:
- A component model is presented to users, where each component is 1:1 with those defined in the configuration, except for processors which are understood to be unique to each pipeline.
- An instance model underlies the component model but is not directly exposed to users.
- Every instance belongs to a component. i.e. A component may have one or more instances.
- Internally, each instance reports its own status.
- Externally, each component reports an aggregation of the statuses of its instances. (Often, a component will have only 1 instance, so its status is that of the instance)
Based on this, I think the host's ReportComponentStatus
should effectively be ReportInstanceStatus
and the reported status should be logged into a data structure which:
- Maintains status for all instances
- Understands the relationship between components and instances
- Roughly,
map[GlobalComponentID]map[GlobalInstanceID]Status
After updating the instance status, the ReportInstanceStatus
function should aggregate the statuses for the relevant component and emit an event to StatusWatchers.
There's some plumbing to figure out, but what do you think about this model conceptually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, are you talking about graph.nodeID as the current implementation of GlobalID, and service.Host's graph.Graph as the mechanism by which we're passing around pointers to the IDs?
I made some changes to graph in this PR. I ended up making a map[graph.nodeID]*component.GlobalID
and due to idiosyncrasies of the implementation, the component.GlobalID pointer is effectively an instance id. That's not super important though.
There's some plumbing to figure out, but what do you think about this model conceptually?
I'm ultimately open to any and all options, but there is one thing I wanted to bring up that likely complicates the ability to aggregate status at the component level. Right now we only have StatusOK
and StatusError
, but I would like a more expressive set of statuses for components to choose from. I was looking at GRPC status as a possibility. I think there is a status for many, if not all issues a component could encounter. The 17 GRPC status codes are well-defined and carry a lot of information about the health of a component as a simple code
and message
. I'm not sure how you would aggregate multiple statuses from a system like GRPC into a single code though.
I wanted to do a little more homework on this idea before bringing it up, but now that I've mentioned it, what are your thoughts about adopting GRPC status for component status reporting? Would this change your thoughts about reporting status per instance vs status per component?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did not want to go with GRPC status, we could come up with our own. At a minimum, I'd like a component to be able to report:
Status | Description |
---|---|
StatusOK | The component is operating as expected |
StatusError | The component has encountered an "unexpected" error, look at the attached error for details |
StatusStartupError | The component failed to start |
StatusRecoverableError | The component is experiencing a likely transient error |
StatusPermanentError | The component has encountered an error from which it will not recover; the component is giving up and will become non-operational |
StatusFatalError | The error from this component should trigger a shutdown of the entire collector |
Setting aside the question of aggregation, I think there is an important distinction between what we are trying to represent and what GRPC is representing with status codes. A GRPC status code describes the response to a specific request. On the other hand, a component is a persistent thing which can be described as being in various states over time, and where the possible future states depend not only on current events but also on the previous state. For example, if we have a "Permanent Error" state, this should be a terminal state where no future event is going to move it back to "OK". Similarly, if we have a "Stopping" state, we would enforce constraints on future possible states. I think a finite state machine is probably an appropriate model here. The vanilla path might be GRPC codes might be a reasonable set of states to represent, but the question to me is whether we can clearly define the relationships between the states. My intuition is that the full set of GRPC codes may be too verbose.
I made a rough attempt to diagram the simpler set of states that you mentioned and this seems to be a bit more manageable. W/r/t aggregating multiple instances of a component into a single component status, perhaps the solution lies in having a well defined state diagram for component status, but which takes into account instance-level information as needed to determine whether and/or when to transition to another status. For example, let's say an
The aggregate behavior would be that the component enters the "Recoverable Error" state at step 1, and returns to the "OK" state at step 4. Optionally, we might want steps 2 and 3 to result in events to the status watcher, but these events are not changing the status, only the error message. E.g.
The implementation of this doesn't necessarily have to revolve around instances. For example, every recoverable error could be associated with a callback or ID which can be used to clear the error. Then we only need to track that all errors have been cleared. Anyways, does this seems like a design worth exploring further? I'm also open to anything that works but it seems like we need some clarity on how instances reporting status-related events will contribute to overall component status. |
I was considering using GRPC status because many of the components are clients or servers of network requests, and they could return the result of their last request as their status. However, I share some of your concerns about GPRC status possibly being too verbose and not a good fit for modeling stateful components. I think modeling status as a finite state machine makes sense, and probably has some benefits. As an FSM we would only have to report changes in status, rather than reporting the same status (possibly repeatedly). The other big benefit is that we could constrain the set of states to things that will know will be useful for monitoring collector health. A smaller set of states, so long as they effectively communicate component status, will be easier for component authors to use properly, and easier for StatusWatchers to consume. I think this is a design that is worth exploring further. |
I'm a HUGE fan of this state machine diagram. I think there are really two FSMs here as mentioned: a component and the collector. i.e. would a permanent failure in a processor mean a permanent failure in the collector? not necessarily. I.e. if we permanently fail to apply a transformation would that stop the flow of telemetry? I would definitely like to see further design on this (probably in a separate issue) I think for this PR it's fine to use the system we have today with the option of expanding this in the future. |
A permanent failure in a component would affect the pipeline it's part of, but the collector would continue to run in a degraded mode. A fatal error would be one that would trigger a collector shutdown. I agree we need to hash out the various failure modes, make sure we have statuses for them and that we can surface this information as best as possible. |
Thanks for working through the gnarly details of how to aggregate status at a component level. The rules you came up with make sense. One piece of feedback I received from @mhausenblas at the Agent Management Working Group, is that while users are interested in component status, they are very interested in knowing the health of each pipeline. I'd assume users would want to look at pipeline health first, and then look at component statuses to help diagnose pipeline issues. Is the reason that some components are represented by multiple instances is that they belong to multiple pipelines? If that is the case, would be easier to tie component health to pipeline health if we report status on a instance basis? If we wanted to link component health to pipeline health, I'd assume that we'd include the pipeline (or possibly pipelines) a component belongs as part of the status event (likely part of the |
Not exactly but that's almost correct. Each instance of a component corresponds to an independent behavior of the component (e.g. exporting logs vs exporting traces)
This is a great idea. I think that all instances can be uniquely identified this way, and that it gives a clear way to roll up a status for a pipeline. type InstanceID struct {
ID ID
Kind Kind
PipelineIDs []ID // order does not matter, so alternately could use map[ID]struct{}
} |
Great. I think my next steps will be to replace OpAMP will be a consumer of this data and we briefly discussed a rough plan of how this could work going forward. We'll have to modify the |
@mwear Please open an issue in https://github.com/open-telemetry/opamp-spec/issues to discuss and work on this. Depending on what the changes are this may need to happen before OpAMP 1.0 (unless change are additive). |
@tigrannajaryan, I created open-telemetry/opamp-spec#165 to discuss. |
5dcb80b
to
d837d72
Compare
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
49a4bb0
to
97aa3cb
Compare
4fe8a3f
to
b02c2fe
Compare
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
Automatic status reporting for a SharedComponent needs to go through its telemtry settings in order to keep status reporting for the component and its instances in sync. This overrides the automatic status reporting that occurs in graph, and makes the reporting in graph essentially a no-op as the SharedComponent premptively transitions state during start and shutdown.
Co-authored-by: Alex Boten <alex@boten.ca>
702d5b2
to
7de81d1
Compare
7de81d1
to
711fb73
Compare
This reverts commit 5361583.
Description:
This PR introduces component status reporting. There have been several attempts to introduce this functionality previously, with the most recent being: #6560.
This PR was orignally based off of #6560, but has evolved based on the feedback received and some additional enhancements to improve the ease of use of the
ReportComponentStatus
API.In earlier discussions (see #8169 (comment)) we decided to model status as a finite state machine with the following statuses:
Starting
,OK
,RecoverableError
,PermanentError
,FatalError
.Stopping
, andStopped
. A benefit of this design is thatStatusWatcher
s will be notified on changes in status rather than on potentially repetitive reports of the same status.With the additional statuses and modeling them using a finite state machine, there are more statuses to report. Rather than having each component be responsible for reporting all of the statuses, I automated status reporting where possible. A component's status will automatically be set to
Starting
at startup. If the componentsStart
returns an error, the status will automatically be set toPermanentError
. A component is expected to reportStatusOK
when it has successfully started (if it has successfully started) and from there can report changes in status as it runs. It will likely be a common scenario for components to transition betweenStatusOK
andStatusRecoverableError
during their lifetime. In extenuating circumstances they can transition into terminal states ofPermanentError
andFatalError
(where a fatal error initiates collector shutdown). Additionally, during component Shutdown statuses are automatically reported where possible. A component's status is set toStopping
when Shutdown is initially called, if Shutdown returns an error, the status will be set toPermanentError
if it does not return an error, the status is set toStopped
.In #6560 ReportComponentStatus was implemented on the
Host
interface. I found that few components use the Host interface, and none of them save a handle to it (to be used outside of thestart
method). I found that many components keep a handle to theTelemetrySettings
that they are initialized with, and this seemed like a more natural, convenient place for theReportComponentStatus
API. I'm ultimately flexible on where this method resides, but feel thatTelemetrySettings
a more user friendly place for it.Regardless of where the
ReportComponentStatus
method resides (Host or TelemetrySettings), there is a difference in the method signature for the API based on whether it is used from the service or from a component. As the service is not bound to a specific component, it needs to take theinstanceID
of a component as a parameter, whereas the component version of the method already knows theinstanceID
. In #6560 this led to having bothcomponent.Host
andservicehost.Host
versions of the Host interface to be used at the component or service levels. In this version, we have the same for TelemetrySettings. There is acomponent.TelemetrySettings
and aservicetelemetry.Settings
with the only difference being the method signature ofReportComponentStatus
.Lastly, this PR sets up the machinery for report component status, and allows extensions to be
StatusWatcher
s, but it does not introduce anyStatusWatcher
s. We expect the OpAMP extension to be aStatusWatcher
and use data from this system as part of its AgentHealth message (the message is currently being extended to accommodate more component level details). We also expect there to be a non-OpAMPStatusWatcher
implementation, likely via the HealthCheck extension (or something similiar).Link to tracking Issue: #7682
Testing: Unit tests
cc: @tigrannajaryan @djaglowski @evan-bradley