Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component Status Reporting #8169

Merged
merged 40 commits into from
Oct 6, 2023
Merged

Conversation

mwear
Copy link
Member

@mwear mwear commented Aug 2, 2023

Description:
This PR introduces component status reporting. There have been several attempts to introduce this functionality previously, with the most recent being: #6560.

This PR was orignally based off of #6560, but has evolved based on the feedback received and some additional enhancements to improve the ease of use of the ReportComponentStatus API.

In earlier discussions (see #8169 (comment)) we decided to model status as a finite state machine with the following statuses: Starting, OK, RecoverableError, PermanentError, FatalError. Stopping, and Stopped. A benefit of this design is that StatusWatchers will be notified on changes in status rather than on potentially repetitive reports of the same status.

With the additional statuses and modeling them using a finite state machine, there are more statuses to report. Rather than having each component be responsible for reporting all of the statuses, I automated status reporting where possible. A component's status will automatically be set to Starting at startup. If the components Start returns an error, the status will automatically be set to PermanentError. A component is expected to report StatusOK when it has successfully started (if it has successfully started) and from there can report changes in status as it runs. It will likely be a common scenario for components to transition between StatusOK and StatusRecoverableError during their lifetime. In extenuating circumstances they can transition into terminal states of PermanentError and FatalError (where a fatal error initiates collector shutdown). Additionally, during component Shutdown statuses are automatically reported where possible. A component's status is set to Stopping when Shutdown is initially called, if Shutdown returns an error, the status will be set to PermanentError if it does not return an error, the status is set to Stopped.

In #6560 ReportComponentStatus was implemented on the Host interface. I found that few components use the Host interface, and none of them save a handle to it (to be used outside of the start method). I found that many components keep a handle to the TelemetrySettings that they are initialized with, and this seemed like a more natural, convenient place for the ReportComponentStatus API. I'm ultimately flexible on where this method resides, but feel that TelemetrySettings a more user friendly place for it.

Regardless of where the ReportComponentStatus method resides (Host or TelemetrySettings), there is a difference in the method signature for the API based on whether it is used from the service or from a component. As the service is not bound to a specific component, it needs to take the instanceID of a component as a parameter, whereas the component version of the method already knows the instanceID. In #6560 this led to having both component.Host and servicehost.Host versions of the Host interface to be used at the component or service levels. In this version, we have the same for TelemetrySettings. There is a component.TelemetrySettings and a servicetelemetry.Settings with the only difference being the method signature of ReportComponentStatus.

Lastly, this PR sets up the machinery for report component status, and allows extensions to be StatusWatchers, but it does not introduce any StatusWatchers. We expect the OpAMP extension to be a StatusWatcher and use data from this system as part of its AgentHealth message (the message is currently being extended to accommodate more component level details). We also expect there to be a non-OpAMP StatusWatcher implementation, likely via the HealthCheck extension (or something similiar).

Link to tracking Issue: #7682

Testing: Unit tests

cc: @tigrannajaryan @djaglowski @evan-bradley

@codecov
Copy link

codecov bot commented Aug 2, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Files Coverage Δ
component/component.go 38.63% <ø> (ø)
extension/extension.go 97.61% <ø> (ø)
extension/extensiontest/statuswatcher_extension.go 100.00% <100.00%> (ø)
internal/sharedcomponent/sharedcomponent.go 100.00% <100.00%> (ø)
processor/processortest/unhealthy_processor.go 100.00% <100.00%> (ø)
receiver/otlpreceiver/factory.go 78.57% <100.00%> (+2.00%) ⬆️
receiver/otlpreceiver/otlp.go 82.94% <100.00%> (ø)
service/host.go 91.66% <100.00%> (+2.19%) ⬆️
service/internal/graph/graph.go 98.88% <100.00%> (+0.22%) ⬆️
...nternal/servicetelemetry/nop_telemetry_settings.go 100.00% <100.00%> (ø)
... and 7 more

... and 2 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@mwear mwear force-pushed the component-status branch 2 times, most recently from beb7ade to cec36a7 Compare August 2, 2023 21:58
Comment on lines 179 to 184
// GlobalID uniquely identifies a component
type GlobalID struct {
ID ID
Kind Kind
PipelineID ID // Not empty only if the Kind is Processor
}
Copy link
Member

@djaglowski djaglowski Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple nuances to the way we instance components within the service, which I think likely need to be reconciled in this PR.

Receivers and exporters are instanced per data type. For example, users will often configure a single otlpexporter for exporting multiple types of data, but the underlying implementation creates separate instances. which would share the same GlobalID.

This is probably the correct model to present externally, because it aligns with the user's understanding, but we will have to define how to reconcile with the underlying implementation. Minimally, instances of the "same" component must be able to safely publish updates concurrently.

This raises a usability complication though, because we may expect status thrashing or burying of problems. e.g. the traces instance of a component reports a problem, but the metrics instance immediately clobbers the status by reporting that it is healthy.

I think we probably need to maintain a separate status for each instance of the "same" component, and aggregate them whenever an event is published. e.g. report "lowest" health state of any instance that shares the GlobalID, regardless of which instance sent the most recent event.

To further complicate things, we have the sharedcomponent package, which allows component authors to opt in to a unified model. In this case, the service still reasons about separate instances, but there is in fact only one instance after all. For example, the otlpreceiver uses this mechanism so that the shared instance can present a single port. I'm not certain how best to handle this, but perhaps the sharedcomponent package needs to manage status reporting for the components that use it.

Connectors share the same concerns as receivers and exporters, except instancing is per ordered pair of data types. sharedcomponent may or may not be used here as well.

Copy link
Member Author

@mwear mwear Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯 this sounds somewhat complicated, but thanks for the explanation.

I'm wondering if we should assume that each instance of a component should report status individually, and if so, perhaps we can rename GlobalID to InstanceID, where the ID would have enough metadata to identify the underlying component. We could add some additional fields to the struct if needed.

I'm also wondering if this is accidentally already happening with the GlobalID as it's currently implemented. We are passing around pointers to GlobalID instances, via the host wrapper. Is the code that generates the GlobalIDs in graph generating a GlobalID per component instance? If so, we would have a GlobalID instance per component, each with a distinct pointer, but possibly the same struct values.

I'd also be interested in any other reasonable ways to approach this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if this is accidentally already happening with the GlobalID as it's currently implemented. We are passing around pointers to GlobalID instances, via the host wrapper. Is the code that generates the GlobalIDs in graph generating a GlobalID per component instance? If so, we would have a GlobalID instance per component, each with a distinct pointer, but possibly the same struct values.

To clarify, are you talking about graph.nodeID as the current implementation of GlobalID, and service.Host's graph.Graph as the mechanism by which we're passing around pointers to the IDs?

Assuming so, I don't believe there are uniqueness problems there because the nodeID's are hashes that are generated based on the the considerations I mentioned above. Processor IDs are hashed with a specific pipeline ID. Receiver IDs and Exporter IDs are hashed with a data type, and Connector IDs are hashed with an ordered pair of data types. nodeID is also not accessible outside of the graph package, and I don't think there's any way to indirectly reference or modify them.

I'm wondering if we should assume that each instance of a component should report status individually, and if so, perhaps we can rename GlobalID to InstanceID, where the ID would have enough metadata to identify the underlying component. We could add some additional fields to the struct if needed.

I think it would be very hard for consumers of the events to make sense of this. The instancing is really an internal concern of the collector and not how we expect users to reason about the components.

I'd also be interested in any other reasonable ways to approach this.

The only way I can think to do this is by modeling component and instance separately:

  1. A component model is presented to users, where each component is 1:1 with those defined in the configuration, except for processors which are understood to be unique to each pipeline.
  2. An instance model underlies the component model but is not directly exposed to users.
  3. Every instance belongs to a component. i.e. A component may have one or more instances.
  4. Internally, each instance reports its own status.
  5. Externally, each component reports an aggregation of the statuses of its instances. (Often, a component will have only 1 instance, so its status is that of the instance)

Based on this, I think the host's ReportComponentStatus should effectively be ReportInstanceStatus and the reported status should be logged into a data structure which:

  1. Maintains status for all instances
  2. Understands the relationship between components and instances
  3. Roughly, map[GlobalComponentID]map[GlobalInstanceID]Status

After updating the instance status, the ReportInstanceStatus function should aggregate the statuses for the relevant component and emit an event to StatusWatchers.

There's some plumbing to figure out, but what do you think about this model conceptually?

Copy link
Member Author

@mwear mwear Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, are you talking about graph.nodeID as the current implementation of GlobalID, and service.Host's graph.Graph as the mechanism by which we're passing around pointers to the IDs?

I made some changes to graph in this PR. I ended up making a map[graph.nodeID]*component.GlobalID and due to idiosyncrasies of the implementation, the component.GlobalID pointer is effectively an instance id. That's not super important though.

There's some plumbing to figure out, but what do you think about this model conceptually?

I'm ultimately open to any and all options, but there is one thing I wanted to bring up that likely complicates the ability to aggregate status at the component level. Right now we only have StatusOK and StatusError, but I would like a more expressive set of statuses for components to choose from. I was looking at GRPC status as a possibility. I think there is a status for many, if not all issues a component could encounter. The 17 GRPC status codes are well-defined and carry a lot of information about the health of a component as a simple code and message. I'm not sure how you would aggregate multiple statuses from a system like GRPC into a single code though.

I wanted to do a little more homework on this idea before bringing it up, but now that I've mentioned it, what are your thoughts about adopting GRPC status for component status reporting? Would this change your thoughts about reporting status per instance vs status per component?

Copy link
Member Author

@mwear mwear Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did not want to go with GRPC status, we could come up with our own. At a minimum, I'd like a component to be able to report:

Status Description
StatusOK The component is operating as expected
StatusError The component has encountered an "unexpected" error, look at the attached error for details
StatusStartupError The component failed to start
StatusRecoverableError The component is experiencing a likely transient error
StatusPermanentError The component has encountered an error from which it will not recover; the component is giving up and will become non-operational
StatusFatalError The error from this component should trigger a shutdown of the entire collector

@djaglowski
Copy link
Member

I was looking at GRPC status as a possibility. I think there is a status for many, if not all issues a component could encounter. The 17 GRPC status codes are well-defined and carry a lot of information about the health of a component as a simple code and message. I'm not sure how you would aggregate multiple statuses from a system like GRPC into a single code though.

Setting aside the question of aggregation, I think there is an important distinction between what we are trying to represent and what GRPC is representing with status codes. A GRPC status code describes the response to a specific request. On the other hand, a component is a persistent thing which can be described as being in various states over time, and where the possible future states depend not only on current events but also on the previous state. For example, if we have a "Permanent Error" state, this should be a terminal state where no future event is going to move it back to "OK". Similarly, if we have a "Stopping" state, we would enforce constraints on future possible states.

I think a finite state machine is probably an appropriate model here. The vanilla path might be Starting -> OK -> Stopping -> Stopped. We can have various error states, only some of which are be recoverable, etc.

GRPC codes might be a reasonable set of states to represent, but the question to me is whether we can clearly define the relationships between the states. My intuition is that the full set of GRPC codes may be too verbose.


Right now we only have StatusOK and StatusError, but I would like a more expressive set of statuses for components to choose from.

I made a rough attempt to diagram the simpler set of states that you mentioned and this seems to be a bit more manageable.

image

W/r/t aggregating multiple instances of a component into a single component status, perhaps the solution lies in having a well defined state diagram for component status, but which takes into account instance-level information as needed to determine whether and/or when to transition to another status.

For example, let's say an otlpexporter is configured for use in both a traces and metrics pipeline, and that everything has started up successfully so we are in an "OK" state. Then the following occurs:

  1. The metrics instance reports a recoverable error
  2. The traces instance reports a recoverable error
  3. The traces instance reports that it has recovered
  4. The metrics instance reports that it has recovered

The aggregate behavior would be that the component enters the "Recoverable Error" state at step 1, and returns to the "OK" state at step 4. Optionally, we might want steps 2 and 3 to result in events to the status watcher, but these events are not changing the status, only the error message. E.g.

  1. Error: "metrics..."
  2. Error: "metrics ..., traces ..."
  3. Error: "metrics..."
  4. OK

The implementation of this doesn't necessarily have to revolve around instances. For example, every recoverable error could be associated with a callback or ID which can be used to clear the error. Then we only need to track that all errors have been cleared.

Anyways, does this seems like a design worth exploring further? I'm also open to anything that works but it seems like we need some clarity on how instances reporting status-related events will contribute to overall component status.

@mwear
Copy link
Member Author

mwear commented Aug 7, 2023

I was considering using GRPC status because many of the components are clients or servers of network requests, and they could return the result of their last request as their status. However, I share some of your concerns about GPRC status possibly being too verbose and not a good fit for modeling stateful components. I think modeling status as a finite state machine makes sense, and probably has some benefits. As an FSM we would only have to report changes in status, rather than reporting the same status (possibly repeatedly). The other big benefit is that we could constrain the set of states to things that will know will be useful for monitoring collector health. A smaller set of states, so long as they effectively communicate component status, will be easier for component authors to use properly, and easier for StatusWatchers to consume. I think this is a design that is worth exploring further.

@jaronoff97
Copy link
Contributor

I'm a HUGE fan of this state machine diagram. I think there are really two FSMs here as mentioned: a component and the collector. i.e. would a permanent failure in a processor mean a permanent failure in the collector? not necessarily. I.e. if we permanently fail to apply a transformation would that stop the flow of telemetry? I would definitely like to see further design on this (probably in a separate issue) I think for this PR it's fine to use the system we have today with the option of expanding this in the future.

@mwear
Copy link
Member Author

mwear commented Aug 8, 2023

would a permanent failure in a processor mean a permanent failure in the collector?

A permanent failure in a component would affect the pipeline it's part of, but the collector would continue to run in a degraded mode. A fatal error would be one that would trigger a collector shutdown. I agree we need to hash out the various failure modes, make sure we have statuses for them and that we can surface this information as best as possible.

@djaglowski
Copy link
Member

djaglowski commented Aug 8, 2023

I iterated on the initial state machine and came up with the following, which has just a couple changes:

  • Clarifies that both "Permanent" and "Fatal" are directly reachable from any state besides "Stopped"
  • Allows transition from "Recoverable" directly to "Stopping". The idea being that we still have control so should be able to issue a stop command before it recovers.
image

Then I tried to figure out how this would apply to components vs instances. The only way I could make this work was to say that the above diagram defines the lifecycle of an instance. Then I worked out some aggregation logic.

To work through this, I imagined two instances moving through the state machine independently, and then represented this as a compound state machine, where each state represents a pair of states (one for each instance) and the transitions between states each represent a change to one of the instances.

I found that the aggregate states seem somewhat intuitive, based on some rules (see below). Here's the compound machine:

image

The rules that I think make sense, in order of priority are as follows:

  1. If any instance encounters a fatal error, the component is in a Fatal Error state (if this even matters, since the collector will be killed).
  2. If any instance is in a Permanent Error state, the component status is Permanent Error.
  3. If any instance is Stopping, the component is in a Stopping state. That is, there is a clear intention to stop all instances and so the component is Stopping until it is fully Stopped, or encounters a Fatal or Permanent Error.
  4. If any instance is Stopped, but no instances are Stopping, we must be in the process of stopping the component. This is an edge case where one instance has complied with a stop command but the stop command hasn't been issued to another instance yet.
  5. If all instances are Stopped, the component is Stopped.
  6. If any instance is in a Recoverable Error state, the component status is Recoverable Error.
  7. If any instance is Starting, the component status is Starting.
  8. None of the above were true, so the component is OK. (In other words, all instances are OK.)

In code:

func (c Component) Status {

  for _, i := c.instances {
    if i.status == Fatal {
      return Fatal
    }
  }

  for _, i := c.instances {
    if i.status == Permanent {
      return Permanent
    }
  }

  for _, i := c.instances {
    if i.status == Stopping {
      return Stopping
    }
  }

  var stopped, nonstopped bool
  for _, i := c.instances {
    if i.status == Stopped {
      stopped = true
    } else {
      nonstopped = true
    }
  }
  if stopped && nonstopped {
    return Stopping
  } else if stopped {
    return Stopped
  }

  for _, i := c.instances {
    if i.status == Recoverable {
      return Recoverable
    }
  }

  for _, i := c.instances {
    if i.status == Starting {
      return Starting
    }
  }

  return OK
}

@mwear
Copy link
Member Author

mwear commented Aug 9, 2023

Thanks for working through the gnarly details of how to aggregate status at a component level. The rules you came up with make sense.

One piece of feedback I received from @mhausenblas at the Agent Management Working Group, is that while users are interested in component status, they are very interested in knowing the health of each pipeline. I'd assume users would want to look at pipeline health first, and then look at component statuses to help diagnose pipeline issues. Is the reason that some components are represented by multiple instances is that they belong to multiple pipelines? If that is the case, would be easier to tie component health to pipeline health if we report status on a instance basis?

If we wanted to link component health to pipeline health, I'd assume that we'd include the pipeline (or possibly pipelines) a component belongs as part of the status event (likely part of the InstanceID of the component). A plausible way for a status watcher to determine pipeline status would be to group by pipeline and follow rules similar to what you came up with for aggregating component status. I'm wondering if this scenario is any more difficult if we aggregate status at a component level, before it is sent to a status watcher? I fear that we may lose the ability to map components to pipelines in this situation, but I'm still trying to understand how components, component instances, and pipelines are related under the hood.

@djaglowski
Copy link
Member

Is the reason that some components are represented by multiple instances is that they belong to multiple pipelines?

Not exactly but that's almost correct. Each instance of a component corresponds to an independent behavior of the component (e.g. exporting logs vs exporting traces)

If we wanted to link component health to pipeline health, I'd assume that we'd include the pipeline (or possibly pipelines) a component belongs as part of the status event (likely part of the InstanceID of the component). A plausible way for a status watcher to determine pipeline status would be to group by pipeline

This is a great idea. I think that all instances can be uniquely identified this way, and that it gives a clear way to roll up a status for a pipeline.

type InstanceID struct {
	ID         ID
	Kind       Kind
	PipelineIDs []ID // order does not matter, so alternately could use map[ID]struct{}
}

@mwear
Copy link
Member Author

mwear commented Aug 10, 2023

Great. I think my next steps will be to replace GlobalID with InstanceID and add the additional statuses. I think the goal is to have enough information on the status events so that status watchers can make use of them.

OpAMP will be a consumer of this data and we briefly discussed a rough plan of how this could work going forward. We'll have to modify the AgentHealth message to accommodate a map of values so that it can hold health information about multiple entities. The OpAMP extension can implement the StatusWatcher interface and surface health information out of the collector. We can probably apply similar ideas to the health check extension for the non-OpAMP use case.

@tigrannajaryan
Copy link
Member

OpAMP will be a consumer of this data and we briefly discussed a rough plan of how this could work going forward. We'll have to modify the AgentHealth message to accommodate a map of values so that it can hold health information about multiple entities. The OpAMP extension can implement the StatusWatcher interface and surface health information out of the collector. We can probably apply similar ideas to the health check extension for the non-OpAMP use case.

@mwear Please open an issue in https://github.com/open-telemetry/opamp-spec/issues to discuss and work on this. Depending on what the changes are this may need to happen before OpAMP 1.0 (unless change are additive).

@mwear
Copy link
Member Author

mwear commented Aug 17, 2023

@tigrannajaryan, I created open-telemetry/opamp-spec#165 to discuss.

@mwear mwear force-pushed the component-status branch 2 times, most recently from 5dcb80b to d837d72 Compare August 24, 2023 00:23
@github-actions
Copy link
Contributor

github-actions bot commented Sep 8, 2023

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Sep 8, 2023
@mwear mwear force-pushed the component-status branch 6 times, most recently from 49a4bb0 to 97aa3cb Compare September 11, 2023 03:14
@github-actions github-actions bot removed the Stale label Sep 11, 2023
@mwear mwear force-pushed the component-status branch 5 times, most recently from 4fe8a3f to b02c2fe Compare September 11, 2023 15:16
mwear and others added 18 commits October 6, 2023 10:52
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Co-authored-by: Evan Bradley <11745660+evan-bradley@users.noreply.github.com>
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
Automatic status reporting for a SharedComponent needs to go through
its telemtry settings in order to keep status reporting for the component
and its instances in sync. This overrides the automatic status reporting
that occurs in graph, and makes the reporting in graph essentially a no-op
as the SharedComponent premptively transitions state during start and
shutdown.
Co-authored-by: Alex Boten <alex@boten.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants