-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: opamp persist health message #1398
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
82b8dee
feat: record agent health into CRD
blumamir 88f243a
feat: report opamp client no heartbeat correctly
blumamir d6b3fee
feat: write health message to CRD and in cli describe
blumamir 049601e
feat: new health status Starting
blumamir 6855bab
chore: make api
blumamir e8383d5
Merge remote-tracking branch 'upstream/main' into agent-health-message
blumamir c6029cb
Merge remote-tracking branch 'upstream/main' into agent-health-message
blumamir df511e9
fix: instrumentaiton instance override only what specified
blumamir 942652d
fix: new attribute in nodejs identifying attributes
blumamir d8092c9
fix: test add attribute in assert for other test
blumamir 61169c5
fix: test ordered attributes correctly
blumamir ae90a0c
fix: mark python as healthy null until it is implemented
blumamir df2c7b8
fix: test update python health
blumamir 8fabac2
chore: rename types and functions for better readability
blumamir File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
package common | ||
|
||
type AgentHealthStatus string | ||
|
||
const ( | ||
// AgentHealthStatusHealthy represents the healthy status of an agent | ||
// It started the OpenTelemetry SDK with no errors, processed any configuration and is ready to receive data. | ||
AgentHealthStatusHealthy AgentHealthStatus = "Healthy" | ||
|
||
// AgentHealthStatusStarting represents that the agent is starting and there is still no health status available. | ||
// Once the agent finishes starting, it should report an either healthy or unhealthy status depending on the result. | ||
AgentHealthStatusStarting AgentHealthStatus = "Starting" | ||
|
||
// AgentHealthStatusUnsupportedRuntimeVersion represents that the agent is running on an unsupported runtime version | ||
// For example: Otel sdk supports node.js >= 14 and workload is running with node.js 12 | ||
AgentHealthStatusUnsupportedRuntimeVersion = "UnsupportedRuntimeVersion" | ||
|
||
// AgentHealthStatusNoHeartbeat is when the server did not receive a 3 heartbeats from the agent, thus it is considered unhealthy | ||
AgentHealthStatusNoHeartbeat = "NoHeartbeat" | ||
|
||
// AgentHealthStatusProcessTerminated is when the agent process is terminated. | ||
// The termination can be due to normal shutdown (e.g. event loop run out of work) | ||
// due to explicit termination (e.g. code calls exit(), or OS signal), or due to an error (e.g. unhandled exception) | ||
AgentHealthProcessTerminated = "ProcessTerminated" | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,8 +16,6 @@ import ( | |
"github.com/odigos-io/odigos/opampserver/pkg/sdkconfig/configresolvers" | ||
"github.com/odigos-io/odigos/opampserver/protobufs" | ||
semconv "go.opentelemetry.io/otel/semconv/v1.24.0" | ||
apierrors "k8s.io/apimachinery/pkg/api/errors" | ||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
"k8s.io/apimachinery/pkg/runtime" | ||
"sigs.k8s.io/controller-runtime/pkg/client" | ||
) | ||
|
@@ -79,7 +77,6 @@ func (c *ConnectionHandlers) OnNewConnection(ctx context.Context, deviceId strin | |
c.logger.Error(err, "failed to get full config", "k8sAttributes", k8sAttributes) | ||
return nil, nil, err | ||
} | ||
|
||
c.logger.Info("new OpAMP client connected", "deviceId", deviceId, "namespace", k8sAttributes.Namespace, "podName", k8sAttributes.PodName, "instrumentedAppName", instrumentedAppName, "workloadKind", k8sAttributes.WorkloadKind, "workloadName", k8sAttributes.WorkloadName, "containerName", k8sAttributes.ContainerName, "otelServiceName", k8sAttributes.OtelServiceName) | ||
|
||
connectionInfo := &connection.ConnectionInfo{ | ||
|
@@ -118,25 +115,42 @@ func (c *ConnectionHandlers) OnAgentToServerMessage(ctx context.Context, request | |
} | ||
|
||
func (c *ConnectionHandlers) OnConnectionClosed(ctx context.Context, connectionInfo *connection.ConnectionInfo) { | ||
c.logger.Info("Connection closed for device", "deviceId", connectionInfo.DeviceId) | ||
instrumentationInstanceName := instrumentation_instance.InstrumentationInstanceName(connectionInfo.Pod, int(connectionInfo.Pid)) | ||
err := c.kubeclient.Delete(ctx, &odigosv1.InstrumentationInstance{ | ||
TypeMeta: metav1.TypeMeta{ | ||
APIVersion: "odigos.io/v1alpha1", | ||
Kind: "InstrumentationInstance", | ||
}, | ||
ObjectMeta: metav1.ObjectMeta{ | ||
Name: instrumentationInstanceName, | ||
Namespace: connectionInfo.Pod.GetNamespace(), | ||
}, | ||
}) | ||
|
||
if err != nil && !apierrors.IsNotFound(err) { | ||
c.logger.Error(err, "failed to delete instrumentation instance", "instrumentationInstanceName", instrumentationInstanceName) | ||
// keep the instrumentation instance CR in unhealthy state so it can be used for troubleshooting | ||
} | ||
|
||
func (c *ConnectionHandlers) OnConnectionNoHeartbeat(ctx context.Context, connectionInfo *connection.ConnectionInfo) error { | ||
healthy := false | ||
message := fmt.Sprintf("OpAMP server did not receive heartbeat from the agent, last message time: %s", connectionInfo.LastMessageTime.Format("2006-01-02 15:04:05 MST")) | ||
// keep the instrumentation instance CR in unhealthy state so it can be used for troubleshooting | ||
err := instrumentation_instance.UpdateInstrumentationInstanceStatus(ctx, connectionInfo.Pod, connectionInfo.ContainerName, c.kubeclient, connectionInfo.InstrumentedAppName, int(connectionInfo.Pid), c.scheme, | ||
instrumentation_instance.WithHealthy(&healthy, common.AgentHealthStatusNoHeartbeat, &message), | ||
) | ||
if err != nil { | ||
return fmt.Errorf("failed to persist instrumentation instance health status on connection timedout: %w", err) | ||
} | ||
|
||
return nil | ||
} | ||
|
||
func (c *ConnectionHandlers) PersistInstrumentationDeviceStatus(ctx context.Context, message *protobufs.AgentToServer, connectionInfo *connection.ConnectionInfo) error { | ||
func (c *ConnectionHandlers) UpdateInstrumentationInstanceStatus(ctx context.Context, message *protobufs.AgentToServer, connectionInfo *connection.ConnectionInfo) error { | ||
|
||
isAgentDisconnect := message.AgentDisconnect != nil | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe better to check at the beginning of this func? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. moved to the top 👍 |
||
hasHealth := message.Health != nil | ||
// when agent disconnects, it need to report that it is unhealthy and disconnected | ||
if isAgentDisconnect { | ||
if !hasHealth { | ||
return fmt.Errorf("missing health in agent disconnect message") | ||
} | ||
if message.Health.Healthy { | ||
return fmt.Errorf("agent disconnect message with healthy status") | ||
} | ||
if message.Health.LastError == "" { | ||
return fmt.Errorf("missing last error in unhealthy message") | ||
} | ||
} | ||
|
||
dynamicOptions := make([]instrumentation_instance.InstrumentationInstanceOption, 0) | ||
|
||
if message.AgentDescription != nil { | ||
identifyingAttributes := make([]odigosv1.Attribute, 0, len(message.AgentDescription.IdentifyingAttributes)) | ||
for _, attr := range message.AgentDescription.IdentifyingAttributes { | ||
|
@@ -146,13 +160,18 @@ func (c *ConnectionHandlers) PersistInstrumentationDeviceStatus(ctx context.Cont | |
Value: strValue, | ||
}) | ||
} | ||
dynamicOptions = append(dynamicOptions, instrumentation_instance.WithAttributes(identifyingAttributes, []odigosv1.Attribute{})) | ||
} | ||
|
||
// agent is only expected to send health status when it changes, so if found - persist it to CRD as new status | ||
if hasHealth { | ||
// always record healthy status into the CRD, to reflect the current state | ||
healthy := message.Health.Healthy | ||
dynamicOptions = append(dynamicOptions, instrumentation_instance.WithHealthy(&healthy, message.Health.Status, &message.Health.LastError)) | ||
} | ||
|
||
healthy := true // TODO: populate this field with real health status | ||
err := instrumentation_instance.PersistInstrumentationInstanceStatus(ctx, connectionInfo.Pod, connectionInfo.ContainerName, c.kubeclient, connectionInfo.InstrumentedAppName, int(connectionInfo.Pid), c.scheme, | ||
instrumentation_instance.WithIdentifyingAttributes(identifyingAttributes), | ||
instrumentation_instance.WithMessage("Agent connected"), | ||
instrumentation_instance.WithHealthy(&healthy), | ||
) | ||
if len(dynamicOptions) > 0 { | ||
err := instrumentation_instance.UpdateInstrumentationInstanceStatus(ctx, connectionInfo.Pod, connectionInfo.ContainerName, c.kubeclient, connectionInfo.InstrumentedAppName, int(connectionInfo.Pid), c.scheme, dynamicOptions...) | ||
if err != nil { | ||
return fmt.Errorf("failed to persist instrumentation instance status: %w", err) | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason why in some of the cases you mention that it is AgentHealthStatus (which is string) and sometimes set it explicitly to a string?