feat: collect pod metrics of duration of multiple stage in pod lifetime #217

zyy17 · 2024-11-20T04:42:29Z

Summary by CodeRabbit

New Features
- Introduced a metrics collection system for monitoring Kubernetes pod performance in the GreptimeDB operator.
- Added functionality to collect and track various lifecycle metrics of pods.
Bug Fixes
- Improved error handling during metrics collection to ensure it does not disrupt the reconciliation process.
Tests
- Added unit tests to validate the functionality of parsing image pulling durations and image names.

coderabbitai · 2024-11-20T04:42:35Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce a metrics collection system to the GreptimeDB operator. A new MetricsCollector field is added to the Reconciler struct to facilitate the collection of pod metrics during the reconciliation process. The Setup function initializes this collector, while the Reconcile method includes logic to gather metrics when the cluster is running. Additionally, a new metrics.go file defines various metrics and methods for collecting and processing pod lifecycle data, accompanied by a test file to ensure the correctness of parsing functions.

Changes

File Path	Change Summary
controllers/greptimedbcluster/controller.go	Added `MetricsCollector` field to `Reconciler` struct, updated `Setup` function to initialize `MetricsCollector`, and modified `Reconcile` method to include metric collection for running clusters.
pkg/metrics/metrics.go	Introduced `MetricsCollector` struct with methods for collecting pod metrics, including lifecycle durations and error handling. Added utility functions for parsing image names and durations. Defined several Prometheus histogram metrics for monitoring.
pkg/metrics/metrics_test.go	Created test file with functions `TestParseImagePullingDuration` and `TestParseImageName` to validate the parsing of image pulling durations and names, ensuring expected behavior through defined test cases.

Poem

In the cluster where pods do play,
Metrics gather, bright as day.
With each heartbeat, data flows,
Tracking time as the garden grows.
A rabbit hops, with joy to see,
The metrics dance, so wild and free! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

zyy17 · 2024-11-20T06:44:24Z

@coderabbitai review

coderabbitai · 2024-11-20T06:44:30Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (6)

pkg/metrics/metrics_test.go (2)
8-33: Consider enhancing test coverage and readability

While the table-driven test pattern is well implemented, consider these improvements:

Add test cases for edge cases and invalid inputs (e.g., malformed messages, missing duration)

Include descriptive names for test cases to improve readability

Make duration calculations more readable

Here's a suggested improvement:
 func TestParseImagePullingDuration(t *testing.T) {
 	tests := []struct {
+		name     string
 		message  string
 		duration time.Duration
+		wantErr  bool
 	}{
 		{
+			name:     "valid duration in milliseconds",
 			message:  `Successfully pulled image "greptime/greptimedb:latest" in 314.950966ms (6.159733714s including waiting)`,
-			duration: time.Duration(314950966 * time.Nanosecond),
+			duration: 314*time.Millisecond + 950966*time.Nanosecond,
 		},
 		{
+			name:     "valid duration in seconds",
 			message:  `Successfully pulled image "greptime/greptimedb:latest" in 10s (6.159733714s including waiting)`,
 			duration: time.Duration(10 * time.Second),
 		},
+		{
+			name:     "invalid message format",
+			message:  `Invalid message without duration`,
+			wantErr:  true,
+		},
 	}

 	for _, test := range tests {
+		t.Run(test.name, func(t *testing.T) {
 			duration, err := parseImagePullingDuration(test.message)
-			if err != nil {
+			if (err != nil) != test.wantErr {
 				t.Fatalf("parse image pulling duration from message %q: %v", test.message, err)
 			}

+			if test.wantErr {
+				return
+			}
+
 			if duration != test.duration {
 				t.Fatalf("expected duration: %v, got: %v", test.duration, duration)
 			}
+		})
 	}
 }
35-60: Enhance test coverage for image name parsing

The test structure is good, but consider these improvements:

Add test cases for various image formats (e.g., with/without registry, different tag formats)

Include negative test cases

Add descriptive names for test cases

Here's a suggested improvement:
 func TestParseImageName(t *testing.T) {
 	tests := []struct {
+		name     string
 		message  string
-		name     string
+		imageName string
+		wantErr   bool
 	}{
 		{
+			name:      "simple image with latest tag",
 			message:   `Successfully pulled image "greptime/greptimedb:latest" in 314.950966ms (6.159733714s including waiting)`,
-			name:      "greptime/greptimedb:latest",
+			imageName: "greptime/greptimedb:latest",
 		},
 		{
+			name:      "image with version tag",
 			message:   `Successfully pulled image "greptime/greptimedb:v0.9.5" in 314.950966ms (6.159733714s including waiting)`,
-			name:      "greptime/greptimedb:v0.9.5",
+			imageName: "greptime/greptimedb:v0.9.5",
 		},
+		{
+			name:      "image with registry",
+			message:   `Successfully pulled image "registry.example.com/greptime/greptimedb:latest" in 10s`,
+			imageName: "registry.example.com/greptime/greptimedb:latest",
+		},
+		{
+			name:     "invalid message format",
+			message:  `Invalid message without image`,
+			wantErr:  true,
+		},
 	}

 	for _, test := range tests {
+		t.Run(test.name, func(t *testing.T) {
-			name, err := parseImageName(test.message)
-			if err != nil {
+			imageName, err := parseImageName(test.message)
+			if (err != nil) != test.wantErr {
 				t.Fatalf("parse image name from message %q: %v", test.message, err)
 			}

-			if name != test.name {
-				t.Fatalf("expected name: %q, got: %q", test.name, name)
+			if test.wantErr {
+				return
+			}
+
+			if imageName != test.imageName {
+				t.Fatalf("expected name: %q, got: %q", test.imageName, imageName)
 			}
+		})
 	}
 }
controllers/greptimedbcluster/controller.go (2)
68-72: Consider enhancing error handling with more context

While the error handling is present, it would be more helpful to wrap the error with additional context about the failure point.

Consider updating the error handling:
 metricsCollector, err := metrics.NewMetricsCollector()
 if err != nil {
-    return err
+    return fmt.Errorf("failed to create metrics collector: %w", err)
 }
233-240: Consider structured logging and simplified return

While the error handling approach is correct, there are two potential improvements:

Use structured logging for better error tracking

Simplify the return statement after error

Consider these improvements:
 if cluster.Status.ClusterPhase == v1alpha1.PhaseRunning && r.MetricsCollector != nil {
     if err := r.MetricsCollector.CollectClusterPodMetrics(ctx, cluster); err != nil {
-        klog.Errorf("Failed to collect cluster pod metrics: '%v'", err)
+        klog.ErrorS(err, "Failed to collect cluster pod metrics",
+            "namespace", cluster.Namespace,
+            "name", cluster.Name)
 
         // We will not return error here because it is not a critical issue.
-        return ctrl.Result{}, nil
+        return ctrl.Result{RequeueAfter: defaultRequeueAfter}, nil
     }
 }
The RequeueAfter ensures we retry the metrics collection after a delay, which is more robust than simply continuing without retry.
pkg/metrics/metrics.go (2)
164-167: Continue processing pods even if one fails

In collectPodMetricsByRole, returning an error immediately upon failure to collect metrics for a pod will skip metrics collection for the remaining pods. To improve resilience, consider logging the error and continuing with the next pod.

Apply this diff to handle errors gracefully:
 for _, pod := range pods {
     if err := c.collectPodMetrics(ctx, cluster.Name, &pod, role); err != nil {
-        return err
+        // Log the error and continue processing
+        log.Error(err, "Failed to collect metrics for pod", "pod", pod.Name)
     }
 }
Ensure that a logger (log) is available in this context.

44-93: Reduce label cardinality in Prometheus metrics

Using dynamic labels like pod and container names can lead to high cardinality in Prometheus metrics, impacting performance and storage. It's recommended to use more static labels.

Apply this diff to remove high-cardinality labels:
- []string{"namespace", "resource", "pod", "node", "role"},
+ []string{"namespace", "resource", "node", "role"},
This change applies to all metric definitions where pod and container labels are used.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 76d8f13 and 8f585f4.

📒 Files selected for processing (3)

controllers/greptimedbcluster/controller.go (4 hunks)
pkg/metrics/metrics.go (1 hunks)
pkg/metrics/metrics_test.go (1 hunks)

🔇 Additional comments (3)

pkg/metrics/metrics_test.go (1)

1-6: LGTM! Clean package structure and imports

The package name matches the directory structure, and imports are minimal and appropriate for the testing requirements.

controllers/greptimedbcluster/controller.go (2)

40-40: LGTM: Import statement is correctly placed

The metrics package import follows the project's import organization pattern.

55-58: LGTM: Reconciler struct field addition is well-structured

The MetricsCollector field is properly added as a pointer type and follows the existing field organization pattern.

pkg/metrics/metrics_test.go

pkg/metrics/metrics.go

zyy17 marked this pull request as ready for review November 20, 2024 04:42

zyy17 requested a review from daviderli614 November 20, 2024 04:43

zyy17 changed the title ~~feat: add metrics collector~~ feat: collect pod metrics of duration in multiple stage of pod lifetime Nov 20, 2024

zyy17 changed the title ~~feat: collect pod metrics of duration in multiple stage of pod lifetime~~ feat: collect pod metrics of duration of multiple stage in pod lifetime Nov 20, 2024

feat: add metrics collector

8f585f4

zyy17 force-pushed the refactor/add-metrics branch from 7116264 to 8f585f4 Compare November 20, 2024 06:20

coderabbitai bot reviewed Nov 20, 2024

View reviewed changes

pkg/metrics/metrics_test.go Show resolved Hide resolved

pkg/metrics/metrics.go Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Show resolved Hide resolved

daviderli614 approved these changes Nov 21, 2024

View reviewed changes

refactor(metrics): add logging for pod duration stats

e03dfa3

zyy17 merged commit 15a3726 into GreptimeTeam:main Nov 22, 2024
4 checks passed

zyy17 deleted the refactor/add-metrics branch November 22, 2024 06:24

coderabbitai bot mentioned this pull request Dec 2, 2024

fix(monitoring): create duplicated pipelines #225

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: collect pod metrics of duration of multiple stage in pod lifetime #217

feat: collect pod metrics of duration of multiple stage in pod lifetime #217

zyy17 commented Nov 20, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 20, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

zyy17 commented Nov 20, 2024

coderabbitai bot commented Nov 20, 2024

coderabbitai bot left a comment

feat: collect pod metrics of duration of multiple stage in pod lifetime #217

feat: collect pod metrics of duration of multiple stage in pod lifetime #217

Conversation

zyy17 commented Nov 20, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Nov 20, 2024 • edited Loading

Review skipped

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

zyy17 commented Nov 20, 2024

coderabbitai bot commented Nov 20, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

zyy17 commented Nov 20, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 20, 2024 •

edited

Loading