Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: collect pod metrics of duration of multiple stage in pod lifetime #217

Merged
merged 2 commits into from
Nov 22, 2024

Conversation

zyy17
Copy link
Collaborator

@zyy17 zyy17 commented Nov 20, 2024

Summary by CodeRabbit

  • New Features

    • Introduced a metrics collection system for monitoring Kubernetes pod performance in the GreptimeDB operator.
    • Added functionality to collect and track various lifecycle metrics of pods.
  • Bug Fixes

    • Improved error handling during metrics collection to ensure it does not disrupt the reconciliation process.
  • Tests

    • Added unit tests to validate the functionality of parsing image pulling durations and image names.

@zyy17 zyy17 marked this pull request as ready for review November 20, 2024 04:42
Copy link
Contributor

coderabbitai bot commented Nov 20, 2024

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce a metrics collection system to the GreptimeDB operator. A new MetricsCollector field is added to the Reconciler struct to facilitate the collection of pod metrics during the reconciliation process. The Setup function initializes this collector, while the Reconcile method includes logic to gather metrics when the cluster is running. Additionally, a new metrics.go file defines various metrics and methods for collecting and processing pod lifecycle data, accompanied by a test file to ensure the correctness of parsing functions.

Changes

File Path Change Summary
controllers/greptimedbcluster/controller.go Added MetricsCollector field to Reconciler struct, updated Setup function to initialize MetricsCollector, and modified Reconcile method to include metric collection for running clusters.
pkg/metrics/metrics.go Introduced MetricsCollector struct with methods for collecting pod metrics, including lifecycle durations and error handling. Added utility functions for parsing image names and durations. Defined several Prometheus histogram metrics for monitoring.
pkg/metrics/metrics_test.go Created test file with functions TestParseImagePullingDuration and TestParseImageName to validate the parsing of image pulling durations and names, ensuring expected behavior through defined test cases.

Poem

In the cluster where pods do play,
Metrics gather, bright as day.
With each heartbeat, data flows,
Tracking time as the garden grows.
A rabbit hops, with joy to see,
The metrics dance, so wild and free! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@zyy17 zyy17 requested a review from daviderli614 November 20, 2024 04:43
@zyy17 zyy17 changed the title feat: add metrics collector feat: collect pod metrics of duration in multiple stage of pod lifetime Nov 20, 2024
@zyy17 zyy17 changed the title feat: collect pod metrics of duration in multiple stage of pod lifetime feat: collect pod metrics of duration of multiple stage in pod lifetime Nov 20, 2024
@zyy17 zyy17 force-pushed the refactor/add-metrics branch from 7116264 to 8f585f4 Compare November 20, 2024 06:20
@zyy17
Copy link
Collaborator Author

zyy17 commented Nov 20, 2024

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Nov 20, 2024

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (6)
pkg/metrics/metrics_test.go (2)

8-33: Consider enhancing test coverage and readability

While the table-driven test pattern is well implemented, consider these improvements:

  1. Add test cases for edge cases and invalid inputs (e.g., malformed messages, missing duration)
  2. Include descriptive names for test cases to improve readability
  3. Make duration calculations more readable

Here's a suggested improvement:

 func TestParseImagePullingDuration(t *testing.T) {
 	tests := []struct {
+		name     string
 		message  string
 		duration time.Duration
+		wantErr  bool
 	}{
 		{
+			name:     "valid duration in milliseconds",
 			message:  `Successfully pulled image "greptime/greptimedb:latest" in 314.950966ms (6.159733714s including waiting)`,
-			duration: time.Duration(314950966 * time.Nanosecond),
+			duration: 314*time.Millisecond + 950966*time.Nanosecond,
 		},
 		{
+			name:     "valid duration in seconds",
 			message:  `Successfully pulled image "greptime/greptimedb:latest" in 10s (6.159733714s including waiting)`,
 			duration: time.Duration(10 * time.Second),
 		},
+		{
+			name:     "invalid message format",
+			message:  `Invalid message without duration`,
+			wantErr:  true,
+		},
 	}

 	for _, test := range tests {
+		t.Run(test.name, func(t *testing.T) {
 			duration, err := parseImagePullingDuration(test.message)
-			if err != nil {
+			if (err != nil) != test.wantErr {
 				t.Fatalf("parse image pulling duration from message %q: %v", test.message, err)
 			}

+			if test.wantErr {
+				return
+			}
+
 			if duration != test.duration {
 				t.Fatalf("expected duration: %v, got: %v", test.duration, duration)
 			}
+		})
 	}
 }

35-60: Enhance test coverage for image name parsing

The test structure is good, but consider these improvements:

  1. Add test cases for various image formats (e.g., with/without registry, different tag formats)
  2. Include negative test cases
  3. Add descriptive names for test cases

Here's a suggested improvement:

 func TestParseImageName(t *testing.T) {
 	tests := []struct {
+		name     string
 		message  string
-		name     string
+		imageName string
+		wantErr   bool
 	}{
 		{
+			name:      "simple image with latest tag",
 			message:   `Successfully pulled image "greptime/greptimedb:latest" in 314.950966ms (6.159733714s including waiting)`,
-			name:      "greptime/greptimedb:latest",
+			imageName: "greptime/greptimedb:latest",
 		},
 		{
+			name:      "image with version tag",
 			message:   `Successfully pulled image "greptime/greptimedb:v0.9.5" in 314.950966ms (6.159733714s including waiting)`,
-			name:      "greptime/greptimedb:v0.9.5",
+			imageName: "greptime/greptimedb:v0.9.5",
 		},
+		{
+			name:      "image with registry",
+			message:   `Successfully pulled image "registry.example.com/greptime/greptimedb:latest" in 10s`,
+			imageName: "registry.example.com/greptime/greptimedb:latest",
+		},
+		{
+			name:     "invalid message format",
+			message:  `Invalid message without image`,
+			wantErr:  true,
+		},
 	}

 	for _, test := range tests {
+		t.Run(test.name, func(t *testing.T) {
-			name, err := parseImageName(test.message)
-			if err != nil {
+			imageName, err := parseImageName(test.message)
+			if (err != nil) != test.wantErr {
 				t.Fatalf("parse image name from message %q: %v", test.message, err)
 			}

-			if name != test.name {
-				t.Fatalf("expected name: %q, got: %q", test.name, name)
+			if test.wantErr {
+				return
+			}
+
+			if imageName != test.imageName {
+				t.Fatalf("expected name: %q, got: %q", test.imageName, imageName)
 			}
+		})
 	}
 }
controllers/greptimedbcluster/controller.go (2)

68-72: Consider enhancing error handling with more context

While the error handling is present, it would be more helpful to wrap the error with additional context about the failure point.

Consider updating the error handling:

 metricsCollector, err := metrics.NewMetricsCollector()
 if err != nil {
-    return err
+    return fmt.Errorf("failed to create metrics collector: %w", err)
 }

233-240: Consider structured logging and simplified return

While the error handling approach is correct, there are two potential improvements:

  1. Use structured logging for better error tracking
  2. Simplify the return statement after error

Consider these improvements:

 if cluster.Status.ClusterPhase == v1alpha1.PhaseRunning && r.MetricsCollector != nil {
     if err := r.MetricsCollector.CollectClusterPodMetrics(ctx, cluster); err != nil {
-        klog.Errorf("Failed to collect cluster pod metrics: '%v'", err)
+        klog.ErrorS(err, "Failed to collect cluster pod metrics",
+            "namespace", cluster.Namespace,
+            "name", cluster.Name)
 
         // We will not return error here because it is not a critical issue.
-        return ctrl.Result{}, nil
+        return ctrl.Result{RequeueAfter: defaultRequeueAfter}, nil
     }
 }

The RequeueAfter ensures we retry the metrics collection after a delay, which is more robust than simply continuing without retry.

pkg/metrics/metrics.go (2)

164-167: Continue processing pods even if one fails

In collectPodMetricsByRole, returning an error immediately upon failure to collect metrics for a pod will skip metrics collection for the remaining pods. To improve resilience, consider logging the error and continuing with the next pod.

Apply this diff to handle errors gracefully:

 for _, pod := range pods {
     if err := c.collectPodMetrics(ctx, cluster.Name, &pod, role); err != nil {
-        return err
+        // Log the error and continue processing
+        log.Error(err, "Failed to collect metrics for pod", "pod", pod.Name)
     }
 }

Ensure that a logger (log) is available in this context.


44-93: Reduce label cardinality in Prometheus metrics

Using dynamic labels like pod and container names can lead to high cardinality in Prometheus metrics, impacting performance and storage. It's recommended to use more static labels.

Apply this diff to remove high-cardinality labels:

- []string{"namespace", "resource", "pod", "node", "role"},
+ []string{"namespace", "resource", "node", "role"},

This change applies to all metric definitions where pod and container labels are used.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 76d8f13 and 8f585f4.

📒 Files selected for processing (3)
  • controllers/greptimedbcluster/controller.go (4 hunks)
  • pkg/metrics/metrics.go (1 hunks)
  • pkg/metrics/metrics_test.go (1 hunks)
🔇 Additional comments (3)
pkg/metrics/metrics_test.go (1)

1-6: LGTM! Clean package structure and imports

The package name matches the directory structure, and imports are minimal and appropriate for the testing requirements.

controllers/greptimedbcluster/controller.go (2)

40-40: LGTM: Import statement is correctly placed

The metrics package import follows the project's import organization pattern.


55-58: LGTM: Reconciler struct field addition is well-structured

The MetricsCollector field is properly added as a pointer type and follows the existing field organization pattern.

pkg/metrics/metrics_test.go Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
@zyy17 zyy17 merged commit 15a3726 into GreptimeTeam:main Nov 22, 2024
4 checks passed
@zyy17 zyy17 deleted the refactor/add-metrics branch November 22, 2024 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants