Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vpj][controller] Emit push job status metrics from controller #1185

Merged
merged 8 commits into from
Sep 24, 2024

Conversation

m-nagarajan
Copy link
Contributor

@m-nagarajan m-nagarajan commented Sep 18, 2024

VPJ communicates with the controller to write PushJobDetails to the PUSH_JOB_DETAILS_STORE_NAME system store. This PR introduces new metrics emitted by the parent controller for push job success/failure.

New Metrics Added (Count and CountSinceLastMeasurement added via Tehuti PR, hence using tehuti:0.12.2):
batch_push_job_success, batch_push_job_failed_user_error, batch_push_job_failed_non_user_error
incremental_push_job_success, incremental_push_job_failed_user_error, incremental_push_job_failed_non_user_error

Current flow is VPJ checks push.job.status.upload.enable config and sends PushJobDetails to /send_push_job_details path in Venice-controller, which writes it to the push job details system store. Approaches considered to emit metrics:

  1. Derive success/failure in the controller and emit metrics, tying metric emission and push job details upload via push.job.status.upload.enable config. This config is enabled everywhere.
  2. Add a new controller endpoint to emit metrics, duplicating work to talk to controllers and serialize/deserialize data.
  3. Add two new controller endpoints: one to emit metrics if upload is not enabled, and one to do both (emit metrics and upload push status store) to avoid duplicate work.
  4. Evolve PushJobDetails to hold push.job.status.upload.enable config and move the logic of determining success/failure status to VenicePushJob.

Chose approach 1 as it’s the simplest and doesn’t require deployment ordering (controllers -> VPJ) unlike other options and no schema evolution needed.

Configs introduced:
Added parent controller config push.job.failure.checkpoints.to.define.user.error to provide a custom list of these checkpoints based on the usecases to emit the metrics accordingly. DEFAULT_PUSH_JOB_USER_ERROR_CHECKPOINTS will be used by default.

For Reviewers:
Main code changes are in VeniceHelixAdmin.java

How was this PR tested?

GH CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.
  • [] Yes. Make sure to explain your proposed changes and call out the behavior change.

@nisargthakkar nisargthakkar self-assigned this Sep 18, 2024
Copy link
Contributor

@nisargthakkar nisargthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to go through the tests, but putting a preliminary review out

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nisargthakkar for the review. Addressed the comments.

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nisargthakkar. Addressed the comments.

Copy link
Contributor Author

@m-nagarajan m-nagarajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nisargthakkar addressed the comments

Copy link
Contributor

@nisargthakkar nisargthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. A couple of nitpicks

Copy link
Contributor

@nisargthakkar nisargthakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @m-nagarajan!

@m-nagarajan m-nagarajan changed the title [controller] Emit push job status metrics from controller [vpj][controller] Emit push job status metrics from controller Sep 24, 2024
@m-nagarajan m-nagarajan merged commit 328d72a into linkedin:main Sep 24, 2024
46 checks passed
@m-nagarajan
Copy link
Contributor Author

Thanks @nisargthakkar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants