Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api/cli): add verification management endpoints #1611

Merged
merged 16 commits into from
Mar 15, 2024

Conversation

hiddeco
Copy link
Contributor

@hiddeco hiddeco commented Mar 13, 2024

Fixes: #1581

This pull requests adds mechanisms to:

  1. Reverify the Freight of a Stage
  2. Abort a (long-)running verification for the Freight of a Stage

Both are implemented through annotations (kargo.akuity.io/reverify and kargo.akuity.io/abort), allowing the behavior to be triggered outside of the API/CLI by annotating the Stage with the respective annotation using the ID of the existing verification information for the Stage as the value.

In addition to this, a kargo verify stage (NAME) [--abort] command has been added to make the features available to CLI users.


There is more to cover here in the future, e.g. keeping track of a "stack" of VerificationInfo objects to provide an historical overview, and to detect changes as mentioned in #1581 (comment). However, they are out of scope for this pull request and should be handled as separate feature requests.

Copy link

netlify bot commented Mar 13, 2024

Deploy Preview for docs-kargo-akuity-io ready!

Name Link
🔨 Latest commit 10657fc
🔍 Latest deploy log https://app.netlify.com/sites/docs-kargo-akuity-io/deploys/65f46fdfbeee9c00084deccf
😎 Deploy Preview https://deploy-preview-1611.kargo.akuity.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Mar 13, 2024

Codecov Report

Attention: Patch coverage is 53.05466% with 146 lines in your changes are missing coverage. Please review.

Project coverage is 43.96%. Comparing base (9b4c986) to head (1e67282).

❗ Current head 1e67282 differs from pull request most recent head 10657fc. Consider uploading reports for the commit 10657fc to get more accurate results

Files Patch % Lines
internal/cli/cmd/verify/stage.go 0.00% 69 Missing ⚠️
internal/api/abort_verification_v1alpha1.go 0.00% 15 Missing ⚠️
internal/api/reverify_v1alpha1.go 0.00% 15 Missing ⚠️
internal/cli/cmd/verify/verify.go 0.00% 11 Missing ⚠️
internal/controller/stages/stages.go 73.17% 11 Missing ⚠️
api/v1alpha1/warehouse_helpers.go 0.00% 7 Missing ⚠️
internal/controller/promotions/promotions.go 45.45% 5 Missing and 1 partial ⚠️
internal/controller/warehouses/warehouses.go 0.00% 6 Missing ⚠️
internal/controller/stages/verification.go 96.25% 2 Missing and 1 partial ⚠️
api/v1alpha1/promotion_helpers.go 0.00% 1 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1611      +/-   ##
==========================================
+ Coverage   43.77%   43.96%   +0.18%     
==========================================
  Files         195      199       +4     
  Lines       12472    12673     +201     
==========================================
+ Hits         5460     5572     +112     
- Misses       6776     6864      +88     
- Partials      236      237       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hiddeco hiddeco force-pushed the add-verification-endpoints branch from 974730e to dd14f03 Compare March 13, 2024 13:52
api/v1alpha1/annotations.go Outdated Show resolved Hide resolved
@hiddeco hiddeco force-pushed the add-verification-endpoints branch from 99d5462 to 0148e0d Compare March 13, 2024 15:19
@hiddeco hiddeco force-pushed the add-verification-endpoints branch 2 times, most recently from 8e0e5d5 to 3d6d82a Compare March 14, 2024 09:45
api/v1alpha1/stage_types.go Outdated Show resolved Hide resolved
Comment on lines 777 to 775
if err := r.abortVerificationFn(ctx, stage); err != nil {
return status, fmt.Errorf(
"error aborting verification for Stage %q in namespace %q: %w",
stage.Name,
stage.Namespace,
err,
)
}
Copy link
Contributor Author

@hiddeco hiddeco Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on the fence here about returning an error, or returning a VerificationInfo struct with the error.

Another interesting thing I have noticed is that if you provide Argo Rollouts with a terminate request, it appears that in some cases the AnalysisRun will end up as Successful. I wonder if this is actually what we want to happen, or if we should construct a custom VerificationInfo with data which reflects the abort operation having taken place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons I would not return a VerificationInfo struct here (for errors) is that any issue would wipe out information data including references to the AnalysisRun, even on transient errors.

Essentially, I think this is also an existing flaw in how we receive verification information at present. As in the current implementation, we do not distinguish an "object not found" error from e.g. "we temporary can't reach the Kubernetes API server". Which means that even if we could eventually recover from the error, the user has to rerun the verification because the controller has given up on the previous AnalysisRun.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another interesting thing I have noticed is that if you provide Argo Rollouts with a terminate request, it appears that in some cases the AnalysisRun will end up as Successful. I wonder if this is actually what we want to happen, or if we should construct a custom VerificationInfo with data which reflects the abort operation having taken place.

I think we have separate field for VerificationInfo phase and then, under AnalysRunReference, it has its own phase.

So we should be able to capture that the AnalysisRun succeeded and still record that the verification was aborted.

In determining the VerificationInfo's phase, I agree that the attempt to abort takes precedence over the fact that the attempt raced the AnalysisRun and lost with the AnalysisRun succeeding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the current implementation, we do not distinguish an "object not found" error from e.g. "we temporary can't reach the Kubernetes API server". Which means that even if we could eventually recover from the error, the user has to rerun the verification because the controller has given up on the previous AnalysisRun.

Right. We make no attempt currently to distinguish between errors where a retry helps and one where it doesn't.

I support however you might wish to improve that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In determining the VerificationInfo's phase, I agree that the attempt to abort takes precedence over the fact that the attempt raced the AnalysisRun and lost with the AnalysisRun succeeding.

It is actually not a race, it is the AnalysisRun reflecting a success while it was the result from a termination request. I am not sure if this is default behavior for Kargo Rollouts, or if it depends on the command which is used to perform the analysis and the potential exit code it gives when it e.g. receives a SIGKILL. FWIW, I am testing things with a simple sleep 600; exit 0;.

I covered a potential race by checking the current state of the AnalysisRun before adhering to the abort request, which prevents an unnecessary patch operation.

We make no attempt currently to distinguish between errors where a retry helps and one where it doesn't.

I support however you might wish to improve that.

I think improving this is too big of a change for the current scope of the PR, as it would require bubbling up Kubernetes errors (or translating them into our own typed errors). Where as at present, we have a pattern of translating not found to a nil object while returning other errors.

Given this, I am inclined to keep the pattern of reflecting errors in VerificationInfo for now. To address this in full in a follow up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually not a race, it is the AnalysisRun reflecting a success while it was the result from a termination request.

Ah. I was mistakenly thinking the AnalysisRun succeeded between the time the abort annotation was added and the time the Stage was reconciled.

This behavior seems peculiar indeed.

@jessesuen would probably be able to give us some insight on this...

@jessesuen you know anything about aborted AnalysisRuns having phase Success?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, this shouldn't block the PR.

As long as the verification phase correctly identifies that the verification was aborted, I don't think the phase of the analysis run, even if it is arguably incorrect, matters anywhere else in Kargo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that in some cases the AnalysisRun will end up as Successful

Is it some cases or all?

it is the AnalysisRun reflecting a success while it was the result from a termination request

I seem to recall that this was designed behavior for AnalysisRun. There is such thing as an "indefinite" analysisrun that needs to stopped manually / programmatically.

Let me check on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we terminate an AnalysisRun, we consider it a Successful run:

https://github.com/argoproj/argo-rollouts/blob/master/analysis/analysis.go#L581-L585
https://github.com/argoproj/argo-rollouts/blob/master/analysis/analysis.go#L610-L614

It's because Argo Rollouts will terminate indefinite/background AnalyisRuns when the update is completed. So in the case of Rollouts, termination is a happy path (for background analysis)

I'm thinking about how Kargo should handle this. I think in the case of Kargo, termination is more of a user decision, whereas in Argo Rollouts, the controller deciding to stop the run because the update completed.

Given that, perhaps an terminated analysis run should be treated as failed analysis.

@hiddeco hiddeco marked this pull request as ready for review March 14, 2024 12:47
@hiddeco hiddeco requested a review from a team as a code owner March 14, 2024 12:47
@hiddeco hiddeco requested a review from krancour March 14, 2024 12:47
@hiddeco hiddeco added this to the v0.5.0 milestone Mar 14, 2024
@hiddeco hiddeco self-assigned this Mar 14, 2024
@hiddeco hiddeco force-pushed the add-verification-endpoints branch 3 times, most recently from 312ec6c to fe258a7 Compare March 14, 2024 16:47
# Abort the verification of a stage in the default project
kargo config set-project my-project
kargo verify stage my-stage --abort
`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We need to fix this in a bunch of places. We're getting an extra line break at the end of the examples section (two instead of one), so we should probably move these ending ticks to the end of the previous line.

Not a blocker. Just a thing I've noticed -- and definitely not unique to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also actually has a leading newline everywhere. We should probably just do:

Example: `# Verify a stage
kargo verify stage --project=my-project my-stage`,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also actually has a leading newline everywhere.

Funny... I noticed the double blank line after examples and never noticed that the one blank line between "Examples:" and the actual examples was still one blank line more than what is between all the other help section headings and their content.

Good eye.

Copy link
Contributor Author

@hiddeco hiddeco Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I have see some projects do is adding a tiny wrapper which adds strings.TrimSpace (e.g. printDescription), as this allows you to continue to maintain the visible advantage of having everything start at the same point in-code and/or to move it to constants defined outside of the struct.

Abort bool
}

func newVerifyStageCommand(cfg config.CLIConfig) *cobra.Command {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another inconsistency we can fix when time permits. (Not specifically introduced here.)

These constructor-like funcs for building cobra commands are sometimes named like newNounCommand and other times are named like newVerbNounCommand.

I'll open a separate issue.

kargo verify stage my-stage

# Abort the verification of a stage's current freight
kargo verify stage --project=my-project my-stage --abort
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little iffy on --abort as a flag on kargo verify stage.

We've gotten a lot better about consistently adhering to kargo verb nound and this feels like a departure from that.

I get why you did it this way. There's currently nothing else "abort" applies to, which would make it feel like some awkward one-off sub-command...

I feel we will soon have other things that can be aborted as well. I know @jessesuen has mentioned aborting promotions a time or two before...

We might want to plan ahead for that and consider making abort its own subcommand after all.

Copy link
Contributor Author

@hiddeco hiddeco Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the full abort command be called in this case? As with kargo abort verification --stage <stage>, I feel it would be counterintuitive from kargo verify stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya... now that you say that... that seems clunky because verification isn't a resource -- nor do I think it needs to be.

Obviously this was non-blocking anyway.

@hiddeco hiddeco force-pushed the add-verification-endpoints branch from d79121f to e0629ca Compare March 15, 2024 08:50
The `kargo.akuity.io/reconfirm` annotation is intended to be used to
signal a request to reconfirm e.g. the verification of a Stage.

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
hiddeco added 11 commits March 15, 2024 09:58
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
@hiddeco hiddeco force-pushed the add-verification-endpoints branch from e0629ca to fcd8f0d Compare March 15, 2024 09:08
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
@hiddeco hiddeco force-pushed the add-verification-endpoints branch from fcd8f0d to 66f3f03 Compare March 15, 2024 09:22
hiddeco added 2 commits March 15, 2024 11:36
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
@hiddeco hiddeco force-pushed the add-verification-endpoints branch from fed36b7 to 1e67282 Compare March 15, 2024 11:15
// If the stage does not have a reverification annotation, check if there is
// an existing AnalysisRun for the Stage and Freight. If there is, return
// the status of this AnalysisRun.
if _, ok := stage.GetAnnotations()[kargoapi.AnnotationKeyReverify]; !ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if we previously succeeded in restarting verification, but failed to clear the annotation?

It looks to me like that would result in a duplicate AnalysisRun.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case (assuming the status update patch did work), we would not end up here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️

// will indicate a "Succeeded" phase due to Argo Rollouts behavior.
return &kargoapi.VerificationInfo{
ID: stage.Status.CurrentFreight.VerificationInfo.ID,
Phase: kargoapi.VerificationPhaseFailed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should introduce an Aborted phase, then there are three distinct and non-overlapping negative outcomes:

  • Kargo encountered a problem (error)
  • Verification completed with a negative result (failure)
  • Verification was aborted

wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This crossed my mind as well, so more than happy to add it!

Signed-off-by: Hidde Beydals <hidde@hhh.computer>
@hiddeco hiddeco added this pull request to the merge queue Mar 15, 2024
Merged via the queue into akuity:main with commit 028d7a5 Mar 15, 2024
14 checks passed
@hiddeco hiddeco deleted the add-verification-endpoints branch March 15, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add verification management endpoints
3 participants