feat(api/cli): add verification management endpoints #1611

hiddeco · 2024-03-13T13:46:02Z

This pull requests adds mechanisms to:

Reverify the Freight of a Stage
Abort a (long-)running verification for the Freight of a Stage

Both are implemented through annotations (kargo.akuity.io/reverify and kargo.akuity.io/abort), allowing the behavior to be triggered outside of the API/CLI by annotating the Stage with the respective annotation using the ID of the existing verification information for the Stage as the value.

In addition to this, a kargo verify stage (NAME) [--abort] command has been added to make the features available to CLI users.

There is more to cover here in the future, e.g. keeping track of a "stack" of VerificationInfo objects to provide an historical overview, and to detect changes as mentioned in #1581 (comment). However, they are out of scope for this pull request and should be handled as separate feature requests.

netlify · 2024-03-13T13:46:19Z

✅ Deploy Preview for docs-kargo-akuity-io ready!

Name	Link
🔨 Latest commit	`10657fc`
🔍 Latest deploy log	https://app.netlify.com/sites/docs-kargo-akuity-io/deploys/65f46fdfbeee9c00084deccf
😎 Deploy Preview	https://deploy-preview-1611.kargo.akuity.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov · 2024-03-13T13:49:18Z

Codecov Report

Attention: Patch coverage is 53.05466% with 146 lines in your changes are missing coverage. Please review.

Project coverage is 43.96%. Comparing base (9b4c986) to head (1e67282).

❗ Current head 1e67282 differs from pull request most recent head 10657fc. Consider uploading reports for the commit 10657fc to get more accurate results

Files	Patch %	Lines
internal/cli/cmd/verify/stage.go	0.00%	69 Missing ⚠️
internal/api/abort_verification_v1alpha1.go	0.00%	15 Missing ⚠️
internal/api/reverify_v1alpha1.go	0.00%	15 Missing ⚠️
internal/cli/cmd/verify/verify.go	0.00%	11 Missing ⚠️
internal/controller/stages/stages.go	73.17%	11 Missing ⚠️
api/v1alpha1/warehouse_helpers.go	0.00%	7 Missing ⚠️
internal/controller/promotions/promotions.go	45.45%	5 Missing and 1 partial ⚠️
internal/controller/warehouses/warehouses.go	0.00%	6 Missing ⚠️
internal/controller/stages/verification.go	96.25%	2 Missing and 1 partial ⚠️
api/v1alpha1/promotion_helpers.go	0.00%	1 Missing ⚠️
... and 2 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1611      +/-   ##
==========================================
+ Coverage   43.77%   43.96%   +0.18%     
==========================================
  Files         195      199       +4     
  Lines       12472    12673     +201     
==========================================
+ Hits         5460     5572     +112     
- Misses       6776     6864      +88     
- Partials      236      237       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

internal/controller/stages/stages.go

api/v1alpha1/annotations.go

api/service/v1alpha1/service.proto

api/v1alpha1/stage_types.go

internal/controller/stages/stages.go

hiddeco · 2024-03-14T12:44:43Z

internal/controller/stages/stages.go

+							if err := r.abortVerificationFn(ctx, stage); err != nil {
+								return status, fmt.Errorf(
+									"error aborting verification for Stage %q in namespace %q: %w",
+									stage.Name,
+									stage.Namespace,
+									err,
+								)
+							}


I am on the fence here about returning an error, or returning a VerificationInfo struct with the error.

Another interesting thing I have noticed is that if you provide Argo Rollouts with a terminate request, it appears that in some cases the AnalysisRun will end up as Successful. I wonder if this is actually what we want to happen, or if we should construct a custom VerificationInfo with data which reflects the abort operation having taken place.

One of the reasons I would not return a VerificationInfo struct here (for errors) is that any issue would wipe out information data including references to the AnalysisRun, even on transient errors.

Essentially, I think this is also an existing flaw in how we receive verification information at present. As in the current implementation, we do not distinguish an "object not found" error from e.g. "we temporary can't reach the Kubernetes API server". Which means that even if we could eventually recover from the error, the user has to rerun the verification because the controller has given up on the previous AnalysisRun.

Another interesting thing I have noticed is that if you provide Argo Rollouts with a terminate request, it appears that in some cases the AnalysisRun will end up as Successful. I wonder if this is actually what we want to happen, or if we should construct a custom VerificationInfo with data which reflects the abort operation having taken place.

I think we have separate field for VerificationInfo phase and then, under AnalysRunReference, it has its own phase.

So we should be able to capture that the AnalysisRun succeeded and still record that the verification was aborted.

In determining the VerificationInfo's phase, I agree that the attempt to abort takes precedence over the fact that the attempt raced the AnalysisRun and lost with the AnalysisRun succeeding.

As in the current implementation, we do not distinguish an "object not found" error from e.g. "we temporary can't reach the Kubernetes API server". Which means that even if we could eventually recover from the error, the user has to rerun the verification because the controller has given up on the previous AnalysisRun.

Right. We make no attempt currently to distinguish between errors where a retry helps and one where it doesn't.

I support however you might wish to improve that.

In determining the VerificationInfo's phase, I agree that the attempt to abort takes precedence over the fact that the attempt raced the AnalysisRun and lost with the AnalysisRun succeeding.

It is actually not a race, it is the AnalysisRun reflecting a success while it was the result from a termination request. I am not sure if this is default behavior for Kargo Rollouts, or if it depends on the command which is used to perform the analysis and the potential exit code it gives when it e.g. receives a SIGKILL. FWIW, I am testing things with a simple sleep 600; exit 0;.

I covered a potential race by checking the current state of the AnalysisRun before adhering to the abort request, which prevents an unnecessary patch operation.

We make no attempt currently to distinguish between errors where a retry helps and one where it doesn't.

I support however you might wish to improve that.

I think improving this is too big of a change for the current scope of the PR, as it would require bubbling up Kubernetes errors (or translating them into our own typed errors). Where as at present, we have a pattern of translating not found to a nil object while returning other errors.

Given this, I am inclined to keep the pattern of reflecting errors in VerificationInfo for now. To address this in full in a follow up PR.

It is actually not a race, it is the AnalysisRun reflecting a success while it was the result from a termination request.

Ah. I was mistakenly thinking the AnalysisRun succeeded between the time the abort annotation was added and the time the Stage was reconciled.

This behavior seems peculiar indeed.

@jessesuen would probably be able to give us some insight on this...

@jessesuen you know anything about aborted AnalysisRuns having phase Success?

Thinking about this more, this shouldn't block the PR.

As long as the verification phase correctly identifies that the verification was aborted, I don't think the phase of the analysis run, even if it is arguably incorrect, matters anywhere else in Kargo.

it appears that in some cases the AnalysisRun will end up as Successful

Is it some cases or all?

it is the AnalysisRun reflecting a success while it was the result from a termination request

I seem to recall that this was designed behavior for AnalysisRun. There is such thing as an "indefinite" analysisrun that needs to stopped manually / programmatically.

Let me check on this.

Yes, when we terminate an AnalysisRun, we consider it a Successful run:

https://github.com/argoproj/argo-rollouts/blob/master/analysis/analysis.go#L581-L585
https://github.com/argoproj/argo-rollouts/blob/master/analysis/analysis.go#L610-L614

It's because Argo Rollouts will terminate indefinite/background AnalyisRuns when the update is completed. So in the case of Rollouts, termination is a happy path (for background analysis)

I'm thinking about how Kargo should handle this. I think in the case of Kargo, termination is more of a user decision, whereas in Argo Rollouts, the controller deciding to stop the run because the update completed.

Given that, perhaps an terminated analysis run should be treated as failed analysis.

krancour · 2024-03-14T18:19:53Z

internal/cli/cmd/verify/stage.go

+# Abort the verification of a stage in the default project
+kargo config set-project my-project
+kargo verify stage my-stage --abort
+`,


Nit: We need to fix this in a bunch of places. We're getting an extra line break at the end of the examples section (two instead of one), so we should probably move these ending ticks to the end of the previous line.

Not a blocker. Just a thing I've noticed -- and definitely not unique to this PR.

It also actually has a leading newline everywhere. We should probably just do:

Example: `# Verify a stage kargo verify stage --project=my-project my-stage`,

It also actually has a leading newline everywhere.

Funny... I noticed the double blank line after examples and never noticed that the one blank line between "Examples:" and the actual examples was still one blank line more than what is between all the other help section headings and their content.

Good eye.

What I have see some projects do is adding a tiny wrapper which adds strings.TrimSpace (e.g. printDescription), as this allows you to continue to maintain the visible advantage of having everything start at the same point in-code and/or to move it to constants defined outside of the struct.

krancour · 2024-03-14T20:48:49Z

internal/cli/cmd/verify/stage.go

+	Abort   bool
+}
+
+func newVerifyStageCommand(cfg config.CLIConfig) *cobra.Command {


This is another inconsistency we can fix when time permits. (Not specifically introduced here.)

These constructor-like funcs for building cobra commands are sometimes named like newNounCommand and other times are named like newVerbNounCommand.

I'll open a separate issue.

krancour · 2024-03-14T20:53:30Z

internal/cli/cmd/verify/stage.go

+kargo verify stage my-stage
+
+# Abort the verification of a stage's current freight
+kargo verify stage --project=my-project my-stage --abort


I'm a little iffy on --abort as a flag on kargo verify stage.

We've gotten a lot better about consistently adhering to kargo verb nound and this feels like a departure from that.

I get why you did it this way. There's currently nothing else "abort" applies to, which would make it feel like some awkward one-off sub-command...

I feel we will soon have other things that can be aborted as well. I know @jessesuen has mentioned aborting promotions a time or two before...

We might want to plan ahead for that and consider making abort its own subcommand after all.

What would the full abort command be called in this case? As with kargo abort verification --stage <stage>, I feel it would be counterintuitive from kargo verify stage.

Ya... now that you say that... that seems clunky because verification isn't a resource -- nor do I think it needs to be.

Obviously this was non-blocking anyway.

The `kargo.akuity.io/reconfirm` annotation is intended to be used to signal a request to reconfirm e.g. the verification of a Stage. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

krancour · 2024-03-15T15:18:02Z

internal/controller/stages/verification.go

+	// If the stage does not have a reverification annotation, check if there is
+	// an existing AnalysisRun for the Stage and Freight. If there is, return
+	// the status of this AnalysisRun.
+	if _, ok := stage.GetAnnotations()[kargoapi.AnnotationKeyReverify]; !ok {


What happens here if we previously succeeded in restarting verification, but failed to clear the annotation?

It looks to me like that would result in a duplicate AnalysisRun.

In that case (assuming the status update patch did work), we would not end up here.

🤦‍♂️

krancour · 2024-03-15T15:41:23Z

internal/controller/stages/verification.go

+	// will indicate a "Succeeded" phase due to Argo Rollouts behavior.
+	return &kargoapi.VerificationInfo{
+		ID:      stage.Status.CurrentFreight.VerificationInfo.ID,
+		Phase:   kargoapi.VerificationPhaseFailed,


I think we should introduce an Aborted phase, then there are three distinct and non-overlapping negative outcomes:

Kargo encountered a problem (error)

Verification completed with a negative result (failure)

Verification was aborted

wdyt?

This crossed my mind as well, so more than happy to add it!

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco added kind/enhancement priority/normal area/controller area/cli area/api labels Mar 13, 2024

hiddeco force-pushed the add-verification-endpoints branch from 974730e to dd14f03 Compare March 13, 2024 13:52

hiddeco commented Mar 13, 2024

View reviewed changes

internal/controller/stages/stages.go Show resolved Hide resolved

hiddeco commented Mar 13, 2024

View reviewed changes

api/v1alpha1/annotations.go Outdated Show resolved Hide resolved

hiddeco force-pushed the add-verification-endpoints branch from 99d5462 to 0148e0d Compare March 13, 2024 15:19

krancour reviewed Mar 13, 2024

View reviewed changes

api/service/v1alpha1/service.proto Outdated Show resolved Hide resolved

hiddeco force-pushed the add-verification-endpoints branch 2 times, most recently from 8e0e5d5 to 3d6d82a Compare March 14, 2024 09:45

hiddeco commented Mar 14, 2024

View reviewed changes

api/v1alpha1/stage_types.go Outdated Show resolved Hide resolved

hiddeco commented Mar 14, 2024

View reviewed changes

internal/controller/stages/stages.go Outdated Show resolved Hide resolved

hiddeco commented Mar 14, 2024

View reviewed changes

hiddeco marked this pull request as ready for review March 14, 2024 12:47

hiddeco requested a review from a team as a code owner March 14, 2024 12:47

hiddeco requested a review from krancour March 14, 2024 12:47

hiddeco added this to the v0.5.0 milestone Mar 14, 2024

hiddeco self-assigned this Mar 14, 2024

hiddeco force-pushed the add-verification-endpoints branch 3 times, most recently from 312ec6c to fe258a7 Compare March 14, 2024 16:47

krancour reviewed Mar 14, 2024

View reviewed changes

hiddeco force-pushed the add-verification-endpoints branch from d79121f to e0629ca Compare March 15, 2024 08:50

feat(api): introduce "reconfirm" annotation

b0af651

The `kargo.akuity.io/reconfirm` annotation is intended to be used to signal a request to reconfirm e.g. the verification of a Stage. Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco added 11 commits March 15, 2024 09:58

feat(api): add stage verification request endpoint

463ebd4

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(api): implement stage verification request

c565cd6

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(controller): implement Stage re-verification

1ccb7b4

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(cmd): add kargo verify stage command

c329092

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(controller): ignore removal of annotation

28b94ec

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(api)!: introduce ID in VerificationInfo

4be308e

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

api: change terminology to "reverify"

efeb7ea

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(api): add kargo.akuity.io/abort annotation

983718c

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(controller): allow abort of AnalysisRun

43aeece

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

feat(api/cli): allow aborting AnalysisRun

92b2082

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

chore(api): simplify annotation patch helper

59dc19e

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco force-pushed the add-verification-endpoints branch from e0629ca to fcd8f0d Compare March 15, 2024 09:08

chore(api): simplify annotation clearing

66f3f03

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco force-pushed the add-verification-endpoints branch from fcd8f0d to 66f3f03 Compare March 15, 2024 09:22

hiddeco added 2 commits March 15, 2024 11:36

feat(verification): reflect abort in status

04d6174

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

chore: simplify IgnoreAnnotationRemoval predicate

1e67282

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

hiddeco force-pushed the add-verification-endpoints branch from fed36b7 to 1e67282 Compare March 15, 2024 11:15

krancour reviewed Mar 15, 2024

View reviewed changes

feat(api): add VerificationPhaseAborted

10657fc

Signed-off-by: Hidde Beydals <hidde@hhh.computer>

krancour approved these changes Mar 15, 2024

View reviewed changes

hiddeco added this pull request to the merge queue Mar 15, 2024

Merged via the queue into akuity:main with commit 028d7a5 Mar 15, 2024
14 checks passed

hiddeco deleted the add-verification-endpoints branch March 15, 2024 16:35

This was referenced Mar 15, 2024

Detect changes to AnalysisTemplate #1636

Closed

Maintain verification history for a Stage #1638

Closed

Distinguish permanent API errors from transient ones #1640

Open

chore(cli): tidy all command examples #1657

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api/cli): add verification management endpoints #1611

feat(api/cli): add verification management endpoints #1611

hiddeco commented Mar 13, 2024 •

edited

Loading

netlify bot commented Mar 13, 2024 •

edited

Loading

codecov bot commented Mar 13, 2024 •

edited

Loading

hiddeco Mar 14, 2024 •

edited

Loading

hiddeco Mar 14, 2024

krancour Mar 14, 2024

krancour Mar 14, 2024

hiddeco Mar 14, 2024

krancour Mar 14, 2024

krancour Mar 14, 2024

jessesuen Mar 14, 2024

jessesuen Mar 14, 2024

krancour Mar 14, 2024

hiddeco Mar 14, 2024

krancour Mar 14, 2024

hiddeco Mar 15, 2024 •

edited

Loading

krancour Mar 14, 2024

krancour Mar 14, 2024

hiddeco Mar 15, 2024 •

edited

Loading

krancour Mar 15, 2024

krancour Mar 15, 2024

hiddeco Mar 15, 2024

krancour Mar 15, 2024

krancour Mar 15, 2024

hiddeco Mar 15, 2024

feat(api/cli): add verification management endpoints #1611

feat(api/cli): add verification management endpoints #1611

Conversation

hiddeco commented Mar 13, 2024 • edited Loading

netlify bot commented Mar 13, 2024 • edited Loading

✅ Deploy Preview for docs-kargo-akuity-io ready!

codecov bot commented Mar 13, 2024 • edited Loading

Codecov Report

hiddeco Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiddeco Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiddeco Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiddeco commented Mar 13, 2024 •

edited

Loading

netlify bot commented Mar 13, 2024 •

edited

Loading

codecov bot commented Mar 13, 2024 •

edited

Loading

hiddeco Mar 14, 2024 •

edited

Loading

hiddeco Mar 15, 2024 •

edited

Loading

hiddeco Mar 15, 2024 •

edited

Loading