-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api/cli): add verification management endpoints #1611
Conversation
✅ Deploy Preview for docs-kargo-akuity-io ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1611 +/- ##
==========================================
+ Coverage 43.77% 43.96% +0.18%
==========================================
Files 195 199 +4
Lines 12472 12673 +201
==========================================
+ Hits 5460 5572 +112
- Misses 6776 6864 +88
- Partials 236 237 +1 ☔ View full report in Codecov by Sentry. |
974730e
to
dd14f03
Compare
99d5462
to
0148e0d
Compare
8e0e5d5
to
3d6d82a
Compare
internal/controller/stages/stages.go
Outdated
if err := r.abortVerificationFn(ctx, stage); err != nil { | ||
return status, fmt.Errorf( | ||
"error aborting verification for Stage %q in namespace %q: %w", | ||
stage.Name, | ||
stage.Namespace, | ||
err, | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am on the fence here about returning an error, or returning a VerificationInfo
struct with the error.
Another interesting thing I have noticed is that if you provide Argo Rollouts with a terminate request, it appears that in some cases the AnalysisRun will end up as Successful
. I wonder if this is actually what we want to happen, or if we should construct a custom VerificationInfo
with data which reflects the abort operation having taken place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons I would not return a VerificationInfo
struct here (for errors) is that any issue would wipe out information data including references to the AnalysisRun, even on transient errors.
Essentially, I think this is also an existing flaw in how we receive verification information at present. As in the current implementation, we do not distinguish an "object not found" error from e.g. "we temporary can't reach the Kubernetes API server". Which means that even if we could eventually recover from the error, the user has to rerun the verification because the controller has given up on the previous AnalysisRun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another interesting thing I have noticed is that if you provide Argo Rollouts with a terminate request, it appears that in some cases the AnalysisRun will end up as Successful. I wonder if this is actually what we want to happen, or if we should construct a custom VerificationInfo with data which reflects the abort operation having taken place.
I think we have separate field for VerificationInfo phase and then, under AnalysRunReference, it has its own phase.
So we should be able to capture that the AnalysisRun succeeded and still record that the verification was aborted.
In determining the VerificationInfo's phase, I agree that the attempt to abort takes precedence over the fact that the attempt raced the AnalysisRun and lost with the AnalysisRun succeeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the current implementation, we do not distinguish an "object not found" error from e.g. "we temporary can't reach the Kubernetes API server". Which means that even if we could eventually recover from the error, the user has to rerun the verification because the controller has given up on the previous AnalysisRun.
Right. We make no attempt currently to distinguish between errors where a retry helps and one where it doesn't.
I support however you might wish to improve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In determining the VerificationInfo's phase, I agree that the attempt to abort takes precedence over the fact that the attempt raced the AnalysisRun and lost with the AnalysisRun succeeding.
It is actually not a race, it is the AnalysisRun reflecting a success while it was the result from a termination request. I am not sure if this is default behavior for Kargo Rollouts, or if it depends on the command which is used to perform the analysis and the potential exit code it gives when it e.g. receives a SIGKILL
. FWIW, I am testing things with a simple sleep 600; exit 0;
.
I covered a potential race by checking the current state of the AnalysisRun before adhering to the abort request, which prevents an unnecessary patch operation.
We make no attempt currently to distinguish between errors where a retry helps and one where it doesn't.
I support however you might wish to improve that.
I think improving this is too big of a change for the current scope of the PR, as it would require bubbling up Kubernetes errors (or translating them into our own typed errors). Where as at present, we have a pattern of translating not found to a nil
object while returning other errors.
Given this, I am inclined to keep the pattern of reflecting errors in VerificationInfo
for now. To address this in full in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually not a race, it is the AnalysisRun reflecting a success while it was the result from a termination request.
Ah. I was mistakenly thinking the AnalysisRun succeeded between the time the abort annotation was added and the time the Stage was reconciled.
This behavior seems peculiar indeed.
@jessesuen would probably be able to give us some insight on this...
@jessesuen you know anything about aborted AnalysisRuns having phase Success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, this shouldn't block the PR.
As long as the verification phase correctly identifies that the verification was aborted, I don't think the phase of the analysis run, even if it is arguably incorrect, matters anywhere else in Kargo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that in some cases the AnalysisRun will end up as Successful
Is it some cases or all?
it is the AnalysisRun reflecting a success while it was the result from a termination request
I seem to recall that this was designed behavior for AnalysisRun. There is such thing as an "indefinite" analysisrun that needs to stopped manually / programmatically.
Let me check on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when we terminate an AnalysisRun, we consider it a Successful run:
https://github.com/argoproj/argo-rollouts/blob/master/analysis/analysis.go#L581-L585
https://github.com/argoproj/argo-rollouts/blob/master/analysis/analysis.go#L610-L614
It's because Argo Rollouts will terminate indefinite/background AnalyisRuns when the update is completed. So in the case of Rollouts, termination is a happy path (for background analysis)
I'm thinking about how Kargo should handle this. I think in the case of Kargo, termination is more of a user decision, whereas in Argo Rollouts, the controller deciding to stop the run because the update completed.
Given that, perhaps an terminated analysis run should be treated as failed analysis.
312ec6c
to
fe258a7
Compare
# Abort the verification of a stage in the default project | ||
kargo config set-project my-project | ||
kargo verify stage my-stage --abort | ||
`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We need to fix this in a bunch of places. We're getting an extra line break at the end of the examples section (two instead of one), so we should probably move these ending ticks to the end of the previous line.
Not a blocker. Just a thing I've noticed -- and definitely not unique to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also actually has a leading newline everywhere. We should probably just do:
Example: `# Verify a stage
kargo verify stage --project=my-project my-stage`,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also actually has a leading newline everywhere.
Funny... I noticed the double blank line after examples and never noticed that the one blank line between "Examples:" and the actual examples was still one blank line more than what is between all the other help section headings and their content.
Good eye.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I have see some projects do is adding a tiny wrapper which adds strings.TrimSpace
(e.g. printDescription
), as this allows you to continue to maintain the visible advantage of having everything start at the same point in-code and/or to move it to constants defined outside of the struct.
Abort bool | ||
} | ||
|
||
func newVerifyStageCommand(cfg config.CLIConfig) *cobra.Command { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another inconsistency we can fix when time permits. (Not specifically introduced here.)
These constructor-like funcs for building cobra commands are sometimes named like newNounCommand
and other times are named like newVerbNounCommand
.
I'll open a separate issue.
kargo verify stage my-stage | ||
|
||
# Abort the verification of a stage's current freight | ||
kargo verify stage --project=my-project my-stage --abort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little iffy on --abort
as a flag on kargo verify stage
.
We've gotten a lot better about consistently adhering to kargo verb nound
and this feels like a departure from that.
I get why you did it this way. There's currently nothing else "abort" applies to, which would make it feel like some awkward one-off sub-command...
I feel we will soon have other things that can be aborted as well. I know @jessesuen has mentioned aborting promotions a time or two before...
We might want to plan ahead for that and consider making abort
its own subcommand after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the full abort command be called in this case? As with kargo abort verification --stage <stage>
, I feel it would be counterintuitive from kargo verify stage
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya... now that you say that... that seems clunky because verification
isn't a resource -- nor do I think it needs to be.
Obviously this was non-blocking anyway.
d79121f
to
e0629ca
Compare
The `kargo.akuity.io/reconfirm` annotation is intended to be used to signal a request to reconfirm e.g. the verification of a Stage. Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
e0629ca
to
fcd8f0d
Compare
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
fcd8f0d
to
66f3f03
Compare
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
fed36b7
to
1e67282
Compare
// If the stage does not have a reverification annotation, check if there is | ||
// an existing AnalysisRun for the Stage and Freight. If there is, return | ||
// the status of this AnalysisRun. | ||
if _, ok := stage.GetAnnotations()[kargoapi.AnnotationKeyReverify]; !ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens here if we previously succeeded in restarting verification, but failed to clear the annotation?
It looks to me like that would result in a duplicate AnalysisRun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case (assuming the status update patch did work), we would not end up here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️
// will indicate a "Succeeded" phase due to Argo Rollouts behavior. | ||
return &kargoapi.VerificationInfo{ | ||
ID: stage.Status.CurrentFreight.VerificationInfo.ID, | ||
Phase: kargoapi.VerificationPhaseFailed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should introduce an Aborted phase, then there are three distinct and non-overlapping negative outcomes:
- Kargo encountered a problem (error)
- Verification completed with a negative result (failure)
- Verification was aborted
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This crossed my mind as well, so more than happy to add it!
Signed-off-by: Hidde Beydals <hidde@hhh.computer>
Fixes: #1581
This pull requests adds mechanisms to:
Both are implemented through annotations (
kargo.akuity.io/reverify
andkargo.akuity.io/abort
), allowing the behavior to be triggered outside of the API/CLI by annotating the Stage with the respective annotation using the ID of the existing verification information for the Stage as the value.In addition to this, a
kargo verify stage (NAME) [--abort]
command has been added to make the features available to CLI users.There is more to cover here in the future, e.g. keeping track of a "stack" of
VerificationInfo
objects to provide an historical overview, and to detect changes as mentioned in #1581 (comment). However, they are out of scope for this pull request and should be handled as separate feature requests.