-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(controller): health checks for multi-source argo cd apps #2160
Conversation
❌ Deploy Preview for docs-kargo-akuity-io failed.
|
@krancour , this is a WIP but I've made some good progress fixing the issues I found. |
df220d0
to
09fe1ee
Compare
5609c3d
to
3b0aa71
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2160 +/- ##
==========================================
+ Coverage 48.27% 49.23% +0.95%
==========================================
Files 246 250 +4
Lines 17739 18125 +386
==========================================
+ Hits 8564 8923 +359
+ Misses 8749 8724 -25
- Partials 426 478 +52 ☔ View full report in Codecov by Sentry. |
065a95f
to
4e632ef
Compare
f1bd673
to
fbd8ded
Compare
This is on our list to look at, but given the (upstream) complexity it may take a while before we have time to validate this end-to-end. Please bear with us until then, thanks 🙇 |
Hey @hiddeco , I made some good progress on trying to test this with Argo Rollouts verifications, and found additional code bits in health.go that needed to be updated to support multi-source ArgoCD apps. I think it would make sense to add those fixes in this PR, thoughts? |
That's quite likely, as I think what you are working on compliments what someone else tried to do before in #2088. |
Added health check fix and tests for multi-source app support. |
68abfe3
to
17a02e0
Compare
Review please? |
@gnadaban we are working at a furious pace to get v0.8.0 released and this isn't something we need for v0.8.0, so we need to ask for some patience please. 🙏 |
Hey @krancour , thanks for the update! It's literally fixing previously broken/unavailable paths, so I don't think it would bother anyone. |
ed63cf5
to
070bde3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the logic appears quite sound now (great job! 🙇), I feel we should trim down the number of logged messages (I count 28 additions, covering 40-something LOC) — and if we log them, ensure they match the style of the logs we already have throughout Kargo.
More concretely:
- No capital letter (except for names, i.e. Application), and no period at the end.
- Present participle form (i.e.
evaluating ArgoCD Application health
instead ofabout to ...
)
To trim down the number of logged messages, my suggestion continues to be to let the caller handle most of the logging based on a positive or negative outcome — rather than meticulously logging every detail, which is something I would expect at trace level instead.
@@ -136,6 +148,9 @@ func (h *applicationHealth) GetApplicationHealth( | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the logging within this method could simply be handled by the caller logging before it is about to assess the health, and after the method has returned. More fine-grain information does not add much besides noise, and it should be possible to derive any other information (or the point where something failed) based on e.g. the returned error (collection).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case, the information these fine-grained details added was quite necessary when diagnosing what is broken and why. I am convinced that debugging is the right level of setting for such, but I'm not against setting some to trace level.
That however would go against the expectation people usually have when wanting to see "all" logs. Debug is the most verbose in most popular projects, few have a distinct "trace" setting, so I wouldn't think to look for that.
I think that having detailed logging for such a complex machinery is necessary, and as time goes by and the app matures and evolves it will only become more important.
Please advise.
internal/argocd/health.go
Outdated
// We follow ArgoCD shadow-array implementation here that preserves the order of app.spec.sources for | ||
// the revisions. | ||
|
||
misaligned_sources := make([]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misaligned_sources := make([]string, 0) | |
misalignedSources := make([]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Out of curiosity, how come the linter doesn't catch these?
ReleaseName string `json:"releaseName,omitempty"` | ||
ValueFiles []string `json:"valueFiles,omitempty"` | ||
// +kubebuilder:validation:Schemaless | ||
// +kubebuilder:pruning:PreserveUnknownFields | ||
// +kubebuilder:validation:Type=object | ||
ValuesObject json.RawMessage `json:"valuesObject,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be removed, as it has been addressed by #2428
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove ValuesObject, would it make it impossible to patch it later on?
internal/argocd/health.go
Outdated
switch { | ||
case revision == "": | ||
case revisions == nil || (len(revisions) == 1 && revisions[0] == ""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another spot where the nil check is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this seems to treat one empty string "desired revision" as a special case, but I think the special case should actually be "all desired revisions are empty strings."
It may be that the best way to handle that is when finding the desired revisions. If we come up empty-handed for each and every source, we should probably just return nil/empty slice so that everywhere else we don't need to have the logic that checks the whole collection to see if its all empties.
tbh, I would prefer seeing all the new log lines removed. The signal to noise ratio in this PR is low mostly on account of the logging. To be frank, the noise has been a hindrance in reviewing the relatively few substantive changes in this PR. |
Status: kargoapi.StageStatus{ | ||
FreightHistory: testCase.freightHistory, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no compelling reason to set this. There's actually a comment in GetDesiredRevisions
about Freight not being obtained from the status and being obtained instead from the last argument to the function because that grants the flexibility to use that function in the context of a Promotion or a health check of a Stage with its current Freight.
Similarly, all the test cases that explicitly build out a State status with Freight history do not need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I see this was a flaw in the original test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, can we keep this for now?
err = fmt.Errorf("error finding Argo CD Application %q in namespace %q: %w", key.Name, key.Namespace, err) | ||
if client.IgnoreNotFound(err) == nil { | ||
err = fmt.Errorf("unable to find Argo CD Application %q in namespace %q", key.Name, key.Namespace) | ||
if app.Status.OperationState != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new nil-check looks like it probably should have been there all along, however, when you added it, you also added the possibility that one or more desired revisions are known, but we still fall through to the logic at the very end of the function that deems the Application healthy without further conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we still fall through to the logic at the very end of the function that deems the Application healthy without further conditions.
Sorry. I read something wrong. This isn't the case.
@hiddeco knows this part of the code better. What's the right thing to do if we don't have any operation state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stageHealthForAppSync
used to error out if there was no OperationState
, this is because ArgoCD resets the OperationState
when a new Operation
is requested (https://github.com/argoproj/argo-cd/blob/473665795c9bcc40e3a60e3bddb7edf76747c974/util/argo/argo.go#L800-L801).
Given this, the lack of an OperationState
can indicate that a new (sync) operation is being attempted (or has never happened).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, keep, remove, what is expected?
…or multi-source apps Signed-off-by: Gyorgy Nadaban <gyorgy.nadaban@gmail.com>
070bde3
to
4ed19c5
Compare
Signed-off-by: Gyorgy Nadaban <gyorgy.nadaban@gmail.com>
@krancour , @hiddeco : I've addressed the majority of the asks, and would like your help to finally complete this. The debug logging messages added in this PR were essential for debugging what was preventing successful sync operations or causing issues, and to understand what parts of the call chain were involved. Understanding what the app does and how it reacts to things is crucial for cluster admins using Kargo. Since there's no equivalent, I'm strongly against complete removal of detailed logging. I'm not opposed to using Trace level for some, albeit I can't recall if I ever had to use such for any open source project, in my experience "debug" is usually what people use for well, debugging. Unless there's something broken I'm hesitant to change or refactor anything else at this point, and would ask to forgo changes to parts of the code that are non-essential to this improvement. As always, your suggestions are greatly appreciated, and I would again like to ask you to help expedite completion. |
Superseded by #2552 |
Fixes #1399
Changes in PR: