Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(controller): health checks for multi-source argo cd apps #2160

Closed
wants to merge 2 commits into from

Conversation

gnadaban
Copy link
Contributor

@gnadaban gnadaban commented Jun 14, 2024

Fixes #1399

Changes in PR:

  • Fix determinining desired revision when application has multiple sources
  • Fix detecting revision for apps with multiple sources targeting the same repository
  • Add debug log messages to help diagnose issues
  • Add missing fields to prevent destroying application settings in Helm sources such as valueFiles, valuesObject, etc.

Copy link

netlify bot commented Jun 14, 2024

Deploy Preview for docs-kargo-akuity-io failed.

Name Link
🔨 Latest commit f7824d5
🔍 Latest deploy log https://app.netlify.com/sites/docs-kargo-akuity-io/deploys/66d745bfdf293c00089cb3ba

@gnadaban
Copy link
Contributor Author

@krancour , this is a WIP but I've made some good progress fixing the issues I found.

@gnadaban gnadaban marked this pull request as ready for review June 14, 2024 18:34
@gnadaban gnadaban requested a review from a team as a code owner June 14, 2024 18:34
@gnadaban gnadaban force-pushed the fix-multisource-app-support branch from df220d0 to 09fe1ee Compare June 17, 2024 12:58
@gnadaban gnadaban force-pushed the fix-multisource-app-support branch from 5609c3d to 3b0aa71 Compare June 17, 2024 14:16
Copy link

codecov bot commented Jun 17, 2024

Codecov Report

Attention: Patch coverage is 88.75502% with 28 lines in your changes missing coverage. Please review.

Project coverage is 49.23%. Comparing base (97452fd) to head (f7824d5).
Report is 33 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/promotion/argocd.go 76.92% 10 Missing and 2 partials ⚠️
internal/argocd/health.go 91.13% 5 Missing and 2 partials ⚠️
internal/argocd/revision.go 91.02% 6 Missing and 1 partial ⚠️
internal/controller/promotion/render.go 50.00% 1 Missing ⚠️
internal/controller/stages/stages.go 92.30% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2160      +/-   ##
==========================================
+ Coverage   48.27%   49.23%   +0.95%     
==========================================
  Files         246      250       +4     
  Lines       17739    18125     +386     
==========================================
+ Hits         8564     8923     +359     
+ Misses       8749     8724      -25     
- Partials      426      478      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gnadaban gnadaban requested a review from hiddeco June 17, 2024 16:51
@gnadaban
Copy link
Contributor Author

@hiddeco , any other concerns with this one, other than having to wait until #2156 is merged?

@gnadaban gnadaban requested a review from krancour June 18, 2024 14:23
@gnadaban gnadaban force-pushed the fix-multisource-app-support branch from 065a95f to 4e632ef Compare June 18, 2024 20:25
@gnadaban gnadaban force-pushed the fix-multisource-app-support branch 3 times, most recently from f1bd673 to fbd8ded Compare June 24, 2024 13:02
@gnadaban
Copy link
Contributor Author

Is there anything else to address for this one @krancour or @hiddeco ?

@hiddeco
Copy link
Contributor

hiddeco commented Jun 26, 2024

This is on our list to look at, but given the (upstream) complexity it may take a while before we have time to validate this end-to-end. Please bear with us until then, thanks 🙇

@gnadaban
Copy link
Contributor Author

Hey @hiddeco , I made some good progress on trying to test this with Argo Rollouts verifications, and found additional code bits in health.go that needed to be updated to support multi-source ArgoCD apps.

I think it would make sense to add those fixes in this PR, thoughts?

@hiddeco
Copy link
Contributor

hiddeco commented Jun 27, 2024

That's quite likely, as I think what you are working on compliments what someone else tried to do before in #2088.

@gnadaban
Copy link
Contributor Author

Added health check fix and tests for multi-source app support.

@gnadaban gnadaban force-pushed the fix-multisource-app-support branch from 68abfe3 to 17a02e0 Compare July 8, 2024 13:35
@gnadaban
Copy link
Contributor Author

gnadaban commented Jul 8, 2024

Hello @hiddeco / @krancour , can one of you review please?

@gnadaban
Copy link
Contributor Author

Review please?

@krancour
Copy link
Member

@gnadaban we are working at a furious pace to get v0.8.0 released and this isn't something we need for v0.8.0, so we need to ask for some patience please. 🙏

@gnadaban
Copy link
Contributor Author

Hey @krancour , thanks for the update!
Any chance to include this in v0.8.0? If there's anything I can do to help to make that happen let me know.

It's literally fixing previously broken/unavailable paths, so I don't think it would bother anyone.

@krancour krancour changed the title fix: Multi-source Helm ArgoCD app support feat: health checks for multi-source argo cd apps Aug 19, 2024
@krancour krancour changed the title feat: health checks for multi-source argo cd apps feat(controller): health checks for multi-source argo cd apps Aug 19, 2024
@gnadaban gnadaban force-pushed the fix-multisource-app-support branch from ed63cf5 to 070bde3 Compare August 20, 2024 14:00
Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the logic appears quite sound now (great job! 🙇), I feel we should trim down the number of logged messages (I count 28 additions, covering 40-something LOC) — and if we log them, ensure they match the style of the logs we already have throughout Kargo.

More concretely:

  • No capital letter (except for names, i.e. Application), and no period at the end.
  • Present participle form (i.e. evaluating ArgoCD Application health instead of about to ...)

To trim down the number of logged messages, my suggestion continues to be to let the caller handle most of the logging based on a positive or negative outcome — rather than meticulously logging every detail, which is something I would expect at trace level instead.

@@ -136,6 +148,9 @@ func (h *applicationHealth) GetApplicationHealth(
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the logging within this method could simply be handled by the caller logging before it is about to assess the health, and after the method has returned. More fine-grain information does not add much besides noise, and it should be possible to derive any other information (or the point where something failed) based on e.g. the returned error (collection).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case, the information these fine-grained details added was quite necessary when diagnosing what is broken and why. I am convinced that debugging is the right level of setting for such, but I'm not against setting some to trace level.

That however would go against the expectation people usually have when wanting to see "all" logs. Debug is the most verbose in most popular projects, few have a distinct "trace" setting, so I wouldn't think to look for that.

I think that having detailed logging for such a complex machinery is necessary, and as time goes by and the app matures and evolves it will only become more important.

Please advise.

// We follow ArgoCD shadow-array implementation here that preserves the order of app.spec.sources for
// the revisions.

misaligned_sources := make([]string, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
misaligned_sources := make([]string, 0)
misalignedSources := make([]string, 0)

https://google.github.io/styleguide/go/guide#mixedcaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Out of curiosity, how come the linter doesn't catch these?

Comment on lines +75 to +80
ReleaseName string `json:"releaseName,omitempty"`
ValueFiles []string `json:"valueFiles,omitempty"`
// +kubebuilder:validation:Schemaless
// +kubebuilder:pruning:PreserveUnknownFields
// +kubebuilder:validation:Type=object
ValuesObject json.RawMessage `json:"valuesObject,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be removed, as it has been addressed by #2428

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove ValuesObject, would it make it impossible to patch it later on?

internal/argocd/revision.go Outdated Show resolved Hide resolved
internal/argocd/revision.go Outdated Show resolved Hide resolved
switch {
case revision == "":
case revisions == nil || (len(revisions) == 1 && revisions[0] == ""):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spot where the nil check is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this seems to treat one empty string "desired revision" as a special case, but I think the special case should actually be "all desired revisions are empty strings."

It may be that the best way to handle that is when finding the desired revisions. If we come up empty-handed for each and every source, we should probably just return nil/empty slice so that everywhere else we don't need to have the logic that checks the whole collection to see if its all empties.

internal/argocd/health.go Outdated Show resolved Hide resolved
@krancour
Copy link
Member

I feel we should trim down the number of logged messages

tbh, I would prefer seeing all the new log lines removed. The signal to noise ratio in this PR is low mostly on account of the logging. To be frank, the noise has been a hindrance in reviewing the relatively few substantive changes in this PR.

Comment on lines +313 to +315
Status: kargoapi.StageStatus{
FreightHistory: testCase.freightHistory,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no compelling reason to set this. There's actually a comment in GetDesiredRevisions about Freight not being obtained from the status and being obtained instead from the last argument to the function because that grants the flexibility to use that function in the context of a Promotion or a health check of a Stage with its current Freight.

Similarly, all the test cases that explicitly build out a State status with Freight history do not need to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I see this was a flaw in the original test as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, can we keep this for now?

err = fmt.Errorf("error finding Argo CD Application %q in namespace %q: %w", key.Name, key.Namespace, err)
if client.IgnoreNotFound(err) == nil {
err = fmt.Errorf("unable to find Argo CD Application %q in namespace %q", key.Name, key.Namespace)
if app.Status.OperationState != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new nil-check looks like it probably should have been there all along, however, when you added it, you also added the possibility that one or more desired revisions are known, but we still fall through to the logic at the very end of the function that deems the Application healthy without further conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we still fall through to the logic at the very end of the function that deems the Application healthy without further conditions.

Sorry. I read something wrong. This isn't the case.

@hiddeco knows this part of the code better. What's the right thing to do if we don't have any operation state?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stageHealthForAppSync used to error out if there was no OperationState, this is because ArgoCD resets the OperationState when a new Operation is requested (https://github.com/argoproj/argo-cd/blob/473665795c9bcc40e3a60e3bddb7edf76747c974/util/argo/argo.go#L800-L801).

Given this, the lack of an OperationState can indicate that a new (sync) operation is being attempted (or has never happened).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, keep, remove, what is expected?

…or multi-source apps

Signed-off-by: Gyorgy Nadaban <gyorgy.nadaban@gmail.com>
@gnadaban gnadaban force-pushed the fix-multisource-app-support branch from 070bde3 to 4ed19c5 Compare September 3, 2024 14:19
Signed-off-by: Gyorgy Nadaban <gyorgy.nadaban@gmail.com>
@gnadaban
Copy link
Contributor Author

gnadaban commented Sep 3, 2024

@krancour , @hiddeco : I've addressed the majority of the asks, and would like your help to finally complete this.

The debug logging messages added in this PR were essential for debugging what was preventing successful sync operations or causing issues, and to understand what parts of the call chain were involved. Understanding what the app does and how it reacts to things is crucial for cluster admins using Kargo. Since there's no equivalent, I'm strongly against complete removal of detailed logging. I'm not opposed to using Trace level for some, albeit I can't recall if I ever had to use such for any open source project, in my experience "debug" is usually what people use for well, debugging.

Unless there's something broken I'm hesitant to change or refactor anything else at this point, and would ask to forgo changes to parts of the code that are non-essential to this improvement.

As always, your suggestions are greatly appreciated, and I would again like to ask you to help expedite completion.

@krancour
Copy link
Member

Superseded by #2552

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

health checks for Applications with multiple sources can probably work now
4 participants