Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: refactor retries #3060

Merged
merged 6 commits into from
Dec 5, 2024
Merged

fix: refactor retries #3060

merged 6 commits into from
Dec 5, 2024

Conversation

krancour
Copy link
Member

@krancour krancour commented Dec 5, 2024

Fixes #3052

Never fear... a lot of this is just codegen.

None of this is breaking because the bits that are refactored haven't been released yet. 😄

@hiddeco this follows the plan you and I discussed offline earlier.

You also get your wish of us tracking start and end time for each step. I bet @Marvin9 could make good use of this information in the future.

I did my very best to preserve the general spirit and structure of #2940.

This is successfully having all steps fail on their first error with no retries by default, but you can make a step retry after an error by explicitly configuring a error threshold > 1.

As long a step says it is running, it will be retried ad infinitum by default, but the argocd-update step has a lower default of 10m, you can make any step have a finite timeout if you explicitly configure a non-nil and non-zero duration.

As a bonus, I found and fixed two other bugs related to step execution:

  • Step execution engine was only ever returning health checks on success. This meant that, if in a single run, step 1 succeeded and had health checks to return, but step 2 was running and needed to be requeued, those health checks step 1 wanted to register never make it back to the reconciler.

  • The reconciler was only ever doing anything with the healthchecks on promotion success. A single promotion succeeds a maximum of once. This meant that even if the previous bug hadn't existed, the reconciler would ignore all health checks except those returned from its final call out the the engine. Other steps could have succeeded in previous calls and their health checks would be nowhere to be found.

    We never happened to notice this, because the only step that registers health checks currently is argocd-update and that tends to be the last step in a promo process, which meant things always have worked out in its favor despite these bugs.

This still needs corresponding doc updates, but I will tackle them in a follow-up.

Copy link

netlify bot commented Dec 5, 2024

Deploy Preview for docs-kargo-io ready!

Name Link
🔨 Latest commit 743a0d8
🔍 Latest deploy log https://app.netlify.com/sites/docs-kargo-io/deploys/6751e3841e5f2400087a6280
😎 Deploy Preview https://deploy-preview-3060.docs.kargo.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
@krancour krancour force-pushed the krancour/retry-redux branch from fe0543f to ca48855 Compare December 5, 2024 05:26
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
@krancour krancour force-pushed the krancour/retry-redux branch from ca48855 to 13a191c Compare December 5, 2024 05:28
@krancour krancour added priority/urgent Fix right away and removed priority/high labels Dec 5, 2024
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 76.08696% with 33 lines in your changes missing coverage. Please review.

Project coverage is 51.22%. Comparing base (3a7cbb3) to head (743a0d8).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/controller/promotions/promotions.go 0.00% 20 Missing ⚠️
internal/directives/simple_engine_promote.go 92.55% 6 Missing and 1 partial ⚠️
internal/directives/argocd_updater.go 0.00% 4 Missing ⚠️
api/v1alpha1/promotion_types.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3060      +/-   ##
==========================================
+ Coverage   51.10%   51.22%   +0.11%     
==========================================
  Files         283      283              
  Lines       25410    25456      +46     
==========================================
+ Hits        12987    13039      +52     
+ Misses      11724    11720       -4     
+ Partials      699      697       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First sweep with a few minor nits. Overall, it looks absolutely great.

api/v1alpha1/promotion_types_test.go Outdated Show resolved Hide resolved
api/v1alpha1/promotion_types_test.go Outdated Show resolved Hide resolved
internal/directives/promotions.go Show resolved Hide resolved
internal/directives/simple_engine_promote.go Outdated Show resolved Hide resolved
internal/directives/simple_engine_promote.go Outdated Show resolved Hide resolved
Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs DCO and linter fixes, but otherwise LGTM. Great work @krancour 🙇

@Marvin9
Copy link
Contributor

Marvin9 commented Dec 5, 2024

You also get your wish of us tracking start and end time for each step. I bet @Marvin9 could make good use of this information in the future.

definitely following on this.. took "x" duration or running since "y" duration will add up value

@hiddeco
Copy link
Contributor

hiddeco commented Dec 5, 2024

Actually, think this also needs a slight update of documentation.

@krancour
Copy link
Member Author

krancour commented Dec 5, 2024

@hiddeco I mentioned docs will be in a follow up. I'll fix DCO and linter errors. This is what I get for editing directly on github on my phone!

Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
@krancour krancour force-pushed the krancour/retry-redux branch from 99a990a to 743a0d8 Compare December 5, 2024 17:31
@krancour krancour enabled auto-merge December 5, 2024 17:40
@krancour krancour added this pull request to the merge queue Dec 5, 2024
Merged via the queue into akuity:main with commit 58588b8 Dec 5, 2024
17 checks passed
@krancour krancour deleted the krancour/retry-redux branch December 5, 2024 18:01
@krancour krancour mentioned this pull request Dec 5, 2024
krancour added a commit to krancour/kargo that referenced this pull request Dec 5, 2024
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
krancour added a commit that referenced this pull request Dec 5, 2024
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
fykaa pushed a commit to fykaa/kargo that referenced this pull request Dec 20, 2024
Signed-off-by: Kent Rancourt <kent.rancourt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

undesired behavior from promotion step retries
3 participants