If an error propagates all the way out, bail execution #3009

olemarkus · 2022-09-07T13:32:53Z

Description

Related to #3008, when external-dns is configured with multiple sources and one of them fails, all other sources are skipped as well. This PR logs and continues rather than aborting completely and log.

If an error propagates all the way out to the control loop, log and exit rather than log and carry on.

szuecs · 2022-09-07T13:54:52Z

@olemarkus in general I don't like the idea of ignoring an error in one source. Consider the update controller scenario were you run current version A and update to version B. In A all sources M,N and K work, but in version B source M does not work for some reason.
I think safety is to fail fast in this case.

olemarkus · 2022-09-07T15:54:57Z

Is that to increase the visibility of the fail?

In our case, where the ambassador host source failed, it broke all the other sources as well. This means it's very easy for one cluster tenant to break external-dns for everyone else. Of course if the sources behaves more nicely, such as ambassador hosts after #3008, then it makes sense to fail faster. Maybe.

szuecs · 2022-09-07T18:55:41Z

If you for example deploy a new version of external-dns it should either work or break, but it should not delete records if one source can not be fetched, which would likely break half of the cluster ingress.
So my point is that it's more severe to break partially without notice then failing completely.
Operators can alert based on control loop pod is in the right state and will notice in case it's crash looping.
Then logs will show what caused the break and they can get this fixed or manually stop using the source if this is applicable for their environment.

olemarkus · 2022-09-07T19:02:12Z

I could agree if it did crashloop. But it doesn't. It just logs and restarts the iteration.
It ends up on this line: https://github.com/kubernetes-sigs/external-dns/blob/master/controller/controller.go#L295

It may very well be that one should os.exit(1) if the controller returns error all the way in there.

szuecs · 2022-09-07T19:34:31Z

@olemarkus great catch that one should be a fatal!
@Raffo @njuettner can you check if you agree here (last comment and linked source (error->fatal)?

olemarkus · 2022-09-14T11:52:06Z

How do we proceed with this?

szuecs · 2022-09-14T22:08:27Z

@olemarkus do you mean the change to fatal?

olemarkus · 2022-09-15T04:53:28Z

Either this PR or change to fatal, yes.

szuecs · 2022-11-10T13:21:23Z

Fatal would be preferred

olemarkus · 2022-11-20T14:43:12Z

Done.

k8s-triage-robot · 2023-02-18T14:48:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-20T15:11:53Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

olemarkus · 2023-03-23T06:49:48Z

/remove-lifecycle rotten

johngmyers · 2023-05-07T21:53:24Z

/lgtm

szuecs · 2023-05-08T11:42:31Z

/approve

k8s-ci-robot · 2023-05-08T11:42:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, szuecs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [szuecs]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alfredkrohmer · 2023-08-29T08:30:47Z

This seems to be partially responsible for #3754. I think this change should be reverted. What's the point of a reconciling controller if it exits and crashes on every error it encounters?

jbilliau-rcd · 2023-08-29T13:53:28Z

Agreed, this completely breaks the controller. So if any errors occur, the controller just blows up, after never doing that since it's inception? Terrible idea.

olemarkus · 2023-09-09T07:32:41Z

As mentioned in the first comment, my original PR was to log and carry on, but this approach was preferred instead. In the long run I do think this approach is better, but it may be a bit painful in the beginning as some source/providers may propagate errors as an acceptable path.

jbilliau-rcd · 2023-09-11T13:46:20Z

Sorry, I'm confused....how is this approach better? As far as I can tell, the controller is completely broken now and non-functional, as ANY issues with creating DNS records cause the entire thing to blow up. We have a ton of clusters that have multiple teams and multiple apps on them. As it is today, if one batch of records is erroring, it doesn't affect anything else....other teams can still deploy Ingress objects and have their records created. Now, if ONE batch of records is incorrect, the entire controller now ceases to function.

In an attempt to make an analogy, imagine if AWS functioned so that if someone created a s3 bucket with a bad bucket policy, no one could then create an s3 bucket as the s3 console would crash and become non-functional. How would that ever be a good idea?

xrl · 2023-11-15T21:12:43Z

I agree, fail fast is great on a single request but in the case of a live-forever controller it would make more sense to log and continue. There's no guarantee that all good resources will be serviced before the bad resource forces an exit -- the controller is orphaning good resources.

I will be downgrading my external-dns and not participating in this flavor of execution. Please consider making the fail-fast opt-out through a command line argument.

ctwilleager-alio · 2024-03-12T19:55:38Z

I just wanted to add my two cents on this. We are using AWS EKS and Route53. I deployed a set of Kubernetes Ingress objects with annotations to create records and external-dns created them without issue. However, on the next run of updates, the external-dns pod kept crashing with a fatal error complaining that those same records already exist. I wasted hours troubleshooting this problem before finding this PR.

The external-dns controller should not even be complaining about records already existing, let alone completely crashing, especially when it created those records in the first place. This is a bad design choice and this PR should either be reverted entirely or reworked so that the controller does not explode when it finds records that it created.

I have pinned my external-dns chart version to v1.12.2 until this bug is fixed.

k8s-ci-robot requested review from Raffo and szuecs September 7, 2022 13:32

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 7, 2022

If controller error propagates all the way out, bail execution

07dc39a

olemarkus force-pushed the continue-on-bad-source branch from 352cf68 to 07dc39a Compare November 20, 2022 14:40

olemarkus changed the title ~~If one source fails, continue to the next source instead of bailing~~ If an error propagates all the way out, bail execution Nov 20, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 20, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 23, 2023

k8s-ci-robot assigned johngmyers May 7, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2023

k8s-ci-robot merged commit 128bcf8 into kubernetes-sigs:master May 8, 2023

alfredkrohmer mentioned this pull request Aug 29, 2023

v0.13.5 trying to create already existing record #3754

Open

dmcdii mentioned this pull request Oct 3, 2023

crashloopbackoff when using RRSet region based and another record already exists #3945

Open

matthewbyrne mentioned this pull request Dec 19, 2023

external-dns pod keeps restarting with aws route53 Throttling: Rate exceeded error #4067

Closed

gregsidelinger mentioned this pull request Jan 2, 2024

Add an option to not bail on errors #4147

Closed

2 tasks

jrosinsk mentioned this pull request Feb 3, 2024

Update the OCI Provider to incorporate SoftError to avoid CrashLoopBackoff #4229

Merged

2 tasks

SimonKienzler mentioned this pull request Mar 14, 2024

feat(WebhookProvider): Let WebhookProvider return SoftError on response status codes >= 500 #4319

Merged

2 tasks

ebachle mentioned this pull request May 6, 2024

Cloudflare provider rate limits cause a fatal error and pod restarts into CLBO #4434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If an error propagates all the way out, bail execution #3009

If an error propagates all the way out, bail execution #3009

olemarkus commented Sep 7, 2022 •

edited

Loading

szuecs commented Sep 7, 2022

olemarkus commented Sep 7, 2022

szuecs commented Sep 7, 2022

olemarkus commented Sep 7, 2022

szuecs commented Sep 7, 2022 •

edited

Loading

olemarkus commented Sep 14, 2022

szuecs commented Sep 14, 2022

olemarkus commented Sep 15, 2022

szuecs commented Nov 10, 2022

olemarkus commented Nov 20, 2022

k8s-triage-robot commented Feb 18, 2023

k8s-triage-robot commented Mar 20, 2023

olemarkus commented Mar 23, 2023

johngmyers commented May 7, 2023

szuecs commented May 8, 2023

k8s-ci-robot commented May 8, 2023

alfredkrohmer commented Aug 29, 2023

jbilliau-rcd commented Aug 29, 2023

olemarkus commented Sep 9, 2023

jbilliau-rcd commented Sep 11, 2023 •

edited

Loading

xrl commented Nov 15, 2023

ctwilleager-alio commented Mar 12, 2024

If an error propagates all the way out, bail execution #3009

If an error propagates all the way out, bail execution #3009

Conversation

olemarkus commented Sep 7, 2022 • edited Loading

szuecs commented Sep 7, 2022

olemarkus commented Sep 7, 2022

szuecs commented Sep 7, 2022

olemarkus commented Sep 7, 2022

szuecs commented Sep 7, 2022 • edited Loading

olemarkus commented Sep 14, 2022

szuecs commented Sep 14, 2022

olemarkus commented Sep 15, 2022

szuecs commented Nov 10, 2022

olemarkus commented Nov 20, 2022

k8s-triage-robot commented Feb 18, 2023

k8s-triage-robot commented Mar 20, 2023

olemarkus commented Mar 23, 2023

johngmyers commented May 7, 2023

szuecs commented May 8, 2023

k8s-ci-robot commented May 8, 2023

alfredkrohmer commented Aug 29, 2023

jbilliau-rcd commented Aug 29, 2023

olemarkus commented Sep 9, 2023

jbilliau-rcd commented Sep 11, 2023 • edited Loading

xrl commented Nov 15, 2023

ctwilleager-alio commented Mar 12, 2024

olemarkus commented Sep 7, 2022 •

edited

Loading

szuecs commented Sep 7, 2022 •

edited

Loading

jbilliau-rcd commented Sep 11, 2023 •

edited

Loading