Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If an error propagates all the way out, bail execution #3009

Merged

Conversation

olemarkus
Copy link
Contributor

@olemarkus olemarkus commented Sep 7, 2022

Description

Related to #3008, when external-dns is configured with multiple sources and one of them fails, all other sources are skipped as well. This PR logs and continues rather than aborting completely and log.

If an error propagates all the way out to the control loop, log and exit rather than log and carry on.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 7, 2022
@szuecs
Copy link
Contributor

szuecs commented Sep 7, 2022

@olemarkus in general I don't like the idea of ignoring an error in one source. Consider the update controller scenario were you run current version A and update to version B. In A all sources M,N and K work, but in version B source M does not work for some reason.
I think safety is to fail fast in this case.

@olemarkus
Copy link
Contributor Author

Is that to increase the visibility of the fail?

In our case, where the ambassador host source failed, it broke all the other sources as well. This means it's very easy for one cluster tenant to break external-dns for everyone else. Of course if the sources behaves more nicely, such as ambassador hosts after #3008, then it makes sense to fail faster. Maybe.

@szuecs
Copy link
Contributor

szuecs commented Sep 7, 2022

If you for example deploy a new version of external-dns it should either work or break, but it should not delete records if one source can not be fetched, which would likely break half of the cluster ingress.
So my point is that it's more severe to break partially without notice then failing completely.
Operators can alert based on control loop pod is in the right state and will notice in case it's crash looping.
Then logs will show what caused the break and they can get this fixed or manually stop using the source if this is applicable for their environment.

@olemarkus
Copy link
Contributor Author

I could agree if it did crashloop. But it doesn't. It just logs and restarts the iteration.
It ends up on this line: https://github.com/kubernetes-sigs/external-dns/blob/master/controller/controller.go#L295

It may very well be that one should os.exit(1) if the controller returns error all the way in there.

@szuecs
Copy link
Contributor

szuecs commented Sep 7, 2022

@olemarkus great catch that one should be a fatal!
@Raffo @njuettner can you check if you agree here (last comment and linked source (error->fatal)?

@olemarkus
Copy link
Contributor Author

How do we proceed with this?

@szuecs
Copy link
Contributor

szuecs commented Sep 14, 2022

@olemarkus do you mean the change to fatal?

@olemarkus
Copy link
Contributor Author

Either this PR or change to fatal, yes.

@szuecs
Copy link
Contributor

szuecs commented Nov 10, 2022

Fatal would be preferred

@olemarkus olemarkus changed the title If one source fails, continue to the next source instead of bailing If an error propagates all the way out, bail execution Nov 20, 2022
@olemarkus
Copy link
Contributor Author

Done.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 18, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 20, 2023
@olemarkus
Copy link
Contributor Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 23, 2023
@johngmyers
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2023
@szuecs
Copy link
Contributor

szuecs commented May 8, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: olemarkus, szuecs

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2023
@k8s-ci-robot k8s-ci-robot merged commit 128bcf8 into kubernetes-sigs:master May 8, 2023
@alfredkrohmer
Copy link
Contributor

This seems to be partially responsible for #3754. I think this change should be reverted. What's the point of a reconciling controller if it exits and crashes on every error it encounters?

@jbilliau-rcd
Copy link

Agreed, this completely breaks the controller. So if any errors occur, the controller just blows up, after never doing that since it's inception? Terrible idea.

@olemarkus
Copy link
Contributor Author

As mentioned in the first comment, my original PR was to log and carry on, but this approach was preferred instead. In the long run I do think this approach is better, but it may be a bit painful in the beginning as some source/providers may propagate errors as an acceptable path.

@jbilliau-rcd
Copy link

jbilliau-rcd commented Sep 11, 2023

Sorry, I'm confused....how is this approach better? As far as I can tell, the controller is completely broken now and non-functional, as ANY issues with creating DNS records cause the entire thing to blow up. We have a ton of clusters that have multiple teams and multiple apps on them. As it is today, if one batch of records is erroring, it doesn't affect anything else....other teams can still deploy Ingress objects and have their records created. Now, if ONE batch of records is incorrect, the entire controller now ceases to function.

In an attempt to make an analogy, imagine if AWS functioned so that if someone created a s3 bucket with a bad bucket policy, no one could then create an s3 bucket as the s3 console would crash and become non-functional. How would that ever be a good idea?

@xrl
Copy link

xrl commented Nov 15, 2023

I agree, fail fast is great on a single request but in the case of a live-forever controller it would make more sense to log and continue. There's no guarantee that all good resources will be serviced before the bad resource forces an exit -- the controller is orphaning good resources.

I will be downgrading my external-dns and not participating in this flavor of execution. Please consider making the fail-fast opt-out through a command line argument.

@ctwilleager-alio
Copy link

I just wanted to add my two cents on this. We are using AWS EKS and Route53. I deployed a set of Kubernetes Ingress objects with annotations to create records and external-dns created them without issue. However, on the next run of updates, the external-dns pod kept crashing with a fatal error complaining that those same records already exist. I wasted hours troubleshooting this problem before finding this PR.

The external-dns controller should not even be complaining about records already existing, let alone completely crashing, especially when it created those records in the first place. This is a bad design choice and this PR should either be reverted entirely or reworked so that the controller does not explode when it finds records that it created.

I have pinned my external-dns chart version to v1.12.2 until this bug is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants