Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore services in endpoint controller using consul.hashicorp.com/service-ignore #858

Merged
merged 37 commits into from
Dec 1, 2021

Conversation

t-eckert
Copy link
Contributor

@t-eckert t-eckert commented Nov 11, 2021

Fixes #592

Changes proposed in this PR:

  • If a service has the label consul.hashicorp.com/service-ignore: "true", it will not be registered by the endpoints controller. If is has been registered before, it will be deregistered.
  • If the user does not have the correct number of services registered, they will get a log instructing them to add the "connect ignore" label to services which should not be routed.
  • Some comments have been added where it took me time to understand what processes were doing.
  • mapAddresses function created as a refactor. I spun off reading this "subprocess" in the Reconcile method and got sidetracked. I think it makes sense to abstract over it. Feel free to disagree.

How I've tested this PR:

  • Unit tests.
  • Manual test (detailed below).

How I expect reviewers to test this PR:

  • Running the unit tests.
  • The manual testing setup I have been using is here
    To test the behavior run the install script which will install Consul onto the cluster; create two services, one with the connect-ignore label; and create a pod which will come online.

Here is a demo of the change, initially the services are deployed without the connect-ignore label on the "headless" service. The pod crashes as consul-connect-inject-init cannot register the service. When the label is added to the "headless" service, the pod comes up in a healthy state:

Screen.Recording.2021-11-19.at.1.02.38.PM.mov

Checklist:

  • Tests added
  • CHANGELOG entry added

    HashiCorp engineers only, community PRs should not add a changelog entry.
    Entries should use present tense (e.g. Add support for...)

@t-eckert t-eckert changed the title Ignore services in endpoint controller using consul.hashicorp.com/connect-ignore Ignore services in endpoint controller using consul.hashicorp.com/service-ignore Nov 12, 2021
@t-eckert t-eckert force-pushed the endpoints-controller-connect-ignore branch from 1745150 to 7a21d1f Compare November 15, 2021 18:14
@t-eckert t-eckert marked this pull request as ready for review November 16, 2021 15:58
Copy link
Contributor

@thisisnotashwin thisisnotashwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @t-eckert. I took a quick look and the approach looks sound. Left some comment inline.

control-plane/connect-inject/endpoints_controller_test.go Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller_test.go Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller.go Outdated Show resolved Hide resolved
@t-eckert t-eckert force-pushed the endpoints-controller-connect-ignore branch from 2ae5ec2 to 970ff02 Compare November 19, 2021 17:26
CHANGELOG.md Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller.go Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller.go Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller_test.go Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller_test.go Outdated Show resolved Hide resolved
control-plane/connect-inject/endpoints_controller_test.go Outdated Show resolved Hide resolved
control-plane/subcommand/connect-init/command.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@kschoche kschoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @t-eckert, great work on this one.
I left a couple comments, 👍🏽

@t-eckert t-eckert force-pushed the endpoints-controller-connect-ignore branch from 8d55c60 to d978d20 Compare December 1, 2021 17:30
Copy link
Contributor

@ishustava ishustava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Thomas!

@nflaig
Copy link

nflaig commented Dec 16, 2021

@t-eckert is there a specific reason why this has to be set as a label? most/all helm charts support setting custom annotations through values but labels are sometimes not supported. Everything else in consul also uses annotations this seems to be the only case where a label is used instead.

@@ -179,6 +179,12 @@ func (c *Command) Run(args []string) int {
c.logger.Info("Check to ensure a Kubernetes service has been created for this application." +
" If your pod is not starting also check the connect-inject deployment logs.")
}
if len(serviceList) > 2 {
c.logger.Error("There are multiple Consul services registered for this pod when there must only be one." +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-eckert I am running multiple stateful sets in kubernets (kafka, zookeeper, elasticsearch, redis) and all of those have two services normal + headless but I dont get this error? shouldnt this fail the pod scheduling?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes it should be failing scheduling if both the headless and non-headless services select the same pods. Maybe open up an issue about this.

@lkysow
Copy link
Member

lkysow commented Dec 17, 2021

is there a specific reason why this has to be set as a label?

Because we're reacting on changes to the Endpoint object which gets its label from the Service but it doesn't get annotations from the Service.

Note that if you have ACLs enabled we're planning on automatically ignoring all Services that don't have the same name as the serviceAccountName. In which case you won't need to label the other services with ignore.

@nflaig
Copy link

nflaig commented Dec 17, 2021

Note that if you have ACLs enabled we're planning on automatically ignoring all Services that don't have the same name as the serviceAccountName. In which case you won't need to label the other services with ignore.

It that a good idea? If you run stateful sets the headless service is used by most client-to-server connections. Or is there a solution for this? Maybe you could first test that with the most common stateful sets like kafka, zookeeper, etc.

What is recommendation, should stateful sets be used with the service mesh? I feel like the transparent proxy design does not consider stateful sets correctly, for example the issues mentioned in #932

@lkysow
Copy link
Member

lkysow commented Dec 20, 2021

It that a good idea?

Currently Consul only supports one service pointing at one set of pods. If you have two services that point at the same set of pods then none of the pods will start. In addition, when ACLs are enabled the serviceAccountName of the pod must match the service name otherwise the pods won't start. So our plan to ignore services that don't match the serviceAccountName simply means that you won't get in a broken situation. Those services would never have worked before, they would have caused the pods to not come up.

RE statefulsets we definitely want to support them. We need to triage and get to your issue.

@nflaig
Copy link

nflaig commented Dec 20, 2021

statefulsets we definitely want to support them. We need to triage and get to your issue

ok thanks, that is definitely good to hear and of course this is not an easy problem to solve. The work you guys are doing is great and I really appreciate it

rrondeau pushed a commit to rrondeau/consul-k8s that referenced this pull request Dec 21, 2021
* Pass either build URL or build number to terraform resources for GCP/AWS/Azure. On GKE, we can't pass build URL because / is not a valid character in labels
* Remove scripting logic that makes tests fail early (we don't need it in nightly acceptance tests)
* Fix metrics merging test on GKE. We are running tests there with enterprise license enabled, but since we're setting the consul image to OSS one, we should also unset enterprise license values so that we don't try to run the enterprise license commands against OSS binary, where they don't exist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If two Services select the same pods then pods won't start
6 participants