Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: drop namespace from zeebe advertisedHost and initialContactPoints #2170

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

jessesimpson36
Copy link
Contributor

Which problem does the PR fix?

closes #1754

What's in this PR?

This change drops the namespace and svc from the initialcontactpoints and advertised host.

The idea behind this change is that since most kubernetes deployments will do this in their dns resolv.conf

search default.svc.cluster.local svc.cluster.local cluster.local

That the first and second dns search will result in a failure since the namespace and svc will be duplicated.

Checklist

Please make sure to follow our Contributing Guide.

Before opening the PR:

  • In the repo's root dir, run make go.update-golden-only.
  • There is no other open pull request for the same update/change.
  • Tests for charts are added (if needed).
  • In-repo documentation are updated (if needed).

After opening the PR:

  • Did you sign our CLA (Contributor License Agreement)? It will show once you open the PR.
  • Did all checks/tests pass in the PR?

@jessesimpson36 jessesimpson36 requested a review from megglos August 2, 2024 15:28
@jessesimpson36 jessesimpson36 marked this pull request as ready for review August 2, 2024 15:28
Copy link
Member

@aabouzaid aabouzaid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you 👍

@megglos megglos requested review from lenaschoenburg and removed request for megglos August 2, 2024 17:23
Copy link
Member

@lenaschoenburg lenaschoenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your comment that this might hurt deployments spread over multiple namespaces. Could this affect multi-region?

Other than that this change seems fine to me, it'll result in fewer failing DNS resolutions for the most common case.

@jessesimpson36
Copy link
Contributor Author

jessesimpson36 commented Aug 6, 2024

I am concerned about multi-region deployments breaking as result of this patch. However, I don't think that concern is valid because in multi-region deployments, those variables are explicitly set.

https://github.com/camunda/c8-multi-region/blob/main/aws/dual-region/kubernetes/camunda-values.yml#L61-L62

I think something could break from this. perhaps in a patch version. but in that scenario, we'd just revert the patch and release again. it's not a huge deal. But personally I don't think DNS failures are a huge deal either. If it's just me deciding, I'd say let's close the PR and not merge. I'm looking for someone to tell me they explicitly want it.

@jessesimpson36 jessesimpson36 added the cycle/alpha4 Tasks will be done in alpha4 cycle label Aug 6, 2024
@npepinpe
Copy link
Member

npepinpe commented Aug 7, 2024

Agreed on the multi-region side, we would be overwriting this explicitly anyway, so I wouldn't worry too much there.

About existing deployments, I'm not sure this would break anything. Do you have a specific case in mind?

As mentioned in the issue, the main benefit is removing a source of noise which may hide real DNS errors in our metrics, and also avoid doing pointless work. The fix worked by doing that. However, the impact is low, so if it were really to break anything, then it may not be worth it depending on what it breaks.

As I said, it's not clear to me what would break. Say we publish a patch with the above - it gets rolled out progressively. The DNS name should be resolved regardless, assuming a standard K8S installation right?

The nodes are matched by their member IDs - so as long as we don't have two nodes with different advertised hosts but same IDs running at the same time, it should be fine. This is something we can quickly test however.

@jessesimpson36
Copy link
Contributor Author

Do you have a specific case in mind?

Nope. it's just a risk.

the main benefit is removing a source of noise which may hide real DNS errors in our metrics, and also avoid doing pointless work

Good enough for me. Lets merge.

The nodes are matched by their member IDs - so as long as we don't have two nodes with different advertised hosts but same IDs running at the same time, it should be fine. This is something we can quickly test however.

I'm not concerned about this scenario. I feel confident in the node + member id matching.

@jessesimpson36 jessesimpson36 merged commit 564581d into main Aug 7, 2024
23 of 24 checks passed
@jessesimpson36 jessesimpson36 deleted the 1754-dns-errors-in-zeebe branch August 7, 2024 13:32
@jessesimpson36 jessesimpson36 removed the cycle/alpha4 Tasks will be done in alpha4 cycle label Aug 7, 2024
@jessesimpson36 jessesimpson36 self-assigned this Aug 7, 2024
@jessesimpson36 jessesimpson36 added the version:8.6.0-alpha4 Label that represents issues released on version 8.6.0-alpha4 label Aug 7, 2024
@maxdanilov
Copy link
Member

maxdanilov commented Aug 12, 2024

@jessesimpson36 while investigating this issue https://camunda.slack.com/archives/C0486FPV076/p1723476618760859 I've discovered this PR and this may be the culprit.

While you're right that in dual region we overwrite/specify ZEEBE_BROKER_CLUSTER_INITIALCONTACTPOINTS and this PR should have no effect, it's not the case for ZEEBE_BROKER_NETWORK_ADVERTISEDHOST (which is not overridden and relies upon the default value the Helm chart sets). Keeping in mind that we heavily rely on the DNS resolution to work via *.svc.cluster.local DNS names cross-cluster (ref) I have a suspicion that this PR was the reason the multi-region tests are broken now for Helm charts of snapshot and 10.3.0 versions (they are working in 10.2.1, 9.3.9 and 8.3.16 though where this change is not present - example)

@maxdanilov
Copy link
Member

maxdanilov commented Aug 12, 2024

verified my hypothesis in camunda/c8-multi-region@14659bc, that actually fixed the dual-region test issue for 10.3.0 /* haven't tried with the snapshot yet */ (workflow run)

@jessesimpson36 @npepinpe @lenaschoenburg what do you think about this PR retrospectively, is this something you want to keep? otherwise the customers for multi-region setup will have slight complication of the setup by having to adjust an additional env var to make C8 work via the Helm chart.

@jessesimpson36
Copy link
Contributor Author

jessesimpson36 commented Aug 13, 2024

I initially thought the problem @maxdanilov posed would only affect test environments where multi-region was enabled (one cluster with multiple namespaces), but you're right that it would affect more than that because they all rely on namespaces in the dns request for the dns forwarding to work between cluster1 and cluster2.

I think we need to revert this and release a 10.3.1 version of the helm chart. The zeebe problem that prompted us to make this patch doesn't seem worth the trouble of breaking multi-region customers (especially since we have to take into account customers using multiregion without the c8-multi-region repo).

@npepinpe @lenaschoenburg , let me know if you have objections.

@npepinpe
Copy link
Member

As this breaks existing deployments, lets revert it. We can look into software based solutions for our issue.

That said, FYI, I believe it doesn't make sense for deployments to specify host (which is the binding host), but instead just specify the advertised host (i.e. where other nodes should reply to).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version:8.6.0-alpha4 Label that represents issues released on version 8.6.0-alpha4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ISSUE] Zeebe shows DNS errors in benchmarks
5 participants