Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failover strategy with 3 or more clusters #1705

Merged
merged 1 commit into from
Aug 29, 2024

Conversation

abaguas
Copy link
Collaborator

@abaguas abaguas commented Aug 12, 2024

In a setup with 3 or more clusters the failover strategy is not returning the correct targets if the DNS query hits a non-primary region.

How to reproduce

Create a 3 cluster setup that contains a GSLB with failover strategy with eu as primary cluster. The podinfo app is up on all clusters:

K8GB_LOCAL_VERSION=test CLUSTERS_NUMBER=3 make deploy-full-local-setup

Retrieve the local targets to see which IP addresses are exposed by each cluster (eu, us and cz respectively):

➜  ~ dig +short -p 5053 @localhost localtargets-failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5054 @localhost localtargets-failover.cloud.example.com
172.19.0.10
172.19.0.11
➜  ~ dig +short -p 5055 @localhost localtargets-failover.cloud.example.com
172.19.0.14
172.19.0.15

Query the domain failover.cloud.example on each cluster:

➜  ~ dig +short -p 5053 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5054 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
172.19.0.14
172.19.0.15
➜  ~ dig +short -p 5055 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
172.19.0.10
172.19.0.11

We would expect all clusters to return 172.19.0.6 and 172.19.0.7, however the non-primary clusters return also the IP addresses of the other non-primary cluster:

  • eu returns eu
  • us returns eu and cz
  • cz returns eu and us

This happens because a non-primary cluster returns the IP addresses of all other clusters, which is correct in a 2 cluster setup but not on a 3 cluster setup.

Fix

To fix the issue the login for non-primary clusters needs to be adapted. If the app is healthy on the primary cluster then these the targets on that cluster must be returned. If the application is unhealthy on the primary cluster then the addresses of all healthy clusters should be returned.

After the change:

➜  ~ dig +short -p 5053 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5054 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5055 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7

After scaling down podinfo on the primary cluster (eu):

➜  ~ k scale deploy frontend-podinfo --replicas=0 -n test-gslb --context k3d-test-gslb1
deployment.apps/frontend-podinfo scaled
➜  ~ dig +short -p 5053 @localhost failover.cloud.example.com
172.19.0.10
172.19.0.11
172.19.0.14
172.19.0.15
➜  ~ dig +short -p 5054 @localhost failover.cloud.example.com
172.19.0.10
172.19.0.11
172.19.0.14
172.19.0.15
➜  ~ dig +short -p 5055 @localhost failover.cloud.example.com
172.19.0.14
172.19.0.15
172.19.0.10
172.19.0.11

Tests

Two unit tests needed to be adapted since they did not have the correct geo tags.
The tests were overwriting the ClusterGeoTag to za and the GSLB's PrimaryGeoTag to eu which resulted in evaluating the targets on a non-primary cluster (za != eu). However, the external targets only contain records from us-east-1 (value of ExtClustersGeoTags). Interestingly, non-intentionally this actually simulates a 3 cluster setup in a non-conventional way with:

  • za -> localtargets up
  • eu -> down
  • us-east-1 -> external targets up

Due to the bug described above, the addresses of us-east-1 (all external targets) were being returned and the test was passing. But with the fix the correct output is the targets of both za and us-east-1 (local targets + external targets), that is why the tests failed.

To fix this the tags were updated to have only two clusters: us-east-1 and us-west-1, since this was the intended scenario for the test.

@abaguas abaguas marked this pull request as draft August 12, 2024 20:34
The failover strategy is not returning the correct targets if the DNS query hits a non-primary region

How to reproduce:

Create a 3 cluster setup that contains a GSLB with failover strategy with `eu` as primary cluster. The podinfo app is up on all clusters.
```
K8GB_LOCAL_VERSION=test CLUSTERS_NUMBER=3 make deploy-full-local-setup
```

Retrieve the local targets to see which IP addresses are exposed by each cluster (`eu`, `us` and `cz` respectively):
```
➜  ~ dig +short -p 5053 @localhost localtargets-failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5054 @localhost localtargets-failover.cloud.example.com
172.19.0.10
172.19.0.11
➜  ~ dig +short -p 5055 @localhost localtargets-failover.cloud.example.com
172.19.0.14
172.19.0.15
```

Query the domain `failover.cloud.example` on each cluster:
```
➜  ~ dig +short -p 5053 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5054 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
172.19.0.14
172.19.0.15
➜  ~ dig +short -p 5055 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
172.19.0.10
172.19.0.11
```
We would expect all clusters to return 172.19.0.6 and 172.19.0.7, however the non-primary clusters return also the IP addresses of the other cluster:
* `eu` returns `eu`
* `us` returns `eu` and `cz`
* `cz` returns `eu` and `us`

This happens because a non-primary cluster returns the IP addresses of all other clusters, which is correct in a 2 cluster setup but not on a 3 cluster setup.

Fix

To fix the issue the IP addresses of the primary cluster must be returned if they exist. If the application is down on the primary cluster then all addresses of all clusters should be returned.

After the change:
```
➜  ~ dig +short -p 5053 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5054 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
➜  ~ dig +short -p 5055 @localhost failover.cloud.example.com
172.19.0.6
172.19.0.7
```

After scaling down podinfo on the primary cluster (`eu`):
```
➜  ~ k scale deploy frontend-podinfo --replicas=0 -n test-gslb --context k3d-test-gslb1
deployment.apps/frontend-podinfo scaled
```

```
➜  ~ dig +short -p 5053 @localhost failover.cloud.example.com
172.19.0.10
172.19.0.11
172.19.0.14
172.19.0.15
➜  ~ dig +short -p 5054 @localhost failover.cloud.example.com
172.19.0.10
172.19.0.11
172.19.0.14
172.19.0.15
➜  ~ dig +short -p 5055 @localhost failover.cloud.example.com
172.19.0.14
172.19.0.15
172.19.0.10
172.19.0.11
```

Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>
@abaguas abaguas force-pushed the fix/multiclusterbug branch from 30df86d to a28696f Compare August 12, 2024 21:19
@abaguas abaguas marked this pull request as ready for review August 13, 2024 05:41
Copy link
Member

@ytsarev ytsarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff 👍

@ytsarev ytsarev merged commit d98d51e into k8gb-io:master Aug 29, 2024
15 checks passed
@abaguas abaguas deleted the fix/multiclusterbug branch November 1, 2024 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants