Fix failover strategy with 3 or more clusters #1705
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In a setup with 3 or more clusters the
failover
strategy is not returning the correct targets if the DNS query hits a non-primary region.How to reproduce
Create a 3 cluster setup that contains a GSLB with failover strategy with
eu
as primary cluster. The podinfo app is up on all clusters:Retrieve the local targets to see which IP addresses are exposed by each cluster (
eu
,us
andcz
respectively):Query the domain
failover.cloud.example
on each cluster:We would expect all clusters to return 172.19.0.6 and 172.19.0.7, however the non-primary clusters return also the IP addresses of the other non-primary cluster:
eu
returnseu
us
returnseu
andcz
cz
returnseu
andus
This happens because a non-primary cluster returns the IP addresses of all other clusters, which is correct in a 2 cluster setup but not on a 3 cluster setup.
Fix
To fix the issue the login for non-primary clusters needs to be adapted. If the app is healthy on the primary cluster then these the targets on that cluster must be returned. If the application is unhealthy on the primary cluster then the addresses of all healthy clusters should be returned.
After the change:
After scaling down podinfo on the primary cluster (
eu
):Tests
Two unit tests needed to be adapted since they did not have the correct geo tags.
The tests were overwriting the
ClusterGeoTag
toza
and the GSLB'sPrimaryGeoTag
toeu
which resulted in evaluating the targets on a non-primary cluster (za
!=eu
). However, the external targets only contain records fromus-east-1
(value ofExtClustersGeoTags
). Interestingly, non-intentionally this actually simulates a 3 cluster setup in a non-conventional way with:za
-> localtargets upeu
-> downus-east-1
-> external targets upDue to the bug described above, the addresses of
us-east-1
(all external targets) were being returned and the test was passing. But with the fix the correct output is the targets of bothza
andus-east-1
(local targets + external targets), that is why the tests failed.To fix this the tags were updated to have only two clusters:
us-east-1
andus-west-1
, since this was the intended scenario for the test.