Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis] Redis and redis-sentinel mix hostnames between config on simultaneous STS rollout restart when running multiple redis instances on single k8s cluster #10016

Closed
JDKnobloch opened this issue May 3, 2022 · 21 comments
Assignees
Labels
redis stale 15 days without activity triage Triage is needed

Comments

@JDKnobloch
Copy link
Contributor

JDKnobloch commented May 3, 2022

Name and Version

bitnami/redis 16.8.7

What steps will reproduce the bug?

  1. Within k8s deployment (occurred in both v1.22.6 and v1.17.2), set up GitOps (Argo v2.3.3 used in this case).
    a. GitOps is likely not necessary but used in this case for visibility and replicability from our production environment.
  2. Apply bitnami/redis helm charts.
    a. Apply two or more redis instances w/ sentinel enabled and auth disabled (base and sentinel)
  3. Once all pods have started successfully, restart the StatefulSet of each instance using the kubectl rollout restart statefulset redis-node command
    a. This should be done as quickly as possible - we want to them all restarting at the same time
  4. All pods will restart, and in the process, configs will mix hostnames. This can be viewed in both redis & sentinel pod logs and viewing the config located within opt/bitnami/redis-sentinel/etc/sentinel.config
    a. It may take multiple restarts before configs are mixed - typically within the first 2 restarts. May be timing related.

Note: These redis instances can be installed in separate namespaces and will still experience this same issue.

Are you using any custom parameters or values?

The only values required to be set are:

auth:
  enabled: false
  sentinel: false
sentinel:
  enabled: true

What is the expected behavior?

Multiple instances should perform StatefulSet restarts as expected and as they do when restarted alone.

What do you see instead?

As the pods restart, configuration between both redis and redis-sentinel can become mixed - containing hostnames for any other replication redis clusters that are restarted at the same time w/ auth disabled. Despite configs using hostnames and redis instances being in separate namespaces, this same issue occurs.

This may take two sts rollout restarts to occur - allowing pods to fully regen between restarts.

Additional information

This first occurred within our nonprod env where we have 4 redis-sentinel replication clusters running. We were simultaneously bumping versions of some of the instances, triggering a restart that caused this issue. Redis nodes in separate namespaces were stuck communicating with each other - while still appearing 'healthy' from a cluster standpoint.

After discovering this issue and seeing its potential scope, we elected to migrate all redis-sentinel replication instances on our clusters to expressly define exclusive ports - necessary for both redis & redis-sentinel - to circumvent the vulnerability.

Further testing allowed me to break down our values file to the one defined - I have rigorously tested this issue and have verified results multiple times.

This issue addresses topics that were discussed in #1682 and #5418

I ran a minikube cluster locally for testing.

Below are logs from an example test:

Argo Application(s) applied:

---
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  finalizers:
    - resources-finalizer.argocd.argoproj.io
  name: qa-redis
spec:
  destination:
    namespace: qa
    server: 'https://kubernetes.default.svc'
  project: default
  revisionHistoryLimit: 2
  source:
    chart: redis
    helm:
      releaseName: qa-redis
      values: |+
        auth:
          enabled: false
          sentinel: false
        sentinel:
          enabled: true
    repoURL: 'https://charts.bitnami.com/bitnami'
    targetRevision: 16.8.7
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - ApplyOutOfSyncOnly=true

Two more applications were deployed - one with names updated to ci-redis, one np-redis - both in their own ci/np namespaces.

After deploying the three applications into the k8s cluster, allowed all nodes to fully launch prior, then ran kubectl restart quickly for all STS. On this test, it took two restarts before the issue occurred - all nodes fully launched prior to both restarts.

As nodes restart, logs can be observed as cross pollination occurs, and by the end, configurations become extremely mixed.

Here are logs and config from a replica node on the cluster - np-redis-node-1.
Redis container logs:

19:23:25.26 INFO  ==> about to run the command: timeout 220 redis-cli -h np-redis.np.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
19:23:25.35 INFO  ==> about to run the command: timeout 220 redis-cli -h np-redis.np.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
19:23:25.38 INFO  ==> Current master: REDIS_SENTINEL_INFO=(ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local,6379)
19:23:25.39 INFO  ==> Configuring the node as replica
1:C 03 May 2022 19:23:25.404 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:C 03 May 2022 19:23:25.404 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:C 03 May 2022 19:23:25.404 # Configuration loaded
1:S 03 May 2022 19:23:25.405 * monotonic clock: POSIX clock_gettime
1:S 03 May 2022 19:23:25.406 * Running mode=standalone, port=6379.
1:S 03 May 2022 19:23:25.406 # Server initialized
1:S 03 May 2022 19:23:25.406 * Reading RDB preamble from AOF file...
1:S 03 May 2022 19:23:25.406 * Loading RDB produced by version 6.2.6
1:S 03 May 2022 19:23:25.406 * RDB age 349 seconds
1:S 03 May 2022 19:23:25.406 * RDB memory usage when created 1.79 Mb
1:S 03 May 2022 19:23:25.406 * RDB has an AOF tail
1:S 03 May 2022 19:23:25.406 # Done loading RDB, keys loaded: 0, keys expired: 0.
1:S 03 May 2022 19:23:25.406 * Reading the remaining AOF tail...
1:S 03 May 2022 19:23:25.406 * DB loaded from append only file: 0.000 seconds
1:S 03 May 2022 19:23:25.406 * Ready to accept connections
1:S 03 May 2022 19:23:25.407 * Connecting to MASTER ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local:6379
1:S 03 May 2022 19:23:25.408 * MASTER <-> REPLICA sync started
1:S 03 May 2022 19:23:25.408 * Non blocking connect for SYNC fired the event.
1:S 03 May 2022 19:23:25.408 * Master replied to PING, replication can continue...
1:S 03 May 2022 19:23:25.408 * Partial resynchronization not possible (no cached master)
1:S 03 May 2022 19:23:25.409 * Full resync from master: 2bde88f8684c77e8024ea51b38ef0d214cc0e530:237454
1:S 03 May 2022 19:23:25.656 * MASTER <-> REPLICA sync: receiving 178 bytes from master to disk
1:S 03 May 2022 19:23:25.656 * MASTER <-> REPLICA sync: Flushing old data
1:S 03 May 2022 19:23:25.757 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 03 May 2022 19:23:26.532 * Loading RDB produced by version 6.2.6
1:S 03 May 2022 19:23:26.532 * RDB age 1 seconds
1:S 03 May 2022 19:23:26.532 * RDB memory usage when created 2.68 Mb
1:S 03 May 2022 19:23:26.532 # Done loading RDB, keys loaded: 0, keys expired: 0.
1:S 03 May 2022 19:23:26.532 * MASTER <-> REPLICA sync: Finished with success
1:S 03 May 2022 19:23:26.532 * Background append only file rewriting started by pid 34
1:S 03 May 2022 19:23:27.284 * AOF rewrite child asks to stop sending diffs.
34:C 03 May 2022 19:23:27.284 * Parent agreed to stop sending diffs. Finalizing AOF...
34:C 03 May 2022 19:23:27.284 * Concatenating 0.00 MB of AOF diff received from parent.
34:C 03 May 2022 19:23:27.436 * SYNC append only file rewrite performed
34:C 03 May 2022 19:23:27.442 * AOF rewrite: 0 MB of memory used by copy-on-write
1:S 03 May 2022 19:23:27.444 * Background AOF rewrite terminated with success
1:S 03 May 2022 19:23:27.444 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
1:S 03 May 2022 19:23:27.444 * Background AOF rewrite finished successfully

Sentinel container logs:

19:23:31.26 INFO  ==> about to run the command: redis-cli -h np-redis.np.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local
6379
19:23:32.77 INFO  ==> about to run the command: redis-cli -h np-redis.np.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
19:23:33.18 INFO  ==> printing REDIS_SENTINEL_INFO=(ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local,6379)
1:X 03 May 2022 19:23:33.402 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
1:X 03 May 2022 19:23:33.402 # Redis version=6.2.6, bits=64, commit=00000000, modified=0, pid=1, just started
1:X 03 May 2022 19:23:33.402 # Configuration loaded
1:X 03 May 2022 19:23:33.403 * monotonic clock: POSIX clock_gettime
1:X 03 May 2022 19:23:33.404 * Running mode=sentinel, port=26379.
1:X 03 May 2022 19:23:33.404 # Sentinel ID is 859b1464b451d86bed47ad351716ebfc8278db1b
1:X 03 May 2022 19:23:33.404 # +monitor master mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379 quorum 2
1:X 03 May 2022 19:23:33.408 * +slave slave ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local:6379 ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:33.887 * +slave slave ci-redis-node-1.ci-redis-headless.ci.svc.cluster.local:6379 ci-redis-node-1.ci-redis-headless.ci.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:34.495 * +slave slave qa-redis-node-1.qa-redis-headless.qa.svc.cluster.local:6379 qa-redis-node-1.qa-redis-headless.qa.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:34.939 * +slave slave qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local:6379 qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:35.353 * +slave slave qa-redis-node-2.qa-redis-headless.qa.svc.cluster.local:6379 qa-redis-node-2.qa-redis-headless.qa.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:35.703 * +sentinel sentinel 225901f21200955c907690cf889082d7040d262a ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:36.477 # +new-epoch 2
1:X 03 May 2022 19:23:36.479 * +sentinel sentinel 477647859e5e13a8031bcf89f2c4ee65bfa5013c qa-redis-node-1.qa-redis-headless.qa.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:36.920 * +sentinel sentinel d095c031c93fc8863c8da37e7863f56c321236ab qa-redis-node-2.qa-redis-headless.qa.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:37.454 * +sentinel sentinel 99953b14a2e7ac560bc8eb18bfbd35b04ab50ae3 ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:38.062 * +sentinel sentinel ae21e437610dbcc0a1e4ec2fea0f059f85ca3aae ci-redis-node-1.ci-redis-headless.ci.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:39.138 * +sentinel sentinel d09b0415f0a74561756769ce826ce26edaf1c134 qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:23:39.988 # +tilt #tilt mode entered
1:X 03 May 2022 19:24:10.012 # -tilt #tilt mode exited
1:X 03 May 2022 19:24:27.661 * +reboot slave qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local:6379 qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:24:37.539 # +sdown sentinel 99953b14a2e7ac560bc8eb18bfbd35b04ab50ae3 ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:24:40.513 * +sentinel-address-switch master mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379 ip qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local port 26379 for d09b0415f0a74561756769ce826ce26edaf1c134
1:X 03 May 2022 19:24:42.108 * +sentinel-address-switch master mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379 ip ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local port 26379 for 99953b14a2e7ac560bc8eb18bfbd35b04ab50ae3
1:X 03 May 2022 19:24:48.170 * +reboot slave np-redis-node-0.np-redis-headless.np.svc.cluster.local:6379 np-redis-node-0.np-redis-headless.np.svc.cluster.local 6379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:24:55.968 * +sentinel-address-switch master mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379 ip np-redis-node-0.np-redis-headless.np.svc.cluster.local port 26379 for d60e97c65451987f45e8ec25b11bcb72eaab9090
1:X 03 May 2022 19:25:03.353 * +sentinel sentinel 99953b14a2e7ac560bc8eb18bfbd35b04ab50ae3 ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:25:04.064 * +sentinel sentinel d09b0415f0a74561756769ce826ce26edaf1c134 qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379
1:X 03 May 2022 19:25:17.886 * +sentinel sentinel d60e97c65451987f45e8ec25b11bcb72eaab9090 np-redis-node-0.np-redis-headless.np.svc.cluster.local 26379 @ mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379

Redis config from opt/bitnami/redis/etc/redis.conf:

# User-supplied common configuration:
# Enable AOF https://redis.io/topics/persistence#append-only-file
appendonly yes
# Disable RDB persistence, AOF persistence already enabled.
save ""
# End of common configuration

Replica config from opt/bitnami/redis/etc/replica.conf:

dir /data
slave-read-only yes
# User-supplied replica configuration:
rename-command FLUSHDB ""
rename-command FLUSHALL ""
# End of replica configuration
replica-announce-port 6379
replica-announce-ip np-redis-node-1.np-redis-headless.np.svc.cluster.local

Sentinel config from opt/bitnami/redis-sentinel/etc/sentinel.config:

dir "/tmp"
port 26379
sentinel monitor mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 6379 2
sentinel down-after-milliseconds mymaster 60000
sentinel failover-timeout mymaster 18000

# User-supplied sentinel configuration:
# End of sentinel configuration
sentinel myid 859b1464b451d86bed47ad351716ebfc8278db1b

sentinel known-sentinel mymaster qa-redis-node-2.qa-redis-headless.qa.svc.cluster.local 26379 d095c031c93fc8863c8da37e7863f56c321236ab

sentinel known-replica mymaster np-redis-node-0.np-redis-headless.np.svc.cluster.local 6379

sentinel known-replica mymaster np-redis-node-1.np-redis-headless.np.svc.cluster.local 6379

sentinel known-sentinel mymaster np-redis-node-0.np-redis-headless.np.svc.cluster.local 26379 d60e97c65451987f45e8ec25b11bcb72eaab9090

sentinel known-replica mymaster ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local 6379

sentinel announce-hostnames yes
sentinel resolve-hostnames yes
sentinel announce-port 26379
sentinel announce-ip "np-redis-node-1.np-redis-headless.np.svc.cluster.local"
# Generated by CONFIG REWRITE
protected-mode no
user default on nopass ~* &* +@all
sentinel config-epoch mymaster 2
sentinel leader-epoch mymaster 0
sentinel known-replica mymaster ci-redis-node-1.ci-redis-headless.ci.svc.cluster.local 6379
sentinel current-epoch 2
sentinel known-replica mymaster qa-redis-node-1.qa-redis-headless.qa.svc.cluster.local 6379
sentinel known-replica mymaster qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local 6379
sentinel known-replica mymaster qa-redis-node-2.qa-redis-headless.qa.svc.cluster.local 6379
sentinel known-replica mymaster np-redis-node-2.np-redis-headless.np.svc.cluster.local 6379
sentinel known-sentinel mymaster ci-redis-node-1.ci-redis-headless.ci.svc.cluster.local 26379 ae21e437610dbcc0a1e4ec2fea0f059f85ca3aae
sentinel known-sentinel mymaster ci-redis-node-0.ci-redis-headless.ci.svc.cluster.local 26379 99953b14a2e7ac560bc8eb18bfbd35b04ab50ae3
sentinel known-sentinel mymaster np-redis-node-2.np-redis-headless.np.svc.cluster.local 26379 875696576dbb4eaf5f0ea54c286b2870e1eefa17
sentinel known-sentinel mymaster qa-redis-node-0.qa-redis-headless.qa.svc.cluster.local 26379 d09b0415f0a74561756769ce826ce26edaf1c134

sentinel known-sentinel mymaster qa-redis-node-1.qa-redis-headless.qa.svc.cluster.local 26379 477647859e5e13a8031bcf89f2c4ee65bfa5013c
sentinel known-sentinel mymaster ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local 26379 225901f21200955c907690cf889082d7040d262a

I chose these logs as I believe this issue may have been sourced from this node in this scenario - within the redis container logs, we can see that immediately the following logs are shown:

19:23:25.35 INFO  ==> about to run the command: timeout 220 redis-cli -h np-redis.np.svc.cluster.local -p 26379 sentinel get-master-addr-by-name mymaster
19:23:25.38 INFO  ==> Current master: REDIS_SENTINEL_INFO=(ci-redis-node-2.ci-redis-headless.ci.svc.cluster.local,6379)

which appears to indicate that the redis-cli get-master-addr-by-name function likely searches for any redis instances named mymaster on the subnet, and is not meant for k8s deployments necessarily (which would explain its returning of an 'incorrect' hostname). When you rename the master using masterSet, replicas are still able to attach across instances - which I believe again indicates how ever replicas are discovered from redis is not meant for a kubernetes multi-instance scenario. Updating port allows us to fully circumvent the issues.

This script is one source (more within that file) that call the redis-cli get-master-addr-by-name function and are returned the nefarious results. I believe functionality should be added within these scripts (and replica discovery scripts) to ensure only proper hostnames within the current namespace are pulled OR this issue should be brought down to redis, however, I believe redis is likely behaving as expected in this scenario.

Please let me know any questions regarding this! I am happy to provide additional information as needed.

@miguelaeh
Copy link
Contributor

Hi @JDKnobloch ,
So if I understood the issue properly, what is happening is that the nodes of a Redis release from one namespace are being mixed with another namespace release when both are restarted at the same time, right?

Is there any reason for not having the k8s namespaces isolated? I guess this should not happen if there is no mixed traffic between namespaces.

@JDKnobloch
Copy link
Contributor Author

Hey @miguelaeh,

Correct - the nodes are mixing across any redis-sentinel replication instances on the same network. That boils it down a bit smoother - so cross namespace, same namespace, as long as two sentinel replication instances exist on a cluster they can interact (Lines 1-3 in the Redis container log shows a cli command for locating np master return the ci master).

While isolating the namespaces would solve the problem our the original nonprod occurrence, ultimately it cannot be done for some scenarios - for example, one of our production k8s clusters runs 2 redis-sentinel replication instances within the default namespace, and both communicate with various apps across the cluster. We need those instances communicating across cluster, thus swapping ports was a simpler solution for us.

@miguelaeh
Copy link
Contributor

I would need to take a look to how sentinel locates the master in deep. In lines 1-3 we are using the FQDN for the headless service, including the namespace, so the issue is probably on how the sentinel locates the master maybe there is some wrong configuration.

@miguelaeh miguelaeh added the on-hold Issues or Pull Requests with this label will never be considered stale label May 20, 2022
@miguelaeh
Copy link
Contributor

I created an internal task to investigate the issue, nevertheless, we cannot provide an ETA of when it will be done.

@miguelaeh
Copy link
Contributor

Hi @JDKnobloch ,
I have been trying to reproduce the issue but was unable to make the traffic mix between the namespaces.

Could you share the list of commands step by step to reproduce using Helm and Kubectl? (I mean, without Argo)

@JDKnobloch
Copy link
Contributor Author

Okay, I will try to get around to this in the coming days and get back to you

@carrodher carrodher removed the on-hold Issues or Pull Requests with this label will never be considered stale label May 30, 2022
@github-actions
Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Jun 15, 2022
@JDKnobloch
Copy link
Contributor Author

Bump as I still plan to return to this.

@github-actions github-actions bot removed the stale 15 days without activity label Jun 16, 2022
@github-actions
Copy link

github-actions bot commented Jul 1, 2022

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Jul 1, 2022
@github-actions
Copy link

github-actions bot commented Jul 6, 2022

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

@github-actions github-actions bot closed this as completed Jul 6, 2022
@michalpenka
Copy link
Contributor

michalpenka commented Sep 22, 2022

We have the same issue. As said, it is possible to reproduce by deploying redis twice within the same namespace.

Please note that we tried setting unique sentinel.masterSet value for each release but it does not seem to help

@michalpenka
Copy link
Contributor

What seems to help though (still need to verify) is to deploy always with a unique port set... which is a bit unmaintainable for us as we deploy 36 redis sets (3 redis nodes, one being master + 3 sentinel nodes) to a single cluster.

@BenHesketh21
Copy link

BenHesketh21 commented Jan 6, 2023

I have just experienced this, we also have lot's of redis sets across a cluster in different namespaces, it happened when I increase a resource requests across all instances and they all got applied together/around the same time and now they are stuck in a state of trying to assign a sentinel node in another namespace, for which it has no configurations to do so and then failing and restarting.

It came along with this error:

*** FATAL CONFIG FILE ERROR (Redis 6.2.7) ***
Reading the configuration file, at line 3
>>> 'sentinel monitor mymaster    2'
Unrecognized sentinel configuration statement

@leqduyvp
Copy link

I'm facing the mismatch master issue too. A sentinel suddenly claims the master from other deployment to be its own master

@Voolodimer
Copy link

@JDKnobloch did you find a solution to this problem?

@JDKnobloch
Copy link
Contributor Author

@Voolodimer Our solution was to simply migrate each redis instance to use different ports so they could no longer cross streams.

So for each instance the following values were updated (we are running replication & sentinel):

master.containerPorts.redis
master.service.ports.redis
replica.containerPorts.redis
replica.service.ports.redis
sentinel.containerPorts.sentinel
sentinel.service.ports.redis
sentinel.service.ports.sentinel

Where we moved redis to port 6385 and sentinel to 26385. Then just assign each redis / sentinel instance unique ports.

@phucnv282
Copy link

We have the same issue,
I deployed to Redis Sentinel into different namespaces but it still pub/sub to each other sentinel:hello channel
=> Sentinel realizes new deployment as the members
Both deployments have the same masterSet: mymaster, but I think it doesn't matter
Correct me if I wrong :((

@whiskeysierra
Copy link

Same issue here. We also see this happening when we changed resource requests and updated versions for multiple redis installations in different namespaces.

@phucnv282
Copy link

I observed it would happen whenever you have previous Redis Sentinel deployed with option useHostnames: false, and the laters with option useHostnames: true.
As mentioned in the documentation, Sentinel instances never forget about the slaves of the monitored master (https://redis.io/docs/management/sentinel/#removing-the-old-master-or-unreachable-replicas).
The previous deployment may use IPs to declare about the known slaves and it never automatically reset it, when a sentinel/slaves restart (common in K8s), it would be assigned a new IP that makes the known list would be growing.
When the latter is up, it may have the known IP in the previous list and the discovery happens so weirdly because the new deployment update the sentinel config from the old, and all redis servers become their slaves.

@vijayrl
Copy link

vijayrl commented Sep 19, 2023

I'm facing the same issue, was this issue fixed in the later versions?

@github-actions github-actions bot added the triage Triage is needed label Sep 19, 2023
@carrodher
Copy link
Member

Could you please create a new ticket describing your specific use case and configuration? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
redis stale 15 days without activity triage Triage is needed
Projects
None yet
Development

No branches or pull requests