[bitnami/redis] Sentinel return wrong IP #1682

baznikin · 2019-11-28T16:55:47Z

Which chart:

bitnami/redis version 9.5.5

Description

Sentinel do not update his config so we have stale IP addresses

Steps to reproduce the issue:

I setup redis and stolon in same namespace and play with them a while. When I tried to actually use redis I found sentinel gives me address of postrges pod! As far as I understand sentinel compiles its config upon pod start and beleave addresses do not change.

$ telnet redis.db.svc 26379
Trying 192.168.47.226...
Connected to 192.168.47.226.
Escape character is '^]'.
sentinel get-master-addr-by-name spt-redis
*2
$14
192.168.47.214
$4
6379

$ kubectl -n db get pod -o custom-columns=NAME:.metadata.name,IP:.status.podIP
NAME                               IP
redis-master-0                     192.168.47.235
redis-slave-0                      192.168.47.234
redis-slave-1                      192.168.47.226
stolon-create-cluster-8xj76        192.168.47.209
stolon-keeper-0                    192.168.47.201
stolon-keeper-1                    192.168.47.202
stolon-proxy-c49bdd5c5-bwp2l       192.168.47.243
stolon-proxy-c49bdd5c5-wzvzb       192.168.47.214
stolon-sentinel-6cb88b84c8-gdw4r   192.168.47.198
stolon-sentinel-6cb88b84c8-m8dpn   192.168.47.250
stolon-update-cluster-spec-fpc5h   192.168.47.200

$ kubectl -n db exec -it pod/redis-master-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid 85dd23902cf42f9b601086f5e4814f704d15937f
sentinel deny-scripts-reconfig yes
sentinel monitor spt-redis 192.168.47.214 6379 2
sentinel down-after-milliseconds spt-redis 60000
sentinel failover-timeout spt-redis 18000
# Generated by CONFIG REWRITE
protected-mode no
sentinel auth-pass spt-redis **********
sentinel config-epoch spt-redis 0
sentinel leader-epoch spt-redis 0
sentinel known-replica spt-redis 192.168.47.218 6379
sentinel known-replica spt-redis 192.168.47.215 6379
sentinel known-sentinel spt-redis 192.168.47.218 26379 f13e511bad9f51182ba73b03697126b6bf1c752f
sentinel known-sentinel spt-redis 192.168.47.215 26379 91bef7dfdfda7aadd047839cc78f5acf14ade2c5
sentinel current-epoch 0

-rw-r--r-- 1 1001 1001 743 Nov 23 17:14 /opt/bitnami/redis-sentinel/etc/sentinel.conf

Thu Nov 28 19:45:40 MSK 2019```

**Describe the results you received:**

<!-- What actually happens -->

**Describe the results you expected:**

<!-- What you expect to happen -->

**Additional information you deem important (e.g. issue happens only occasionally):**

<!-- Any additional information, configuration or data that might be necessary to reproduce the issue. -->


**Version of Helm and Kubernetes**:

- Output of `helm version`:

(paste your output here)


- Output of `kubectl version`:

(paste your output here)

The text was updated successfully, but these errors were encountered:

javsalgar · 2019-12-02T11:26:15Z

Hi,

This is strange because the generated config map uses the domain name, not the IP

{{- if .Values.sentinel.enabled }}
  sentinel.conf: |-
    dir "/tmp"
    bind 0.0.0.0
    port {{ .Values.sentinel.port }}
    sentinel monitor {{ .Values.sentinel.masterSet }} {{ template "redis.fullname" . }}-master-0.{{ template "redis.fullname" . }}-headless.{{ .Release.Namespace }}.svc.{{ .Values.clusterDomain }} {{ .Values.redisPort }} {{ .Values.sentinel.quorum }}
    sentinel down-after-milliseconds {{ .Values.sentinel.masterSet }} {{ .Values.sentinel.downAfterMilliseconds }}
    sentinel failover-timeout {{ .Values.sentinel.masterSet }} {{ .Values.sentinel.failoverTimeout }}
    sentinel parallel-syncs {{ .Values.sentinel.masterSet }} {{ .Values.sentinel.parallelSyncs }}

Could you show the generated config map using kubectl?

baznikin · 2019-12-02T11:39:43Z

Hmmm, domain name here... Very strange

apiVersion: v1
data:
  master.conf: |-
  master.conf: |-
    dir /data
    rename-command FLUSHDB ""
    rename-command FLUSHALL ""
  redis.conf: |-
    # User-supplied configuration:
    # Enable AOF https://redis.io/topics/persistence#append-only-file
    appendonly yes
    # Disable RDB persistence, AOF persistence already enabled.
    save ""
  replica.conf: |-
    dir /data
    slave-read-only yes
    rename-command FLUSHDB ""
    rename-command FLUSHALL ""
  sentinel.conf: |-
    dir "/tmp"
    bind 0.0.0.0
    port 26379                                                                         
    sentinel monitor spt-redis redis-master-0.redis-headless.db.svc.cluster.local 6379 2
    sentinel down-after-milliseconds spt-redis 60000
    sentinel failover-timeout spt-redis 18000
    sentinel parallel-syncs spt-redis 1
kind: ConfigMap
metadata:
  creationTimestamp: "2019-11-23T17:05:00Z"
  labels:
    app: redis
    chart: redis-9.5.5
    heritage: Tiller
    release: redis
  name: redis
  namespace: db
  resourceVersion: "3099795"
  selfLink: /api/v1/namespaces/db/configmaps/redis
  uid: 6d867ed4-7175-433c-9bcd-58d887b64ecc

baznikin · 2019-12-02T11:42:40Z

I do not reload redis yet, any tests I can do to track down issue?

javsalgar · 2019-12-02T11:48:35Z

Maybe deploying a new one and see if the address gets changed to IP. Maybe it's something that Redis does automatically

baznikin · 2019-12-02T15:00:42Z

I deploy new redis with helm install --name redis2 --namespace db bitnami/redis --version 9.5.5 -f deploy/helm/redis.yaml and values

password: "<password here>"
cluster:
  enabled: true
  slaveCount: 2
sentinel:
  enabled: true
  masterSet: spt-redis
persistence: {}
  # existingClaim:
master:
  statefulset:
    updateStrategy: RollingUpdate
slave:
  statefulset:
    updateStrategy: RollingUpdate
metrics:
  enabled: true
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9121"
  serviceMonitor:
    enabled: false

## Redis config file
## ref: https://redis.io/topics/config
##
configmap: |-
  # Enable AOF https://redis.io/topics/persistence#append-only-file
  appendonly yes
  # Disable RDB persistence, AOF persistence already enabled.
  save ""

Same result, IP in config:

$ kubectl -n db get configmap -o yaml redis2 | grep monitor
    sentinel monitor spt-redis redis2-master-0.redis2-headless.db.svc.cluster.local 6379 2

$ kubectl -n db exec -it pod/redis2-master-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf | grep monitor
sentinel monitor spt-redis 192.168.47.208 6379 2

$ kubectl -n db get pod -o custom-columns=NAME:.metadata.name,IP:.status.podIP | grep 192.168.47.208
redis2-master-0                    192.168.47.208

baznikin · 2019-12-02T15:11:25Z

According to docs there is should be IP address. Also I found pretty old issue on this topic. I suppose we have to workaround master pod IP change (or check how others run redis sentinel on Kubernetes), correct me if I am wrong

baznikin · 2019-12-02T15:41:15Z

Another point. If I restart pod/redis2-master-0 it's config updated. However, slaves's sentinels do not:

$ kubectl delete pod/redis2-master-0 -n db
pod "redis2-master-0" deleted

$ kubectl -n db exec -it pod/redis2-master-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf | grep monitor
sentinel monitor spt-redis 192.168.47.253 6379 2

$ kubectl -n db get pod -o custom-columns=NAME:.metadata.name,IP:.status.podIP | grep redis2-master-0
redis2-master-0                    192.168.47.253

$ kubectl -n db exec -it pod/redis2-slave-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf | grep monitor
sentinel monitor spt-redis 192.168.47.208 6379 2

javsalgar · 2019-12-03T09:46:23Z

Hi,

Thanks for letting us know. I think this will require further investigation. Let me open an internal task. I will let you know when we have more details.

stale · 2019-12-18T10:24:49Z

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

ajcann · 2020-03-13T15:21:56Z

@javsalgar Would this issue suggest it's not wise to rely on using Sentinel mode for HA in a production situation?

javsalgar · 2020-03-17T09:49:45Z

Hi,

We still need to investigate how to properly deal with sentinel and the ephemerality of IP addresses. For the time being, until this issue is fixed, I would recommend you sticking to a regular master-slave configuration. We are also working on a redis-cluster chart which has a different failover mechanism and could be more suited for this kind of scenarios. We will let you know when we have more updates on this.

miguelaeh · 2020-04-08T10:53:13Z

Hi @baznikin ,
I have been testing what you explained here and it seems to be a temporal issue. Once you kill the master, there is a time where one of the slaves needs to be promoted to master. During that time, the sentinel at both slaves will be pointing to the old master, and if the new master pod has been created it will point to itself because the hostname in the configmap is pointing to the pod called master. There is something here to clarify that is that at this moment, the master will be one of the pods called slave and the pod called master will be a slave.
Once the cluster reaches a stable state, the sentinel pods start an auto-reconfiguring process, and after some time they all point to the new master (that is actually a pod called slave).
Let me illustrate this:

First deploy of the chart you will have the cluster in a stable state:

10:43:56 › kgp -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
sentinel-redis-master-0   3/3     Running   0          3m50s   10.244.2.86   aks-agentpool-38805687-vmss000002   <none>           <none>
sentinel-redis-slave-0    3/3     Running   2          3m50s   10.244.3.89   aks-agentpool-38805687-vmss000003   <none>           <none>
sentinel-redis-slave-1    3/3     Running   0          2m15s   10.244.1.83   aks-agentpool-38805687-vmss000001   <none>           <none>

As you can see in the sentinel configuration of one of the slaves, it is pointing to the master pod, that is correct:

10:44:03 › k exec -it sentinel-redis-master-0 -c sentinel bash
I have no name!@sentinel-redis-master-0:/$ cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid 3b9bba815cc15706f7b66f7ef85eefe215cb4c1b
sentinel deny-scripts-reconfig yes
sentinel monitor spt-redis 10.244.2.86 6379 2
.
.
.

Then, I killed the master pod:

10:45:51 › k delete pod sentinel-redis-master-0
pod "sentinel-redis-master-0" deleted

And now there is an unstable period where a salve should be promoted to master, checking the logs of one of the slaves they will have the following:

1:S 08 Apr 2020 10:41:54.951 # CONFIG REWRITE executed with success.
1:S 08 Apr 2020 10:41:55.265 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:41:55.265 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:41:55.266 * Non blocking connect for SYNC fired the event.
1:S 08 Apr 2020 10:41:55.266 * Master replied to PING, replication can continue...
1:S 08 Apr 2020 10:41:55.268 * Trying a partial resynchronization (request e071e2ecae29225177b980a90e8afea809390681:2550).
1:S 08 Apr 2020 10:41:55.269 * Successful partial resynchronization with master.
1:S 08 Apr 2020 10:41:55.269 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
1:S 08 Apr 2020 10:45:55.030 # Connection with master lost.
1:S 08 Apr 2020 10:45:55.030 * Caching the disconnected master state.
1:S 08 Apr 2020 10:45:55.078 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:45:55.078 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:45:55.079 # Error condition on socket for SYNC: Connection refused
1:S 08 Apr 2020 10:45:56.081 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:45:56.081 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:46:14.230 # Error condition on socket for SYNC: No route to host
1:S 08 Apr 2020 10:46:15.159 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:46:15.160 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:46:23.454 # Error condition on socket for SYNC: No route to host
1:S 08 Apr 2020 10:46:24.190 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:46:24.191 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:46:27.254 # Error condition on socket for SYNC: No route to host
1:S 08 Apr 2020 10:46:28.208 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:46:28.208 * MASTER <-> REPLICA sync started

The new master pod is created automatically and its IP will be different from the previous one:

10:48:08 › kgp -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
sentinel-redis-master-0   3/3     Running   0          2m15s   10.244.2.87   aks-agentpool-38805687-vmss000002   <none>           <none>
sentinel-redis-slave-0    3/3     Running   2          8m      10.244.3.89   aks-agentpool-38805687-vmss000003   <none>           <none>
sentinel-redis-slave-1    3/3     Running   0          6m25s   10.244.1.83   aks-agentpool-38805687-vmss000001   <none>           <none>

If you exec to the new pod just in the moment it is created, you will see that the sentinel configuration is pointing to itself, to its new ip.
Now, in one of the slaves, the following will appear, indicating it is now the master:

1:M 08 Apr 2020 10:48:07.934 * Discarding previously cached master state.
1:M 08 Apr 2020 10:48:07.934 * MASTER MODE enabled (user request from 'id=5 addr=10.244.3.89:39847 fd=10 name=sentinel-ce6ec014-cmd age=357 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=140 qbuf-free=32628 obl=36 oll=0 omem=0 events=r cmd=exec')
1:M 08 Apr 2020 10:48:07.934 # CONFIG REWRITE executed with success.
1:M 08 Apr 2020 10:48:09.553 * Replica 10.244.3.89:6379 asks for synchronization
1:M 08 Apr 2020 10:48:09.554 * Partial resynchronization request from 10.244.3.89:6379 accepted. Sending 437 bytes of backlog starting from offset 50657.
1:M 08 Apr 2020 10:48:21.036 * Replica 10.244.2.87:6379 asks for synchronization
1:M 08 Apr 2020 10:48:21.036 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '567f79774695fcba15dc30133c6a398d59c74b4d', my replication IDs are '38c06c6daab43ea958f6fb2f0d615e8e02376c14' and 'e071e2ecae29225177b980a90e8afea809390681')
1:M 08 Apr 2020 10:48:21.036 * Starting BGSAVE for SYNC with target: disk
1:M 08 Apr 2020 10:48:21.037 * Background saving started by pid 741
741:C 08 Apr 2020 10:48:21.051 * DB saved on disk
741:C 08 Apr 2020 10:48:21.052 * RDB: 10 MB of memory used by copy-on-write
1:M 08 Apr 2020 10:48:21.085 * Background saving terminated with success
1:M 08 Apr 2020 10:48:21.086 * Synchronization with replica 10.244.2.87:6379 succeeded

And after some time, if we go to the old master we will see that the sentinel configuration is now pointing to the new master (that is sentinel-redis-slave-1):

10:49:11 › k exec -it sentinel-redis-master-0 -c sentinel bash
I have no name!@sentinel-redis-master-0:/$ cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid d25053c91626dbfabd456c4cdeab9bed39ea33fc
sentinel deny-scripts-reconfig yes
sentinel monitor spt-redis 10.244.1.83 6379 2

And now the cluster is stable again. I guess this is the behaviour you were expecting but you didn't give it enough time for the sentinel to update the IPs.

baznikin · 2020-04-09T09:58:19Z

And now the cluster is stable again. I guess this is the behaviour you

were expecting but you didn't give it enough time for the sentinel to update the IPs. Maybe! But at time report was created I watch for wrong configuration for few hours. I didn't use this chart now and give it a try next time. ср, 8 апр. 2020 г., 17:53 Miguel Ángel Cabrera Miñagorri < notifications@github.com>:

…

Hi @baznikin <https://github.com/baznikin> , I have been testing what you explained here and it seems to be a temporal issue. Once you kill the master, there is a time where one of the slaves needs to be promoted to master. During that time, the sentinel at both slaves will be pointing to the old master, and if the new master pod has been created it will point to itself because the hostname in the configmap is pointing to the pod called master. There is something here to clarify that is that at this moment, the master will be one of the pods called slave and the pod called master will be a slave. Once the cluster reaches a stable state, the sentinel pods start an auto-reconfiguring process, and after some time they all point to the new master (that is actually a pod called slave). Let me illustrate this: - First deploy of the chart you will have the cluster in a stable state: 10:43:56 › kgp -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sentinel-redis-master-0 3/3 Running 0 3m50s 10.244.2.86 aks-agentpool-38805687-vmss000002 <none> <none> sentinel-redis-slave-0 3/3 Running 2 3m50s 10.244.3.89 aks-agentpool-38805687-vmss000003 <none> <none> sentinel-redis-slave-1 3/3 Running 0 2m15s 10.244.1.83 aks-agentpool-38805687-vmss000001 <none> <none> As you can see in the sentinel configuration of one of the slaves, it is pointing to the master pod, that is correct: 10:44:03 › k exec -it sentinel-redis-master-0 -c sentinel bash I have no ***@***.***:/$ cat /opt/bitnami/redis-sentinel/etc/sentinel.conf dir "/tmp" bind 0.0.0.0 port 26379 sentinel myid 3b9bba815cc15706f7b66f7ef85eefe215cb4c1b sentinel deny-scripts-reconfig yes sentinel monitor spt-redis 10.244.2.86 6379 2 . . . Then, I killed the master pod: 10:45:51 › k delete pod sentinel-redis-master-0 pod "sentinel-redis-master-0" deleted And now there is an unstable period where a salve should be promoted to master, checking the logs of one of the slaves they will have the following: 1:S 08 Apr 2020 10:41:54.951 # CONFIG REWRITE executed with success. 1:S 08 Apr 2020 10:41:55.265 * Connecting to MASTER 10.244.2.86:6379 1:S 08 Apr 2020 10:41:55.265 * MASTER <-> REPLICA sync started 1:S 08 Apr 2020 10:41:55.266 * Non blocking connect for SYNC fired the event. 1:S 08 Apr 2020 10:41:55.266 * Master replied to PING, replication can continue... 1:S 08 Apr 2020 10:41:55.268 * Trying a partial resynchronization (request e071e2ecae29225177b980a90e8afea809390681:2550). 1:S 08 Apr 2020 10:41:55.269 * Successful partial resynchronization with master. 1:S 08 Apr 2020 10:41:55.269 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization. 1:S 08 Apr 2020 10:45:55.030 # Connection with master lost. 1:S 08 Apr 2020 10:45:55.030 * Caching the disconnected master state. 1:S 08 Apr 2020 10:45:55.078 * Connecting to MASTER 10.244.2.86:6379 1:S 08 Apr 2020 10:45:55.078 * MASTER <-> REPLICA sync started 1:S 08 Apr 2020 10:45:55.079 # Error condition on socket for SYNC: Connection refused 1:S 08 Apr 2020 10:45:56.081 * Connecting to MASTER 10.244.2.86:6379 1:S 08 Apr 2020 10:45:56.081 * MASTER <-> REPLICA sync started 1:S 08 Apr 2020 10:46:14.230 # Error condition on socket for SYNC: No route to host 1:S 08 Apr 2020 10:46:15.159 * Connecting to MASTER 10.244.2.86:6379 1:S 08 Apr 2020 10:46:15.160 * MASTER <-> REPLICA sync started 1:S 08 Apr 2020 10:46:23.454 # Error condition on socket for SYNC: No route to host 1:S 08 Apr 2020 10:46:24.190 * Connecting to MASTER 10.244.2.86:6379 1:S 08 Apr 2020 10:46:24.191 * MASTER <-> REPLICA sync started 1:S 08 Apr 2020 10:46:27.254 # Error condition on socket for SYNC: No route to host 1:S 08 Apr 2020 10:46:28.208 * Connecting to MASTER 10.244.2.86:6379 1:S 08 Apr 2020 10:46:28.208 * MASTER <-> REPLICA sync started The new master pod is created automatically and its IP will be different from the previous one: 10:48:08 › kgp -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES sentinel-redis-master-0 3/3 Running 0 2m15s 10.244.2.87 aks-agentpool-38805687-vmss000002 <none> <none> sentinel-redis-slave-0 3/3 Running 2 8m 10.244.3.89 aks-agentpool-38805687-vmss000003 <none> <none> sentinel-redis-slave-1 3/3 Running 0 6m25s 10.244.1.83 aks-agentpool-38805687-vmss000001 <none> <none> If you exec to the new pod just in the moment it is created, you will see that the sentinel configuration is pointing to itself, to its new ip. Now, in one of the slaves, the following will appear, indicating it is now the master: 1:M 08 Apr 2020 10:48:07.934 * Discarding previously cached master state. 1:M 08 Apr 2020 10:48:07.934 * MASTER MODE enabled (user request from 'id=5 addr=10.244.3.89:39847 fd=10 name=sentinel-ce6ec014-cmd age=357 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=140 qbuf-free=32628 obl=36 oll=0 omem=0 events=r cmd=exec') 1:M 08 Apr 2020 10:48:07.934 # CONFIG REWRITE executed with success. 1:M 08 Apr 2020 10:48:09.553 * Replica 10.244.3.89:6379 asks for synchronization 1:M 08 Apr 2020 10:48:09.554 * Partial resynchronization request from 10.244.3.89:6379 accepted. Sending 437 bytes of backlog starting from offset 50657. 1:M 08 Apr 2020 10:48:21.036 * Replica 10.244.2.87:6379 asks for synchronization 1:M 08 Apr 2020 10:48:21.036 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '567f79774695fcba15dc30133c6a398d59c74b4d', my replication IDs are '38c06c6daab43ea958f6fb2f0d615e8e02376c14' and 'e071e2ecae29225177b980a90e8afea809390681') 1:M 08 Apr 2020 10:48:21.036 * Starting BGSAVE for SYNC with target: disk 1:M 08 Apr 2020 10:48:21.037 * Background saving started by pid 741 741:C 08 Apr 2020 10:48:21.051 * DB saved on disk 741:C 08 Apr 2020 10:48:21.052 * RDB: 10 MB of memory used by copy-on-write 1:M 08 Apr 2020 10:48:21.085 * Background saving terminated with success 1:M 08 Apr 2020 10:48:21.086 * Synchronization with replica 10.244.2.87:6379 succeeded And after some time, if we go to the old master we will see that the sentinel configuration is now pointing to the new master (that is sentinel-redis-slave-1): 10:49:11 › k exec -it sentinel-redis-master-0 -c sentinel bash I have no ***@***.***:/$ cat /opt/bitnami/redis-sentinel/etc/sentinel.conf dir "/tmp" bind 0.0.0.0 port 26379 sentinel myid d25053c91626dbfabd456c4cdeab9bed39ea33fc sentinel deny-scripts-reconfig yes sentinel monitor spt-redis 10.244.1.83 6379 2 And now the cluster is stable again. I guess this is the behaviour you were expecting but you didn't give it enough time for the sentinel to update the IPs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1682 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHROZVIAWQ53FEOI6FGXATRLRJSPANCNFSM4JSW6KGA> .

miguelaeh · 2020-04-13T07:52:05Z

Thank you for the confirmation @baznikin !!

albertocsm · 2020-04-24T11:54:35Z

ive also experienced the issue described by @baznikin here

when killing the pod named redis-master-0 while its actually the current elected Master, a condition occurs where Sentinel is not able to failover with success for an undetermined ammount of time (hours).

1:X 24 Apr 2020 11:38:20.750 # +new-epoch 204
1:X 24 Apr 2020 11:38:20.750 # +try-failover master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.751 # +vote-for-leader c855252ac9d60e833a37eb9a19b867fcbf9eec72 204
1:X 24 Apr 2020 11:38:20.754 # cde3c17ddd887f4c1fedc7f172b94073f3b3eef0 voted for c855252ac9d60e833a37eb9a19b867fcbf9eec72 204
1:X 24 Apr 2020 11:38:20.805 # +elected-leader master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.805 # +failover-state-select-slave master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.867 # -failover-abort-no-good-slave master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.967 # Next failover delay: I will not start a failover before Fri Apr 24 11:38:50 2020

10.145.159.239 is the old IP for the dead redis-master-0 pod

when in this state querying each sentinel instance with sentinel ckquorum resulted in an error where each of the (3) sentinels instances in the cluster reported themselfs as not ok -> not enough Sentinels / majority not reached.
not sure why this is - s_down instances should not be considered for majority pourpuses, right?

after a sentinel reset on all the sentinels instances, followed by another sentinel ckquorum each sentinel instance now reports as OK.

OK <N> usable Sentinels. Quorum and failover authorization can be reached

even though not all sentinels instances are aware of all other instances (?), all of them are aware of a majority (majority == 2 in the 1Master+2Replicas cluster im running)

at this point, the failover procedure is still not able to complete with success

unfortunatly, i have yet to make this reproducible 100% of the times

for reference, here is the output of sentinel master for each of the 3 members of the ensemble:

m0-----------
 1) "name"
 2) "redis"
 3) "ip"
 4) "10.145.159.239"
 5) "port"
 6) "6379"
 7) "runid"
 8) ""
 9) "flags"
10) "s_down,o_down,master,disconnected"
11) "link-pending-commands"
12) "32"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "2099287"
17) "last-ok-ping-reply"
18) "2099287"
19) "last-ping-reply"
20) "2099287"
21) "s-down-time"
22) "2094225"
23) "o-down-time"
24) "2094173"
25) "down-after-milliseconds"
26) "5000"
27) "info-refresh"
28) "4864481"
29) "role-reported"
30) "master"
31) "role-reported-time"
32) "2099287"
33) "config-epoch"
34) "6"
35) "num-slaves"
36) "0"
37) "num-other-sentinels"
38) "1"
39) "quorum"
40) "2"
41) "failover-timeout"
42) "15000"
43) "parallel-syncs"
44) "1"
s0-----------
 1) "name"
 2) "redis"
 3) "ip"
 4) "10.145.159.239"
 5) "port"
 6) "6379"
 7) "runid"
 8) ""
 9) "flags"
10) "s_down,o_down,master"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "2549571"
17) "last-ok-ping-reply"
18) "2549571"
19) "last-ping-reply"
20) "2549571"
21) "s-down-time"
22) "2544547"
23) "o-down-time"
24) "2284604"
25) "down-after-milliseconds"
26) "5000"
27) "info-refresh"
28) "4868280"
29) "role-reported"
30) "master"
31) "role-reported-time"
32) "2549571"
33) "config-epoch"
34) "6"
35) "num-slaves"
36) "0"
37) "num-other-sentinels"
38) "1"
39) "quorum"
40) "2"
41) "failover-timeout"
42) "15000"
43) "parallel-syncs"
44) "1"
s1-----------
 1) "name"
 2) "redis"
 3) "ip"
 4) "10.145.159.239"
 5) "port"
 6) "6379"
 7) "runid"
 8) ""
 9) "flags"
10) "s_down,o_down,master"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "2290182"
17) "last-ok-ping-reply"
18) "2290182"
19) "last-ping-reply"
20) "2290182"
21) "s-down-time"
22) "2285127"
23) "o-down-time"
24) "2285060"
25) "down-after-milliseconds"
26) "5000"
27) "info-refresh"
28) "4876023"
29) "role-reported"
30) "master"
31) "role-reported-time"
32) "2290182"
33) "config-epoch"
34) "6"
35) "num-slaves"
36) "0"
37) "num-other-sentinels"
38) "2"
39) "quorum"
40) "2"
41) "failover-timeout"
42) "15000"
43) "parallel-syncs"
44) "1"

would be happy to provide more info / logs

miguelaeh · 2020-04-27T08:16:30Z

Hi @albertocsm ,
This seems to be a different case from the explained in this case, we would appreciate if you can open a new issue for your case, it is not necessary to write again all the information, just reference your comment in this thread.
Regarding your issue, I would say the sentinel failover time is configurable, checking the line from your output it seems it is configured to wait more time before restart:

Next failover delay: I will not start a failover before Fri Apr 24 11:38:50 2020

Regards.

rtriveurbana · 2020-08-25T07:49:06Z

i experienced same problem this night. One node of my cluster went down and failover starts. I experienced that one sentinel return the old Master ip instead of new one causing problems to other services that use redis

miguelaeh · 2020-08-25T09:02:17Z

Hi @rtriveurbana ,
Could you provide more information? Did you wait enough time for the cluster to recover?

tomislater · 2021-09-06T17:40:48Z

I have noticed that old IPs are not removed from sentinel.conf. I have default sentinel setup, the newest version (3 nodes):

cat /opt/bitnami/redis-sentinel/etc/sentinel.conf:

dir "/tmp"
port 26379
sentinel monitor mymaster 10.110.37.122 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 18000

# User-supplied sentinel configuration:
# End of sentinel configuration
sentinel myid 80d3b99129c4f9bb62cdcec2399499f2c4666789

sentinel known-sentinel mymaster 10.110.59.80 26379 475427d0cefa44aff518ab6b888c3510314d882c

sentinel known-sentinel mymaster 10.110.17.137 26379 cae9f9b5ed951299f4bdc7de1da8e996e7b7197c

sentinel known-replica mymaster 10.110.51.41 6379

sentinel known-replica mymaster 10.110.59.80 6379
# Generated by CONFIG REWRITE
protected-mode no
user default on nopass ~* &* +@all
sentinel config-epoch mymaster 3
sentinel leader-epoch mymaster 3
sentinel current-epoch 3
sentinel known-replica mymaster 10.110.16.155 6379
sentinel known-replica mymaster 10.110.18.152 6379
sentinel known-replica mymaster 10.110.17.137 6379

10.110.37.122 is my current master and 10.110.59.80, 10.110.17.137 are my replicas.

And here are problems with these IPs:

10.110.51.41 - I do not know what is it
10.110.16.155 - that was old replica (I deleted one pod, just for fun)
10.110.18.152 - that was also old replica (I also deleted this pod, but not for fun; for tests)

Looks like sentinel configuration is not configured properly when I (or my cluster (like autoscaler)) delete pods.

It is even funnier! 🙈

I spawned a new redis release (standalone, only one pod, in another namespace) and k8s gave him 10.110.18.152 IP address which as you can see is the same as the old (deleted) replica had! And now, sentinel "thinks" that this new redis in another namespace is a replica :D And, start syncing with master...

I think that we need something which is able to remove old IPs from sentinel.conf 🤔. I must study the chart, then I can help.

miguelaeh · 2021-09-07T08:22:38Z

Hi @tomislater ,
We would appreciate if you could create a new case for your issues since it seems to be non-related to the original one.
What you are exposing should not happen, we will probably need to re-check that.

tomislater · 2021-09-07T10:16:04Z

@miguelaeh hey, I have added a comment here: #5418 (comment)

Looks that we should use:

SENTINEL resolve-hostnames yes
SENTINEL announce-hostnames yes

I am going to debug this further.

tomislater · 2021-09-07T14:26:56Z

I have a solution and it seems working 🤔 I will propose PR soon.

miguelaeh · 2021-09-08T08:05:25Z

Hi @tomislater ,
Thank you for the PR! We will take a look to it

carrodher · 2022-10-20T13:20:56Z

Unfortunately, this issue was created a long time ago and although there is an internal task to fix it, it was not prioritized as something to address in the short/mid term. It's not a technical reason but something related to the capacity since we're a small team.

Being said that, contributions via PRs are more than welcome in both repositories (containers and charts). Just in case you would like to contribute.

During this time, there are several releases of this asset and it's possible the issue has gone as part of other changes. If that's not the case and you are still experiencing this issue, please feel free to reopen it and we will re-evaluate it.

baznikin changed the title ~~[bitnami/redis] Sentenel return wrong IP~~ [bitnami/redis] Sentinel return wrong IP Nov 28, 2019

stale bot added the stale 15 days without activity label Dec 18, 2019

alemorcuq added on-hold Issues or Pull Requests with this label will never be considered stale and removed stale 15 days without activity labels Dec 19, 2019

albertocsm mentioned this issue Apr 27, 2020

Redis - Sentinel unable to finish a failover #2436

Closed

rafariossaa mentioned this issue Sep 11, 2020

[bitnami/redis] Removes master/slave when using sentinel #3658

Merged

4 tasks

tomislater mentioned this issue Sep 7, 2021

[bitnami/redis] use hostnames in sentinel #7428

Closed

3 tasks

JDKnobloch mentioned this issue May 3, 2022

[bitnami/redis] Redis and redis-sentinel mix hostnames between config on simultaneous STS rollout restart when running multiple redis instances on single k8s cluster #10016

Closed

carrodher closed this as completed Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/redis] Sentinel return wrong IP #1682

[bitnami/redis] Sentinel return wrong IP #1682

baznikin commented Nov 28, 2019

javsalgar commented Dec 2, 2019

baznikin commented Dec 2, 2019 •

edited

Loading

baznikin commented Dec 2, 2019

javsalgar commented Dec 2, 2019

baznikin commented Dec 2, 2019

baznikin commented Dec 2, 2019

baznikin commented Dec 2, 2019

javsalgar commented Dec 3, 2019

stale bot commented Dec 18, 2019

ajcann commented Mar 13, 2020

javsalgar commented Mar 17, 2020

miguelaeh commented Apr 8, 2020

baznikin commented Apr 9, 2020 via email

miguelaeh commented Apr 13, 2020

albertocsm commented Apr 24, 2020 •

edited

Loading

miguelaeh commented Apr 27, 2020

rtriveurbana commented Aug 25, 2020

miguelaeh commented Aug 25, 2020

tomislater commented Sep 6, 2021

miguelaeh commented Sep 7, 2021

tomislater commented Sep 7, 2021

tomislater commented Sep 7, 2021

miguelaeh commented Sep 8, 2021

carrodher commented Oct 20, 2022

[bitnami/redis] Sentinel return wrong IP #1682

[bitnami/redis] Sentinel return wrong IP #1682

Comments

baznikin commented Nov 28, 2019

javsalgar commented Dec 2, 2019

baznikin commented Dec 2, 2019 • edited Loading

baznikin commented Dec 2, 2019

javsalgar commented Dec 2, 2019

baznikin commented Dec 2, 2019

baznikin commented Dec 2, 2019

baznikin commented Dec 2, 2019

javsalgar commented Dec 3, 2019

stale bot commented Dec 18, 2019

ajcann commented Mar 13, 2020

javsalgar commented Mar 17, 2020

miguelaeh commented Apr 8, 2020

baznikin commented Apr 9, 2020 via email

miguelaeh commented Apr 13, 2020

albertocsm commented Apr 24, 2020 • edited Loading

miguelaeh commented Apr 27, 2020

rtriveurbana commented Aug 25, 2020

miguelaeh commented Aug 25, 2020

tomislater commented Sep 6, 2021

miguelaeh commented Sep 7, 2021

tomislater commented Sep 7, 2021

tomislater commented Sep 7, 2021

miguelaeh commented Sep 8, 2021

carrodher commented Oct 20, 2022

baznikin commented Dec 2, 2019 •

edited

Loading

albertocsm commented Apr 24, 2020 •

edited

Loading