Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis] Sentinel return wrong IP #1682

Closed
baznikin opened this issue Nov 28, 2019 · 24 comments
Closed

[bitnami/redis] Sentinel return wrong IP #1682

baznikin opened this issue Nov 28, 2019 · 24 comments
Labels
on-hold Issues or Pull Requests with this label will never be considered stale

Comments

@baznikin
Copy link

Which chart:

bitnami/redis version 9.5.5

Description

Sentinel do not update his config so we have stale IP addresses

Steps to reproduce the issue:

I setup redis and stolon in same namespace and play with them a while. When I tried to actually use redis I found sentinel gives me address of postrges pod! As far as I understand sentinel compiles its config upon pod start and beleave addresses do not change.

$ telnet redis.db.svc 26379
Trying 192.168.47.226...
Connected to 192.168.47.226.
Escape character is '^]'.
sentinel get-master-addr-by-name spt-redis
*2
$14
192.168.47.214
$4
6379
$ kubectl -n db get pod -o custom-columns=NAME:.metadata.name,IP:.status.podIP
NAME                               IP
redis-master-0                     192.168.47.235
redis-slave-0                      192.168.47.234
redis-slave-1                      192.168.47.226
stolon-create-cluster-8xj76        192.168.47.209
stolon-keeper-0                    192.168.47.201
stolon-keeper-1                    192.168.47.202
stolon-proxy-c49bdd5c5-bwp2l       192.168.47.243
stolon-proxy-c49bdd5c5-wzvzb       192.168.47.214
stolon-sentinel-6cb88b84c8-gdw4r   192.168.47.198
stolon-sentinel-6cb88b84c8-m8dpn   192.168.47.250
stolon-update-cluster-spec-fpc5h   192.168.47.200
$ kubectl -n db exec -it pod/redis-master-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid 85dd23902cf42f9b601086f5e4814f704d15937f
sentinel deny-scripts-reconfig yes
sentinel monitor spt-redis 192.168.47.214 6379 2
sentinel down-after-milliseconds spt-redis 60000
sentinel failover-timeout spt-redis 18000
# Generated by CONFIG REWRITE
protected-mode no
sentinel auth-pass spt-redis **********
sentinel config-epoch spt-redis 0
sentinel leader-epoch spt-redis 0
sentinel known-replica spt-redis 192.168.47.218 6379
sentinel known-replica spt-redis 192.168.47.215 6379
sentinel known-sentinel spt-redis 192.168.47.218 26379 f13e511bad9f51182ba73b03697126b6bf1c752f
sentinel known-sentinel spt-redis 192.168.47.215 26379 91bef7dfdfda7aadd047839cc78f5acf14ade2c5
sentinel current-epoch 0
-rw-r--r-- 1 1001 1001 743 Nov 23 17:14 /opt/bitnami/redis-sentinel/etc/sentinel.conf
Thu Nov 28 19:45:40 MSK 2019```

**Describe the results you received:**

<!-- What actually happens -->

**Describe the results you expected:**

<!-- What you expect to happen -->

**Additional information you deem important (e.g. issue happens only occasionally):**

<!-- Any additional information, configuration or data that might be necessary to reproduce the issue. -->


**Version of Helm and Kubernetes**:

- Output of `helm version`:

(paste your output here)


- Output of `kubectl version`:

(paste your output here)

@baznikin baznikin changed the title [bitnami/redis] Sentenel return wrong IP [bitnami/redis] Sentinel return wrong IP Nov 28, 2019
@javsalgar
Copy link
Contributor

Hi,

This is strange because the generated config map uses the domain name, not the IP

{{- if .Values.sentinel.enabled }}
  sentinel.conf: |-
    dir "/tmp"
    bind 0.0.0.0
    port {{ .Values.sentinel.port }}
    sentinel monitor {{ .Values.sentinel.masterSet }} {{ template "redis.fullname" . }}-master-0.{{ template "redis.fullname" . }}-headless.{{ .Release.Namespace }}.svc.{{ .Values.clusterDomain }} {{ .Values.redisPort }} {{ .Values.sentinel.quorum }}
    sentinel down-after-milliseconds {{ .Values.sentinel.masterSet }} {{ .Values.sentinel.downAfterMilliseconds }}
    sentinel failover-timeout {{ .Values.sentinel.masterSet }} {{ .Values.sentinel.failoverTimeout }}
    sentinel parallel-syncs {{ .Values.sentinel.masterSet }} {{ .Values.sentinel.parallelSyncs }}

Could you show the generated config map using kubectl?

@baznikin
Copy link
Author

baznikin commented Dec 2, 2019

Hmmm, domain name here... Very strange

apiVersion: v1
data:
  master.conf: |-
  master.conf: |-
    dir /data
    rename-command FLUSHDB ""
    rename-command FLUSHALL ""
  redis.conf: |-
    # User-supplied configuration:
    # Enable AOF https://redis.io/topics/persistence#append-only-file
    appendonly yes
    # Disable RDB persistence, AOF persistence already enabled.
    save ""
  replica.conf: |-
    dir /data
    slave-read-only yes
    rename-command FLUSHDB ""
    rename-command FLUSHALL ""
  sentinel.conf: |-
    dir "/tmp"
    bind 0.0.0.0
    port 26379                                                                         
    sentinel monitor spt-redis redis-master-0.redis-headless.db.svc.cluster.local 6379 2
    sentinel down-after-milliseconds spt-redis 60000
    sentinel failover-timeout spt-redis 18000
    sentinel parallel-syncs spt-redis 1
kind: ConfigMap
metadata:
  creationTimestamp: "2019-11-23T17:05:00Z"
  labels:
    app: redis
    chart: redis-9.5.5
    heritage: Tiller
    release: redis
  name: redis
  namespace: db
  resourceVersion: "3099795"
  selfLink: /api/v1/namespaces/db/configmaps/redis
  uid: 6d867ed4-7175-433c-9bcd-58d887b64ecc

@baznikin
Copy link
Author

baznikin commented Dec 2, 2019

I do not reload redis yet, any tests I can do to track down issue?

@javsalgar
Copy link
Contributor

Maybe deploying a new one and see if the address gets changed to IP. Maybe it's something that Redis does automatically

@baznikin
Copy link
Author

baznikin commented Dec 2, 2019

I deploy new redis with helm install --name redis2 --namespace db bitnami/redis --version 9.5.5 -f deploy/helm/redis.yaml and values

password: "<password here>"
cluster:
  enabled: true
  slaveCount: 2
sentinel:
  enabled: true
  masterSet: spt-redis
persistence: {}
  # existingClaim:
master:
  statefulset:
    updateStrategy: RollingUpdate
slave:
  statefulset:
    updateStrategy: RollingUpdate
metrics:
  enabled: true
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9121"
  serviceMonitor:
    enabled: false

## Redis config file
## ref: https://redis.io/topics/config
##
configmap: |-
  # Enable AOF https://redis.io/topics/persistence#append-only-file
  appendonly yes
  # Disable RDB persistence, AOF persistence already enabled.
  save ""

Same result, IP in config:

$ kubectl -n db get configmap -o yaml redis2 | grep monitor
    sentinel monitor spt-redis redis2-master-0.redis2-headless.db.svc.cluster.local 6379 2
$ kubectl -n db exec -it pod/redis2-master-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf | grep monitor
sentinel monitor spt-redis 192.168.47.208 6379 2
$ kubectl -n db get pod -o custom-columns=NAME:.metadata.name,IP:.status.podIP | grep 192.168.47.208
redis2-master-0                    192.168.47.208

@baznikin
Copy link
Author

baznikin commented Dec 2, 2019

According to docs there is should be IP address. Also I found pretty old issue on this topic. I suppose we have to workaround master pod IP change (or check how others run redis sentinel on Kubernetes), correct me if I am wrong

@baznikin
Copy link
Author

baznikin commented Dec 2, 2019

Another point. If I restart pod/redis2-master-0 it's config updated. However, slaves's sentinels do not:

$ kubectl delete pod/redis2-master-0 -n db
pod "redis2-master-0" deleted

$ kubectl -n db exec -it pod/redis2-master-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf | grep monitor
sentinel monitor spt-redis 192.168.47.253 6379 2

$ kubectl -n db get pod -o custom-columns=NAME:.metadata.name,IP:.status.podIP | grep redis2-master-0
redis2-master-0                    192.168.47.253

$ kubectl -n db exec -it pod/redis2-slave-0 -c sentinel -- cat /opt/bitnami/redis-sentinel/etc/sentinel.conf | grep monitor
sentinel monitor spt-redis 192.168.47.208 6379 2

@javsalgar
Copy link
Contributor

Hi,

Thanks for letting us know. I think this will require further investigation. Let me open an internal task. I will let you know when we have more details.

@stale
Copy link

stale bot commented Dec 18, 2019

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@stale stale bot added the stale 15 days without activity label Dec 18, 2019
@alemorcuq alemorcuq added on-hold Issues or Pull Requests with this label will never be considered stale and removed stale 15 days without activity labels Dec 19, 2019
@ajcann
Copy link

ajcann commented Mar 13, 2020

@javsalgar Would this issue suggest it's not wise to rely on using Sentinel mode for HA in a production situation?

@javsalgar
Copy link
Contributor

Hi,

We still need to investigate how to properly deal with sentinel and the ephemerality of IP addresses. For the time being, until this issue is fixed, I would recommend you sticking to a regular master-slave configuration. We are also working on a redis-cluster chart which has a different failover mechanism and could be more suited for this kind of scenarios. We will let you know when we have more updates on this.

@miguelaeh
Copy link
Contributor

Hi @baznikin ,
I have been testing what you explained here and it seems to be a temporal issue. Once you kill the master, there is a time where one of the slaves needs to be promoted to master. During that time, the sentinel at both slaves will be pointing to the old master, and if the new master pod has been created it will point to itself because the hostname in the configmap is pointing to the pod called master. There is something here to clarify that is that at this moment, the master will be one of the pods called slave and the pod called master will be a slave.
Once the cluster reaches a stable state, the sentinel pods start an auto-reconfiguring process, and after some time they all point to the new master (that is actually a pod called slave).
Let me illustrate this:

  • First deploy of the chart you will have the cluster in a stable state:
10:43:56 › kgp -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
sentinel-redis-master-0   3/3     Running   0          3m50s   10.244.2.86   aks-agentpool-38805687-vmss000002   <none>           <none>
sentinel-redis-slave-0    3/3     Running   2          3m50s   10.244.3.89   aks-agentpool-38805687-vmss000003   <none>           <none>
sentinel-redis-slave-1    3/3     Running   0          2m15s   10.244.1.83   aks-agentpool-38805687-vmss000001   <none>           <none>

As you can see in the sentinel configuration of one of the slaves, it is pointing to the master pod, that is correct:

10:44:03 › k exec -it sentinel-redis-master-0 -c sentinel bash
I have no name!@sentinel-redis-master-0:/$ cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid 3b9bba815cc15706f7b66f7ef85eefe215cb4c1b
sentinel deny-scripts-reconfig yes
sentinel monitor spt-redis 10.244.2.86 6379 2
.
.
.

Then, I killed the master pod:

10:45:51 › k delete pod sentinel-redis-master-0
pod "sentinel-redis-master-0" deleted

And now there is an unstable period where a salve should be promoted to master, checking the logs of one of the slaves they will have the following:

1:S 08 Apr 2020 10:41:54.951 # CONFIG REWRITE executed with success.
1:S 08 Apr 2020 10:41:55.265 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:41:55.265 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:41:55.266 * Non blocking connect for SYNC fired the event.
1:S 08 Apr 2020 10:41:55.266 * Master replied to PING, replication can continue...
1:S 08 Apr 2020 10:41:55.268 * Trying a partial resynchronization (request e071e2ecae29225177b980a90e8afea809390681:2550).
1:S 08 Apr 2020 10:41:55.269 * Successful partial resynchronization with master.
1:S 08 Apr 2020 10:41:55.269 * MASTER <-> REPLICA sync: Master accepted a Partial Resynchronization.
1:S 08 Apr 2020 10:45:55.030 # Connection with master lost.
1:S 08 Apr 2020 10:45:55.030 * Caching the disconnected master state.
1:S 08 Apr 2020 10:45:55.078 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:45:55.078 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:45:55.079 # Error condition on socket for SYNC: Connection refused
1:S 08 Apr 2020 10:45:56.081 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:45:56.081 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:46:14.230 # Error condition on socket for SYNC: No route to host
1:S 08 Apr 2020 10:46:15.159 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:46:15.160 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:46:23.454 # Error condition on socket for SYNC: No route to host
1:S 08 Apr 2020 10:46:24.190 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:46:24.191 * MASTER <-> REPLICA sync started
1:S 08 Apr 2020 10:46:27.254 # Error condition on socket for SYNC: No route to host
1:S 08 Apr 2020 10:46:28.208 * Connecting to MASTER 10.244.2.86:6379
1:S 08 Apr 2020 10:46:28.208 * MASTER <-> REPLICA sync started

The new master pod is created automatically and its IP will be different from the previous one:

10:48:08 › kgp -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
sentinel-redis-master-0   3/3     Running   0          2m15s   10.244.2.87   aks-agentpool-38805687-vmss000002   <none>           <none>
sentinel-redis-slave-0    3/3     Running   2          8m      10.244.3.89   aks-agentpool-38805687-vmss000003   <none>           <none>
sentinel-redis-slave-1    3/3     Running   0          6m25s   10.244.1.83   aks-agentpool-38805687-vmss000001   <none>           <none>

If you exec to the new pod just in the moment it is created, you will see that the sentinel configuration is pointing to itself, to its new ip.
Now, in one of the slaves, the following will appear, indicating it is now the master:

1:M 08 Apr 2020 10:48:07.934 * Discarding previously cached master state.
1:M 08 Apr 2020 10:48:07.934 * MASTER MODE enabled (user request from 'id=5 addr=10.244.3.89:39847 fd=10 name=sentinel-ce6ec014-cmd age=357 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=140 qbuf-free=32628 obl=36 oll=0 omem=0 events=r cmd=exec')
1:M 08 Apr 2020 10:48:07.934 # CONFIG REWRITE executed with success.
1:M 08 Apr 2020 10:48:09.553 * Replica 10.244.3.89:6379 asks for synchronization
1:M 08 Apr 2020 10:48:09.554 * Partial resynchronization request from 10.244.3.89:6379 accepted. Sending 437 bytes of backlog starting from offset 50657.
1:M 08 Apr 2020 10:48:21.036 * Replica 10.244.2.87:6379 asks for synchronization
1:M 08 Apr 2020 10:48:21.036 * Partial resynchronization not accepted: Replication ID mismatch (Replica asked for '567f79774695fcba15dc30133c6a398d59c74b4d', my replication IDs are '38c06c6daab43ea958f6fb2f0d615e8e02376c14' and 'e071e2ecae29225177b980a90e8afea809390681')
1:M 08 Apr 2020 10:48:21.036 * Starting BGSAVE for SYNC with target: disk
1:M 08 Apr 2020 10:48:21.037 * Background saving started by pid 741
741:C 08 Apr 2020 10:48:21.051 * DB saved on disk
741:C 08 Apr 2020 10:48:21.052 * RDB: 10 MB of memory used by copy-on-write
1:M 08 Apr 2020 10:48:21.085 * Background saving terminated with success
1:M 08 Apr 2020 10:48:21.086 * Synchronization with replica 10.244.2.87:6379 succeeded

And after some time, if we go to the old master we will see that the sentinel configuration is now pointing to the new master (that is sentinel-redis-slave-1):

10:49:11 › k exec -it sentinel-redis-master-0 -c sentinel bash
I have no name!@sentinel-redis-master-0:/$ cat /opt/bitnami/redis-sentinel/etc/sentinel.conf
dir "/tmp"
bind 0.0.0.0
port 26379
sentinel myid d25053c91626dbfabd456c4cdeab9bed39ea33fc
sentinel deny-scripts-reconfig yes
sentinel monitor spt-redis 10.244.1.83 6379 2

And now the cluster is stable again. I guess this is the behaviour you were expecting but you didn't give it enough time for the sentinel to update the IPs.

@baznikin
Copy link
Author

baznikin commented Apr 9, 2020 via email

@miguelaeh
Copy link
Contributor

Thank you for the confirmation @baznikin !!

@albertocsm
Copy link

albertocsm commented Apr 24, 2020

ive also experienced the issue described by @baznikin here

when killing the pod named redis-master-0 while its actually the current elected Master, a condition occurs where Sentinel is not able to failover with success for an undetermined ammount of time (hours).

1:X 24 Apr 2020 11:38:20.750 # +new-epoch 204
1:X 24 Apr 2020 11:38:20.750 # +try-failover master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.751 # +vote-for-leader c855252ac9d60e833a37eb9a19b867fcbf9eec72 204
1:X 24 Apr 2020 11:38:20.754 # cde3c17ddd887f4c1fedc7f172b94073f3b3eef0 voted for c855252ac9d60e833a37eb9a19b867fcbf9eec72 204
1:X 24 Apr 2020 11:38:20.805 # +elected-leader master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.805 # +failover-state-select-slave master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.867 # -failover-abort-no-good-slave master redis 10.145.159.239 6379
1:X 24 Apr 2020 11:38:20.967 # Next failover delay: I will not start a failover before Fri Apr 24 11:38:50 2020

10.145.159.239 is the old IP for the dead redis-master-0 pod

when in this state querying each sentinel instance with sentinel ckquorum resulted in an error where each of the (3) sentinels instances in the cluster reported themselfs as not ok -> not enough Sentinels / majority not reached.
not sure why this is - s_down instances should not be considered for majority pourpuses, right?

after a sentinel reset on all the sentinels instances, followed by another sentinel ckquorum each sentinel instance now reports as OK.

OK <N> usable Sentinels. Quorum and failover authorization can be reached

even though not all sentinels instances are aware of all other instances (?), all of them are aware of a majority (majority == 2 in the 1Master+2Replicas cluster im running)

at this point, the failover procedure is still not able to complete with success

unfortunatly, i have yet to make this reproducible 100% of the times

for reference, here is the output of sentinel master for each of the 3 members of the ensemble:

m0-----------
 1) "name"
 2) "redis"
 3) "ip"
 4) "10.145.159.239"
 5) "port"
 6) "6379"
 7) "runid"
 8) ""
 9) "flags"
10) "s_down,o_down,master,disconnected"
11) "link-pending-commands"
12) "32"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "2099287"
17) "last-ok-ping-reply"
18) "2099287"
19) "last-ping-reply"
20) "2099287"
21) "s-down-time"
22) "2094225"
23) "o-down-time"
24) "2094173"
25) "down-after-milliseconds"
26) "5000"
27) "info-refresh"
28) "4864481"
29) "role-reported"
30) "master"
31) "role-reported-time"
32) "2099287"
33) "config-epoch"
34) "6"
35) "num-slaves"
36) "0"
37) "num-other-sentinels"
38) "1"
39) "quorum"
40) "2"
41) "failover-timeout"
42) "15000"
43) "parallel-syncs"
44) "1"
s0-----------
 1) "name"
 2) "redis"
 3) "ip"
 4) "10.145.159.239"
 5) "port"
 6) "6379"
 7) "runid"
 8) ""
 9) "flags"
10) "s_down,o_down,master"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "2549571"
17) "last-ok-ping-reply"
18) "2549571"
19) "last-ping-reply"
20) "2549571"
21) "s-down-time"
22) "2544547"
23) "o-down-time"
24) "2284604"
25) "down-after-milliseconds"
26) "5000"
27) "info-refresh"
28) "4868280"
29) "role-reported"
30) "master"
31) "role-reported-time"
32) "2549571"
33) "config-epoch"
34) "6"
35) "num-slaves"
36) "0"
37) "num-other-sentinels"
38) "1"
39) "quorum"
40) "2"
41) "failover-timeout"
42) "15000"
43) "parallel-syncs"
44) "1"
s1-----------
 1) "name"
 2) "redis"
 3) "ip"
 4) "10.145.159.239"
 5) "port"
 6) "6379"
 7) "runid"
 8) ""
 9) "flags"
10) "s_down,o_down,master"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "2290182"
17) "last-ok-ping-reply"
18) "2290182"
19) "last-ping-reply"
20) "2290182"
21) "s-down-time"
22) "2285127"
23) "o-down-time"
24) "2285060"
25) "down-after-milliseconds"
26) "5000"
27) "info-refresh"
28) "4876023"
29) "role-reported"
30) "master"
31) "role-reported-time"
32) "2290182"
33) "config-epoch"
34) "6"
35) "num-slaves"
36) "0"
37) "num-other-sentinels"
38) "2"
39) "quorum"
40) "2"
41) "failover-timeout"
42) "15000"
43) "parallel-syncs"
44) "1"

would be happy to provide more info / logs

@miguelaeh
Copy link
Contributor

Hi @albertocsm ,
This seems to be a different case from the explained in this case, we would appreciate if you can open a new issue for your case, it is not necessary to write again all the information, just reference your comment in this thread.
Regarding your issue, I would say the sentinel failover time is configurable, checking the line from your output it seems it is configured to wait more time before restart:

Next failover delay: I will not start a failover before Fri Apr 24 11:38:50 2020

Regards.

@rtriveurbana
Copy link

i experienced same problem this night. One node of my cluster went down and failover starts. I experienced that one sentinel return the old Master ip instead of new one causing problems to other services that use redis

@miguelaeh
Copy link
Contributor

Hi @rtriveurbana ,
Could you provide more information? Did you wait enough time for the cluster to recover?

@tomislater
Copy link
Contributor

I have noticed that old IPs are not removed from sentinel.conf. I have default sentinel setup, the newest version (3 nodes):

cat /opt/bitnami/redis-sentinel/etc/sentinel.conf:

dir "/tmp"
port 26379
sentinel monitor mymaster 10.110.37.122 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 18000

# User-supplied sentinel configuration:
# End of sentinel configuration
sentinel myid 80d3b99129c4f9bb62cdcec2399499f2c4666789

sentinel known-sentinel mymaster 10.110.59.80 26379 475427d0cefa44aff518ab6b888c3510314d882c

sentinel known-sentinel mymaster 10.110.17.137 26379 cae9f9b5ed951299f4bdc7de1da8e996e7b7197c

sentinel known-replica mymaster 10.110.51.41 6379

sentinel known-replica mymaster 10.110.59.80 6379
# Generated by CONFIG REWRITE
protected-mode no
user default on nopass ~* &* +@all
sentinel config-epoch mymaster 3
sentinel leader-epoch mymaster 3
sentinel current-epoch 3
sentinel known-replica mymaster 10.110.16.155 6379
sentinel known-replica mymaster 10.110.18.152 6379
sentinel known-replica mymaster 10.110.17.137 6379

10.110.37.122 is my current master and 10.110.59.80, 10.110.17.137 are my replicas.

And here are problems with these IPs:

  • 10.110.51.41 - I do not know what is it
  • 10.110.16.155 - that was old replica (I deleted one pod, just for fun)
  • 10.110.18.152 - that was also old replica (I also deleted this pod, but not for fun; for tests)

Looks like sentinel configuration is not configured properly when I (or my cluster (like autoscaler)) delete pods.

It is even funnier! 🙈

I spawned a new redis release (standalone, only one pod, in another namespace) and k8s gave him 10.110.18.152 IP address which as you can see is the same as the old (deleted) replica had! And now, sentinel "thinks" that this new redis in another namespace is a replica :D And, start syncing with master...

I think that we need something which is able to remove old IPs from sentinel.conf 🤔. I must study the chart, then I can help.

@miguelaeh
Copy link
Contributor

Hi @tomislater ,
We would appreciate if you could create a new case for your issues since it seems to be non-related to the original one.
What you are exposing should not happen, we will probably need to re-check that.

@tomislater
Copy link
Contributor

@miguelaeh hey, I have added a comment here: #5418 (comment)

Looks that we should use:

SENTINEL resolve-hostnames yes
SENTINEL announce-hostnames yes

I am going to debug this further.

@tomislater
Copy link
Contributor

I have a solution and it seems working 🤔 I will propose PR soon.

@miguelaeh
Copy link
Contributor

Hi @tomislater ,
Thank you for the PR! We will take a look to it

@carrodher
Copy link
Member

Unfortunately, this issue was created a long time ago and although there is an internal task to fix it, it was not prioritized as something to address in the short/mid term. It's not a technical reason but something related to the capacity since we're a small team.

Being said that, contributions via PRs are more than welcome in both repositories (containers and charts). Just in case you would like to contribute.

During this time, there are several releases of this asset and it's possible the issue has gone as part of other changes. If that's not the case and you are still experiencing this issue, please feel free to reopen it and we will re-evaluate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on-hold Issues or Pull Requests with this label will never be considered stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants