Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patroni Postgresql instance is down due to k8s api unavailable #669

Open
cbartz opened this issue Aug 30, 2024 · 5 comments
Open

Patroni Postgresql instance is down due to k8s api unavailable #669

cbartz opened this issue Aug 30, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@cbartz
Copy link

cbartz commented Aug 30, 2024

Steps to reproduce

  1. Deploy postgresql-k8s with three units.
  2. Integrate it with another k8s app.

Example deployment: https://pastebin.canonical.com/p/YRyVgZpxC5/

Expected behavior

Cluster does not fail on K8s API unavailability.

Actual behavior

We received the alert

[[FIRING:1] PatroniPostgresqlDown (alertname="PatroniPostgresqlDown", instance="prod-repo-policy-compliance_e3901794-73c7-43f0-85ae-5179a6dc1b16_postgresql_postgresql/0", job="juju_prod-repo-policy-compliance_e3901794_postgresql_prometheus_scrape-0_60d7da0f2146e3c09ca949795f3ca709e4494a27fcf14b443a684b940f4d44fe", juju_charm="postgresql-k8s", juju_unit="postgresql/0", scope="patroni-postgresql", severity="critical")](http://alertmanager-0.alertmanager-endpoints.prod-cos-k8s-ps6-is-charms.svc.cluster.local:9093/#/alerts?receiver=mattermost-notifications)
Alert: Patroni Posrgresql Down (instance prod-repo-policy-compliance_e3901794-73c7-43f0-85ae-5179a6dc1b16_postgresql_postgresql/0) - critical

Description: Patroni Postgresql instance is down
  VALUE = 0
  LABELS = map[name:patroni_postgres_running instance:prod-repo-policy-compliance_e3901794-73c7-43f0-85ae-5179a6dc1b16_postgresql_postgresql/0 job:juju_prod-repo-policy-compliance_e3901794_postgresql_prometheus_scrape-0_60d7da0f2146e3c09ca949795f3ca709e4494a27fcf14b443a684b940f4d44fe juju_application:postgresql juju_charm:postgresql-k8s juju_model:prod-repo-policy-compliance juju_model_uuid:e3901794-73c7-43f0-85ae-5179a6dc1b16 juju_unit:postgresql/0 scope:patroni-postgresql]

Details:
   • alertname: PatroniPostgresqlDown
   • instance: prod-repo-policy-compliance_e3901794-73c7-43f0-85ae-5179a6dc1b16_postgresql_postgresql/0
   • job: juju_prod-repo-policy-compliance_e3901794_postgresql_prometheus_scrape-0_60d7da0f2146e3c09ca949795f3ca709e4494a27fcf14b443a684b940f4d44fe
   • juju_application: postgresql
   • juju_charm: postgresql-k8s
   • juju_model: prod-repo-policy-compliance
   • juju_model_uuid: e3901794-73c7-43f0-85ae-5179a6dc1b16
   • juju_unit: postgresql/0
   • scope: patroni-postgresql
   • severity: critical

which caused our application to fail because it could only use read-only transactions:

2024-08-29 14:08:04.851	
sqlalchemy.exc.InternalError: (psycopg2.errors.ReadOnlySqlTransaction) cannot execute DELETE in a read-only transaction

2024-08-29 14:08:04.851	

2024-08-29 14:08:04.851	
[SQL: DELETE FROM one_time_token WHERE one_time_token.value = %(value_1)s]
2024-08-29 14:08:04.851	
[parameters: {'value_1': 'foobar''}]

The Patroni logs indicate that the root cause was the unavailability of the k8s api:

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=4.976570268161595)")': /api/v1/namespaces/prod-repo-policy-compliance/endpoints?labelSelector=application%3Dpatroni%2Ccluster-name%3Dpatroni-postgresql 

It took about 10 minutes to get the cluster back to a healthy state.

Versions

Operating system:

Juju CLI: 3.5.3-ubuntu-amd64

Juju agent: 3.5.3

Charm revision: 239

microk8s:

Log output

Patronoi logs:

2024-08-29 14:03:29.723	
2024-08-29 12:03:28 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-0) 
2024-08-29 14:03:29.825	
2024-08-29 12:03:28 UTC [223009]: INFO: no action. I am (postgresql-1), a secondary, and following a leader (postgresql-0) 
2024-08-29 14:03:40.325	
2024-08-29 12:03:38 UTC [223009]: INFO: Got response from postgresql-0 http://postgresql-0.postgresql-endpoints:8008/patroni: {"state": "stopping", "role": "master", "cluster_unlocked": true, "dcs_last_seen": 1724932997, "database_system_identifier": "7398399066156073000", "patroni": {"version": "3.1.2", "scope": "patroni-postgresql"}, "logger_queue_size": 14} 
2024-08-29 14:03:40.325	
2024-08-29 12:03:38 UTC [223009]: WARNING: Request failed to postgresql-0: GET http://postgresql-0.postgresql-endpoints:8008/patroni ('xlog') 
2024-08-29 14:03:43.021	
2024-08-29 12:03:17 UTC [232335]: INFO: Lock owner: postgresql-0; I am postgresql-0 
2024-08-29 14:03:43.021	
2024-08-29 12:03:22 UTC [232335]: ERROR: Request to server https://10.87.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=4.999772991985083)") 
2024-08-29 14:03:43.021	
2024-08-29 12:03:25 UTC [232335]: ERROR: Request to server https://10.87.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=2.355637992732227)") 
2024-08-29 14:03:43.021	
2024-08-29 12:03:27 UTC [232335]: ERROR: Request to server https://10.87.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=1.8544748411513865)") 
2024-08-29 14:03:43.021	
2024-08-29 12:03:27 UTC [232335]: ERROR: Error communicating with DCS 
2024-08-29 14:03:43.021	
2024-08-29 12:03:27 UTC [232335]: INFO: demoting self because DCS is not accessible and I was a leader 
2024-08-29 14:03:43.021	
2024-08-29 12:03:27 UTC [232335]: INFO: Demoting self (offline) 
2024-08-29 14:03:43.021	
2024-08-29 12:03:27 UTC [232335]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-08-29 14:03:44.322	
2024-08-29 12:03:28 UTC [232335]: WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=4.976570268161595)")': /api/v1/namespaces/prod-repo-policy-compliance/endpoints?labelSelector=application%3Dpatroni%2Ccluster-name%3Dpatroni-postgresql 
2024-08-29 14:03:44.322	
2024-08-29 12:03:28 UTC [232335]: ERROR: Request to server https://10.87.0.1:443 failed: ClosedPoolError("HTTPSConnectionPool(host='10.87.0.1', port=443): Pool is closed.") 
2024-08-29 14:03:44.322	
2024-08-29 12:03:37 UTC [232335]: ERROR: get_cluster 
2024-08-29 14:03:44.322	
Traceback (most recent call last):
2024-08-29 14:03:44.322	
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 977, in __load_cluster
2024-08-29 14:03:44.322	
    self._wait_caches(stop_time)
2024-08-29 14:03:44.323	
  File "/usr/lib/python3/dist-packages/patroni/dcs/kubernetes.py", line 859, in _wait_caches
2024-08-29 14:03:44.323	
    raise RetryFailedError('Exceeded retry deadline')
2024-08-29 14:03:44.323	
patroni.utils.RetryFailedError: Exceeded retry deadline
2024-08-29 14:03:44.323	
2024-08-29 12:03:37 UTC [232335]: ERROR: Error communicating with DCS 
2024-08-29 14:03:44.323	
2024-08-29 12:03:37 UTC [232335]: WARNING: AsyncExecutor is busy, demoting from the main thread 
2024-08-29 14:03:44.323	
2024-08-29 12:03:37 UTC [232335]: INFO: Demoting self (offline) 
2024-08-29 14:03:44.323	
2024-08-29 12:03:39 UTC [232335]: ERROR: Request to server https://10.87.0.1:443 failed: MaxRetryError('HTTPSConnectionPool(host=\'10.87.0.1\', port=443): Max retries exceeded with url: /api/v1/namespaces/prod-repo-policy-compliance/endpoints?labelSelector=application%3Dpatroni%2Ccluster-name%3Dpatroni-postgresql (Caused by ReadTimeoutError("HTTPSConnectionPool(host=\'10.87.0.1\', port=443): Read timed out. (read timeout=4.995564742013812)"))') 
2024-08-29 14:03:44.323	
2024-08-29 12:03:40 UTC [232335]: ERROR: ObjectCache.run K8sConnectionFailed('No more API server nodes in the cluster') 
2024-08-29 14:03:44.323	
2024-08-29 12:03:43 UTC [232335]: INFO: closed patroni connection to the postgresql cluster 
2024-08-29 14:03:44.525	
2024-08-29 12:03:43 UTC [223009]: ERROR: Request to server https://10.87.0.1:443 failed: ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=4.994864068925381)") 
2024-08-29 14:03:45.822	
2024-08-29 12:03:44 UTC [232335]: INFO: postmaster pid=342084 
2024-08-29 14:03:45.822	
2024-08-29 12:03:44 UTC [232335]: INFO: postmaster pid=342086 
2024-08-29 14:03:45.822	
2024-08-29 12:03:44 UTC [232335]: INFO: demoted self because DCS is not accessible and I was a leader 
2024-08-29 14:03:45.822	
2024-08-29 12:03:44 UTC [232335]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-08-29 14:03:47.122	
2024-08-29 12:03:45 UTC [232335]: WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='10.87.0.1', port=443): Read timed out. (read timeout=4.993611121084541)")': /api/v1/namespaces/prod-repo-policy-compliance/endpoints?labelSelector=application%3Dpatroni%2Ccluster-name%3Dpatroni-postgresql 
2024-08-29 14:03:47.122	
2024-08-29 12:03:46 UTC [232335]: INFO: Lock owner: postgresql-1; I am postgresql-0 
2024-08-29 14:03:47.122	
2024-08-29 12:03:46 UTC [232335]: INFO: demoting self because DCS is not accessible and I was a leader in progress 
2024-08-29 14:03:47.325	
2024-08-29 12:03:46 UTC [223009]: INFO: promoted self to leader by acquiring session lock 
2024-08-29 14:03:49.327	
2024-08-29 12:03:48 UTC [223009]: INFO: Lock owner: postgresql-1; I am postgresql-1 
2024-08-29 14:03:49.327	
2024-08-29 12:03:48 UTC [223009]: INFO: updated leader lock during promote 
2024-08-29 14:03:49.622	
2024-08-29 12:03:48 UTC [232335]: INFO: Lock owner: postgresql-1; I am postgresql-0 
2024-08-29 14:03:49.622	
2024-08-29 12:03:48 UTC [232335]: INFO: establishing a new patroni connection to the postgres cluster 
2024-08-29 14:03:49.622	
2024-08-29 12:03:48 UTC [232335]: INFO: Local timeline=247 lsn=0/CE0000A0 
2024-08-29 14:03:49.622	
2024-08-29 12:03:49 UTC [232335]: INFO: Dropped unknown replication slot 'postgresql_1' 
2024-08-29 14:03:49.622	
2024-08-29 12:03:49 UTC [232335]: INFO: Dropped unknown replication slot 'postgresql_2' 
2024-08-29 14:03:49.622	
2024-08-29 12:03:49 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:03:50.124	
2024-08-29 12:03:38 UTC [360554]: INFO: following a different leader because i am not the healthiest node 
2024-08-29 14:03:50.124	
2024-08-29 12:03:46 UTC [360554]: INFO: Lock owner: postgresql-1; I am postgresql-2 
2024-08-29 14:03:50.124	
2024-08-29 12:03:46 UTC [360554]: INFO: Local timeline=247 lsn=0/CD348B40 
2024-08-29 14:03:50.124	
2024-08-29 12:03:47 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:03:50.124	
2024-08-29 12:03:48 UTC [360554]: INFO: Lock owner: postgresql-1; I am postgresql-2 
2024-08-29 14:03:50.124	
2024-08-29 12:03:48 UTC [360554]: INFO: Local timeline=247 lsn=0/CD348B40 
2024-08-29 14:03:50.424	
2024-08-29 12:03:49 UTC [223009]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-08-29 14:03:50.424	
2024-08-29 12:03:49 UTC [223009]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-08-29 14:03:50.822	
2024-08-29 12:03:49 UTC [232335]: INFO: Lock owner: postgresql-1; I am postgresql-0 
2024-08-29 14:03:50.822	
2024-08-29 12:03:49 UTC [232335]: INFO: Local timeline=247 lsn=0/CE0000A0 
2024-08-29 14:03:50.822	
2024-08-29 12:03:49 UTC [232335]: INFO: primary_timeline=248 
2024-08-29 14:03:50.822	
2024-08-29 12:03:49 UTC [232335]: INFO: primary: history=244	0/CD0000A0	no recovery target specified
2024-08-29 14:03:50.822	
245	0/CD1E0B78	no recovery target specified
2024-08-29 14:03:50.822	
246	0/CD1E7C88	no recovery target specified
2024-08-29 14:03:50.822	
247	0/CE0000A0	no recovery target specified 
2024-08-29 14:03:50.822	
2024-08-29 12:03:49 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:03:50.822	
2024-08-29 12:03:49 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:03:54.424	
2024-08-29 12:03:48 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:03:54.424	
2024-08-29 12:03:48 UTC [360554]: INFO: Lock owner: postgresql-1; I am postgresql-2 
2024-08-29 14:03:54.424	
2024-08-29 12:03:48 UTC [360554]: INFO: Local timeline=247 lsn=0/CD348B40 
2024-08-29 14:03:54.424	
2024-08-29 12:03:48 UTC [360554]: INFO: primary_timeline=248 
2024-08-29 14:03:54.424	
2024-08-29 12:03:48 UTC [360554]: INFO: primary: history=244	0/CD0000A0	no recovery target specified
2024-08-29 14:03:54.424	
245	0/CD1E0B78	no recovery target specified
2024-08-29 14:03:54.424	
246	0/CD1E7C88	no recovery target specified
2024-08-29 14:03:54.424	
247	0/CE0000A0	no recovery target specified 
2024-08-29 14:03:56.124	
2024-08-29 12:03:48 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:03:56.124	
2024-08-29 12:03:49 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:00.122	
2024-08-29 12:03:59 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:00.325	
2024-08-29 12:03:59 UTC [223009]: INFO: Lock owner: postgresql-1; I am postgresql-1 
2024-08-29 14:04:00.325	
2024-08-29 12:03:59 UTC [223009]: INFO: Assigning synchronous standby status to ['postgresql-2'] 
2024-08-29 14:04:00.624	
2024-08-29 12:03:59 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:03.225	
2024-08-29 12:04:01 UTC [223009]: INFO: Synchronous standby status assigned to ['postgresql-2'] 
2024-08-29 14:04:03.225	
2024-08-29 12:04:02 UTC [223009]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-08-29 14:04:07.725	
2024-08-29 12:04:06 UTC [223009]: INFO: acquired session lock as a leader 
2024-08-29 14:04:08.122	
2024-08-29 12:04:07 UTC [232335]: INFO: following a different leader because i am not the healthiest node 
2024-08-29 14:04:08.122	
2024-08-29 12:04:07 UTC [232335]: INFO: Lock owner: postgresql-1; I am postgresql-0 
2024-08-29 14:04:08.122	
2024-08-29 12:04:07 UTC [232335]: INFO: Local timeline=247 lsn=0/CE0000A0 
2024-08-29 14:04:08.122	
2024-08-29 12:04:07 UTC [232335]: ERROR: Exception when working with primary via replication connection 
2024-08-29 14:04:08.122	
Traceback (most recent call last):
2024-08-29 14:04:08.122	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/rewind.py", line 219, in _check_timeline_and_lsn
2024-08-29 14:04:08.122	
    with self._postgresql.get_replication_connection_cursor(**leader.conn_kwargs()) as cur:
2024-08-29 14:04:08.122	
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
2024-08-29 14:04:08.122	
    return next(self.gen)
2024-08-29 14:04:08.122	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1027, in get_replication_connection_cursor
2024-08-29 14:04:08.122	
    with get_connection_cursor(**conn_kwargs) as cur:
2024-08-29 14:04:08.122	
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
2024-08-29 14:04:08.122	
    return next(self.gen)
2024-08-29 14:04:08.122	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/connection.py", line 48, in get_connection_cursor
2024-08-29 14:04:08.122	
    conn = psycopg.connect(**kwargs)
2024-08-29 14:04:08.122	
  File "/usr/lib/python3/dist-packages/patroni/psycopg.py", line 103, in connect
2024-08-29 14:04:08.122	
    ret = _connect(*args, **kwargs)
2024-08-29 14:04:08.122	
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
2024-08-29 14:04:08.122	
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
2024-08-29 14:04:08.122	
psycopg2.OperationalError: connection to server at "postgresql-1.postgresql-endpoints" (192.168.100.212), port 5432 failed: FATAL:  no pg_hba.conf entry for replication connection from host "192.168.102.57", user "replication", no encryption
2024-08-29 14:04:08.122	

2024-08-29 14:04:08.424	
2024-08-29 12:04:06 UTC [360554]: INFO: Got response from postgresql-1 http://postgresql-1.postgresql-endpoints:8008/patroni: {"state": "running", "postmaster_start_time": "2024-08-28 19:54:15.898293+00:00", "role": "master", "server_version": 140012, "xlog": {"location": 3456107208}, "timeline": 248, "replication": [{"usename": "replication", "application_name": "postgresql-2", "client_addr": "192.168.103.49", "state": "streaming", "sync_state": "sync", "sync_priority": 1}], "cluster_unlocked": true, "dcs_last_seen": 1724933046, "database_system_identifier": "7398399066156073000", "patroni": {"version": "3.1.2", "scope": "patroni-postgresql"}} 
2024-08-29 14:04:08.424	
2024-08-29 12:04:06 UTC [360554]: WARNING: Primary (postgresql-1) is still alive 
2024-08-29 14:04:08.424	
2024-08-29 12:04:07 UTC [360554]: INFO: following a different leader because i am not the healthiest node 
2024-08-29 14:04:08.424	
2024-08-29 12:04:07 UTC [360554]: INFO: Lock owner: postgresql-1; I am postgresql-2 
2024-08-29 14:04:08.424	
2024-08-29 12:04:07 UTC [360554]: INFO: Local timeline=248 lsn=0/CE0002C8 
2024-08-29 14:04:08.424	
2024-08-29 12:04:07 UTC [360554]: INFO: primary_timeline=248 
2024-08-29 14:04:08.424	
2024-08-29 12:04:08 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:08.825	
2024-08-29 12:04:06 UTC [223009]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-08-29 14:04:08.825	
2024-08-29 12:04:07 UTC [223009]: INFO: Enabled synchronous replication 
2024-08-29 14:04:08.825	
2024-08-29 12:04:07 UTC [223009]: INFO: Assigning synchronous standby status to ['postgresql-2'] 
2024-08-29 14:04:08.825	
2024-08-29 12:04:07 UTC [223009]: INFO: Synchronous standby status assigned to ['postgresql-2'] 
2024-08-29 14:04:08.825	
2024-08-29 12:04:07 UTC [223009]: INFO: acquired session lock as a leader 
2024-08-29 14:04:08.825	
2024-08-29 12:04:07 UTC [223009]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-08-29 14:04:09.422	
2024-08-29 12:04:07 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:09.422	
2024-08-29 12:04:07 UTC [232335]: INFO: Lock owner: postgresql-1; I am postgresql-0 
2024-08-29 14:04:09.422	
2024-08-29 12:04:07 UTC [232335]: INFO: Local timeline=247 lsn=0/CE0000A0 
2024-08-29 14:04:09.422	
2024-08-29 12:04:07 UTC [232335]: ERROR: Exception when working with primary via replication connection 
2024-08-29 14:04:09.422	
Traceback (most recent call last):
2024-08-29 14:04:09.422	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/rewind.py", line 219, in _check_timeline_and_lsn
2024-08-29 14:04:09.422	
    with self._postgresql.get_replication_connection_cursor(**leader.conn_kwargs()) as cur:
2024-08-29 14:04:09.422	
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
2024-08-29 14:04:09.422	
    return next(self.gen)
2024-08-29 14:04:09.422	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1027, in get_replication_connection_cursor
2024-08-29 14:04:09.422	
    with get_connection_cursor(**conn_kwargs) as cur:
2024-08-29 14:04:09.422	
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
2024-08-29 14:04:09.422	
    return next(self.gen)
2024-08-29 14:04:09.422	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/connection.py", line 48, in get_connection_cursor
2024-08-29 14:04:09.422	
    conn = psycopg.connect(**kwargs)
2024-08-29 14:04:09.422	
  File "/usr/lib/python3/dist-packages/patroni/psycopg.py", line 103, in connect
2024-08-29 14:04:09.422	
    ret = _connect(*args, **kwargs)
2024-08-29 14:04:09.422	
  File "/usr/lib/python3/dist-packages/psycopg2/__init__.py", line 122, in connect
2024-08-29 14:04:09.422	
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
2024-08-29 14:04:09.422	
psycopg2.OperationalError: connection to server at "postgresql-1.postgresql-endpoints" (192.168.100.212), port 5432 failed: FATAL:  no pg_hba.conf entry for replication connection from host "192.168.102.57", user "replication", no encryption
2024-08-29 14:04:09.422	

2024-08-29 14:04:09.422	
2024-08-29 12:04:07 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:10.124	
2024-08-29 12:04:08 UTC [360554]: INFO: Removing data directory: /var/lib/postgresql/data/pgdata 
2024-08-29 14:04:10.124	
2024-08-29 12:04:08 UTC [360554]: ERROR: Could not remove data directory /var/lib/postgresql/data/pgdata 
2024-08-29 14:04:10.124	
Traceback (most recent call last):
2024-08-29 14:04:10.124	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1314, in remove_data_directory
2024-08-29 14:04:10.124	
    shutil.rmtree(self._data_dir)
2024-08-29 14:04:10.124	
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
2024-08-29 14:04:10.124	
    onerror(os.rmdir, path, sys.exc_info())
2024-08-29 14:04:10.124	
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
2024-08-29 14:04:10.124	
    os.rmdir(path)
2024-08-29 14:04:10.124	
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'
2024-08-29 14:04:10.124	
2024-08-29 12:04:08 UTC [360554]: INFO: renaming data directory to /var/lib/postgresql/data/pgdata.failed 
2024-08-29 14:04:10.124	
2024-08-29 12:04:08 UTC [360554]: ERROR: Could not rename data directory /var/lib/postgresql/data/pgdata 
2024-08-29 14:04:10.124	
Traceback (most recent call last):

2024-08-29 14:04:10.124	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1314, in remove_data_directory
2024-08-29 14:04:10.124	
    shutil.rmtree(self._data_dir)
2024-08-29 14:04:10.124	
  File "/usr/lib/python3.10/shutil.py", line 731, in rmtree
2024-08-29 14:04:10.124	
    onerror(os.rmdir, path, sys.exc_info())
2024-08-29 14:04:10.124	
  File "/usr/lib/python3.10/shutil.py", line 729, in rmtree
2024-08-29 14:04:10.124	
    os.rmdir(path)
2024-08-29 14:04:10.124	
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata'
2024-08-29 14:04:10.124	

2024-08-29 14:04:10.124	
During handling of the above exception, another exception occurred:
2024-08-29 14:04:10.124	

2024-08-29 14:04:10.124	
Traceback (most recent call last):
2024-08-29 14:04:10.124	
  File "/usr/lib/python3/dist-packages/patroni/postgresql/__init__.py", line 1287, in move_data_directory
2024-08-29 14:04:10.124	
    os.rename(self._data_dir, new_name)
2024-08-29 14:04:10.124	
PermissionError: [Errno 13] Permission denied: '/var/lib/postgresql/data/pgdata' -> '/var/lib/postgresql/data/pgdata.failed'
2024-08-29 14:04:13.923	
2024-08-29 12:04:12 UTC [360554]: INFO: replica has been created using basebackup 
2024-08-29 14:04:13.923	
2024-08-29 12:04:12 UTC [360554]: INFO: bootstrapped from leader 'postgresql-1' 
2024-08-29 14:04:13.923	
2024-08-29 12:04:12 UTC [360554]: INFO: closed patroni connection to the postgresql cluster 
2024-08-29 14:04:13.923	
2024-08-29 12:04:13 UTC [360554]: INFO: postmaster pid=368840 
2024-08-29 14:04:15.224	
2024-08-29 12:04:14 UTC [360554]: INFO: Lock owner: postgresql-1; I am postgresql-2 
2024-08-29 14:04:15.224	
2024-08-29 12:04:14 UTC [360554]: INFO: establishing a new patroni connection to the postgres cluster 
2024-08-29 14:04:16.725	
2024-08-29 12:04:14 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:18.222	
2024-08-29 12:04:17 UTC [232335]: INFO: Lock owner: postgresql-1; I am postgresql-0 
2024-08-29 14:04:18.222	
2024-08-29 12:04:17 UTC [232335]: INFO: Local timeline=248 lsn=0/D0000000 
2024-08-29 14:04:18.222	
2024-08-29 12:04:17 UTC [232335]: INFO: primary_timeline=248 
2024-08-29 14:04:18.222	
2024-08-29 12:04:17 UTC [232335]: INFO: no action. I am (postgresql-0), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:18.325	
2024-08-29 12:04:17 UTC [223009]: INFO: Lock owner: postgresql-1; I am postgresql-1 
2024-08-29 14:04:18.325	
2024-08-29 12:04:17 UTC [223009]: INFO: Updating synchronous privilege temporarily from ['postgresql-2'] to [] 
2024-08-29 14:04:18.325	
2024-08-29 12:04:17 UTC [223009]: INFO: Assigning synchronous standby status to ['postgresql-0'] 
2024-08-29 14:04:18.423	
2024-08-29 12:04:17 UTC [360554]: INFO: no action. I am (postgresql-2), a secondary, and following a leader (postgresql-1) 
2024-08-29 14:04:20.325	
2024-08-29 12:04:20 UTC [223009]: INFO: Synchronous standby status assigned to ['postgresql-0'] 

Additional context

@cbartz cbartz added the bug Something isn't working label Aug 30, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5311.

This message was autogenerated

@dragomirp
Copy link
Contributor

Sounds the same/similar to #616. Patroni needs the k8s API to track the primary. DCS failsafe mode should keep the cluster up until the k8s API is back.

@cbartz
Copy link
Author

cbartz commented Aug 30, 2024

Thanks @dragomirp . So iiuc, your team will be working on a feature to enable this mode using the charm? We have an application where we cannot tolerate downtime and we are thinking of switching to the machine charm version of this charm to avoid instability (due to k8s). Thank you very much.

@taurus-forever
Copy link
Contributor

Hi @cbartz it is an provisional update. We are in the discussion process of the safety of failsafe Patroni mode. You can follow it in PR. Hope to have good news for your soon.

Thank you!

@cbartz
Copy link
Author

cbartz commented Sep 17, 2024

Thank you @taurus-forever @dragomirp You all rock!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants