Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: overcome postgres sync repl limit causing lost transactions under some events. #514

Conversation

sgotti
Copy link
Member

@sgotti sgotti commented Jun 13, 2018

Postgres synchronous replication has a downside explained in the docs:
https://www.postgresql.org/docs/current/static/warm-standby.html

If primary restarts while commits are waiting for acknowledgement, those waiting transactions will be marked fully committed once the primary database recovers. There is no way to be certain that all standbys have received all outstanding WAL data at time of the crash of the primary. Some transactions may not show as committed on the standby, even though they show as committed on the primary. The guarantee we offer is that the application will not receive explicit acknowledgement of the successful commit of a transaction until the WAL data is known to be safely received by all the synchronous standbys.

Under some events this will cause lost transactions. For example:

  • Sync standby goes down.
  • A client commits a transaction, it blocks waiting for acknowledgement.
  • Primary restart, it'll mark the above transaction as fully committed. All the
    clients will now see that transaction.
  • Primary dies
  • Standby comes back.
  • The sentinel will elect the standby as the new master since it's in the
    synchronous_standby_names list.
  • The above transaction will be lost despite synchronous replication being
    enabled.

So there can be some conditions where a syncstandby could be elected also if it's
missing the last transactions if it was down at the commit time.

It's not easy to fix this issue since these events cannot be resolved by the
sentinel because it's not possible to know if a sync standby is really in sync
when the master is down (since we cannot query its last wal position and the
reporting from the keeper is asynchronous).

But with stolon we have the power to overcome this issue by noticing when a
primary restarts (since we control it), allow only "internal" connections until
all the defined synchronous standbys are really in sync.

Allowing only "internal" connections means not adding the default rules or the
user defined pgHBA rules but only the rules needed for replication (and local
communication from the keeper).

Since "internal" rules accepts the defined superuser and replication users,
client should not use these roles for normal operation or the above solution
won't work (but they shouldn't do it anyway since this could cause exhaustion of
reserved superuser connections needed by the keeper to check the instance).

@sgotti sgotti changed the title *: overcome postgres sync repl limit causing lost transactions under … *: overcome postgres sync repl limit causing lost transactions under some events. Jun 13, 2018
@sgotti sgotti force-pushed the overcome_postgres_sync_repl_limit_causing_lost_transactions_under_some_events branch 4 times, most recently from 5fe441c to 9c3a923 Compare June 16, 2018 15:10
@sgotti sgotti force-pushed the overcome_postgres_sync_repl_limit_causing_lost_transactions_under_some_events branch 2 times, most recently from 7cb5629 to 52c4c92 Compare August 24, 2018 15:13
@sgotti sgotti force-pushed the overcome_postgres_sync_repl_limit_causing_lost_transactions_under_some_events branch from 52c4c92 to 8ed7d42 Compare September 4, 2018 11:59
…some events.

Postgres synchronous replication has a downside explained in the docs:
https://www.postgresql.org/docs/current/static/warm-standby.html

`If primary restarts while commits are waiting for acknowledgement, those
waiting transactions will be marked fully committed once the primary database
recovers. There is no way to be certain that all standbys have received all
outstanding WAL data at time of the crash of the primary. Some transactions may
not show as committed on the standby, even though they show as committed on the
primary. The guarantee we offer is that the application will not receive
explicit acknowledgement of the successful commit of a transaction until the WAL
data is known to be safely received by all the synchronous standbys.`

Under some events this will cause lost transactions. For example:

* Sync standby goes down.
* A client commits a transaction, it blocks waiting for acknowledgement.
* Primary restart, it'll mark the above transaction as fully committed. All the
clients will now see that transaction.
* Primary dies
* Standby comes back.
* The sentinel will elect the standby as the new master since it's in the
synchronous_standby_names list.
* The above transaction will be lost despite synchronous replication being
enabled.

So there can be some conditions where a syncstandby could be elected also if it's
missing the last transactions if it was down at the commit time.

It's not easy to fix this issue since these events cannot be resolved by the
sentinel because it's not possible to know if a sync standby is really in sync
when the master is down (since we cannot query its last wal position and the
reporting from the keeper is asynchronous).

But with stolon we have the power to overcome this issue by noticing when a
primary restarts (since we control it), allow only "internal" connections until
all the defined synchronous standbys are really in sync.

Allowing only "internal" connections means not adding the default rules or the
user defined pgHBA rules but only the rules needed for replication (and local
communication from the keeper).

Since "internal" rules accepts the defined superuser and replication users,
client should not use these roles for normal operation or the above solution
won't work (but they shouldn't do it anyway since this could cause exhaustion of
reserved superuser connections needed by the keeper to check the instance).
@sgotti sgotti force-pushed the overcome_postgres_sync_repl_limit_causing_lost_transactions_under_some_events branch from 8ed7d42 to 87766c9 Compare September 5, 2018 11:24
@sgotti sgotti merged commit 87766c9 into sorintlab:master Sep 10, 2018
sgotti added a commit that referenced this pull request Sep 10, 2018
…_causing_lost_transactions_under_some_events

*: overcome postgres sync repl limit causing lost transactions under some events.
@sgotti sgotti added this to the v0.13.0 milestone Sep 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant