Skip to content

Split Brain Scenarios

EdVassie edited this page Sep 10, 2019 · 8 revisions
Previous Always On Failover Proc FB_AGFailover Next

This page gives some general advice on Availability Group (AG) Split Brain situations and their resolutions.

In a normal situation, one server in an AG will hold the Primary Role, and will replicate data to all other servers that hold a Secondary Role within that AG.

A Split Brain situation is where the the normal Primary and Secondary server roles for Availability Group replication are no longer clear.

The potential causes of these situations and their resolution is described the following sections:

Split Brain Causes

This section describes how a Split Brain situation may occur.

In normal operation, a Split Brain situation should not occur. However, problems can happen and a Split Brain can sometimes be the result.

Problem During Cluster Failover

There have been some issues where the failover of the underlying cluster does not cause the expected failover of the Availability Group roles. This can leave the Availability Group in a Split Brain situation, normally with multiple servers taking on the Primary role.

Most situations where a cluster failover has caused a Split Brain have been fixed by Microsoft. If you are up to date with SQL Server fixes, then it is recommended you raise the problem with Microsoft to find the best way to prevent this problem from happening again.

The resolution will depend on which scenario has been caused by the problem. Proceed with either All Servers Hold Secondary Role or Multiple Servers Hold Primary Role as appropriate.

Disaster Recovery Test

This is the most common cause of a Split Brain situation.

In a Disaster Recovery Test where a Secondary server is isolated from the main network and promoted to a Primary Server, a Split Brain situation will exist when the server is rejoined to the main network. This is a deliberate and planned Split Brain situation.

The resolution is given at Multiple Servers Hold Primary Role. This work can be done either before or after the server is rejoined to the main network.

Top


All Servers Hold Secondary Role

In this situation there is no server that holds the Primary Role for a given AG. This will have the following impacts for that AG:

  • No update activity is possible on any of the servers
  • No replication activity is taking place between any of the servers

The resolution for this situation is given below:

  1. Identify which server you want to hold the Primary Role

    It is very important you correctly select which server you want to hold the Primary Role. Selecting the wrong server will revert all your data to a previous point in time.

  2. Resynchronise databases on all Secondary Servers

    Normal or Basic AG:

    • Remove each database from the AG
    • Delete all copies of the database from the Secondary servers
    • Add each database back into the AG
    • For SQL 2016, manually reinitialise the databases on the Secondary servers
    • For SQL 2017 and above, allow Reseeding to automatically reinitialise the databases on the Secondary servers

    Distributed AG:

    • Delete all copies of the database from the Secondary servers
    • Allow Reseeding to automatically reinitialise the databases on the Secondary servers

Top


Multiple Servers Hold Primary Role

In this situation there is more than one server that is holding the Primary Role for a given AG. This will have the following impacts for that AG:

  • All Master Role servers will be trying to replicate data to secondary servers
  • All Master Role servers will be rejecting replication from other master servers
  • Log Files for all databases on the Master Role servers cannot be maintained and may quickly increase in size
  • Databases on any Secondary Role servers should be considered as having unknown content

The resolution for this situation is given below:

  1. Identify which server you want to continue as Primary Role

    It is very important you correctly select which server you want to continue as Primary Role. Selecting the wrong server will revert all your data to a previous point in time.

  2. Force all other Primary Role servers into Secondary Role

    Use the following command in SSMS, replacing dAGName with your Distributed Availability Group name

  ALTER AVAILABILITY GROUP [dAGName] SET (ROLE=SECONDARY);
  1. Resynchronise databases on all Secondary Servers

    Normal or Basic AG:

    • Remove each database from the AG
    • Delete all copies of the database from the Secondary servers
    • Add each database back into the AG
    • For SQL 2016, manually reinitialise the databases on the Secondary servers
    • For SQL 2017 and above, allow Reseeding to automatically reinitialise the databases on the Secondary servers

    Distributed AG:

    • Delete all copies of the database from the Secondary servers
    • Allow Reseeding to automatically reinitialise the databases on the Secondary servers
  2. Review database log size and plan to reduce file size when convenient

Top

Copyright FineBuild Team © 2019. License and Acknowledgements

Previous Always On Failover Top Proc FB_AGFailover Next

Key SQL FineBuild Links:

SQL FineBuild supports:

  • All SQL Server versions from SQL 2019 through to SQL 2005
  • Clustered, Non-Clustered and Core implementations of server operating systems
  • Availability and Distributed Availability Groups
  • 64-bit and (where relevant) 32-bit versions of Windows

The following Windows versions are supported:

  • Windows 2022
  • Windows 11
  • Windows 2019
  • Windows 2016
  • Windows 10
  • Windows 2012 R2
  • Windows 8.1
  • Windows 2012
  • Windows 8
  • Windows 2008 R2
  • Windows 7
  • Windows 2008
  • Windows Vista
  • Windows 2003
  • Windows XP
Clone this wiki locally