Thursday, 18 February 2016

What is Voting Disk & Split Brain Syndrome in RAC ?


Voting Disk

Oracle Clusterware uses the voting disk to determine which instances are members of a cluster. The voting disk must reside on a shared disk. Basically all nodes in the RAC cluster register their heart-beat information on thes voting disks. The number decides the number of active nodes in the RAC cluster. These are also used for checking the availability of instances in RAC and remove the unavailable nodes out of the cluster. It helps in preventing split-brain condition and keeps database information intact. The split brain syndrome and its affects and how it has been managed in oracle is mentioned below.

For high availability, Oracle recommends that you have a minimum of three voting disks. If you configure a single voting disk, then you should use external mirroring to provide redundancy. You can have up to 32 voting disks in your cluster. What I could understand about the odd value of the number of voting disks is that a noe should see maximun number of voting disk to continue to function, so with 2, if it can see only 1, its not the maximum value but a half value of voting disk. I am still trying to search more on this concept.
Split Brain Syndrome:

In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle rac systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all pysically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of commincation the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instance running, the sane block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. Oracle has efficiently implemented check for the split brain syndrome. 
What does RAC do incase node becomes inactive:

In RAC if any node becomes inactive, or if other nodes are unable to ping/connect to a node in the RAC, then the node which first detects that one of the node is not accessible, it will evict that node from the RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect to node 3 and finds it not responding, then node 1 will evict node 3 out of the RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to continue functioning.
The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. Moreover, I have seen this node eviction issue with only 1 node getting evicted and the rest function fine, so I cannot really testify that if thats how it work by experience, but this is the theory behind it.
When we see that the node is evicted, usually oracle rac will reboot that node and try to do a cluster reconfiguration to include back the evicted node.
You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.
A good metalink note on understanding node eviction and how to address is Note ID: 219361.1
The CSS (Cluster Synchronization Service) daemon in the clusterware maintains the heart beat to the voting disk.

No comments:

Post a Comment