Quantcast
Channel: VOIP-info.org Wiki Changes
Viewing all articles
Browse latest Browse all 5767

Asterisk High Availability Design

$
0
0
High Availability (HA) is normally achieved through "clustering" - which means two machines acting as one for a specific purpose. There are many ways to create a cluster, each with its own benefits, risks, costs, and trade-offs. The terms "High Availability" (HA) and "Clustering" are severely overused so beware of the hype. If you are responsible for creating a high availability cluster for Asterisk these are the issues you should consider. This page is intended to be a starting point in the design,creation/selection of a High Availability or Clustering solution for Asterisk.

Please do not add specific product names/links to this page, it is intended to be product neutral.

Co-Dependence and Autonomy

In order to be a true cluster, the two machines (or "peers") must share nothing. Many HA solutions involve sharing hardware, software, a logical device, etc. The problem with this approach is that you create a single point of failure. For example, if a cluster shares a hardware channel bank (eg: connected to 2 machines via 2 USB cables), then if the channel bank fails the entire cluster fails. As another example, if a cluster shares a disk (eg: DRBD), then corruption of the disk content from a failing peer immediately corrupts the disk content of the other peer. In a true cluster the peers must be autonomous; i.e. not share any hardware, software, logical devices, etc. (Beware of some solutions which place a single device in front of Asterisk servers - creating a single point of failure).

Data Synchronization

In order for a cluster to remain useful, the data on the peers must remain in sync. This allows one peer to pick up where the other left off in the event of a failure. However, synchronization is one of the greatest challenges for clusters. Do-it-yourself solutions will share a block or file level device (eg: DRBD, NFS mount, iSCSI, etc) but this introduces the issue described above (autonomy). Such block or file level sharing devices are the primary reason that simplistic clustering solutions fail in real life scenarios.

Since peers are often located in different data centers (see below), there may need to be subtle configuration differences between the peers. Simplistic synchronization solutions do not allow for differences and blindly overwrite the data on one peer with data from the other. It is important that you can configure what data be synchronized, and what data be left alone on each peer.

In clustering solutions without shared block or file level devices, the next consideration is preventing corrupted data on one peer negatively affecting the other. Simplistic synchronization tools like rsync or scp will copy corrupted data from a failing peer to the healthy peer. It is critical that synchronization of the peers only occur if the source of the data is verified healthy (i.e. the peer is healthy).

Defining Peer Failure

In order for a cluster to remain productive, it must know when a peer is healthy, when it is deteriorating, and when it has failed. The most simplistic failure scenario (and the easiest to handle) is when a peer completely shuts down and disappears from the cluster. However, this is rarely the case in real life, where peers slowly deteriorate, introducing issues, corruption, no longer bridging calls, etc. ...

Viewing all articles
Browse latest Browse all 5767

Trending Articles