Phi Accrual Failure Detection

Motivation

A heartbeat only tells you is a system is down or up, no in between, but in reality it might just be overwhelmed. Heartbeating uses a fixed timeout, and if there is no heartbeat from a server, the system, after the timeout assumes that the server has crashed, which makes the value of the timeout critical.

Solution

We can use an adaptive failure detection algorithm:

  • accrual = accumulation or the act of accumulating over time

  • uses historical heartbeat information to make the threshold adaptive

  • instead of telling you is a server is alive or not, it outputs the suspicion level about a server

  • if a node does not respond, its suspicion level is increased and could be declared dead later

  • as a node’s suspicion level increases, the system can gradually stop sending new requests to it

  • makes a system efficient as it takes into account fluctuations in the network environment and other intermittent server issues before declaring a system completely dead.

Application

Cassandra uses the Phi Accrual Failure Detector algorithm to determine the state of the nodes in the cluster.

Last updated