High Availability & Fault Tolerance
Distinguish between High Availability, Fault Tolerance, and Resilience, and learn strategies to maximize system uptime.
Concept Overview
In system design, "staying online" is not a binary state. Engineers often conflate High Availability (HA), Fault Tolerance, and Resilience, but they represent distinct guarantees with vastly different cost implications.
1. High Availability (HA)
Goal: Minimize downtime. An HA system accepts that failures will occur but is designed to recover quickly. There is a brief period of disruption (seconds or minutes) while the system fails over to a backup.
- Analogy: Getting a flat tire, but having a spare and the skills to change it in 10 minutes.
2. Fault Tolerance
Goal: Zero downtime. A fault-tolerant system continues operating without any interruption when a component fails. This usually requires expensive, active-active redundancy where multiple components process the same task simultaneously.
- Analogy: A twin-engine jet losing one engine but continuing to fly safely.
3. Resilience
Goal: Survive the unexpected. While HA and Fault Tolerance handle known failures (disk crash, network cut), Resilience is about graceful degradation under unknown or catastrophic conditions (unexpected traffic spike, cascading dependency failure).
Availability Metrics (The Nines)
Availability is mathematically defined as:
Formula:
Availability = MTBF / (MTBF + MTTR)
Where:
- MTBF (Mean Time Between Failures): How long the system runs before failing.
- MTTR (Mean Time To Recovery): How fast you fix it. (HA focuses heavily on shrinking this).
| Availability Level | Downtime per Year | Typical Use Case |
|---|---|---|
| 99% (2 Nines) | 3 days, 15 hours | Internal admin tools, Batch jobs. |
| 99.9% (3 Nines) | 8 hours, 45 mins | Standard web applications, e-commerce. |
| 99.99% (4 Nines) | 52 minutes | Enterprise SaaS, Payment Gateways. |
| 99.999% (5 Nines) | 5 minutes | Telecommunications, Critical Infrastructure. |
Moving from 4 nines to 5 nines often involves a 10x increase in cost and complexity. You stop fighting software bugs and start fighting physics (network speed, speed of light).
Architectural Patterns
Redundancy
The core principle of HA is eliminating Single Points of Failure (SPOF).
Active-Passive (Failover)
One node takes traffic (Active); the other waits (Passive).
- Pros: Simple configuration. No data consistency issues locally.
- Cons: Wasted resources (passive node sits idle). Failover takes time (downtime).
Active-Active
All nodes take traffic.
- Pros: Zero downtime failover (fault tolerance). 100% resource utilization.
- Cons: Complex consistency management. If one node dies, the other must handle 200% load.
Failover Mechanisms
How does traffic move when a node dies?
- DNS Failover: Update DNS records to point to a new IP. (Slow: Depends on TTL).
- Load Balancer Health Checks: The LB detects a dead node and stops sending it traffic immediately. (Fast).
- Floating IP / VIP: A virtual IP address instantly rebinds to a healthy server. (Very fast).
Resilience Patterns
- Circuit Breakers: Stop calling a failing service to prevent cascading failures.
- Bulkheads: Isolate resources so a crash in the "Image Processing" service doesn't take down the "User Login" service.
- Graceful Degradation: If the "Recommendations" service fails, show a generic "Popular Item" list instead of a 500 error page.
Comparisons & Trade-offs
| Feature | High Availability | Fault Tolerance |
|---|---|---|
| Downtime | Minimal (Seconds/Minutes). | Zero (Invisible to user). |
| Cost | Moderate ($$). | High ($$$$). |
| Complexity | Standard (Load Balancers, Auto-scaling). | Extreme (Consensus algorithms, Lock-step execution). |
| Recovery | Automated recovery script. | Redundant parallel processing. |