System Latency: Designing for Speed & Responsiveness

Understand the critical impact of latency on system performance. Master percentiles (P99), typical system timings, and strategies to minimize delay.

Concept Overview

Latency is the duration of time it takes for a system to process a request and return a response to the user. While availability asks "Is the system up?", latency asks "How fast is the system?".

In distributed systems, latency is composed of multiple segments:

  1. Network Propagation: Time for data to travel through physical cables.
  2. Processing Time: Time for the CPU to execute logic.
  3. I/O Wait: Time waiting for disk reads or database queries.
  4. Queueing Delay: Time spent waiting in a backlog before processing begins.

Minimizing latency is crucial because it directly correlates with user engagement and revenue. Amazon found that every 100ms of latency cost them 1% in sales, and Google saw a 20% traffic drop for an extra 0.5s delay.


Latency vs. Throughput vs. Bandwidth

These terms are often confused but measure different aspects of performance.

Analogy: A Highway System

  • Latency: The time it takes for a single car to drive from City A to City B (measured in minutes).
  • Throughput: The number of cars that arrive at City B per hour (measured in cars/hour).
  • Bandwidth: The width of the highway (number of lanes).
Key Distinction

You can have a system with high throughput but high latency (e.g., a batch processing job that processes 1M records per hour but takes 10 minutes to start).


The "False God" of Averages

When measuring performance, never use the specific average (mean). Averages hide outliers and give a false sense of security.

Instead, usage Percentiles:

  • P50 (Median): 50% of requests are faster than this. Measures the "typical" user experience.
  • P95: 95% of requests are faster than this. Exposes issues affecting 1 in 20 users.
  • P99 (Tail Latency): 99% of requests are faster than this. Critical for strict SLAs.

Your dashboard shows an average latency of 150ms. However, users are complaining about timeouts (errors > 5s). What is the most likely explanation?


Latency Numbers Every Engineer Should Know

Jeff Dean (Google) famously popularized these numbers. While hardware improves, the relative orders of magnitude remain correct.

OperationApproximate Time
L1 Cache Reference0.5 ns
Branch Mispredict5 ns
L2 Cache Reference7 ns
Mutex Lock/Unlock100 ns
Main Memory Reference100 ns
Read 1MB sequentially from Memory250,000 ns (250 µs)
Round trip in same datacenter500,000 ns (0.5 ms)
Disk Seek10,000,000 ns (10 ms)
Read 1MB sequentially from Network10,000,000 ns (10 ms)
Read 1MB sequentially from Disk30,000,000 ns (30 ms)
packet send CA->Netherlands->CA150,000,000 ns (150 ms)
Rule of Thumb
  • Memory (RAM) is fast (~nanoseconds).
  • Disk (SSD/HDD) is slow (~milliseconds).
  • Network (Cross-Region) is very slow (~hundreds of milliseconds). Avoid network calls in critical loops!

Real-World Use Cases & Strategies

1. Multiplayer FPS Game (e.g., Call of Duty)

  • Requirement: P99 Latency < 50ms (Real-time).
  • Challenge: If latency is high, players see "lag" (rubber-banding).
  • Strategy:
    • UDP instead of TCP: Drop dropped packets rather than waiting for retransmission.
    • Edge Servers: Match players to servers physically close to them.
    • Client-Side Prediction: The game client simulates movement instantly before the server confirms it.

2. Global Search Engine (e.g., Google)

  • Requirement: P95 Latency < 500ms (Interactive).
  • Challenge: Searching billions of indexed pages instantly.
  • Strategy:
    • In-Memory Index: Keep the most frequently accessed index data in RAM (Redis/Memcached).
    • Parallelization: Scatter the query to 1000 nodes, gather results, and return the top 10. The latency is determined by the slowest node (straggler problem).

3. Background Image Processing

  • Requirement: Latency is irrelevant (Minutes/Hours).
  • Challenge: Processing terabytes of data efficiently.
  • Strategy:
    • Throughput Optimization: Focus on processing as many images as possible per hour, not how fast one image finishes. Queueing is acceptable.

Architecture: The Request Path

Every hop adds latency. Service-Oriented Architectures (Microservices) are inherently slower than Monoliths due to network overhead.

Latency Compounds in Microservices

Loading diagram...

Optimization: Parallelize calls where possible! If Inventory and Pricing don't depend on each other, call them simultaneously.

Total Latency = Max(Inventory, Pricing) + Auth instead of Sum(All).


Optimization Strategies

StrategyMechanismBest For
CachingStore computed results in RAM (Redis) to avoid slow DB/Disk access.Read-heavy workloads (News feed, profiles).
CDN (Content Delivery Network)Cache static assets (files, images) in servers geographically close to the user.Static content serving.
CompressionGzip/Brotli payload to reduce network transfer time.Large JSON/HTML responses.
Connection PoolingReuse TCP connections to avoid the 3-way handshake overhead.Database connections.
Parallel ExecutionExecute independent tasks concurrently.Aggregating data from multiple microservices.

You are designing a notification service. You need to send emails to 1 million users. The email provider API takes 1 second per email. What is the most important metric to optimize?