Latency vs. Throughput: Mastering System Performance Metrics

Understand the critical difference between latency and throughput, their trade-offs, and how to optimize for each in distributed systems.

Concept Overview

In distributed systems, Latency and Throughput are the two fundamental metrics that define performance. While they are often discussed together, they represent distinct properties of a system. Mixing them up specifically during system design interviews is a red flag that indicates a lack of experience with large-scale architecture.

This guide clarifies these concepts, explores their trade-offs, and provides strategies to optimize for each.

What is Latency?

Latency is a measure of time. It answers the question: "How long does it take for a single operation to complete?"

In a client-server model, latency includes the time for:

  1. The request to travel from client to server (Network Propagation).
  2. The server to process the request (Processing Time).
  3. The response to travel back to the client.

It is typically measured in milliseconds (ms).

What is Throughput?

Throughput is a measure of capacity or rate. It answers the question: "How much work can the system handle in a given unit of time?"

It focuses on the volume of data or requests processed rather than the speed of individual requests.

  • For Web Servers: Measured in Requests Per Second (RPS) or Queries Per Second (QPS).
  • For Data Pipelines: Measured in Megabytes per second (MB/s) or Gigabytes per second (GB/s).
The Highway Analogy

Think of a highway:

  • Latency is how fast a single car drives from Point A to Point B (e.g., 60 minutes).
  • Throughput is how many cars pass a specific checkpoint per hour (e.g., 1,000 cars/hour).

Where They Fit in a System

Latency and throughput manifest at every layer of a system. Understanding where bottlenecks occur is key to optimization.

Loading diagram...

Real-World Use Cases

Different systems prioritize these metrics differently based on business requirements.

1. AdTech: Real-Time Bidding (Latency Critical)

In programmatic advertising, when a user loads a webpage, an auction happens in milliseconds to decide which ad to show.

  • Requirement: The entire auction must complete within 100ms. If a bidder takes 200ms to respond, their bid is ignored.
  • Focus: Ultra-low latency. Throughput is important, but latency is a hard constraint.

2. Log Ingestion Pipeline (Throughput Critical)

Consider a system collecting logs from millions of devices (e.g., Datadog or Splunk).

  • Requirement: Ingest terabytes of data per hour. It doesn't matter if a specific log entry takes 2 seconds or 5 seconds to appear in the dashboard, as long as the system doesn't drop data under load.
  • Focus: High Throughput.

3. Video Streaming (Hybrid)

Netflix or YouTube needs to deliver high-quality video to millions of users.

  • Latency: Critical for the initial "Time to First Frame" (play start time).
  • Throughput: Critical for sustaining high-definition video data flow without buffering.

You are designing a payment gateway's core transaction engine. ACID compliance and avoiding double-charges is the top priority. What should be your primary optimization focus?


Read vs. Write Considerations

Optimizing for reads typically allows for lower latency compared to writes, which often require stronger consistency guarantees.

FeatureRead-Heavy SystemsWrite-Heavy Systems
Primary GoalFast retrieval, low latency.High ingestion rate, reliable storage.
ConsistencyCan often tolerate Eventual Consistency.Often requires Strong Consistency or effective conflict resolution.
BottleneckDatabase connections, specific hot rows.Disk I/O, database locks, index updates.
ExampleX (Twitter) Timeline, News Site.IoT Sensor Telemetry, Chat Messages.

Design Strategies

How do we manipulate latency and throughput? Often, improving one comes at the cost of the other.

1. Caching (Improves Latency, Improves Read Throughput)

storing frequently accessed data in memory (Redis/Memcached) closer to the application eliminates slow database calls.

  • Pros: Drastically reduces read latency.
  • Cons: Cache invalidation complexity; Stale data.

2. Batching (Improves Throughput, Degrades Latency)

Instead of processing requests one by one, group them and process them together.

  • Mechanism: A database commit for 1 record takes 10ms. A commit for 100 records might take 20ms.
  • Trade-off: The first request in the batch waits for the last request to arrive before processing starts, increasing latency for the individual, but increasing the overall system throughput.

3. Asynchrony (Decouples Latency from Throughput)

Offload heavy work to background workers using message queues (Kafka, RabbitMQ).

  • Scenario: A user uploads a video.
  • Sync: User waits for encoding to finish (High Latency).
  • Async: Server acknowledges upload immediately (Low Latency), and queues encoding job (High Throughput processing in background).
The Serialization Trap

A common design flaw is processing requests serially on a single thread. This might yield low latency for one user, but destroys throughput. Concurrency (multithreading, event loops) is essential for throughput.

Strategy Comparison Table

StrategyImpact on LatencyImpact on ThroughputBest For
Vertical ScalingNeutral / Slight ImprovementModerate IncreaseQuick fix for unpredictable load.
Horizontal ScalingNeutralHuge IncreaseLong-term growth; massive scale.
CachingSignificant DecreaseSignificant IncreaseRead-heavy workloads.
BatchingIncrease (Worse)Significant IncreaseWrite-heavy, non-interactive tasks.
CompressionIncrease (CPU overhead)Increase (Network/Disk)Bandwidth-constrained environments.

Why does Batching (e.g., sending 100 SQL inserts in one transaction) typically increase latency for the individual request?


Failure & Scale Considerations

As systems scale, the relationship between latency and throughput becomes non-linear.

The Knee of the Curve

Every system has a breaking point.

  1. Low Load: Latency is low and constant.
  2. High Load: Latency stays stable as throughput increases.
  3. Saturation: Once resources (CPU, DB connections) are exhausted, requests queue up. Latency spikes exponentially, and throughput levels off or crashes.

Defense: Implement Rate Limiting and Load Shedding to prevent your system from entering this failure state. It is better to reject 10% of requests (lower throughput) to keep latency acceptable for the remaining 90%.

Consistency vs. Latency (CAP Theorem)

In a distributed database, if you want strong consistency (all nodes see the same data at once), you must wait for data to replicate to multiple nodes. This coordination takes time, increasing latency.

  • Lower Latency: Choose Eventual Consistency (AP systems).
  • Stronger Consistency: Architect for higher latency on writes (CP systems).

Your system is experiencing an exponential spike in latency while throughput has plateaued. What is the most likely cause?