This content originally appeared on DEV Community and was authored by David Kjerrumgaard

Latency Numbers Every Data Streaming Engineer Should Know

Jeff Dean’s “Latency Numbers Every Programmer Should Know” became essential reading because it grounded abstract performance discussions in concrete reality. For data streaming engineers, we need an equivalent framework that translates those fundamental hardware latencies into the specific challenges of real-time data pipelines.

Just as Dean showed that a disk seek (10ms) costs the same as 40,000 L1 cache references, streaming engineers must understand that a cross-region sync replication (100ms+) costs the same as processing 10,000 in-memory events. These aren’t just numbers—they’re the physics that govern what’s possible in your streaming architecture.

TL;DR: Your Latency Budget Quick Reference

Latency Class	End-to-End Target	Use Cases	Key Constraints
Ultra-low	< 10ms	HFT, real-time control, gaming	Single AZ only, no disk fsync per record, specialized hardware
Low	10-200ms	Interactive dashboards, alerts, online ML features	Streaming processing, minimal batching, same region
Latency-relaxed	200ms – minutes	Near-real-time analytics, ETL, reporting	Enables aggressive batching, cross-region, cost optimization

Critical Hardware & Network Floors

Operation	Latency	Streaming Impact
HDD seek/fsync	5-20ms	Consumes entire ultra-low budget
SSD fsync	0.05-1ms	Manageable for low latency
Same AZ network (RTT)	0.2-1ms	Base cost for any distributed system
Cross-AZ (RTT)	1-4ms	Minimum for AZ-redundant streams
Cross-region (RTT)	30-200ms+	Makes <100ms E2E impossible
Schema registry lookup	1-5ms (cached), 10-50ms (miss)	Often overlooked latency source

Streaming Platform Specifics

Operation	Typical Latency	Configuration Notes
Kafka publish (acks=1, same-AZ)	1-5ms	No replica wait
Kafka publish (acks=all, same-AZ)	3-15ms	Adds replica sync
Cross-AZ sync replication	+1-5ms	Per additional AZ
Producer batching (linger.ms)	+5-50ms	Intentional latency for throughput
Consumer poll interval	0-500ms+	Misconfiguration can dominate E2E
Iceberg commit visibility	5s-10min	Depends on commit interval

What “Real-Time” Actually Means

In data streaming, “real-time” has become as overloaded as “big data” once was. Let’s establish clear definitions based on both technical constraints and human perception thresholds.

Figure 1: Streaming Latency Spectrum showing the logarithmic scale from nanoseconds to minutes, with technology examples and use cases for each latency category.

Ultra-Low Latency (< 10ms End-to-End)

This is the realm of hard real-time systems where every microsecond counts. Applications requiring sub-10ms latency include:

High-frequency trading (where 1ms advantage = millions in profit)
Real-time control systems (industrial automation, autonomous vehicles)
Competitive gaming (where 16ms = one frame at 60fps)
Low-latency market data (every trader needs the same speed)

Technical requirements:

Everything in one availability zone (cross-AZ RTT alone is 1-4ms)
No per-record disk fsync (HDD seek = 10ms, breaking your entire budget)
Kernel bypass networking (DPDK, RDMA)
Custom serialization (Protocol Buffers, Avro, or binary)
Memory-mapped storage or pure in-memory processing

Example stack: Apache Pulsar with BookKeeper on NVMe, or heavily tuned Kafka with:

# Ultra-low latency Kafka producer config
linger.ms=0
batch.size=1024
acks=1
compression.type=none

Reality check: For perspective, 100ms is the threshold where UI interactions feel instantaneous to humans. Ultra-low latency is an order of magnitude faster than human perception—you’re optimizing for machines, not users.

Low Latency (10-200ms End-to-End)

This covers the sweet spot for most interactive real-time applications. Users perceive anything under 200ms as “instant” response, making this the target for:

Live dashboards and monitoring (business metrics, system health)
Real-time alerting (fraud detection, anomaly detection)
Online machine learning features (recommendation engines, personalization)
Live chat and notifications (social platforms, collaboration tools)
Real-time analytics (A/B test results, user behavior tracking)

Technical characteristics:

Event-at-a-time processing (not micro-batches)
Cross-AZ replication acceptable (adds ~2-5ms)
Moderate batching for efficiency (5-50ms linger times)
SSD storage with occasional fsync
Standard streaming platforms work well

Example stack: Apache Kafka + Apache Flink with event-time processing:

# Balanced Kafka configuration
linger.ms=5
batch.size=16384
acks=all
max.in.flight.requests.per.connection=5

Cost implications: This range allows reasonable optimization without exotic hardware. A well-tuned Kafka cluster can achieve 10-50ms P50 latency with hundreds of thousands of events per second.

Figure 2: The classic trade-off between latency and throughput in streaming systems. Lower latency typically means higher cost and lower throughput, while batch processing achieves high throughput at the cost of latency.

Latency-Relaxed (200ms – Minutes)

When latency requirements relax beyond a few hundred milliseconds, you enter the realm of cost optimization and massive throughput. This category includes:

Near-real-time ETL (data lake ingestion, warehouse loading)
Business intelligence dashboards (updating every 30 seconds to 5 minutes)
Batch-oriented analytics (hourly/daily reports with “fresh” data)
Data lake table formats (Iceberg, Delta Lake with 1-10 minute commits)
Cross-region data replication (disaster recovery, global distribution)

Technical advantages:

Aggressive batching (seconds to minutes)
Cross-region replication feasible
Cheaper storage tiers (object storage vs. hot SSDs)
Higher compression ratios
Simpler error handling and retry logic

Example: Netflix’s architecture keeps only hours of hot data in Kafka (expensive) and tiers the rest to Apache Iceberg on S3 (38x cheaper)¹. For most analytics, 1-5 minute latency is perfectly acceptable and dramatically reduces infrastructure costs.

The Physics of Streaming Latency

Understanding hardware and network fundamentals isn’t academic—these are the unavoidable floors that constrain every streaming system.

Storage: The Latency Hierarchy

Every streaming platform must persist data for durability, but storage choices have massive latency implications:

Memory access:        ~100 nanoseconds
SSD random read:      ~150 microseconds (1,500x slower than memory)
NVMe fsync:          ~0.05-1 milliseconds  
SATA SSD fsync:      ~0.5-5 milliseconds
HDD seek/fsync:      ~5-20 milliseconds (200,000x slower than memory!)

Real-world example: Intel Optane NVMe can sync writes in ~43 microseconds average, while a traditional HDD takes ~18ms—that’s 400x faster. For a streaming broker writing 10,000 events/second:

With HDD: Maximum ~50-100 synced writes/second/disk (disk-bound)
With NVMe: Thousands of synced writes/second (CPU/network bound)

Kafka-specific insight: Kafka’s sequential write pattern helps with HDDs, but modern deployments use SSDs for predictable low latency. The difference between “usually fast” and “always fast” matters for P99 latency.

Network: Distance Costs Time

Network latency follows the speed of light in fiber (roughly 5 microseconds per kilometer), plus routing overhead:

Same host (loopback):     < 0.1ms
Same rack/AZ:            0.1-0.5ms one-way  
Cross-AZ, same region:   0.5-2ms one-way
Cross-region (continent): 15-40ms one-way
Intercontinental:        80-200ms one-way (varies by route)

AWS measurements: Cross-AZ pings typically show 1-2ms RTT, while us-east-1 to eu-west-1 is ~80-90ms RTT.

Streaming implications:

Synchronous cross-region replication: Automatically adds ≥80ms to every write
Leader election during failures: Cross-AZ coordination adds several milliseconds
Consumer rebalancing: Group coordination latency scales with member distribution

Figure 3: Global network latency map showing realistic RTT times between major cloud regions. These physical constraints set hard floors for any distributed streaming system.

Common Failure Scenarios

Streaming systems must handle failures gracefully, but each failure mode has latency implications:

Failure Type	Latency Impact	Mitigation
Broker failover	+50-200ms during leader election	Faster election timeouts, more brokers
GC pause	+100-500ms to P99 latencies	G1GC tuning, smaller heaps, off-heap storage
Network partition	+timeout duration (often 30s default)	Shorter timeouts, circuit breakers
Schema registry miss	+10-50ms per lookup	Larger caches, schema pre-loading
Consumer rebalance	+5-30s processing halt	Incremental rebalancing, sticky assignment

Streaming Platform Latency Breakdown

Publish Latency (Producer → Broker)

This is where your event first enters the streaming platform. Key factors:

Network transit: Usually negligible within a data center (<1ms), but can dominate for remote producers.

Broker processing: Includes parsing, validation, and local storage. Modern brokers can handle this in microseconds for simple events.

Replication strategy: The big variable. Kafka’s acks setting illustrates the trade-off:

acks=0: Fire-and-forget (~1-2ms, risk data loss)
acks=1: Wait for leader only (~2-5ms, balanced)  
acks=all: Wait for all replicas (~5-15ms same-AZ, much higher cross-region)

Producer batching: Intentionally trading latency for throughput:

linger.ms=0:  Send immediately (lowest latency)
linger.ms=5:  Wait up to 5ms to batch (better throughput)
linger.ms=50: Wait up to 50ms to batch (much better throughput)

Real example: A well-tuned Kafka cluster with acks=all, same-AZ replication typically shows 3-8ms publish latency at P50, 10-25ms at P99.

Consume Latency (Broker → Consumer)

Once data is available on the broker, how quickly can consumers access it?

Push vs. Pull:

Push systems (like some message queues) can deliver in sub-millisecond
Pull systems (like Kafka) depend on poll frequency

Polling configuration mistakes:

# Bad: Creates 0-500ms artificial delay
max.poll.interval.ms=500

# Good: Near-real-time consumption
max.poll.interval.ms=10
fetch.min.bytes=1

Processing overhead: In-memory transformations are typically <1ms per event, but external calls (database lookups, API calls) can dominate:

Simple in-memory filter:     ~0.001ms per event
JSON parsing/validation:     ~0.01-0.1ms per event  
Database lookup (cached):    ~1-5ms per event
Database lookup (cache miss): ~10-50ms per event
External API call:          ~50-200ms per event

End-to-End Latency Monitoring

What users actually experience: E2E = Publish + Network + Consume + Processing

Key percentiles to track:

P50 (median): Your typical performance
P95: What 95% of users experience
P99: Catches tail latencies from GC, network hiccups
P99.9: Exposes rare but severe problems

Example real-world numbers:
A well-tuned, single-region Kafka pipeline typically achieves:

P50: 10-30ms end-to-end
P95: 25-75ms end-to-end
P99: 50-200ms end-to-end (watch for GC pauses, network bursts)

Cross-region reality:
With synchronous cross-region replication, add ≥80ms minimum:

P50: 90-120ms end-to-end
P99: 150-400ms end-to-end

Figure 4: Detailed breakdown of where latency accumulates in a streaming pipeline, from producer to final storage. Shows how each component contributes to total end-to-end latency.

Data Lake Integration: The Visibility Latency Challenge

Modern streaming architectures often flow into analytical storage (Apache Iceberg, Delta Lake) for cost-effective long-term analytics. However, these systems operate on a fundamentally different latency model.

Commit Interval Governs Freshness

Unlike streaming brokers that make data available immediately, table formats batch writes into atomic commits:

Commit every 5 seconds:   ~2.5s average visibility latency, ~5s max
Commit every 1 minute:    ~30s average visibility latency, ~60s max  
Commit every 10 minutes:  ~5min average visibility latency, ~10min max

Why the delay? Table formats like Iceberg prioritize:

Atomic visibility: Readers see complete batches or nothing (no partial data)
Efficient storage: Larger files are cheaper and faster to read from object storage
Metadata efficiency: Fewer commits = less metadata overhead

Real-World Commit Strategy Examples

Use Case	Commit Interval	Trade-offs
Real-time dashboard	5-30 seconds	Higher cost, more small files, near-real-time visibility
Hourly reporting	1-5 minutes	Balanced cost and freshness
Daily analytics	10-60 minutes	Lowest cost, highest efficiency, delayed visibility

Experimental data: In testing with Flink → Iceberg:

10-second commits: ~10s median latency, ~20s P99
1-minute commits: ~30s median latency, ~60s P99
Latency closely tracks commit interval plus small processing overhead

Cost Impact of Commit Frequency

Netflix’s analysis showed that keeping data in Kafka costs 38x more than Iceberg storage¹. The commit interval directly affects this trade-off:

Example calculation (1TB/day workload):

Kafka retention (24 hours): ~$500/month
Iceberg (frequent 30s commits): ~$25/month + processing costs
Iceberg (relaxed 10min commits): ~$13/month + processing costs

More frequent commits mean:

Higher compute costs (more Flink/Spark jobs)
More small files (worse query performance)
Higher metadata overhead
But lower visibility latency

Synchronous vs. Asynchronous: The Fundamental Trade-off

Every distributed streaming system faces choices about when to wait for confirmation versus proceeding optimistically.

Replication Strategies

Synchronous replication:

Producer → Broker → Wait for replicas → ACK to producer
Latency: Base + (RTT to slowest replica)
Durability: High (data on multiple nodes before ACK)

Asynchronous replication:

Producer → Broker → Immediate ACK → Background replication
Latency: Base + local write time only
Durability: Lower (brief window where data exists on only one node)

Real-world impact:

Same AZ sync replication: +1-3ms
Cross-AZ sync replication: +2-8ms
Cross-region sync replication: +80-200ms (often impractical)

Processing Patterns

Synchronous processing:

Event → Process Step 1 → Wait → Process Step 2 → Wait → Response
Latency: Sum of all steps

Asynchronous processing:

Event → Trigger Step 1 → Trigger Step 2 → Collect results → Response
Latency: Max of parallel steps

Example: External enrichment workflow:

Synchronous: Event → DB lookup (20ms) → API call (50ms) → Process (5ms) = 75ms total
Asynchronous: Event → [DB lookup || API call] → Process = ~50ms total (parallelized)

The async approach requires more complex code (handling out-of-order responses, partial failures) but can significantly reduce latency.

Troubleshooting: When Latency Goes Wrong

Your Latency Budget Checklist

Before designing any streaming system:

[ ] Identified true latency requirement (ultra-low/low/relaxed)
[ ] Mapped data path (same AZ/cross-AZ/cross-region)
[ ] Chosen durability level (async/sync replication)
[ ] Configured monitoring for P50/P95/P99, not just averages
[ ] Load tested at peak throughput (latency often degrades under load)

Common Latency Culprits

Symptom	Likely Cause	Investigation
Consistent >100ms in same region	Network saturation or misconfigured routing	Check network utilization, traceroute
P99 >> P50	GC pauses or batching effects	JVM GC logs, batch size analysis
Sudden latency spikes	Broker failover or rebalancing	Broker logs, consumer group stability
High variance	Resource contention or queueing	CPU/memory/disk utilization
Gradual degradation	Growing consumer lag	Partition count, consumer scaling

Key Metrics to Monitor

Producer side:

request-latency-avg/max: How long broker requests take
batch-size-avg: Batching efficiency
buffer-available-bytes: Memory pressure

Broker side:

request-handler-idle-ratio: CPU saturation
log-flush-time: Disk performance
leader-election-rate: Stability issues

Consumer side:

lag-max: How far behind consumers are
poll-time-avg: Processing efficiency
commit-latency-avg: Offset management overhead

End-to-end:

Application-level latency tracking with correlation IDs
P50/P95/P99 latency distributions over time
Latency broken down by pipeline stage

Technology-Specific Configurations

Apache Kafka for Low Latency

Producer configuration:

# Minimize batching
linger.ms=0
batch.size=1024

# Reduce network overhead  
acks=1
max.in.flight.requests.per.connection=1

# Disable compression for lowest latency
compression.type=none

Broker configuration:

# Fast leader election
replica.lag.time.max.ms=500
replica.socket.timeout.ms=1000

# Frequent flushes (if durability required)
log.flush.interval.messages=1
log.flush.interval.ms=10

Consumer configuration:

# Minimal poll interval
fetch.min.bytes=1
fetch.max.wait.ms=10

# Reduce rebalance overhead
session.timeout.ms=6000
heartbeat.interval.ms=2000

Apache Pulsar for Ultra-Low Latency

Pulsar’s architecture allows some optimizations Kafka cannot match:

// Memory-mapped journal for minimal write latency
dbStorage_writeCacheMaxSizeMb=512
dbStorage_readAheadCacheMaxSizeMb=256

// Disable fsync for maximum speed (if durability allows)
journalSyncData=false

Apache Flink for Stream Processing

// Minimize checkpoint overhead
state.checkpoints.dir=memory://
state.backend=rocksdb

// Reduce buffering
pipeline.auto-watermark-interval=1ms
pipeline.latency-tracking-interval=1ms

Conclusion: Engineering Time as a Feature

Latency in streaming systems isn’t just a performance metric—it’s a feature you must consciously design, budget, and engineer for. Just as Jeff Dean’s numbers taught programmers to respect the reality of time in computing hardware, these streaming latency numbers should guide every architectural decision you make.

The key insights:

Physics sets hard floors. You cannot stream across continents in under 80ms, period. You cannot do synchronous disk writes faster than your storage allows. Design within reality.
Latency is expensive. Ultra-low latency often costs 10x-100x more than “good enough” latency. Netflix’s 38x cost difference between Kafka and Iceberg isn’t unique—it’s typical.
Percentiles matter more than averages. Your users experience P95 and P99 latencies, not medians. A system with 50ms average and 2-second P99 is not a real-time system.
Every millisecond is a trade-off. Choosing synchronous replication adds latency but prevents data loss. Choosing small batches reduces latency but limits throughput. These aren’t bugs—they’re fundamental engineering decisions.
Monitor what matters. End-to-end latency with business-relevant percentiles. Break down by pipeline stage. Alert on degradation, not just failures.

The goal isn’t to build the fastest possible system—it’s to build the right system for your latency budget. Sometimes that’s a 5ms ultra-low latency platform costing hundreds of thousands per month. Sometimes it’s a 5-minute batch process costing hundreds per month. Both can be “real-time” in their proper context.

Armed with these numbers, you can confidently navigate the trade-offs between speed, cost, and complexity. You’ll know when a requirement is physically impossible, when it’s technically feasible but economically questionable, and when it’s the right fit for your streaming architecture.

Most importantly, you’ll stop debating whether something is “real-time” and start designing systems that deliver data when and where it’s needed—in real real-time.

References

As an Apache Pulsar committer, I’m always interested in hearing about your experiences with streaming data technologies. Feel free to reach out with questions or share your own insights!

Netflix cost analysis based on industry presentations and blog posts discussing their data lake architecture. Specific 38x figure commonly cited in streaming architecture discussions, though exact source documentation may vary. For current Netflix data architecture details, see their technology blog and conference presentations on data platform evolution.

This content originally appeared on DEV Community and was authored by David Kjerrumgaard