This content originally appeared on DEV Community and was authored by Gregory Chris

Building a Chat System Like WhatsApp: Real-Time at Scale

Real-time messaging systems are the backbone of modern communication platforms like WhatsApp, Signal, and Telegram. Designing a system that supports billions of users and delivers messages in real time, across devices, and with high reliability is a hallmark challenge for senior software engineers preparing for system design interviews.

In this blog post, we’ll walk through the design of a scalable chat system that supports one-on-one and group messaging, tackling key challenges like WebSocket connections, message queuing, push notifications, and ensuring data consistency across multiple devices. Along the way, we’ll address common interview pitfalls and provide actionable strategies to ace system design interviews.

Why This Matters: The Scale of Real-Time Messaging

Imagine handling 2 billion users, each exchanging hundreds of messages per day, with latency requirements as low as milliseconds. Add features like message delivery guarantees, synchronization across multiple devices, encryption, and rich media support. The complexity is immense, but understanding the architectural principles behind such systems is crucial for system design interviews.

Designing a chat system is not just about writing code—it’s about:

Scalability: Can the system handle exponential growth?
Reliability: How do we ensure messages are delivered even during failures?
Consistency: How do we sync messages across devices without conflicts?

Let’s dive in.

Key Requirements and Features

Before jumping into the architecture, let’s define the functional and non-functional requirements:

Functional Requirements

One-on-one chats: Users can send and receive messages in real time.
Group chats: Support for group messaging with delivery guarantees.
Read receipts: Indicate when messages are delivered and read.
Device synchronization: Messages should sync across multiple devices.
Push notifications: Notify users of new messages when offline.

Non-Functional Requirements

Scalability: Support billions of users and millions of concurrent connections.
Latency: Ensure sub-second message delivery.
Reliability: Handle intermittent network failures gracefully.
Security: Encrypt messages in transit and at rest.

High-Level Architecture

Let’s break the system into core components:

1. Client Communication Layer

The client communicates with the server using WebSockets for real-time messaging. Long-lived WebSocket connections enable low-latency bidirectional communication.

Client ↔ WebSocket Server ↔ Backend Services

Advantages of WebSockets:

Persistent connection reduces overhead compared to HTTP polling.
Enables low-latency, bidirectional communication.

Diagram:

+-------------+       +------------------+       +------------------+
|   Client    | <---> | WebSocket Server | <---> | Backend Services |
+-------------+       +------------------+       +------------------+

2. Message Queuing

Messages are queued and processed asynchronously using a system like Apache Kafka or RabbitMQ. Queues ensure reliability and decouple message ingestion from processing.

3. Storage Layer

Messages are persisted in a distributed database like Cassandra or MongoDB. These databases are optimized for high write throughput and low-latency reads.

4. Push Notification Service

When users are offline, a push notification service (e.g., Firebase Cloud Messaging or APNs) alerts them to new messages.

Detailed Design

WebSocket Server

The WebSocket server manages millions of concurrent connections. Each user establishes a persistent WebSocket connection with the server.

Challenges:

Connection Management: How do you maintain millions of concurrent WebSocket connections?
- Solution: Use load balancers (e.g., HAProxy, Nginx) and horizontal scaling with WebSocket server clusters.
Session Persistence: How do you route messages to the correct WebSocket server during reconnections?
- Solution: Use sticky sessions or consistent hashing based on user IDs.

Message Queuing and Delivery Guarantees

Messages are routed through a message queue system (e.g., Apache Kafka) to ensure reliable delivery.

Why Kafka?

Kafka provides:

High throughput for message ingestion.
Partitioning for scalability.
Durability with replicated logs.

Message Flow:

Client → WebSocket Server → Kafka → Message Processor → Database

Delivery Guarantees:

At-Least-Once Delivery: Retry mechanism ensures that messages are delivered even if transient failures occur.
Ordering: Kafka partitions guarantee ordering within a topic, which is essential for chat systems.

Database Design

Messages are stored in a distributed database optimized for high write throughput. A common schema is:

Table: Messages  
- MessageID (Primary Key)  
- SenderID  
- ReceiverID  
- GroupID (optional)  
- Timestamp  
- Content

Key Considerations:

Partitioning: Partition data by ReceiverID or GroupID for efficient querying.
Replication: Use multi-region replication for disaster recovery.

Synchronization Across Devices

To sync messages across devices:

Store messages in a central database.
Use event sourcing or change data capture (CDC) to notify devices of new updates.

Example Solution:

Use Kafka to stream changes (new messages) to devices.
On the client side, reconcile message state using timestamps or version numbers.

Push Notifications

When users are offline, the system sends push notifications via services like Firebase or APNs.