Building a Chat System Like WhatsApp: Real-time at Scale



This content originally appeared on DEV Community and was authored by Gregory Chris

Building a Chat System Like WhatsApp: Real-Time at Scale

Real-time messaging systems are the backbone of modern communication platforms like WhatsApp, Signal, and Telegram. Designing a system that supports billions of users and delivers messages in real time, across devices, and with high reliability is a hallmark challenge for senior software engineers preparing for system design interviews.

In this blog post, we’ll walk through the design of a scalable chat system that supports one-on-one and group messaging, tackling key challenges like WebSocket connections, message queuing, push notifications, and ensuring data consistency across multiple devices. Along the way, we’ll address common interview pitfalls and provide actionable strategies to ace system design interviews.

Why This Matters: The Scale of Real-Time Messaging

Imagine handling 2 billion users, each exchanging hundreds of messages per day, with latency requirements as low as milliseconds. Add features like message delivery guarantees, synchronization across multiple devices, encryption, and rich media support. The complexity is immense, but understanding the architectural principles behind such systems is crucial for system design interviews.

Designing a chat system is not just about writing code—it’s about:

  1. Scalability: Can the system handle exponential growth?
  2. Reliability: How do we ensure messages are delivered even during failures?
  3. Consistency: How do we sync messages across devices without conflicts?

Let’s dive in.

Key Requirements and Features

Before jumping into the architecture, let’s define the functional and non-functional requirements:

Functional Requirements

  • One-on-one chats: Users can send and receive messages in real time.
  • Group chats: Support for group messaging with delivery guarantees.
  • Read receipts: Indicate when messages are delivered and read.
  • Device synchronization: Messages should sync across multiple devices.
  • Push notifications: Notify users of new messages when offline.

Non-Functional Requirements

  • Scalability: Support billions of users and millions of concurrent connections.
  • Latency: Ensure sub-second message delivery.
  • Reliability: Handle intermittent network failures gracefully.
  • Security: Encrypt messages in transit and at rest.

High-Level Architecture

Let’s break the system into core components:

1. Client Communication Layer

The client communicates with the server using WebSockets for real-time messaging. Long-lived WebSocket connections enable low-latency bidirectional communication.

Client ↔ WebSocket Server ↔ Backend Services

Advantages of WebSockets:

  • Persistent connection reduces overhead compared to HTTP polling.
  • Enables low-latency, bidirectional communication.

Diagram:

+-------------+       +------------------+       +------------------+
|   Client    | <---> | WebSocket Server | <---> | Backend Services |
+-------------+       +------------------+       +------------------+

2. Message Queuing

Messages are queued and processed asynchronously using a system like Apache Kafka or RabbitMQ. Queues ensure reliability and decouple message ingestion from processing.

3. Storage Layer

Messages are persisted in a distributed database like Cassandra or MongoDB. These databases are optimized for high write throughput and low-latency reads.

4. Push Notification Service

When users are offline, a push notification service (e.g., Firebase Cloud Messaging or APNs) alerts them to new messages.

Detailed Design

WebSocket Server

The WebSocket server manages millions of concurrent connections. Each user establishes a persistent WebSocket connection with the server.

Challenges:

  1. Connection Management: How do you maintain millions of concurrent WebSocket connections?

    • Solution: Use load balancers (e.g., HAProxy, Nginx) and horizontal scaling with WebSocket server clusters.
  2. Session Persistence: How do you route messages to the correct WebSocket server during reconnections?

    • Solution: Use sticky sessions or consistent hashing based on user IDs.

Message Queuing and Delivery Guarantees

Messages are routed through a message queue system (e.g., Apache Kafka) to ensure reliable delivery.

Why Kafka?

Kafka provides:

  • High throughput for message ingestion.
  • Partitioning for scalability.
  • Durability with replicated logs.

Message Flow:

Client → WebSocket Server → Kafka → Message Processor → Database

Delivery Guarantees:

  • At-Least-Once Delivery: Retry mechanism ensures that messages are delivered even if transient failures occur.
  • Ordering: Kafka partitions guarantee ordering within a topic, which is essential for chat systems.

Database Design

Messages are stored in a distributed database optimized for high write throughput. A common schema is:

Table: Messages  
- MessageID (Primary Key)  
- SenderID  
- ReceiverID  
- GroupID (optional)  
- Timestamp  
- Content  

Key Considerations:

  • Partitioning: Partition data by ReceiverID or GroupID for efficient querying.
  • Replication: Use multi-region replication for disaster recovery.

Synchronization Across Devices

To sync messages across devices:

  1. Store messages in a central database.
  2. Use event sourcing or change data capture (CDC) to notify devices of new updates.

Example Solution:

  • Use Kafka to stream changes (new messages) to devices.
  • On the client side, reconcile message state using timestamps or version numbers.

Push Notifications

When users are offline, the system sends push notifications via services like Firebase or APNs.

Challenges:

  1. Notification Deduplication: Ensure users don’t receive duplicate notifications.

    • Solution: Use a notification queue with deduplication logic.
  2. Battery Optimizations: Avoid excessive notifications that drain the user’s battery.

    • Solution: Batch notifications for chat groups.

Scaling Considerations

  1. Horizontal Scaling: Scale WebSocket servers, backend services, and databases independently.
  2. Sharding: Use database sharding to distribute load across multiple clusters.
  3. Rate Limiting: Prevent abuse by limiting the number of messages sent per user per second.

Common Interview Pitfalls

  1. Skipping the Basics: Start with one-on-one chats before jumping to group messaging and advanced features.
  2. Ignoring Failure Scenarios: Discuss how the system handles failures (e.g., server crashes, network partitions).
  3. Overlooking Data Consistency: Explain how you ensure message ordering and avoid duplication.

Interview Talking Points

Framework for Discussing System Design

  1. Clarify Requirements: Start by asking clarifying questions about features, scale, and constraints.
  2. Define Core Components: Identify the major subsystems (e.g., WebSocket server, message queue).
  3. Discuss Trade-Offs: Explain why you chose a specific database, queue system, or protocol.

Example Talking Points

  • “I’d use Kafka for message queuing because it provides durability and ordering guarantees, which are critical for chat systems.”
  • “WebSockets are ideal for real-time messaging because they enable low-latency, bidirectional communication without polling overhead.”
  • “To ensure consistency across devices, I’d implement event sourcing and use timestamps to reconcile conflicts.”

Key Takeaways

  1. Start Simple: Begin with one-on-one messaging before tackling group chats.
  2. Focus on Scalability: Design for billions of users with distributed systems principles.
  3. Prioritize Consistency: Ensure reliable message delivery and synchronization across devices.
  4. Address Failure Scenarios: Highlight how the system handles crashes, retries, and network issues.

Actionable Next Steps

  1. Practice System Design: Sketch out architectures for other real-time systems like Uber’s location tracking or Twitter’s live feed.
  2. Learn Distributed Systems: Dive into topics like Kafka, Cassandra, and event sourcing.
  3. Mock Interviews: Practice explaining your design to peers and get feedback.

Real-time messaging systems like WhatsApp are among the most challenging architectures to design, but they offer a perfect opportunity to showcase your distributed systems expertise in interviews. By mastering the principles outlined in this post, you’ll be well-equipped to design scalable, reliable systems and impress interviewers with your technical depth and clarity.

Good luck with your interviews! 🚀


This content originally appeared on DEV Community and was authored by Gregory Chris