Overview
Source: Neo Kim — systemdesign.one
Slack sends billions of messages daily across millions of channels. It's a hybrid between email and IRC — combining real-time delivery with persistent, searchable history. Designed for enterprise-scale organizations, Slack's architecture had to evolve significantly from startup to supporting 500K+ organizations globally.
Key Concepts
WebSocket — Persistent bidirectional connection used for real-time message delivery. Each connected client holds an open WebSocket to a channel server.
Presence Status — Real-time tracking of which users are online, idle, or offline. Updated via heartbeat signals from connected clients.
Workspace — Top-level organizational unit in Slack. Contains channels, users, and messages.
Scale Baseline
- 10M DAU, 7M simultaneously connected users
- Up to 10,000 users per channel; 200,000 users per workspace
- Peak traffic: 11:00–14:00 weekdays
- 60%+ traffic from outside the US
- Billions of messages per day
Core Components
- Gateway / Load Balancer — Routes WebSocket connections to the appropriate channel server based on workspace.
- Channel Server — Manages persistent WebSocket connections for a set of channels. Routes messages to connected clients.
- Message Store (MySQL) — Persists all messages with channel ID, user ID, timestamp, and content.
- Search Index (Elasticsearch) — Full-text search across message history.
- Presence Service — Aggregates heartbeats and maintains online/idle/offline state per user.
- Push Notification Service — Sends mobile push notifications to offline/backgrounded clients.
- File Storage — Object storage for shared files and images.
- Notification Preferences Service — Per-user, per-channel notification settings.
Message Flow (Send a Message)
- User types message, client sends via WebSocket to Channel Server
- Channel Server persists message to MySQL (write-ahead)
- Channel Server identifies all connected members of the channel
- Broadcasts message to connected clients via their WebSocket connections
- For offline members → Push Notification Service fires APNs/FCM
- Elasticsearch indexer consumes message asynchronously for search
Presence System Design
- Every client sends a heartbeat every 5–30 seconds
- Presence Service aggregates: if heartbeat stops → user transitions to idle, then offline
- Eventually consistent — slight delay acceptable for presence (not critical path)
- Presence state stored in Redis for low-latency reads
High Availability & Scaling
- Database sharding by workspace ID — each shard owns the full message history for a set of workspaces
- Read replicas for search and message history queries
- Vitess (MySQL clustering) used to manage sharded MySQL at scale
- Channel servers are stateless — session state stored in distributed cache
- Global traffic routing via Anycast DNS
Key Trade-offs
Decision | Reasoning |
WebSocket over HTTP polling | True real-time; polling adds latency and server load |
MySQL over NoSQL | Structured queries (threads, reactions) benefit from relational model |
Shard by workspace | Keeps workspace data co-located; simplifies consistency |
Async search indexing | Doesn't block message delivery; slight search lag acceptable |