logo

Overview

Source: Neo Kim — systemdesign.one
Slack sends billions of messages daily across millions of channels. It's a hybrid between email and IRC — combining real-time delivery with persistent, searchable history. Designed for enterprise-scale organizations, Slack's architecture had to evolve significantly from startup to supporting 500K+ organizations globally.

Key Concepts

WebSocket — Persistent bidirectional connection used for real-time message delivery. Each connected client holds an open WebSocket to a channel server.
Presence Status — Real-time tracking of which users are online, idle, or offline. Updated via heartbeat signals from connected clients.
Workspace — Top-level organizational unit in Slack. Contains channels, users, and messages.

Scale Baseline

  • 10M DAU, 7M simultaneously connected users
  • Up to 10,000 users per channel; 200,000 users per workspace
  • Peak traffic: 11:00–14:00 weekdays
  • 60%+ traffic from outside the US
  • Billions of messages per day

Core Components

  • Gateway / Load Balancer — Routes WebSocket connections to the appropriate channel server based on workspace.
  • Channel Server — Manages persistent WebSocket connections for a set of channels. Routes messages to connected clients.
  • Message Store (MySQL) — Persists all messages with channel ID, user ID, timestamp, and content.
  • Search Index (Elasticsearch) — Full-text search across message history.
  • Presence Service — Aggregates heartbeats and maintains online/idle/offline state per user.
  • Push Notification Service — Sends mobile push notifications to offline/backgrounded clients.
  • File Storage — Object storage for shared files and images.
  • Notification Preferences Service — Per-user, per-channel notification settings.

Message Flow (Send a Message)

  1. User types message, client sends via WebSocket to Channel Server
  1. Channel Server persists message to MySQL (write-ahead)
  1. Channel Server identifies all connected members of the channel
  1. Broadcasts message to connected clients via their WebSocket connections
  1. For offline members → Push Notification Service fires APNs/FCM
  1. Elasticsearch indexer consumes message asynchronously for search

Presence System Design

  • Every client sends a heartbeat every 5–30 seconds
  • Presence Service aggregates: if heartbeat stops → user transitions to idle, then offline
  • Eventually consistent — slight delay acceptable for presence (not critical path)
  • Presence state stored in Redis for low-latency reads

High Availability & Scaling

  • Database sharding by workspace ID — each shard owns the full message history for a set of workspaces
  • Read replicas for search and message history queries
  • Vitess (MySQL clustering) used to manage sharded MySQL at scale
  • Channel servers are stateless — session state stored in distributed cache
  • Global traffic routing via Anycast DNS

Key Trade-offs

Decision
Reasoning
WebSocket over HTTP polling
True real-time; polling adds latency and server load
MySQL over NoSQL
Structured queries (threads, reactions) benefit from relational model
Shard by workspace
Keeps workspace data co-located; simplifies consistency
Async search indexing
Doesn't block message delivery; slight search lag acceptable