Overview
Source: Neo Kim — systemdesign.one
Pastebin is an online text storage service where clients store and share text snippets (source code, configs, logs) via a unique short URL. Despite its simple appearance, designing Pastebin at scale surfaces important decisions around ID generation, storage tiering, caching, and cleanup.
Key Concepts
Paste ID — A short, unique, human-readable identifier (≤10 characters) generated for each paste. Functions similarly to a URL shortener.
Object Storage — Raw paste content is stored in object storage (like S3), while metadata (ID, user, expiry, timestamps) lives in a relational DB.
Bloom Filter — Used to check whether a generated paste ID already exists, avoiding a DB lookup in most cases.
Functional Requirements
- Create a paste → receive a unique URL
- View a paste via its URL
- Delete a paste
- Pastes expire after 5 years by default
- Support for 1 million DAU writes, 10:1 read:write ratio
Core Components
- API Gateway — Exposes REST endpoints for create, read, delete
- Paste Service — Orchestrates ID generation, storage, and retrieval
- ID Generator — Produces unique paste IDs (see strategies below)
- Metadata Store (SQL) — Stores paste ID, user ID, expiry, creation timestamp
- Object Storage — Stores the raw paste content
- Cache (Redis) — Caches hot pastes (most pastes accessed only ~2x after creation)
- Cleanup Service — Purges expired pastes from storage
ID Generation Strategies
Strategy | Description | Trade-off |
Random UUID | Generate random string | Collision possible; needs Bloom filter |
Hashing | Hash paste content | Identical content = same ID |
Token Range | Pre-allocate ID ranges to servers | Coordination overhead |
Custom ID | User-provided vanity slug | Conflict checking required |
Capacity Estimates
- Writes: 1M DAU → ~12 writes/sec
- Reads: 10:1 ratio → ~120 reads/sec
- Storage: avg 10KB/paste × 1M/day × 365 × 5 years ≈ 18 TB
- Bandwidth: ~1.2 MB/s ingress, ~12 MB/s egress
Design Deep Dive
Caching: Since most pastes are accessed only twice, cache only the top 20% of frequently accessed pastes (80/20 rule applies loosely).
Cleanup: Two strategies:
- Lazy removal — check expiry on read, delete if expired
- Dedicated cleanup job — periodic batch deletion of expired records
Rate Limiting: Cap paste creation per IP/user to prevent abuse.
Partitioning: Shard metadata DB by paste ID hash for horizontal scalability.
Summary
Pastebin's architecture mirrors a URL shortener with a content storage layer. The core challenges are unique ID generation at scale, efficient storage tiering (metadata vs. content), and cleaning up billions of expired records over time.