logo

Overview

Source: Neo Kim — systemdesign.one
Pastebin is an online text storage service where clients store and share text snippets (source code, configs, logs) via a unique short URL. Despite its simple appearance, designing Pastebin at scale surfaces important decisions around ID generation, storage tiering, caching, and cleanup.

Key Concepts

Paste ID — A short, unique, human-readable identifier (≤10 characters) generated for each paste. Functions similarly to a URL shortener.
Object Storage — Raw paste content is stored in object storage (like S3), while metadata (ID, user, expiry, timestamps) lives in a relational DB.
Bloom Filter — Used to check whether a generated paste ID already exists, avoiding a DB lookup in most cases.

Functional Requirements

  • Create a paste → receive a unique URL
  • View a paste via its URL
  • Delete a paste
  • Pastes expire after 5 years by default
  • Support for 1 million DAU writes, 10:1 read:write ratio

Core Components

  • API Gateway — Exposes REST endpoints for create, read, delete
  • Paste Service — Orchestrates ID generation, storage, and retrieval
  • ID Generator — Produces unique paste IDs (see strategies below)
  • Metadata Store (SQL) — Stores paste ID, user ID, expiry, creation timestamp
  • Object Storage — Stores the raw paste content
  • Cache (Redis) — Caches hot pastes (most pastes accessed only ~2x after creation)
  • Cleanup Service — Purges expired pastes from storage

ID Generation Strategies

Strategy
Description
Trade-off
Random UUID
Generate random string
Collision possible; needs Bloom filter
Hashing
Hash paste content
Identical content = same ID
Token Range
Pre-allocate ID ranges to servers
Coordination overhead
Custom ID
User-provided vanity slug
Conflict checking required

Capacity Estimates

  • Writes: 1M DAU → ~12 writes/sec
  • Reads: 10:1 ratio → ~120 reads/sec
  • Storage: avg 10KB/paste × 1M/day × 365 × 5 years ≈ 18 TB
  • Bandwidth: ~1.2 MB/s ingress, ~12 MB/s egress

Design Deep Dive

Caching: Since most pastes are accessed only twice, cache only the top 20% of frequently accessed pastes (80/20 rule applies loosely).
Cleanup: Two strategies:
  • Lazy removal — check expiry on read, delete if expired
  • Dedicated cleanup job — periodic batch deletion of expired records
Rate Limiting: Cap paste creation per IP/user to prevent abuse.
Partitioning: Shard metadata DB by paste ID hash for horizontal scalability.

Summary

Pastebin's architecture mirrors a URL shortener with a content storage layer. The core challenges are unique ID generation at scale, efficient storage tiering (metadata vs. content), and cleaning up billions of expired records over time.