06 · How Pastebin Works

Overview

Source: Neo Kim — systemdesign.one

Pastebin is an online text storage service where clients store and share text snippets (source code, configs, logs) via a unique short URL. Despite its simple appearance, designing Pastebin at scale surfaces important decisions around ID generation, storage tiering, caching, and cleanup.

Key Concepts

Paste ID — A short, unique, human-readable identifier (≤10 characters) generated for each paste. Functions similarly to a URL shortener.

Object Storage — Raw paste content is stored in object storage (like S3), while metadata (ID, user, expiry, timestamps) lives in a relational DB.

Bloom Filter — Used to check whether a generated paste ID already exists, avoiding a DB lookup in most cases.

Functional Requirements

Create a paste → receive a unique URL

View a paste via its URL

Delete a paste

Pastes expire after 5 years by default

Support for 1 million DAU writes, 10:1 read:write ratio

Core Components

API Gateway — Exposes REST endpoints for create, read, delete

Paste Service — Orchestrates ID generation, storage, and retrieval

ID Generator — Produces unique paste IDs (see strategies below)

Metadata Store (SQL) — Stores paste ID, user ID, expiry, creation timestamp

Object Storage — Stores the raw paste content

Cache (Redis) — Caches hot pastes (most pastes accessed only ~2x after creation)

Cleanup Service — Purges expired pastes from storage

ID Generation Strategies

Strategy	Description	Trade-off
Random UUID	Generate random string	Collision possible; needs Bloom filter
Hashing	Hash paste content	Identical content = same ID
Token Range	Pre-allocate ID ranges to servers	Coordination overhead
Custom ID	User-provided vanity slug	Conflict checking required

Capacity Estimates

Writes: 1M DAU → ~12 writes/sec

Reads: 10:1 ratio → ~120 reads/sec

Storage: avg 10KB/paste × 1M/day × 365 × 5 years ≈ 18 TB

Bandwidth: ~1.2 MB/s ingress, ~12 MB/s egress

Design Deep Dive

Caching: Since most pastes are accessed only twice, cache only the top 20% of frequently accessed pastes (80/20 rule applies loosely).

Cleanup: Two strategies:

Lazy removal — check expiry on read, delete if expired

Dedicated cleanup job — periodic batch deletion of expired records

Rate Limiting: Cap paste creation per IP/user to prevent abuse.

Partitioning: Shard metadata DB by paste ID hash for horizontal scalability.

Summary

Pastebin's architecture mirrors a URL shortener with a content storage layer. The core challenges are unique ID generation at scale, efficient storage tiering (metadata vs. content), and cleaning up billions of expired records over time.