logo

Overview

Source: The System Design Newsletter — Neo Kim
Amazon S3 stores trillions of objects and serves hundreds of billions of requests per day. It provides eleven 9s of durability (99.999999999%) and has been running since 2006. Understanding S3's architecture reveals deep lessons about distributed storage, consistency, and fault tolerance.

Key Concepts

Object Storage — Files stored as immutable objects with a flat namespace (bucket/key). Unlike block storage, objects can't be partially updated — writes create new versions.
Erasure Coding — Data is split into K data chunks + M parity chunks. Any K chunks can reconstruct the original data. More storage-efficient than full replication for cold data.
Strong Consistency — After a successful PUT or DELETE, subsequent GET requests will always return the updated value. S3 became strongly consistent in December 2020.
Multipart Upload — Large objects (>100MB) are split and uploaded in parallel parts, then reassembled by S3. Enables resumable uploads and higher throughput.

Core Components

  • S3 API Layer — Handles HTTP requests (PUT, GET, DELETE, LIST). Authenticates via IAM, routes to appropriate backend.
  • Metadata Service — Maps object keys to physical storage locations. Critical hot path — must be fast and highly available.
  • Storage Nodes — Physical servers storing object data. Data distributed across multiple Availability Zones.
  • Replication Manager — Ensures data is replicated across AZs (and optionally regions for Cross-Region Replication).
  • Garbage Collector — Cleans up orphaned data chunks from failed writes or deleted objects.
  • Index / Catalog — Stores bucket metadata, object metadata (size, ETag, permissions, user metadata).

Object Write Flow

  1. Client sends PUT request to S3 API layer
  1. API layer authenticates + authorizes (IAM)
  1. Data chunked and written to K storage nodes in parallel
  1. Erasure coding computes parity chunks; stored on M additional nodes
  1. Metadata (key → chunk locations) written to Metadata Service
  1. 200 OK returned to client once all writes confirmed

Object Read Flow

  1. Client sends GET request with bucket + key
  1. API layer queries Metadata Service for chunk locations
  1. Chunks fetched from storage nodes in parallel
  1. Chunks reassembled and streamed to client
  1. ETag checksum verified for data integrity

Durability: Eleven 9s

How does S3 achieve 99.999999999% durability?
  • 3+ copies across multiple AZs (or erasure coding for Standard-IA and Glacier tiers)
  • Continuous background integrity checking via checksums
  • Automatic repair — if a chunk fails integrity check, it's reconstructed from remaining chunks
  • Hardware failure, bit rot, and silent data corruption are all detected and corrected automatically

Storage Tiers

Tier
Use Case
Durability
Retrieval
S3 Standard
Frequently accessed
11 9s
Milliseconds
S3 Standard-IA
Infrequent access
11 9s
Milliseconds
S3 Glacier
Archives
11 9s
Minutes to hours
S3 Intelligent-Tiering
Unknown access patterns
11 9s
Automatic

Key Trade-offs

Decision
Reasoning
Erasure coding over full replication
50% storage overhead vs. 200% for 3x replication
Flat namespace
Simplifies distribution; "folders" are just key prefixes
Strong consistency (since 2020)
Simplifies client code; no more eventual consistency bugs
Immutable objects
Enables deduplication, caching, and versioning without conflicts