17 · How Amazon S3 Works

Overview

Source: The System Design Newsletter — Neo Kim

Amazon S3 stores trillions of objects and serves hundreds of billions of requests per day. It provides eleven 9s of durability (99.999999999%) and has been running since 2006. Understanding S3's architecture reveals deep lessons about distributed storage, consistency, and fault tolerance.

Key Concepts

Object Storage — Files stored as immutable objects with a flat namespace (bucket/key). Unlike block storage, objects can't be partially updated — writes create new versions.

Erasure Coding — Data is split into K data chunks + M parity chunks. Any K chunks can reconstruct the original data. More storage-efficient than full replication for cold data.

Strong Consistency — After a successful PUT or DELETE, subsequent GET requests will always return the updated value. S3 became strongly consistent in December 2020.

Multipart Upload — Large objects (>100MB) are split and uploaded in parallel parts, then reassembled by S3. Enables resumable uploads and higher throughput.

Core Components

S3 API Layer — Handles HTTP requests (PUT, GET, DELETE, LIST). Authenticates via IAM, routes to appropriate backend.

Metadata Service — Maps object keys to physical storage locations. Critical hot path — must be fast and highly available.

Storage Nodes — Physical servers storing object data. Data distributed across multiple Availability Zones.

Replication Manager — Ensures data is replicated across AZs (and optionally regions for Cross-Region Replication).

Garbage Collector — Cleans up orphaned data chunks from failed writes or deleted objects.

Index / Catalog — Stores bucket metadata, object metadata (size, ETag, permissions, user metadata).

Object Write Flow

Client sends PUT request to S3 API layer

API layer authenticates + authorizes (IAM)

Data chunked and written to K storage nodes in parallel

Erasure coding computes parity chunks; stored on M additional nodes

Metadata (key → chunk locations) written to Metadata Service

200 OK returned to client once all writes confirmed

Object Read Flow

Client sends GET request with bucket + key

API layer queries Metadata Service for chunk locations

Chunks fetched from storage nodes in parallel

Chunks reassembled and streamed to client

ETag checksum verified for data integrity

Durability: Eleven 9s

How does S3 achieve 99.999999999% durability?

3+ copies across multiple AZs (or erasure coding for Standard-IA and Glacier tiers)

Continuous background integrity checking via checksums

Automatic repair — if a chunk fails integrity check, it's reconstructed from remaining chunks

Hardware failure, bit rot, and silent data corruption are all detected and corrected automatically

Storage Tiers

Tier	Use Case	Durability	Retrieval
S3 Standard	Frequently accessed	11 9s	Milliseconds
S3 Standard-IA	Infrequent access	11 9s	Milliseconds
S3 Glacier	Archives	11 9s	Minutes to hours
S3 Intelligent-Tiering	Unknown access patterns	11 9s	Automatic

Key Trade-offs

Decision	Reasoning
Erasure coding over full replication	50% storage overhead vs. 200% for 3x replication
Flat namespace	Simplifies distribution; "folders" are just key prefixes
Strong consistency (since 2020)	Simplifies client code; no more eventual consistency bugs
Immutable objects	Enables deduplication, caching, and versioning without conflicts