Overview
Source: The System Design Newsletter — Neo Kim
Amazon S3 stores trillions of objects and serves hundreds of billions of requests per day. It provides eleven 9s of durability (99.999999999%) and has been running since 2006. Understanding S3's architecture reveals deep lessons about distributed storage, consistency, and fault tolerance.
Key Concepts
Object Storage — Files stored as immutable objects with a flat namespace (bucket/key). Unlike block storage, objects can't be partially updated — writes create new versions.
Erasure Coding — Data is split into K data chunks + M parity chunks. Any K chunks can reconstruct the original data. More storage-efficient than full replication for cold data.
Strong Consistency — After a successful PUT or DELETE, subsequent GET requests will always return the updated value. S3 became strongly consistent in December 2020.
Multipart Upload — Large objects (>100MB) are split and uploaded in parallel parts, then reassembled by S3. Enables resumable uploads and higher throughput.
Core Components
- S3 API Layer — Handles HTTP requests (PUT, GET, DELETE, LIST). Authenticates via IAM, routes to appropriate backend.
- Metadata Service — Maps object keys to physical storage locations. Critical hot path — must be fast and highly available.
- Storage Nodes — Physical servers storing object data. Data distributed across multiple Availability Zones.
- Replication Manager — Ensures data is replicated across AZs (and optionally regions for Cross-Region Replication).
- Garbage Collector — Cleans up orphaned data chunks from failed writes or deleted objects.
- Index / Catalog — Stores bucket metadata, object metadata (size, ETag, permissions, user metadata).
Object Write Flow
- Client sends PUT request to S3 API layer
- API layer authenticates + authorizes (IAM)
- Data chunked and written to K storage nodes in parallel
- Erasure coding computes parity chunks; stored on M additional nodes
- Metadata (key → chunk locations) written to Metadata Service
200 OKreturned to client once all writes confirmed
Object Read Flow
- Client sends GET request with bucket + key
- API layer queries Metadata Service for chunk locations
- Chunks fetched from storage nodes in parallel
- Chunks reassembled and streamed to client
- ETag checksum verified for data integrity
Durability: Eleven 9s
How does S3 achieve 99.999999999% durability?
- 3+ copies across multiple AZs (or erasure coding for Standard-IA and Glacier tiers)
- Continuous background integrity checking via checksums
- Automatic repair — if a chunk fails integrity check, it's reconstructed from remaining chunks
- Hardware failure, bit rot, and silent data corruption are all detected and corrected automatically
Storage Tiers
Tier | Use Case | Durability | Retrieval |
S3 Standard | Frequently accessed | 11 9s | Milliseconds |
S3 Standard-IA | Infrequent access | 11 9s | Milliseconds |
S3 Glacier | Archives | 11 9s | Minutes to hours |
S3 Intelligent-Tiering | Unknown access patterns | 11 9s | Automatic |
Key Trade-offs
Decision | Reasoning |
Erasure coding over full replication | 50% storage overhead vs. 200% for 3x replication |
Flat namespace | Simplifies distribution; "folders" are just key prefixes |
Strong consistency (since 2020) | Simplifies client code; no more eventual consistency bugs |
Immutable objects | Enables deduplication, caching, and versioning without conflicts |