Overview
Source: The System Design Newsletter — Neo Kim
Google Search indexes hundreds of billions of web pages and returns relevant results in under 200ms. The system spans web crawling, indexing, ranking, and serving — one of the most complex distributed systems ever built.
Key Concepts
Web Crawler — Automated bot that discovers and downloads web pages by following hyperlinks. Starting from seed URLs, it expands the frontier continuously.
Inverted Index — Maps each word to a list of documents containing that word (with position and frequency). Enables O(1) document lookup per query term.
PageRank — Algorithm that scores a page by the number and quality of inbound links. A link from a high-authority page counts more than many links from low-authority pages.
Query Understanding — NLP layer that interprets query intent: spelling correction, synonym expansion, entity detection, and personalization.
Core Components
- Crawler (Googlebot) — Distributed crawler that fetches billions of pages. Respects
robots.txt. Prioritizes fresh and high-authority pages for re-crawl.
- URL Frontier — Queue of URLs to be crawled, prioritized by authority and freshness.
- Content Store — Raw and processed page content stored at massive scale (Google Bigtable/Colossus).
- Indexer — Parses HTML, extracts text, computes term frequencies, and writes to the inverted index.
- Inverted Index — Sharded, distributed index mapping terms to document lists. Stored in memory for hot terms.
- Ranking Engine — Combines hundreds of signals: PageRank, freshness, relevance, E-E-A-T (Experience, Expertise, Authority, Trust).
- Serving Layer — Receives user queries, distributes across index shards, merges results, applies ranking, and returns the top 10.
- Knowledge Graph — Structured database of entities and relationships. Powers information boxes and direct answers.
Crawl → Index → Serve Pipeline
- Crawl: Googlebot fetches URL → HTML stored in Content Store
- Parse: Extract text, links, and structured data (schema.org)
- Index: Tokenize text → update inverted index with document ID, term positions
- Rank: Compute PageRank + quality signals offline; stored as document scores
- Serve: Query arrives → parsed → shards queried in parallel → results merged → ranked → returned
Query Serving (Sub-200ms)
- Query fan-out to thousands of index shards in parallel
- Each shard returns top-K local results
- Results merged and globally re-ranked
- Final top 10 results returned with snippets
- Cache layer for popular queries (significant hit rate)
Scale Characteristics
- 8.5 billion searches per day
- Index covers hundreds of billions of pages
- Crawling: billions of pages per day
- Query latency: < 200ms end-to-end
- Duplicate content detection via SimHash fingerprinting
Key Trade-offs
Decision | Reasoning |
Distributed index sharding | Single machine can't hold the full index |
PageRank computed offline | Too expensive for real-time; acceptable staleness |
In-memory index for hot terms | "the", "python", etc. must return in microseconds |
ML-based ranking (RankBrain, BERT) | Human-curated signals don't scale to billions of queries |