logo

Overview

Source: The System Design Newsletter — Neo Kim
Google Search indexes hundreds of billions of web pages and returns relevant results in under 200ms. The system spans web crawling, indexing, ranking, and serving — one of the most complex distributed systems ever built.

Key Concepts

Web Crawler — Automated bot that discovers and downloads web pages by following hyperlinks. Starting from seed URLs, it expands the frontier continuously.
Inverted Index — Maps each word to a list of documents containing that word (with position and frequency). Enables O(1) document lookup per query term.
PageRank — Algorithm that scores a page by the number and quality of inbound links. A link from a high-authority page counts more than many links from low-authority pages.
Query Understanding — NLP layer that interprets query intent: spelling correction, synonym expansion, entity detection, and personalization.

Core Components

  • Crawler (Googlebot) — Distributed crawler that fetches billions of pages. Respects robots.txt. Prioritizes fresh and high-authority pages for re-crawl.
  • URL Frontier — Queue of URLs to be crawled, prioritized by authority and freshness.
  • Content Store — Raw and processed page content stored at massive scale (Google Bigtable/Colossus).
  • Indexer — Parses HTML, extracts text, computes term frequencies, and writes to the inverted index.
  • Inverted Index — Sharded, distributed index mapping terms to document lists. Stored in memory for hot terms.
  • Ranking Engine — Combines hundreds of signals: PageRank, freshness, relevance, E-E-A-T (Experience, Expertise, Authority, Trust).
  • Serving Layer — Receives user queries, distributes across index shards, merges results, applies ranking, and returns the top 10.
  • Knowledge Graph — Structured database of entities and relationships. Powers information boxes and direct answers.

Crawl → Index → Serve Pipeline

  1. Crawl: Googlebot fetches URL → HTML stored in Content Store
  1. Parse: Extract text, links, and structured data (schema.org)
  1. Index: Tokenize text → update inverted index with document ID, term positions
  1. Rank: Compute PageRank + quality signals offline; stored as document scores
  1. Serve: Query arrives → parsed → shards queried in parallel → results merged → ranked → returned

Query Serving (Sub-200ms)

  • Query fan-out to thousands of index shards in parallel
  • Each shard returns top-K local results
  • Results merged and globally re-ranked
  • Final top 10 results returned with snippets
  • Cache layer for popular queries (significant hit rate)

Scale Characteristics

  • 8.5 billion searches per day
  • Index covers hundreds of billions of pages
  • Crawling: billions of pages per day
  • Query latency: < 200ms end-to-end
  • Duplicate content detection via SimHash fingerprinting

Key Trade-offs

Decision
Reasoning
Distributed index sharding
Single machine can't hold the full index
PageRank computed offline
Too expensive for real-time; acceptable staleness
In-memory index for hot terms
"the", "python", etc. must return in microseconds
ML-based ranking (RankBrain, BERT)
Human-curated signals don't scale to billions of queries