12 · How Google Search Works

Overview

Source: The System Design Newsletter — Neo Kim

Google Search indexes hundreds of billions of web pages and returns relevant results in under 200ms. The system spans web crawling, indexing, ranking, and serving — one of the most complex distributed systems ever built.

Key Concepts

Web Crawler — Automated bot that discovers and downloads web pages by following hyperlinks. Starting from seed URLs, it expands the frontier continuously.

Inverted Index — Maps each word to a list of documents containing that word (with position and frequency). Enables O(1) document lookup per query term.

PageRank — Algorithm that scores a page by the number and quality of inbound links. A link from a high-authority page counts more than many links from low-authority pages.

Query Understanding — NLP layer that interprets query intent: spelling correction, synonym expansion, entity detection, and personalization.

Core Components

Crawler (Googlebot) — Distributed crawler that fetches billions of pages. Respects robots.txt. Prioritizes fresh and high-authority pages for re-crawl.

URL Frontier — Queue of URLs to be crawled, prioritized by authority and freshness.

Content Store — Raw and processed page content stored at massive scale (Google Bigtable/Colossus).

Indexer — Parses HTML, extracts text, computes term frequencies, and writes to the inverted index.

Inverted Index — Sharded, distributed index mapping terms to document lists. Stored in memory for hot terms.

Ranking Engine — Combines hundreds of signals: PageRank, freshness, relevance, E-E-A-T (Experience, Expertise, Authority, Trust).

Serving Layer — Receives user queries, distributes across index shards, merges results, applies ranking, and returns the top 10.

Knowledge Graph — Structured database of entities and relationships. Powers information boxes and direct answers.

Crawl → Index → Serve Pipeline

Crawl: Googlebot fetches URL → HTML stored in Content Store

Parse: Extract text, links, and structured data (schema.org)

Index: Tokenize text → update inverted index with document ID, term positions

Rank: Compute PageRank + quality signals offline; stored as document scores

Serve: Query arrives → parsed → shards queried in parallel → results merged → ranked → returned

Query Serving (Sub-200ms)

Query fan-out to thousands of index shards in parallel

Each shard returns top-K local results

Results merged and globally re-ranked

Final top 10 results returned with snippets

Cache layer for popular queries (significant hit rate)

Scale Characteristics

8.5 billion searches per day

Index covers hundreds of billions of pages

Crawling: billions of pages per day

Query latency: < 200ms end-to-end

Duplicate content detection via SimHash fingerprinting

Key Trade-offs

Decision	Reasoning
Distributed index sharding	Single machine can't hold the full index
PageRank computed offline	Too expensive for real-time; acceptable staleness
In-memory index for hot terms	"the", "python", etc. must return in microseconds
ML-based ranking (RankBrain, BERT)	Human-curated signals don't scale to billions of queries