19 · How ChatGPT Works

Overview

Source: The System Design Newsletter — Neo Kim

ChatGPT is a large language model (LLM)-powered chat application built on GPT-4. Understanding how it works requires both knowledge of the underlying ML architecture (Transformer, RLHF) and the distributed systems challenges of serving a model with billions of parameters to millions of users.

Key Concepts

Transformer Architecture — The neural network architecture underlying GPT. Uses self-attention mechanisms to process relationships between all tokens in the input simultaneously. Enables understanding of long-range context.

Token — The basic unit of text for LLMs. A token ≈ 4 characters or 0.75 words in English. GPT-4 has a context window of 8K–128K tokens depending on configuration.

RLHF (Reinforcement Learning from Human Feedback) — Training technique that fine-tunes the base model to follow instructions and produce helpful, harmless, and honest responses. Human raters rank model outputs; a reward model is trained on these rankings.

Autoregressive Generation — The model generates text one token at a time. Each token is conditioned on all previous tokens. This makes generation sequential and latency-sensitive.

Inference — Running the trained model to generate a response. Requires significant GPU memory (hundreds of GBs for GPT-4) and compute per token.

Training Pipeline

Pre-training: GPT trained on hundreds of billions of tokens from the internet (books, code, web pages) using next-token prediction. Learns language, facts, and reasoning.

Supervised Fine-Tuning (SFT): Model fine-tuned on curated examples of (prompt, ideal response) pairs written by humans.

Reward Model Training: Human raters rank multiple responses to the same prompt. A reward model learns to predict human preferences.

RLHF (PPO): The SFT model is fine-tuned using the reward model signal via Proximal Policy Optimization. Output becomes more helpful and aligned.

Inference Architecture

Model Parallelism — GPT-4 is too large for a single GPU. Parameters split across multiple GPUs (tensor parallelism) and multiple servers (pipeline parallelism).

KV Cache — Key-Value cache stores attention states for already-processed tokens so they don't need to be recomputed when generating each new token.

Batching — Multiple user requests processed in the same GPU forward pass to maximize hardware utilization.

Streaming — Tokens sent to the client as generated (Server-Sent Events) rather than waiting for the full response. Reduces perceived latency.

System Architecture

API Gateway — Handles authentication, rate limiting, and routing.

Load Balancer — Distributes inference requests across GPU server clusters.

Inference Cluster — Thousands of H100 GPUs running the model. Each request may span 8–16 GPUs.

Context Store — Stores conversation history (chat messages) to provide context for each API call.

Safety Filters — Input and output moderation filters (content classifiers) run before and after model inference.

Usage & Billing Service — Tracks token consumption per API key.

Scale Characteristics

100M+ users at peak

~1 trillion parameters estimated for GPT-4

Token generation: ~20–100 tokens/sec per request (GPU-bound)

Thousands of H100 GPUs running 24/7

Cost: ~$0.03 per 1K tokens (GPT-4) — GPU cost is the dominant expense

Key Trade-offs

Decision	Reasoning
Autoregressive generation	Simplest correctness guarantee; no good parallel alternative yet
KV Cache	Avoids recomputing attention for prompt tokens on every new token
Streaming output	First token in ~1s feels fast even if full response takes 10s
RLHF over pure SFT	Better alignment with human intent; reduces harmful outputs