Overview
Source: The System Design Newsletter — Neo Kim
ChatGPT is a large language model (LLM)-powered chat application built on GPT-4. Understanding how it works requires both knowledge of the underlying ML architecture (Transformer, RLHF) and the distributed systems challenges of serving a model with billions of parameters to millions of users.
Key Concepts
Transformer Architecture — The neural network architecture underlying GPT. Uses self-attention mechanisms to process relationships between all tokens in the input simultaneously. Enables understanding of long-range context.
Token — The basic unit of text for LLMs. A token ≈ 4 characters or 0.75 words in English. GPT-4 has a context window of 8K–128K tokens depending on configuration.
RLHF (Reinforcement Learning from Human Feedback) — Training technique that fine-tunes the base model to follow instructions and produce helpful, harmless, and honest responses. Human raters rank model outputs; a reward model is trained on these rankings.
Autoregressive Generation — The model generates text one token at a time. Each token is conditioned on all previous tokens. This makes generation sequential and latency-sensitive.
Inference — Running the trained model to generate a response. Requires significant GPU memory (hundreds of GBs for GPT-4) and compute per token.
Training Pipeline
- Pre-training: GPT trained on hundreds of billions of tokens from the internet (books, code, web pages) using next-token prediction. Learns language, facts, and reasoning.
- Supervised Fine-Tuning (SFT): Model fine-tuned on curated examples of (prompt, ideal response) pairs written by humans.
- Reward Model Training: Human raters rank multiple responses to the same prompt. A reward model learns to predict human preferences.
- RLHF (PPO): The SFT model is fine-tuned using the reward model signal via Proximal Policy Optimization. Output becomes more helpful and aligned.
Inference Architecture
- Model Parallelism — GPT-4 is too large for a single GPU. Parameters split across multiple GPUs (tensor parallelism) and multiple servers (pipeline parallelism).
- KV Cache — Key-Value cache stores attention states for already-processed tokens so they don't need to be recomputed when generating each new token.
- Batching — Multiple user requests processed in the same GPU forward pass to maximize hardware utilization.
- Streaming — Tokens sent to the client as generated (Server-Sent Events) rather than waiting for the full response. Reduces perceived latency.
System Architecture
- API Gateway — Handles authentication, rate limiting, and routing.
- Load Balancer — Distributes inference requests across GPU server clusters.
- Inference Cluster — Thousands of H100 GPUs running the model. Each request may span 8–16 GPUs.
- Context Store — Stores conversation history (chat messages) to provide context for each API call.
- Safety Filters — Input and output moderation filters (content classifiers) run before and after model inference.
- Usage & Billing Service — Tracks token consumption per API key.
Scale Characteristics
- 100M+ users at peak
- ~1 trillion parameters estimated for GPT-4
- Token generation: ~20–100 tokens/sec per request (GPU-bound)
- Thousands of H100 GPUs running 24/7
- Cost: ~$0.03 per 1K tokens (GPT-4) — GPU cost is the dominant expense
Key Trade-offs
Decision | Reasoning |
Autoregressive generation | Simplest correctness guarantee; no good parallel alternative yet |
KV Cache | Avoids recomputing attention for prompt tokens on every new token |
Streaming output | First token in ~1s feels fast even if full response takes 10s |
RLHF over pure SFT | Better alignment with human intent; reduces harmful outputs |