logo

Overview

Source: The System Design Newsletter — Neo Kim
ChatGPT is a large language model (LLM)-powered chat application built on GPT-4. Understanding how it works requires both knowledge of the underlying ML architecture (Transformer, RLHF) and the distributed systems challenges of serving a model with billions of parameters to millions of users.

Key Concepts

Transformer Architecture — The neural network architecture underlying GPT. Uses self-attention mechanisms to process relationships between all tokens in the input simultaneously. Enables understanding of long-range context.
Token — The basic unit of text for LLMs. A token ≈ 4 characters or 0.75 words in English. GPT-4 has a context window of 8K–128K tokens depending on configuration.
RLHF (Reinforcement Learning from Human Feedback) — Training technique that fine-tunes the base model to follow instructions and produce helpful, harmless, and honest responses. Human raters rank model outputs; a reward model is trained on these rankings.
Autoregressive Generation — The model generates text one token at a time. Each token is conditioned on all previous tokens. This makes generation sequential and latency-sensitive.
Inference — Running the trained model to generate a response. Requires significant GPU memory (hundreds of GBs for GPT-4) and compute per token.

Training Pipeline

  1. Pre-training: GPT trained on hundreds of billions of tokens from the internet (books, code, web pages) using next-token prediction. Learns language, facts, and reasoning.
  1. Supervised Fine-Tuning (SFT): Model fine-tuned on curated examples of (prompt, ideal response) pairs written by humans.
  1. Reward Model Training: Human raters rank multiple responses to the same prompt. A reward model learns to predict human preferences.
  1. RLHF (PPO): The SFT model is fine-tuned using the reward model signal via Proximal Policy Optimization. Output becomes more helpful and aligned.

Inference Architecture

  • Model Parallelism — GPT-4 is too large for a single GPU. Parameters split across multiple GPUs (tensor parallelism) and multiple servers (pipeline parallelism).
  • KV Cache — Key-Value cache stores attention states for already-processed tokens so they don't need to be recomputed when generating each new token.
  • Batching — Multiple user requests processed in the same GPU forward pass to maximize hardware utilization.
  • Streaming — Tokens sent to the client as generated (Server-Sent Events) rather than waiting for the full response. Reduces perceived latency.

System Architecture

  • API Gateway — Handles authentication, rate limiting, and routing.
  • Load Balancer — Distributes inference requests across GPU server clusters.
  • Inference Cluster — Thousands of H100 GPUs running the model. Each request may span 8–16 GPUs.
  • Context Store — Stores conversation history (chat messages) to provide context for each API call.
  • Safety Filters — Input and output moderation filters (content classifiers) run before and after model inference.
  • Usage & Billing Service — Tracks token consumption per API key.

Scale Characteristics

  • 100M+ users at peak
  • ~1 trillion parameters estimated for GPT-4
  • Token generation: ~20–100 tokens/sec per request (GPU-bound)
  • Thousands of H100 GPUs running 24/7
  • Cost: ~$0.03 per 1K tokens (GPT-4) — GPU cost is the dominant expense

Key Trade-offs

Decision
Reasoning
Autoregressive generation
Simplest correctness guarantee; no good parallel alternative yet
KV Cache
Avoids recomputing attention for prompt tokens on every new token
Streaming output
First token in ~1s feels fast even if full response takes 10s
RLHF over pure SFT
Better alignment with human intent; reduces harmful outputs