System Design — ChatGPT

End-to-end architecture for a large-scale, real-time AI chat service.

ChatGPT system design architecture diagram
1

High-Level System Flow

The system follows a hybrid REST and WebSocket architecture to handle the lifecycle of an AI chat session.

Ingress & Control Plane

Requests from the mobile app hit the API Gateway. This layer handles authentication and coordinates with the Rate Limiter, which fetches quota data from Redis (backed by Postgres) to ensure the user hasn't exceeded their tier limits.

Session Management

To prevent duplicate work and double-billing, the system uses Idempotency keys. Before processing a message, the Chat Server checks a unique request_id in Redis.

The Data Plane (Streaming)

For low-latency responses, a persistent WebSocket connection is established. This allows the AI to stream tokens incrementally, significantly improving the user's perceived performance.

2

Compute: The GPU Fleet

The requirement for 1,000 NVIDIA B200 GPUs is derived from Little's Law ($L = \lambda W$):

Capacity Planning via Little's Law

Arrival Rate ($\lambda$)
10,000 Requests Per Second
Wait Time ($W$)
10-second average response duration
Concurrency ($L$)
100,000 active concurrent streams

Hardware Optimizations

First-Principle Memory Scaling

A Llama-3 70B model in FP8 takes 70 GB for weights, leaving ~110 GB for the KV Cache. Using PagedAttention, each token consumes ~1 MB (conservative estimate), allowing for 100 concurrent users per GPU.

Key insight: 180 GB total VRAM − 70 GB weights = 110 GB for KV Cache → ~100 concurrent streams per GPU → 1,000 GPUs needed for 100K concurrent streams.

Prefix Caching

To minimize redundant math, the Model Workers use Prefix Caching. By using Consistent Hashing (by chat_id) at the Load Balancer, a user's messages always land on the same worker, maximizing KV Cache hits and reducing Time to First Token (TTFT).

3

Storage Architecture

A multi-tiered storage strategy handles a projected 10 PB of data over 5 years.

Source of Truth

Postgres

Stores relational data like user profiles and billing tiers. To handle 100M users, it is sharded by user_id and utilizes Primary-Replica sets for high availability and read scaling.

Hot Store

DynamoDB

Stores the actual message history. Chosen for its horizontal scalability. Estimated 10 PB of total storage, which translates to roughly 300 physical machines (accounting for 3x replication).

Cold Store

S3

To manage costs, history older than 1 year is offloaded from DynamoDB to S3 Blob Storage.

Speed Layer

Redis

Tracks active session IDs, idempotency keys, and temporary token quotas.

4

Reliability and Fault Tolerance

Self-Healing DBs

If a Postgres Primary fails, an automated Failover promotes a replica to primary.

Blast Radius Reduction

Sharding ensures that a failure in one database partition only affects a small subset of users.

Stateless Retries

If a Model Worker node fails, the Consistent Hashing simply remaps the chat_id to a new healthy node. The first request is a "cache miss," but the system remains operational.