System Design Case Study: Designing a Distributed Rate Limiter

1. Overview

A distributed rate limiter is a critical infrastructure component that enforces per-entity request quotas across a fleet of stateless application servers (e.g., microservices behind a load balancer). In a single-node system, rate limiting is trivial (in-memory counter), but in distributed environments, requests for the same entity (user, IP, API key) can hit different nodes, leading to inaccurate counting if not synchronized.

The system must solve the distributed counting problem while adding negligible latency. Modern designs (Cloudflare, Stripe, AWS API Gateway as of 2025) overwhelmingly favor a hybrid approach: a fast centralized store (Redis Cluster) for atomic updates + optional local caching for sub-millisecond decisions.

Core challenges:

Atomicity: Prevent race conditions when two nodes increment the same counter simultaneously.
Accuracy vs Latency trade-off: Pure local limiting is fast but inaccurate; pure centralized is accurate but slower.
Burst handling: Real users send traffic in bursts (page load → multiple API calls).
Hot keys: Certain users/IPs generate extreme traffic (e.g., crawlers).

2. Requirements

Granularity levels (applied hierarchically):
1. Global (e.g., 1M req/sec total)
2. Per-endpoint (e.g., /payments → 10K req/sec)
3. Per-user/API-key (e.g., 100 req/min)
4. Per-IP (e.g., 50 req/min for unauthenticated)
Algorithms:
- Token Bucket → default choice (bursty, simple)
- Sliding Window Counter → highest accuracy (used by Cloudflare)
- Fixed Window → simplest but has boundary burst vulnerability
Response headers (standardized):
- X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After on 429

3. Recommended Architecture (2025 Production Standard)

4. Deep Dive: Token Bucket Implementation in Redis (Most Common Pattern)

Example Scenario We want to limit each authenticated user to 100 requests per minute with a burst allowance of 200 requests (i.e., after being idle for 2+ minutes, user can send 200 requests immediately).

Parameters:

Capacity (burst) = 200 tokens
Refill rate = 100 tokens per 60 seconds ≈ 1.6667 tokens/second

Redis Data Structure per User Key: rl:user:12345 (hash) Fields:

tokens → current tokens (double)
last_refill → unix timestamp of last update

Exact Lua Script (Production-Ready 2025 Version)

-- KEYS[1] = rate limit key
-- ARGV[1] = capacity (200)
-- ARGV[2] = tokens per second (1.6667)
-- ARGV[3] = current unix timestamp (ms precision)

local key        = KEYS[1]
local capacity   = tonumber(ARGV[1])
local rate       = tonumber(ARGV[2])
local now        = tonumber(ARGV[3]) / 1000  -- convert ms to seconds

local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = data[1] and tonumber(data[1]) or capacity
local last_refill = data[2] and tonumber(data[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * rate)

if new_tokens >= 1 then
    new_tokens = new_tokens - 1
    redis.call("HSET", key, "tokens", new_tokens, "last_refill", now)
    redis.call("EXPIRE", key, 86400)  -- keep key for 24h

    -- Return remaining tokens and reset time for headers
    local reset_in = math.ceil((capacity - new_tokens) / rate)
    return {1, math.floor(new_tokens), reset_in}
else
    local reset_in = math.ceil((capacity - new_tokens) / rate)
    return {0, 0, reset_in}
end

-- KEYS[1] = rate limit key
-- ARGV[1] = capacity (200)
-- ARGV[2] = tokens per second (1.6667)
-- ARGV[3] = current unix timestamp (ms precision)

local key        = KEYS[1]
local capacity   = tonumber(ARGV[1])
local rate       = tonumber(ARGV[2])
local now        = tonumber(ARGV[3]) / 1000  -- convert ms to seconds

local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = data[1] and tonumber(data[1]) or capacity
local last_refill = data[2] and tonumber(data[2]) or now

-- Refill based on elapsed time
local elapsed = now - last_refill
local new_tokens = math.min(capacity, tokens + elapsed * rate)

if new_tokens >= 1 then
    new_tokens = new_tokens - 1
    redis.call("HSET", key, "tokens", new_tokens, "last_refill", now)
    redis.call("EXPIRE", key, 86400)  -- keep key for 24h

    -- Return remaining tokens and reset time for headers
    local reset_in = math.ceil((capacity - new_tokens) / rate)
    return {1, math.floor(new_tokens), reset_in}
else
    local reset_in = math.ceil((capacity - new_tokens) / rate)
    return {0, 0, reset_in}
end

Request Flow Example (User 12345)

Time 0s: User idle → bucket full (200 tokens)

Time 0.1s: User sends 150 requests in a burst (page load)

All 150 allowed instantly
Tokens left = 50
Redis updates: tokens=50, last_refill=0.1

Time 30s: User sends 80 more requests

Since last refill: 30 × 1.6667 ≈ 50 tokens added
Tokens before request = 50 + 50 = 100
Consume 80 → allowed
Tokens left = 20

Time 70s: User sends 50 requests

Elapsed since last = 40s → +66.67 tokens
Tokens before = 20 + 66.67 ≈ 86.67
Consume 50 → allowed
Tokens left ≈ 36.67

Time 75s: User sends 50 requests

Elapsed = 5s → +8.33 tokens
Tokens before ≈ 45
Only 45 allowed → 5 requests rejected (429)
Response headers: X-RateLimit-Remaining: 0 Retry-After: 33 seconds (time until full refill)

This perfectly handles bursts while enforcing the long-term 100/min average.

5. Adding Local Cache (Sub-Millisecond Latency)

Each application instance maintains a Caffeine cache:

Cache<String, BucketState> localCache = Caffeine.newBuilder()
    .expireAfterWrite(2, TimeUnit.SECONDS)  // aggressive stale
    .build();

Cache<String, BucketState> localCache = Caffeine.newBuilder()
    .expireAfterWrite(2, TimeUnit.SECONDS)  // aggressive stale
    .build();

Flow:

Check local cache → if hit and <1s old → use it (99.9% of cases)
On miss → Redis Lua → update local cache with new state

Result: Median latency ~0.1 ms, p99 ~2 ms even at 5M QPS.

6. Hierarchical Limiting Example

Apply rules in order (first failure blocks):

Global: 1M req/sec → key rl:global
Endpoint /payments: 10K req/sec → key rl:endpoint:/payments
User 12345: 100 req/min → key rl:user:12345
IP 1.2.3.4: 50 req/min → key rl:ip:1.2.3.4

Execute 4 Lua scripts in parallel (Redis pipeline) → first rejection wins.

7. Real-World Deployment (2025)

Company	Algorithm	Backend	Peak QPS Handled	Notes
Cloudflare	Sliding Window Counter	Kyoto Tycoon + edge	100M+	Edge computation
Stripe	Token Bucket + local cache	Redis Cluster	500K+	Famous blog post
AWS API Gateway	Token/Leaky Bucket	Internal	Millions	Managed service
Shopify	Token Bucket	Redis	100K+	Ruby + Lua

This architecture is the undisputed standard in 2025 for any serious distributed system. It delivers sub-millisecond decisions, near-perfect accuracy, and gracefully handles Redis outages via local fallback.

1. Overview

2. Requirements

3. Recommended Architecture (2025 Production Standard)

4. Deep Dive: Token Bucket Implementation in Redis (Most Common Pattern)

Exact Lua Script (Production-Ready 2025 Version)

5. Adding Local Cache (Sub-Millisecond Latency)

6. Hierarchical Limiting Example

7. Real-World Deployment (2025)

Uma Mahesh

Related Posts

Design a “Likes” Counter for Social Media: Discusses designing a scalable likes counting system

System Design Case Study: Designing a Scalable Notification Service

System Design Case Study: Designing a Distributed Job Scheduler