System Design Case Study: Designing a Distributed Key-Value Store (Inspired by Amazon DynamoDB)

1. Overview

This case study presents a comprehensive design for a fully distributed, highly available, and linearly scalable key-value store, directly modeled after Amazon’s Dynamo technology (the foundational system behind DynamoDB, Riak, Cassandra, and Voldemort).

The core philosophy is radical decentralization: no single point of failure, no master node, and no special roles. Every node is identical and can act as a coordinator for any request. The system deliberately sacrifices strict consistency in favor of availability and partition tolerance (AP in the CAP theorem), providing eventual consistency by default while allowing clients to tune consistency strength per operation via quorum parameters (R, W, N).

This design has proven itself at extreme scale: Amazon uses it for shopping carts, session stores, product catalogs, and virtually every high-availability service requiring sub-10ms latency at millions of requests per second.

2. Functional Requirements

Core API:
- Put(key, value, context?) → success/version
- Get(key) → value(s)/version(s)
Value size: Up to 1 MB (sufficient for most use cases; larger objects go to S3/Blob).
Tunable consistency: Clients specify R (reads), W (writes), N (replication factor) per request or per table.
- Common configurations:
  - W=1, R=2, N=3 → high write availability, low latency
  - W=2, R=2, N=3 → balanced
  - W=N, R=N → strong consistency (at cost of availability)
Conflict detection and resolution: System detects concurrent writes using vector clocks and returns all conflicting versions on read (client resolves) or applies last-writer-wins timestamp fallback.
Conditional writes: Put succeeds only if version matches (via context from previous Get).
No complex queries: Pure key-value (no secondary indexes in core design; add as separate layer if needed).

3. Non-Functional Requirements

Latency: p99 ≤ 10 ms for 99% of requests under normal conditions (quorum R/W = 2).
Throughput: 100K+ ops/sec per modern node; cluster-wide 10M+ QPS.
Availability: 99.999% (“five 9s”) — survive multiple node failures and entire data center outages.
Durability: Zero data loss with N=3 and proper acknowledgment.
Scalability: Add/remove nodes with <1% data movement (thanks to virtual nodes); automatic load rebalancing.
Partition tolerance: System remains operational during arbitrary network partitions.
Operational simplicity: Zero-downtime scaling, rolling upgrades, automatic failure handling.

4. Capacity Estimation (Back-of-the-Envelope)

Target workload: 10 million QPS peak, 5:1 read:write ratio.
Per node: 10K–20K QPS (SSD-bound), → ~500–1000 nodes for 10M QPS.
Storage: 1 billion active keys × 1 KB avg value × N=3 replication → ~3 TB raw + overhead → ~5–6 TB.
With 10 TB SSD per node → ~600 nodes sufficient (with headroom).
Network: 10M QPS × 1 KB payload → ~10 GB/s ingress cluster-wide → easily handled with 10 Gbps+ NICs and data center fabrics.

5. High-Level Architecture

6. Core Design Decisions (Deep Dive)

6.1 Partitioning – Consistent Hashing with Virtual Nodes

Keys hashed with MD5/SHA-256 → 256-bit space.
Each physical node owns 100–400 virtual nodes (vnodes) evenly distributed around the ring.
Benefits:
- Load balancing: heterogeneity handled naturally.
- Minimal data movement on node add/remove (only 1/#vnodes fraction moves).
- Smooth scaling: adding a node takes ~1/#vnodes of load from each existing node.

6.2 Replication – Sloppy Quorum + Hinted Handoff

Preference list = next N healthy nodes clockwise from key’s position.
Coordinator writes to first W nodes in preference list.
If a node is down → coordinator writes to the next healthy node and records a “hint” (hinted handoff).
When failed node recovers, hints are replayed.
This ensures writes never fail due to node unavailability.

6.3 Consistency & Conflict Resolution – Vector Clocks

Example:

Client A writes v1 → Node1 creates vector clock [(Node1,1)]
Client B reads v1, writes v2 → [(Node1,1), (Node2,1)]
Client C reads v1, writes v3 → [(Node1,1), (Node3,1)]
Next read returns both v2 and v3 (incomparable clocks) → client merges and writes v4 [(Node1,1), (Node2,1), (Node3,1)]

This preserves causal history without requiring global coordination.

6.4 Storage Engine – Log-Structured Merge Tree (RocksDB, LevelDB, Cassandra SSTables)

Writes → Memtable (in-memory skip list) + WAL (durability).
Memtable flush → immutable SSTable on disk.
Background compaction merges SSTables → excellent write amplification characteristics.
Bloom filters on SSTables → fast negative lookups.

6.5 Failure Detection – Gossip Protocol + Φ Accrual Detector

Every 1 second, each node gossips with 2–3 random nodes: “I am alive” + full membership view + heartbeat counters.
Φ accrual detector outputs suspicion level (not binary alive/dead) → adaptive to network conditions.
Seed nodes (configurable) help bootstrap new nodes.

6.6 Anti-Entropy & Repair

Each node maintains Merkle tree for each key range it owns.
Background process: nodes exchange Merkle root hashes → if mismatch, traverse tree to find differing keys → sync.
Read repair: on every Get, if versions differ across replicas, coordinator writes latest version to lagging nodes.

7. Detailed Operation Flow

Put(key, value, context):

Client → any node (coordinator via consistent hash or load balancer).
Coordinator generates new vector clock entry for itself.
Sends Put to top N nodes in preference list (with W quorum).
Waits for W acknowledgments (with correct version).
Returns success + new version to client.

Get(key):

Coordinator sends Get to top R nodes.
Collects versions + vector clocks.
If multiple conflicting versions → return all (client resolves).
Performs read repair in background.
Returns resolved value(s).

8. Scaling, Trade-offs, and Operational Characteristics

Aspect	Technique Used	Advantage	Trade-off / Cost
Partitioning	Consistent hashing + vnodes	Excellent load balance, minimal movement	Slightly higher ring metadata
Availability	Sloppy quorum + hinted handoff	Writes never fail	Temporary inconsistency during failures
Consistency	Vector clocks + client resolution	Causal consistency without coordination	Client complexity in conflict resolution
Durability	WAL + replication	Survives crashes	Write amplification
Repair	Merkle trees + read repair	Fast, incremental reconciliation	Background bandwidth/CPU usage
Strong consistency	R + W > N	Available when needed	Higher latency, lower availability

9. Real-World Performance (DynamoDB 2025 numbers)

Trillions of requests per day
Peak >100M QPS
p99 latency ~5–15 ms globally
Automatic scaling from 1 to 100K+ capacity units without downtime

This architecture remains the gold standard for any system requiring extreme availability and operational simplicity at global scale. Modern implementations (DynamoDB, Cassandra 4+, ScyllaDB) have added features like secondary indexes, materialized views, and change data capture, but the core Dynamo principles remain unchanged and battle-tested for nearly two decades.