System Design & Scalability: Patterns That Actually Work
Scalability is not about handling millions of requests on day one. It is about designing systems that can grow without requiring a rewrite. The best architectures make scaling a configuration change, not an engineering project.
Scaling Pattern Taxonomy
Scalability Patterns
├── Compute Scaling
│ ├── Horizontal (add instances)
│ ├── Vertical (bigger instances)
│ └── Serverless (per-request)
├── Data Scaling
│ ├── Read replicas
│ ├── Sharding (horizontal partitioning)
│ ├── Partitioning (vertical / functional)
│ └── Polyglot persistence
├── Network / Traffic
│ ├── Load balancing (L4 / L7)
│ ├── CDN / Edge caching
│ ├── API gateway throttling
│ └── Geographic routing
└── Application-Level
├── Caching (multi-layer)
├── Async processing (queues)
├── CQRS (read/write separation)
├── Circuit breakers
└── Backpressure
Horizontal vs Vertical Scaling
| Dimension | Horizontal (Scale Out) | Vertical (Scale Up) |
|---|---|---|
| Mechanism | Add more instances | Increase instance resources |
| Upper limit | Practically unlimited | Hardware ceiling |
| Cost curve | Linear (pay per node) | Exponential (premium hardware) |
| Complexity | Higher (distributed state) | Lower (single machine) |
| Downtime | Zero (rolling updates) | Often required (resize) |
| Data consistency | Requires coordination | Simpler (single instance) |
| Failure blast radius | One node | Entire system |
| When to use | Stateless services, web tier | Databases, in-memory workloads |
Multi-Layer Caching Architecture
┌─────────┐
│ Client │
└────┬─────┘
│
┌────▼──────────┐ Cache hit? → Return immediately
│ CDN / Edge │ TTL: minutes to hours
│ (CloudFront) │ Best for: static assets, public API responses
└────┬──────────┘
│
┌────▼──────────┐ Cache hit? → Return immediately
│ API Gateway │ TTL: seconds to minutes
│ Cache │ Best for: authenticated but cacheable responses
└────┬──────────┘
│
┌────▼──────────┐ Cache hit? → Return immediately
│ Application │ TTL: seconds to minutes
│ Cache (Redis)│ Best for: session data, computed results, rate limits
└────┬──────────┘
│
┌────▼──────────┐ Cache hit? → Return immediately
│ Database │ TTL: managed by DB engine
│ Query Cache │ Best for: repeated complex queries
└────┬──────────┘
│
┌────▼──────────┐
│ Database │ Source of truth
│ (Primary) │
└───────────────┘
Load Balancing Strategy Comparison
| Strategy | Algorithm | Best For | Trade-off |
|---|---|---|---|
| Round Robin | Sequential distribution | Homogeneous instances | Ignores load differences |
| Least Connections | Route to least busy | Varying request durations | Slightly more overhead |
| Weighted | Proportional to capacity | Mixed instance sizes | Requires manual config |
| IP Hash | Consistent per client | Sticky sessions needed | Uneven distribution risk |
| Least Response Time | Route to fastest | Latency-sensitive apps | Requires health monitoring |
| Random | Random selection | Large homogeneous pools | Simple, surprisingly effective |
Capacity Planning Checklist
| Phase | Action | Tool / Method |
|---|---|---|
| Measure | Baseline current throughput (RPS, P99 latency) | Load testing (k6, Locust) |
| Model | Define growth projections (3mo, 6mo, 12mo) | Business metrics + historical data |
| Identify | Find the bottleneck (CPU, memory, I/O, network) | Profiling, APM (Datadog, Grafana) |
| Test | Load test at 2x projected peak | Staged load tests in staging |
| Plan | Define scaling triggers and thresholds | HPA metrics, CloudWatch alarms |
| Budget | Estimate cost at projected scale | Cloud pricing calculators, FinOps |
| Review | Monthly capacity review against actuals | Dashboard + alert review |
Key Principles
Design for 10x, build for 3x. Architecture should handle 10x current load conceptually. Implement infrastructure for 3x. This avoids over-engineering while keeping a growth path clear.
Statelessness is the foundation. Every scaling pattern becomes easier when services are stateless. Move session state to Redis, file uploads to object storage, and persistent data to managed databases.
Cache invalidation is the hard part. Adding a cache is easy. Knowing when to invalidate it is the real engineering challenge. Prefer short TTLs over complex invalidation logic when starting out.