AI Compute Landscape: GPUs, TPUs, Custom Silicon, and the Economics of Scale
Compute is the binding constraint of modern AI. The choice of hardware -- and how it is provisioned -- determines what models can be trained, at what cost, and how quickly inference can be served. Understanding the hardware landscape is no longer optional for AI strategy.
Hardware Comparison Table
| Chip | Vendor | Type | Memory | Memory BW | FP16 TFLOPS | TDP | Primary Use |
|---|---|---|---|---|---|---|---|
| H100 SXM | NVIDIA | GPU | 80 GB HBM3 | 3.35 TB/s | 989 | 700W | Training + Inference |
| H200 SXM | NVIDIA | GPU | 141 GB HBM3e | 4.8 TB/s | 989 | 700W | Large model training |
| B200 | NVIDIA | GPU | 192 GB HBM3e | 8 TB/s | 2,250 | 1000W | Next-gen training |
| GB200 (Grace-Blackwell) | NVIDIA | CPU+GPU | 384 GB combined | 8 TB/s | 2,250 | 2700W (module) | Full-stack AI |
| TPU v5p | ASIC | 95 GB HBM | 2.76 TB/s | ~459 | 250W | JAX/TF training | |
| TPU v6 (Trillium) | ASIC | 128 GB HBM | ~4 TB/s | ~900 (est.) | TBD | Next-gen training | |
| Trainium2 | AWS | ASIC | 96 GB HBM | 3.2 TB/s | ~740 (est.) | 500W | Cost-optimized training |
| Inferentia2 | AWS | ASIC | 32 GB HBM | 820 GB/s | ~190 | 175W | Cost-optimized inference |
| Gaudi 3 | Intel | ASIC | 128 GB HBM2e | 3.7 TB/s | ~1,835 | 600W | Training + Inference |
| MI300X | AMD | GPU | 192 GB HBM3 | 5.3 TB/s | 1,307 | 750W | Training + Inference |
Cost-Per-Token Estimates (Inference, 2026)
Approximate Cost per Million Output Tokens (Llama-3 70B class)
===============================================================
Hardware $/M tokens Relative Cost
───────────── ────────── ─────────────
H100 (cloud) $0.80 1.0x (baseline)
H200 (cloud) $0.60 0.75x
B200 (cloud) $0.35 0.44x
TPU v5p $0.55 0.69x
Trainium2 $0.40 0.50x
Inferentia2 $0.25 0.31x (inference only)
MI300X $0.65 0.81x
Gaudi 3 $0.50 0.63x
Note: Estimates vary significantly by workload, batch size,
quantization level, and cloud provider pricing.
Compute Scaling Laws
Scaling Laws Relationship Diagram
===================================
Performance (loss) improves as power law of three factors:
Loss ~ f(Compute, Data, Parameters)
Compute (C) ──────┐
│
Data tokens (D) ───┼──► Loss = L(C, D, N)
│
Parameters (N) ────┘
Chinchilla-optimal ratio (Hoffmann et al. 2022):
D ≈ 20 × N (tokens ≈ 20x parameters)
Model Size Optimal Tokens Compute (FLOPs)
────────── ────────────── ───────────────
1B 20B 3.6 × 10^19
7B 140B 1.8 × 10^21
70B 1.4T 1.8 × 10^23
400B 8T 5.8 × 10^24
1T 20T 3.6 × 10^25
Post-Chinchilla insight (LLama, Gemma):
Over-training smaller models on more data improves
inference efficiency at the cost of training efficiency.
Cloud GPU Pricing Comparison (On-Demand, per hour, 2026)
| GPU/Accelerator | AWS | GCP | Azure | CoreWeave | Lambda |
|---|---|---|---|---|---|
| H100 80GB | $12.00 (p5.xlarge) | $11.60 (a3-highgpu) | $11.56 | $2.49 | $2.49 |
| H200 | $15.00 (est.) | $14.20 (est.) | TBD | $3.29 | $3.49 |
| B200 | Not yet | Not yet | Not yet | $4.49 (est.) | TBD |
| A100 80GB | $5.12 (p4de) | $5.07 | $3.67 | $1.35 | $1.29 |
| L4 | $0.81 | $0.81 | $0.72 | N/A | N/A |
| TPU v5p | N/A | $4.20/chip | N/A | N/A | N/A |
| Trainium2 | $5.30 (trn2.48xlarge/16) | N/A | N/A | N/A | N/A |
| Gaudi 3 | Via DL AMI | N/A | Available | N/A | N/A |
Note: GPU cloud pricing is highly dynamic. Spot/preemptible pricing can be 50-70% lower. Reserved instances offer 30-50% discounts.
Training vs Inference: Different Hardware Strategies
Training vs Inference Requirements
====================================
Training Inference
──────── ─────────
Memory: Massive (model + Model weights +
optimizer + gradients) KV cache only
Compute: Sustained, high Bursty, latency-
throughput sensitive
Interconnect: Critical (multi-node Less critical
all-reduce) (single node often)
Precision: BF16/FP16, some FP32 INT8/INT4/FP8
(quantized)
Cost driver: GPU-hours × Tokens served ×
number of GPUs latency SLA
Optimal HW: H100/H200/B200/TPU v5p Inferentia2/L4/
(high bandwidth) quantized on B200
Strategic Considerations
The hardware landscape favors organizations that can plan their compute strategy across three horizons: immediate (which cloud GPUs to reserve now), medium-term (which custom silicon pipelines to evaluate), and long-term (which architectural bets -- such as inference-optimized chips or optical computing -- to track). The 10x cost differences between providers and chip generations make compute strategy a board-level decision.