tadata
Back to home

AI Compute Landscape: GPUs, TPUs, Custom Silicon, and the Economics of Scale

#artificial-intelligence#hardware#cloud#infrastructure

Compute is the binding constraint of modern AI. The choice of hardware -- and how it is provisioned -- determines what models can be trained, at what cost, and how quickly inference can be served. Understanding the hardware landscape is no longer optional for AI strategy.

Hardware Comparison Table

ChipVendorTypeMemoryMemory BWFP16 TFLOPSTDPPrimary Use
H100 SXMNVIDIAGPU80 GB HBM33.35 TB/s989700WTraining + Inference
H200 SXMNVIDIAGPU141 GB HBM3e4.8 TB/s989700WLarge model training
B200NVIDIAGPU192 GB HBM3e8 TB/s2,2501000WNext-gen training
GB200 (Grace-Blackwell)NVIDIACPU+GPU384 GB combined8 TB/s2,2502700W (module)Full-stack AI
TPU v5pGoogleASIC95 GB HBM2.76 TB/s~459250WJAX/TF training
TPU v6 (Trillium)GoogleASIC128 GB HBM~4 TB/s~900 (est.)TBDNext-gen training
Trainium2AWSASIC96 GB HBM3.2 TB/s~740 (est.)500WCost-optimized training
Inferentia2AWSASIC32 GB HBM820 GB/s~190175WCost-optimized inference
Gaudi 3IntelASIC128 GB HBM2e3.7 TB/s~1,835600WTraining + Inference
MI300XAMDGPU192 GB HBM35.3 TB/s1,307750WTraining + Inference

Cost-Per-Token Estimates (Inference, 2026)

Approximate Cost per Million Output Tokens (Llama-3 70B class)
===============================================================

Hardware         $/M tokens    Relative Cost
─────────────    ──────────    ─────────────
H100 (cloud)       $0.80         1.0x (baseline)
H200 (cloud)       $0.60         0.75x
B200 (cloud)       $0.35         0.44x
TPU v5p            $0.55         0.69x
Trainium2          $0.40         0.50x
Inferentia2        $0.25         0.31x (inference only)
MI300X             $0.65         0.81x
Gaudi 3            $0.50         0.63x

Note: Estimates vary significantly by workload, batch size,
      quantization level, and cloud provider pricing.

Compute Scaling Laws

Scaling Laws Relationship Diagram
===================================

Performance (loss) improves as power law of three factors:

  Loss ~ f(Compute, Data, Parameters)

  Compute (C) ──────┐
                     │
  Data tokens (D) ───┼──► Loss = L(C, D, N)
                     │
  Parameters (N) ────┘

Chinchilla-optimal ratio (Hoffmann et al. 2022):
  D ≈ 20 × N  (tokens ≈ 20x parameters)

  Model Size    Optimal Tokens    Compute (FLOPs)
  ──────────    ──────────────    ───────────────
  1B            20B               3.6 × 10^19
  7B            140B              1.8 × 10^21
  70B           1.4T              1.8 × 10^23
  400B          8T                5.8 × 10^24
  1T            20T               3.6 × 10^25

Post-Chinchilla insight (LLama, Gemma):
  Over-training smaller models on more data improves
  inference efficiency at the cost of training efficiency.

Cloud GPU Pricing Comparison (On-Demand, per hour, 2026)

GPU/AcceleratorAWSGCPAzureCoreWeaveLambda
H100 80GB$12.00 (p5.xlarge)$11.60 (a3-highgpu)$11.56$2.49$2.49
H200$15.00 (est.)$14.20 (est.)TBD$3.29$3.49
B200Not yetNot yetNot yet$4.49 (est.)TBD
A100 80GB$5.12 (p4de)$5.07$3.67$1.35$1.29
L4$0.81$0.81$0.72N/AN/A
TPU v5pN/A$4.20/chipN/AN/AN/A
Trainium2$5.30 (trn2.48xlarge/16)N/AN/AN/AN/A
Gaudi 3Via DL AMIN/AAvailableN/AN/A

Note: GPU cloud pricing is highly dynamic. Spot/preemptible pricing can be 50-70% lower. Reserved instances offer 30-50% discounts.

Training vs Inference: Different Hardware Strategies

Training vs Inference Requirements
====================================

              Training                    Inference
              ────────                    ─────────
Memory:       Massive (model +           Model weights +
              optimizer + gradients)      KV cache only

Compute:      Sustained, high            Bursty, latency-
              throughput                  sensitive

Interconnect: Critical (multi-node       Less critical
              all-reduce)                (single node often)

Precision:    BF16/FP16, some FP32       INT8/INT4/FP8
                                         (quantized)

Cost driver:  GPU-hours ×                Tokens served ×
              number of GPUs             latency SLA

Optimal HW:   H100/H200/B200/TPU v5p    Inferentia2/L4/
              (high bandwidth)            quantized on B200

Strategic Considerations

The hardware landscape favors organizations that can plan their compute strategy across three horizons: immediate (which cloud GPUs to reserve now), medium-term (which custom silicon pipelines to evaluate), and long-term (which architectural bets -- such as inference-optimized chips or optical computing -- to track). The 10x cost differences between providers and chip generations make compute strategy a board-level decision.

Resources