tadata
Back to home

Small Language Models: Why Smaller Is Sometimes Better

#artificial-intelligence#llm#edge-computing#efficiency

The frontier AI narrative focuses on ever-larger models, but a parallel revolution is happening at the other end of the spectrum. Small language models (SLMs) -- typically under 10B parameters -- are achieving surprising capability through better training data, distillation, and architectural innovations. For many production use cases, they offer superior economics, latency, and deployability.

Model Size vs Performance Data

Model Size vs Benchmark Performance (MMLU, approximate 2026)
=============================================================

Model               Params    MMLU    $/M tokens (API)
──────────────────   ──────   ─────   ────────────────
GPT-4o               ~1.8T    88.7    $5.00
Claude 3.5 Sonnet    ~???     88.3    $3.00
Llama-3 405B         405B     86.1    $2.00 (hosted)
Llama-3 70B          70B      82.0    $0.80
Mistral Large        ~120B    81.2    $1.00
Qwen2.5 72B          72B     80.5    $0.70
──── capability gap narrows below ────
Llama-3 8B           8B       68.4    $0.10
Gemma 2 9B           9B       71.3    $0.08
Phi-3 Medium         14B      78.0    $0.12
Phi-3 Mini           3.8B     69.0    $0.04
Mistral 7B           7.3B     62.5    $0.06
Gemma 2 2B           2.6B     56.2    $0.02
Phi-3 Small          7B       75.3    $0.06
Qwen2.5 7B           7B       74.2    $0.05

Insight: The 7-14B range captures 75-85% of frontier
performance at 3-5% of the cost.

Compression Technique Comparison

TechniqueHow It WorksSize ReductionQuality ImpactSpeed GainComplexity
Quantization (INT8)Reduce weight precision to 8-bit2xMinimal (<1% loss)1.5-2xLow
Quantization (INT4)Reduce weight precision to 4-bit4xSmall (1-3% loss)2-3xLow
GPTQPost-training quantization with calibration4xSmall2-3xLow
AWQActivation-aware weight quantization4xVery small2-3xLow
GGUF (llama.cpp)CPU-optimized quantization format2-8xVaries by quant levelCPU-friendlyLow
Knowledge DistillationTrain small model from large model outputsCustom (choose size)Moderate (5-15% loss)Proportional to sizeHigh
Pruning (structured)Remove entire attention heads or layers1.5-3xModerate1.5-2xMedium
Pruning (unstructured)Zero out individual weights2-10x (sparse)Small if <50% sparsityNeeds sparse HWMedium
LoRA/QLoRALow-rank adaptation (fine-tuning efficiency)Base + tiny adapterImproves task qualityN/A (training method)Low
Speculative DecodingSmall model drafts, large model verifiesN/A (inference method)None (exact)2-3xMedium

Deployment Platform Matrix

PlatformMax Model SizeQuantizationLatencyUse CasesTools
Smartphone (iOS/Android)~3B (4-bit)INT4/GGUF20-50 tok/sOn-device assistant, autocompleteMLC-LLM, llama.cpp
Browser (WebGPU/WASM)~3B (4-bit)INT410-30 tok/sPrivacy-first apps, demosWebLLM, Transformers.js
Edge device (RPi, Jetson)~7B (4-bit)INT4/INT85-20 tok/sIoT, local processingllama.cpp, TensorRT-LLM
Laptop (CPU)~13B (4-bit)GGUF Q415-40 tok/sLocal dev, offline useOllama, LM Studio
Laptop (GPU)~30B (4-bit)GPTQ/AWQ30-80 tok/sDevelopment, local inferenceOllama, vLLM
Server (single GPU)~70B (4-bit)AWQ/GPTQ50-150 tok/sProduction inferencevLLM, TGI, TensorRT-LLM
Server (multi-GPU)400B+FP16/BF16100-300 tok/sFrontier model servingvLLM, TGI, DeepSpeed
Cloud APIUnlimitedProvider-managed50-200 tok/sAllOpenAI, Anthropic, etc.

Cost-Per-Query Comparison by Model Size

Cost Per 1000 Queries (500 input + 200 output tokens each)
============================================================

Model Size     Self-hosted*     API Pricing
──────────     ────────────     ───────────
3B (INT4)      $0.002           $0.01
7B (INT4)      $0.005           $0.03
13B (INT4)     $0.010           $0.06
70B (INT8)     $0.050           $0.56
405B (FP16)    $0.200           $1.40
Frontier       N/A              $3.50

*Self-hosted on appropriate hardware, amortized over 1M queries/month.
Includes compute only, not engineering overhead.

Break-even analysis:
- <10K queries/month: API is cheaper (no infra overhead)
- 10K-100K queries/month: depends on latency requirements
- >100K queries/month: self-hosted SLM often wins
- >1M queries/month: self-hosted SLM significantly cheaper

The SLM Design Philosophy

The best small models are not simply scaled-down versions of large models. They succeed through:

  1. Data quality over quantity: Phi models demonstrated that textbook-quality data at smaller scale outperforms web-crawl data at larger scale
  2. Over-training: Training a 7B model on 2T+ tokens (far beyond Chinchilla-optimal) improves inference-time performance
  3. Architectural efficiency: Grouped-query attention, SwiGLU activations, and RMSNorm reduce overhead per parameter
  4. Targeted distillation: Learning specific capabilities from larger teacher models rather than general knowledge
  5. Task-specific fine-tuning: A 7B model fine-tuned on your domain data often outperforms a 70B general model on that domain

When to Choose Small vs Large

Decision Framework: Small vs Large Model
==========================================

Choose SLM (< 13B) when:
  ✓ Single, well-defined task (classification, extraction, routing)
  ✓ Latency is critical (< 100ms)
  ✓ Cost per query must be minimal
  ✓ Privacy requires on-device or on-prem
  ✓ High query volume (> 100K/month)
  ✓ Domain-specific with good fine-tuning data

Choose Large Model (70B+) when:
  ✓ Complex multi-step reasoning required
  ✓ Broad knowledge across many domains
  ✓ Novel/unpredictable task types
  ✓ Quality is paramount, cost is secondary
  ✓ Low volume, high value per query

Hybrid pattern (increasingly common):
  SLM for routing/classification → Large model for complex cases
  SLM drafts → Large model verifies (speculative decoding)
  Large model generates training data → SLM serves production

Resources