Foundation Models Beyond Text: The Full Landscape
#artificial-intelligence#llm#foundation-models#deep-learning
Foundation models have expanded far beyond language. The same paradigm -- large-scale pretraining on broad data, then adaptation to specific tasks -- now spans vision, code, science, audio, and multimodal reasoning. Understanding this landscape is essential for strategic AI investment.
Foundation Model Taxonomy by Modality
Foundation Model Taxonomy Tree
===============================
Foundation Models
├── Language
│ ├── General: GPT-4, Claude, Gemini, Llama-3, Mistral
│ ├── Code: Codex, StarCoder2, CodeLlama, DeepSeek-Coder
│ └── Domain: BloombergGPT (finance), Med-PaLM (medical)
│
├── Vision
│ ├── Classification/Features: DINOv2, CLIP, SigLIP
│ ├── Segmentation: SAM, SAM2
│ ├── Generation: DALL-E 3, Stable Diffusion 3, Midjourney
│ └── Video: Sora, Runway Gen-3, Kling
│
├── Multimodal
│ ├── Vision+Language: GPT-4o, Gemini 1.5, Claude 3.5
│ ├── Any-to-Any: Gemini Ultra, GPT-4o
│ └── Embodied: RT-2, PaLM-E
│
├── Science
│ ├── Protein: AlphaFold2/3, ESM-2, RoseTTAFold
│ ├── Chemistry: MolBERT, ChemBERTa
│ ├── Climate: ClimaX, Pangu-Weather
│ └── Math: Minerva, Llemma, DeepSeek-Math
│
├── Audio
│ ├── Speech: Whisper, USM, SeamlessM4T
│ ├── Music: MusicLM, Stable Audio
│ └── General: AudioPaLM
│
└── Robotics/Embodied
├── Manipulation: RT-2, Octo
└── Navigation: CLIP-Nav, VLMaps
Model Comparison Matrix
| Model | Modality | Params | Training Data | Open/Closed | Key Capability |
|---|---|---|---|---|---|
| GPT-4o | Text+Vision+Audio | ~1.8T (est.) | Web-scale | Closed | Best multimodal reasoning |
| Claude 3.5 Sonnet | Text+Vision | Undisclosed | Web-scale | Closed | Long context, safety |
| Gemini 1.5 Pro | Text+Vision+Audio | Undisclosed | Web-scale | Closed | 1M+ token context |
| Llama-3 405B | Text | 405B | 15T tokens | Open weights | Best open LLM |
| Mistral Large | Text | ~120B (est.) | Undisclosed | Open weights | Efficiency leader |
| DINOv2 | Vision | 1.1B | 142M images | Open | Self-supervised vision features |
| SAM 2 | Vision (segm.) | 600M | 11M images, 256K videos | Open | Universal segmentation |
| StarCoder2 | Code | 15B | 619 languages, 4T tokens | Open | Multi-language code gen |
| AlphaFold3 | Protein+Ligand | Undisclosed | PDB + molecular data | Partially open | Protein-ligand structure prediction |
| Whisper Large v3 | Audio | 1.5B | 5M hours audio | Open | Multilingual speech recognition |
Open vs Closed Model Landscape
Open vs Closed Foundation Model Map (2026)
===========================================
Capability ──────────►
Low Medium High
┌──────────┬─────────────┬──────────────┐
Open │ TinyLlama│ Mistral 7B │ Llama-3 405B │
Source │ Pythia │ Gemma 2 │ DeepSeek V3 │
├──────────┼─────────────┼──────────────┤
Open │ │ Phi-3 │ Mistral Large│
Weights │ │ StarCoder2 │ DBRX │
├──────────┼─────────────┼──────────────┤
Closed │ │ │ GPT-4o │
API │ │ │ Claude 3.5 │
│ │ │ Gemini Ultra │
└──────────┴─────────────┴──────────────┘
Adoption Timeline 2020-2026
Foundation Model Adoption Timeline
====================================
2020 ── GPT-3 (175B) demonstrates few-shot learning
── Vision Transformers (ViT) proposed
2021 ── CLIP connects vision and language
── Codex enables GitHub Copilot
── AlphaFold2 solves protein structure
2022 ── ChatGPT triggers mainstream adoption
── Stable Diffusion open-sources image generation
── Whisper open-sources speech recognition
2023 ── GPT-4 multimodal capability
── Llama 2 opens the model weight era
── Claude 2, Gemini launch competition
── SAM enables universal segmentation
2024 ── GPT-4o: native multimodal input/output
── Claude 3.5 Sonnet: benchmark leader
── Llama 3: open models match closed
── Video generation (Sora, Kling)
── AlphaFold3 expands to all biomolecules
2025 ── Gemini 2.0 agent capabilities
── Open models close gap on frontier
── Foundation models enter robotics (RT-2 successors)
── Specialized scientific models mature
2026 ── Multimodal agents become default interface
── Open model ecosystem rivals closed
── Domain-specific foundation models proliferate
Strategic Implications
The foundation model landscape reveals three key dynamics:
- Modality convergence: top models increasingly handle multiple modalities natively rather than through bolted-on adapters
- Open-source pressure: the gap between open and closed models narrows each cycle, reshaping pricing and lock-in calculus
- Domain specialization: generic models are being complemented by domain-specific models that outperform on targeted tasks with less compute