tadata
Back to home

Foundation Models Beyond Text: The Full Landscape

#artificial-intelligence#llm#foundation-models#deep-learning

Foundation models have expanded far beyond language. The same paradigm -- large-scale pretraining on broad data, then adaptation to specific tasks -- now spans vision, code, science, audio, and multimodal reasoning. Understanding this landscape is essential for strategic AI investment.

Foundation Model Taxonomy by Modality

Foundation Model Taxonomy Tree
===============================

Foundation Models
├── Language
│   ├── General: GPT-4, Claude, Gemini, Llama-3, Mistral
│   ├── Code: Codex, StarCoder2, CodeLlama, DeepSeek-Coder
│   └── Domain: BloombergGPT (finance), Med-PaLM (medical)
│
├── Vision
│   ├── Classification/Features: DINOv2, CLIP, SigLIP
│   ├── Segmentation: SAM, SAM2
│   ├── Generation: DALL-E 3, Stable Diffusion 3, Midjourney
│   └── Video: Sora, Runway Gen-3, Kling
│
├── Multimodal
│   ├── Vision+Language: GPT-4o, Gemini 1.5, Claude 3.5
│   ├── Any-to-Any: Gemini Ultra, GPT-4o
│   └── Embodied: RT-2, PaLM-E
│
├── Science
│   ├── Protein: AlphaFold2/3, ESM-2, RoseTTAFold
│   ├── Chemistry: MolBERT, ChemBERTa
│   ├── Climate: ClimaX, Pangu-Weather
│   └── Math: Minerva, Llemma, DeepSeek-Math
│
├── Audio
│   ├── Speech: Whisper, USM, SeamlessM4T
│   ├── Music: MusicLM, Stable Audio
│   └── General: AudioPaLM
│
└── Robotics/Embodied
    ├── Manipulation: RT-2, Octo
    └── Navigation: CLIP-Nav, VLMaps

Model Comparison Matrix

ModelModalityParamsTraining DataOpen/ClosedKey Capability
GPT-4oText+Vision+Audio~1.8T (est.)Web-scaleClosedBest multimodal reasoning
Claude 3.5 SonnetText+VisionUndisclosedWeb-scaleClosedLong context, safety
Gemini 1.5 ProText+Vision+AudioUndisclosedWeb-scaleClosed1M+ token context
Llama-3 405BText405B15T tokensOpen weightsBest open LLM
Mistral LargeText~120B (est.)UndisclosedOpen weightsEfficiency leader
DINOv2Vision1.1B142M imagesOpenSelf-supervised vision features
SAM 2Vision (segm.)600M11M images, 256K videosOpenUniversal segmentation
StarCoder2Code15B619 languages, 4T tokensOpenMulti-language code gen
AlphaFold3Protein+LigandUndisclosedPDB + molecular dataPartially openProtein-ligand structure prediction
Whisper Large v3Audio1.5B5M hours audioOpenMultilingual speech recognition

Open vs Closed Model Landscape

Open vs Closed Foundation Model Map (2026)
===========================================

                    Capability ──────────►
                    Low         Medium        High
              ┌──────────┬─────────────┬──────────────┐
   Open       │ TinyLlama│ Mistral 7B  │ Llama-3 405B │
   Source     │ Pythia   │ Gemma 2     │ DeepSeek V3  │
              ├──────────┼─────────────┼──────────────┤
   Open       │          │ Phi-3       │ Mistral Large│
   Weights    │          │ StarCoder2  │ DBRX         │
              ├──────────┼─────────────┼──────────────┤
   Closed     │          │             │ GPT-4o       │
   API        │          │             │ Claude 3.5   │
              │          │             │ Gemini Ultra  │
              └──────────┴─────────────┴──────────────┘

Adoption Timeline 2020-2026

Foundation Model Adoption Timeline
====================================

2020 ── GPT-3 (175B) demonstrates few-shot learning
     ── Vision Transformers (ViT) proposed
2021 ── CLIP connects vision and language
     ── Codex enables GitHub Copilot
     ── AlphaFold2 solves protein structure
2022 ── ChatGPT triggers mainstream adoption
     ── Stable Diffusion open-sources image generation
     ── Whisper open-sources speech recognition
2023 ── GPT-4 multimodal capability
     ── Llama 2 opens the model weight era
     ── Claude 2, Gemini launch competition
     ── SAM enables universal segmentation
2024 ── GPT-4o: native multimodal input/output
     ── Claude 3.5 Sonnet: benchmark leader
     ── Llama 3: open models match closed
     ── Video generation (Sora, Kling)
     ── AlphaFold3 expands to all biomolecules
2025 ── Gemini 2.0 agent capabilities
     ── Open models close gap on frontier
     ── Foundation models enter robotics (RT-2 successors)
     ── Specialized scientific models mature
2026 ── Multimodal agents become default interface
     ── Open model ecosystem rivals closed
     ── Domain-specific foundation models proliferate

Strategic Implications

The foundation model landscape reveals three key dynamics:

  1. Modality convergence: top models increasingly handle multiple modalities natively rather than through bolted-on adapters
  2. Open-source pressure: the gap between open and closed models narrows each cycle, reshaping pricing and lock-in calculus
  3. Domain specialization: generic models are being complemented by domain-specific models that outperform on targeted tasks with less compute

Resources