Map of ML Algorithm Families
Machine learning is far more than supervised classification. This guide covers every major algorithm family — from reinforcement learning to evolutionary computing — with intuitions, trade-offs, and practical pointers for when each shines.
Algorithm Families at a Glance
| Family | Learning Signal | Key Use Cases |
|---|---|---|
| Supervised | Labeled examples (X → Y) | Classification, regression, forecasting |
| Unsupervised | No labels — find structure | Clustering, dimensionality reduction, anomaly detection |
| Self-Supervised | Pseudo-labels from data itself | Language models, image pre-training |
| Reinforcement Learning | Rewards from environment | Robotics, games, resource optimization |
| Evolutionary / Genetic | Fitness function + selection | Optimization, architecture search, scheduling |
| Probabilistic / Bayesian | Prior distributions + evidence | Uncertainty quantification, small data |
| Meta-Learning | Learning across tasks | Few-shot learning, AutoML |
Supervised Learning
The most widely deployed family. You provide labeled training data; the model learns the mapping.
Classification algorithms
| Algorithm | How it works | Best for |
|---|---|---|
| Logistic Regression | Linear decision boundary via sigmoid | Binary classification, interpretable baselines |
| Decision Trees | Recursive feature splits | Interpretable rules, feature importance |
| Random Forest | Ensemble of decorrelated trees (bagging) | Tabular data, robust to overfitting |
| Gradient Boosting (XGBoost, LightGBM, CatBoost) | Sequential trees correcting errors | Tabular data competitions, production ML |
| SVM | Maximize margin between classes | Small-medium datasets, high-dimensional spaces |
| k-NN | Majority vote of nearest neighbors | Simple baselines, recommendation |
| Naive Bayes | Bayes' theorem + feature independence | Text classification, spam filtering |
| Neural Networks | Learned hierarchical features | Images, text, audio, multimodal |
Regression algorithms
Most classification algorithms have regression variants: Linear Regression, Ridge/Lasso, Decision Tree Regressor, Random Forest Regressor, Gradient Boosting Regressor, SVR, k-NN Regressor.
Key insight: when to use what
Tabular data?
/ \
Yes No
/ \
< 100K rows? Images/Text/Audio?
/ \ |
Yes No Yes
/ \ |
Logistic Reg XGBoost/ Deep Learning
Random Forest LightGBM (CNN, Transformer)
(interpretable) (performance)
Gradient boosting (XGBoost, LightGBM, CatBoost) dominates tabular data. Deep learning wins on unstructured data. Don't use neural nets on tabular data unless you've already tried boosting.
Unsupervised Learning
No labels. The goal is to discover structure, patterns, or compressed representations.
Clustering
Group similar data points together.
| Algorithm | How it works | Strengths | Weaknesses |
|---|---|---|---|
| k-Means | Assign points to nearest centroid, update centroids | Fast, scalable | Must specify k, assumes spherical clusters |
| DBSCAN | Density-based — clusters are dense regions | Finds arbitrary shapes, detects outliers | Struggles with varying densities |
| HDBSCAN | Hierarchical DBSCAN | No need to specify eps, varying densities | Slower on very large datasets |
| Gaussian Mixture Models | Soft clustering via probability distributions | Probabilistic assignments, elliptical clusters | Must specify k, can overfit |
| Agglomerative | Bottom-up hierarchical merging | Dendrogram visualization, any distance metric | O(n³) memory, doesn't scale |
| Spectral Clustering | Graph Laplacian + k-Means on eigenvectors | Complex cluster shapes | Expensive for large n, must specify k |
from sklearn.cluster import HDBSCAN
import numpy as np
clusterer = HDBSCAN(min_cluster_size=15, min_samples=5)
labels = clusterer.fit_predict(X)
# labels == -1 means noise/outlier
Dimensionality Reduction
Compress high-dimensional data while preserving structure.
| Algorithm | Type | Best for |
|---|---|---|
| PCA | Linear projection | Feature decorrelation, noise reduction, fast |
| t-SNE | Non-linear embedding | 2D/3D visualization of clusters |
| UMAP | Non-linear embedding | Faster than t-SNE, preserves global structure better |
| Autoencoders | Neural network compression | Learned representations, anomaly detection |
| Truncated SVD | Linear (sparse data) | Text data (LSA), recommender systems |
| ICA | Independent components | Signal separation (e.g., EEG, audio) |
Anomaly Detection
Find data points that don't fit the normal pattern.
| Algorithm | Approach |
|---|---|
| Isolation Forest | Random partitioning — anomalies are isolated faster |
| Local Outlier Factor | Compare local density to neighbors |
| One-Class SVM | Learn a boundary around normal data |
| Autoencoders | High reconstruction error = anomaly |
| Statistical | Z-score, IQR, Mahalanobis distance |
Self-Supervised Learning
The model creates its own labels from raw data — no manual annotation needed. This is how modern foundation models (LLMs, vision transformers) are pre-trained.
Techniques
| Technique | Domain | How it works |
|---|---|---|
| Masked Language Modeling | NLP | Mask tokens, predict them (BERT) |
| Next Token Prediction | NLP | Predict the next word (GPT) |
| Masked Image Modeling | Vision | Mask patches, reconstruct them (MAE) |
| Contrastive Learning | Vision/Multi | Pull similar pairs close, push different pairs apart (SimCLR, CLIP) |
| DINO / DINOv2 | Vision | Self-distillation with no labels |
| JEPA | Vision | Predict latent representations, not pixels |
Why it matters
Self-supervised pre-training on massive unlabeled data → fine-tune on small labeled dataset. This is the dominant paradigm in modern AI: pre-train once, fine-tune for many tasks.
Reinforcement Learning (RL)
An agent learns by interacting with an environment, receiving rewards, and adjusting its policy to maximize cumulative reward.
Core concepts
┌───────────────┐
│ Environment │
└───┬───────┬───┘
state s │ │ reward r
▼ │
┌───────────┴───┐
│ Agent │
│ π(a|s) │──── action a ────►
└───────────────┘
| Term | Meaning |
|---|---|
| State (s) | Current observation of the environment |
| Action (a) | What the agent does |
| Reward (r) | Scalar feedback signal |
| Policy (π) | Strategy: state → action mapping |
| Value function V(s) | Expected cumulative future reward from state s |
| Q-function Q(s,a) | Expected cumulative future reward from taking action a in state s |
| Episode | One complete run from start to terminal state |
Algorithm families
Value-based methods
Learn Q(s,a) and pick the action with highest Q.
| Algorithm | Key idea |
|---|---|
| Q-Learning | Off-policy, tabular, updates Q via Bellman equation |
| DQN (Deep Q-Network) | Q-Learning with neural net approximation, experience replay |
| Double DQN | Fixes overestimation bias in DQN |
| Dueling DQN | Separate value and advantage streams |
| Rainbow | Combines 6 DQN improvements into one agent |
Policy-based methods
Learn the policy π(a|s) directly.
| Algorithm | Key idea |
|---|---|
| REINFORCE | Monte Carlo policy gradient — simple but high variance |
| PPO (Proximal Policy Optimization) | Clipped objective prevents destructive policy updates. The workhorse of modern RL |
| TRPO | Trust region constraint on policy updates |
| A2C / A3C | Actor (policy) + Critic (value) — reduces variance |
| SAC (Soft Actor-Critic) | Maximum entropy RL — encourages exploration |
Model-based methods
Learn a model of the environment and plan within it.
| Algorithm | Key idea |
|---|---|
| Dyna-Q | Interleave real experience with simulated experience |
| MuZero | Learned world model — mastered Chess, Go, Atari without knowing the rules |
| Dreamer (v3) | Learn a world model in latent space, imagine trajectories |
| MBPO | Model-based policy optimization with short rollouts |
RL applications
| Domain | Example |
|---|---|
| Games | AlphaGo, OpenAI Five (Dota 2), AlphaStar (StarCraft) |
| Robotics | Dexterous manipulation, locomotion, drone control |
| LLM alignment | RLHF — fine-tuning language models with human preferences |
| Resource management | Data center cooling (DeepMind), network routing |
| Finance | Portfolio optimization, order execution |
| Autonomous vehicles | Decision making in complex traffic |
Getting started with RL
import gymnasium as gym
# Classic control problem
env = gym.make("CartPole-v1")
obs, info = env.reset()
for _ in range(1000):
action = env.action_space.sample() # random policy
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs, info = env.reset()
Key libraries: Gymnasium (environments), Stable-Baselines3 (algorithms), CleanRL (single-file implementations), RLlib (scalable distributed RL), TorchRL (PyTorch-native).
Evolutionary & Genetic Algorithms
Inspired by biological evolution: population → fitness evaluation → selection → crossover → mutation → repeat.
Core concepts
Generation 0: [A] [B] [C] [D] [E] ← random population
↓ evaluate fitness
Fitness: 7 3 9 5 8
↓ selection (fittest survive)
Parents: [A] [C] [E]
↓ crossover + mutation
Generation 1: [AC'] [CE'] [CA'] [EA'] [EC']
↓ repeat...
Algorithm variants
| Algorithm | How it differs |
|---|---|
| Genetic Algorithm (GA) | Binary/discrete encoding, crossover + mutation |
| Genetic Programming (GP) | Evolves programs/trees rather than fixed-length vectors |
| Evolution Strategies (ES) | Continuous parameters, Gaussian perturbations, no crossover |
| CMA-ES | Covariance Matrix Adaptation — adapts the search distribution. Gold standard for continuous optimization |
| NEAT | Neuroevolution — evolves neural network topology and weights |
| Differential Evolution (DE) | Mutation via vector differences between population members |
| Particle Swarm Optimization (PSO) | Swarm intelligence — particles follow personal and global best |
| Ant Colony Optimization | Pheromone-based path optimization (combinatorial problems) |
When evolutionary beats gradient descent
- Non-differentiable objectives — discrete structures, combinatorial problems
- Multimodal landscapes — many local optima where gradient methods get stuck
- Architecture search — evolving neural network structures (NAS)
- Game AI — evolving strategies for NPCs
- Scheduling/routing — job shop, vehicle routing, timetabling
- Hardware design — circuit optimization, antenna design
Practical example: CMA-ES
import cma
# Minimize a function with CMA-ES
def objective(x):
return sum((xi - 3)**2 for xi in x) # minimum at [3, 3, 3, ...]
es = cma.CMAEvolutionStrategy([0] * 10, 0.5) # 10-dim, initial sigma=0.5
es.optimize(objective)
print(es.result.xbest) # → close to [3, 3, 3, ...]
Key libraries
| Library | Language | Focus |
|---|---|---|
| DEAP | Python | General-purpose evolutionary computation |
| PyGAD | Python | Genetic algorithms, simple API |
| pycma | Python | CMA-ES reference implementation |
| Optuna | Python | Hyperparameter optimization (includes evolutionary samplers) |
| ECJ | Java | Mature evolutionary computation framework |
| Nevergrad | Python | Gradient-free optimization (Meta) |
Probabilistic & Bayesian Methods
Model uncertainty explicitly using probability distributions.
Key approaches
| Method | What it does |
|---|---|
| Bayesian Linear Regression | Linear regression with posterior distributions over weights |
| Gaussian Processes (GP) | Non-parametric — defines a distribution over functions. Perfect for small data with uncertainty |
| Bayesian Neural Networks | Neural nets with distributions over weights — quantifies prediction uncertainty |
| Variational Inference | Approximate intractable posteriors with simpler distributions |
| MCMC (Markov Chain Monte Carlo) | Sample from the posterior — Metropolis-Hastings, Hamiltonian MC, NUTS |
| Bayesian Optimization | Use a GP surrogate to optimize expensive black-box functions |
| Hidden Markov Models | Sequential data with hidden states — speech, genomics |
| Probabilistic Programming | Write models as programs, let the framework do inference |
Bayesian Optimization for hyperparameters
from skopt import gp_minimize
def train_and_evaluate(params):
learning_rate, n_estimators = params
# Train model, return validation error
model = XGBClassifier(learning_rate=learning_rate,
n_estimators=int(n_estimators))
model.fit(X_train, y_train)
return -accuracy_score(y_val, model.predict(X_val))
result = gp_minimize(
train_and_evaluate,
[(1e-4, 1.0, "log-uniform"), # learning_rate
(50, 500)], # n_estimators
n_calls=50
)
Key libraries
| Library | Focus |
|---|---|
| PyMC | Probabilistic programming, MCMC/VI |
| Stan / CmdStanPy | High-performance MCMC (NUTS sampler) |
| NumPyro | JAX-based probabilistic programming |
| GPyTorch | Scalable Gaussian Processes on GPU |
| BoTorch | Bayesian optimization (built on GPyTorch) |
| scikit-optimize | Sequential model-based optimization |
Meta-Learning ("Learning to Learn")
Algorithms that improve their learning ability across tasks.
| Approach | How it works | Example |
|---|---|---|
| MAML (Model-Agnostic Meta-Learning) | Learn initialization weights that adapt in few gradient steps | Few-shot image classification |
| Prototypical Networks | Classify by distance to class prototypes in embedding space | Few-shot learning |
| Matching Networks | Attention-based comparison to support set | One-shot learning |
| Neural Architecture Search (NAS) | Search for optimal network architecture | EfficientNet, DARTS |
| Learned Optimizers | Meta-learn the optimizer itself | L2L, VeLO |
| In-Context Learning | Large models learn from examples in the prompt | GPT-4, Claude |
Hybrid & Emerging Approaches
Neuro-symbolic AI
Combine neural networks (pattern recognition) with symbolic reasoning (logic, rules).
- Neural Theorem Proving — use neural nets to guide proof search
- Differentiable Programming with Logic — DeepProbLog, Scallop
- LLM + Knowledge Graphs — ground language model outputs in structured knowledge
Graph Neural Networks (GNNs)
Learn on graph-structured data directly.
| Variant | Mechanism |
|---|---|
| GCN | Aggregate neighbor features (spectral) |
| GraphSAGE | Sample and aggregate (inductive, scales) |
| GAT | Attention-weighted neighbor aggregation |
| GIN | Maximally powerful under WL isomorphism test |
Applications: molecule property prediction, social network analysis, fraud detection, recommendation systems.
Diffusion Models
Learn to denoise data — the backbone of modern generative AI.
- Forward process: Gradually add noise to data
- Reverse process: Neural network learns to remove noise step by step
- Generation: Start from pure noise, denoise iteratively
Used in: Stable Diffusion, DALL-E 3, Sora (video), protein structure generation.
Choosing the Right Family
| Your situation | Start here |
|---|---|
| Labeled tabular data | Gradient Boosting (XGBoost/LightGBM) |
| Labeled images/text/audio | Fine-tune a pre-trained model (transfer learning) |
| No labels, find groups | HDBSCAN or Gaussian Mixture Models |
| No labels, reduce dimensions | UMAP for visualization, PCA for preprocessing |
| Sequential decisions with rewards | PPO (policy-based RL) |
| Optimize non-differentiable objective | CMA-ES or Genetic Algorithm |
| Small data, need uncertainty | Gaussian Processes or Bayesian methods |
| Very few labeled examples | Meta-learning (Prototypical Networks) or in-context learning |
| Generate images/text/audio | Diffusion models, Transformers |
| Graph-structured data | GNN (GraphSAGE, GAT) |
Resources
- Reinforcement Learning: An Introduction — Sutton & Barto (free online)
- Spinning Up in Deep RL — OpenAI's RL educational resource
- Introduction to Evolutionary Computing — Eiben & Smith
- Probabilistic Machine Learning — Kevin Murphy (free online)
- Dive into Deep Learning — Interactive deep learning textbook
- Papers With Code — State-of-the-art benchmarks across all ML families
:::