Computer vision has evolved from simple classifiers to systems that understand, segment, and generate visual content. The strategic question is no longer "can we do this?" but "where does vision AI create the most value, and how do we deploy it reliably?"
Task Taxonomy
Computer Vision Tasks
├── Understanding
│ ├── Image Classification (what is in the image?)
│ ├── Object Detection (where are the objects?)
│ ├── Semantic Segmentation (pixel-level class labels)
│ ├── Instance Segmentation (individual object masks)
│ ├── Panoptic Segmentation (stuff + things)
│ └── Pose Estimation (body/hand keypoints)
├── Analysis
│ ├── OCR / Document Understanding
│ ├── Scene Understanding
│ ├── Action Recognition (video)
│ ├── Anomaly Detection (defect, fraud)
│ └── 3D Reconstruction
├── Generation
│ ├── Image Generation (Stable Diffusion, DALL-E, Midjourney)
│ ├── Image Editing (inpainting, style transfer)
│ ├── Video Generation (Sora, Runway)
│ ├── Super Resolution
│ └── Synthetic Data Generation
└── Multimodal
├── Visual Question Answering (VQA)
├── Image Captioning
├── Visual Search
└── Vision-Language Models (GPT-4V, Claude Vision, Gemini)
Industry Application Matrix
| Industry | Use Case | Task Type | Maturity | Business Impact |
|---|
| Manufacturing | Defect detection | Anomaly detection | High | Reduces scrap by 20-40% |
| Manufacturing | Assembly verification | Object detection | High | Prevents downstream errors |
| Retail | Visual search | Embedding + similarity | High | Increases conversion 10-25% |
| Retail | Shelf monitoring | Object detection | Medium | Reduces stockouts |
| Healthcare | Medical imaging (radiology) | Classification + segmentation | Medium | Assists diagnosis, reduces read time |
| Healthcare | Pathology slide analysis | Segmentation | Medium | Scales pathologist capacity |
| Agriculture | Crop disease detection | Classification | Medium | Early intervention, yield protection |
| Autonomous | Perception stack | Detection + segmentation | Medium | Core to self-driving |
| Security | Anomaly detection (surveillance) | Action recognition | Medium | Reduces false alarms |
| Insurance | Damage assessment (auto, property) | Detection + classification | Medium | Accelerates claims 50-70% |
| Construction | Progress monitoring | Change detection | Low-Medium | Reduces project delays |
| Media | Content generation | Generation | High | Reduces production costs 60%+ |
Model Architecture Comparison
| Model | Task | Parameters | Speed (FPS) | Accuracy | Open Source | Best For |
|---|
| YOLOv8/v9 | Detection | 3-68M | 80-500+ | High | Yes | Real-time detection, edge |
| RT-DETR | Detection | 32-67M | 60-100 | Very High | Yes | High-accuracy detection |
| SAM 2 | Segmentation | 310M+ | 15-30 | Very High | Yes | Zero-shot segmentation |
| DINOv2 | Foundation (vision) | 86-1100M | 30-100 | Very High | Yes | Transfer learning backbone |
| CLIP | Vision-Language | 63-428M | 50-200 | High | Yes | Zero-shot classification, search |
| EfficientNet | Classification | 5-66M | 100-500 | High | Yes | Classification, mobile |
| Stable Diffusion 3 | Generation | ~2B | 1-5 | Very High | Yes | Image generation |
| GPT-4V / Claude Vision | VQA, understanding | N/A (API) | 1-5 | Very High | Closed | General vision understanding |
Deployment: Cloud vs Edge
| Factor | Cloud Deployment | Edge Deployment |
|---|
| Latency | 100-2000ms (network) | 10-100ms (local) |
| Bandwidth | Requires continuous upload | Processes locally |
| Cost at Scale | Per-inference (scales linearly) | Fixed hardware cost (amortized) |
| Model Size | Unlimited | Constrained (mobile/embedded) |
| Privacy | Data leaves premises | Data stays on device |
| Updates | Instant (server-side) | Requires OTA update |
| Availability | Requires internet | Works offline |
| Hardware | GPU instances (A100, T4) | Jetson, Coral, mobile NPU |
| Best For | Complex tasks, batch processing | Real-time, privacy-sensitive, remote |
Deployment Decision Framework
Is real-time latency (< 50ms) required?
├── YES: Is the model small enough for edge hardware?
│ ├── YES --> Edge deployment (ONNX, TensorRT, CoreML)
│ └── NO: Can you distill/quantize to fit?
│ ├── YES --> Optimize + edge deploy
│ └── NO --> Cloud GPU with edge caching
└── NO: Is data privacy a constraint?
├── YES --> On-premise GPU server or edge
└── NO: What is your volume?
├── Low (< 10K/day) --> Managed API (Rekognition, Vision AI)
├── Medium (10K-1M/day) --> Self-hosted on cloud GPU
└── High (> 1M/day) --> Dedicated cluster or edge fleet
Resources