Prompt engineering is no longer ad hoc tinkering -- it has matured into a discipline with repeatable patterns, evaluation frameworks, and systematic optimization strategies. This post catalogs the core techniques, maps them to use cases, and provides a framework for selecting the right approach.
Technique Taxonomy
Prompt Engineering Techniques
│
├── Basic Prompting
│ ├── Zero-Shot: Direct instruction, no examples
│ ├── Few-Shot: 2-5 examples demonstrating desired output
│ └── System Prompting: Role and behavior definition
│
├── Reasoning Enhancement
│ ├── Chain-of-Thought (CoT): "Think step by step"
│ ├── Tree-of-Thought (ToT): Explore multiple reasoning paths
│ ├── Self-Consistency: Sample multiple CoT paths, majority vote
│ └── Step-Back Prompting: Abstract before solving
│
├── Agent Patterns
│ ├── ReAct: Reason + Act in alternating steps
│ ├── Reflexion: Self-evaluate and retry
│ ├── Plan-and-Execute: Plan first, then execute steps
│ └── Tool-Use: Route to external tools/APIs
│
├── Output Control
│ ├── Structured Output: JSON/XML schema enforcement
│ ├── Constrained Generation: Grammar-based constraints
│ └── Output Parsing: Post-process with validation
│
└── Optimization
├── Prompt Chaining: Break complex tasks into subtasks
├── Meta-Prompting: Use LLM to generate/refine prompts
└── DSPy-style: Programmatic prompt optimization
Pattern Comparison Table
| Technique | Complexity | Latency Impact | Cost Impact | Reliability Gain | Best Model Tier |
|---|
| Zero-Shot | Low | None | Lowest | Baseline | Any |
| Few-Shot | Low | Minimal (+tokens) | Low | +15-30% | Any |
| Chain-of-Thought | Low | +30-50% | +30-50% | +20-40% on reasoning | Medium+ |
| Self-Consistency | Medium | 3-5x (parallel) | 3-5x | +10-20% over CoT | Medium+ |
| Tree-of-Thought | High | 5-10x | 5-10x | +15-25% over CoT | Large |
| ReAct | High | Variable (tool calls) | Variable | High for tool tasks | Large |
| Reflexion | High | 2-3x (retry loop) | 2-3x | +10-30% | Large |
| Structured Output | Low | Minimal | Minimal | High for parsing | Any (with support) |
| Prompt Chaining | Medium | Additive per step | Additive | High for complex tasks | Any |
Evaluation Framework
| Dimension | What to Measure | How to Measure | Target |
|---|
| Accuracy | Correctness of output | Human eval or ground truth comparison | Task-specific |
| Consistency | Same input produces similar output | Run N times, measure variance | Variance < 10% |
| Format compliance | Output matches schema | Schema validation pass rate | > 99% |
| Latency | Time to first token + total | API response timing | Task-specific |
| Cost | Tokens consumed per task | Token counting | Budget-constrained |
| Safety | No harmful or biased output | Red-team testing + guardrails | Zero violations |
| Robustness | Performance on edge cases | Adversarial test suite | Graceful degradation |
Use Case to Technique Decision Matrix
| Use Case | Recommended Primary | Fallback | Avoid |
|---|
| Classification | Few-Shot | Zero-Shot + examples in system | CoT (overkill) |
| Summarization | Zero-Shot + constraints | Few-Shot with style examples | Self-Consistency |
| Code generation | Few-Shot + Structured Output | CoT for complex logic | Zero-Shot for complex |
| Math/reasoning | CoT + Self-Consistency | Tree-of-Thought | Zero-Shot |
| Data extraction | Few-Shot + Structured Output | Prompt Chaining | Unstructured output |
| Multi-step research | ReAct / Plan-and-Execute | Prompt Chaining | Single-shot |
| Creative writing | System prompt + constraints | Few-Shot for style | Over-constraining |
| Conversation/chat | System prompt + context mgmt | RAG for knowledge | Long context stuffing |
| Decision support | CoT + Structured Output | Tree-of-Thought | Zero-Shot |
| Content moderation | Few-Shot + classification | Constitutional AI pattern | Complex reasoning |
Anti-Patterns to Avoid
Common Anti-Patterns
│
├── Prompt Stuffing
│ └── Cramming too much context → model loses focus
│ Fix: Prioritize, summarize, use RAG
│
├── Instruction Overload
│ └── 20+ rules in system prompt → contradictions
│ Fix: Hierarchical instructions, test each rule
│
├── Example Contamination
│ └── Few-shot examples bias toward narrow patterns
│ Fix: Diverse examples, include edge cases
│
├── Output Format Ambiguity
│ └── "Return structured data" without schema
│ Fix: Provide explicit JSON schema or example
│
├── Ignoring Model Strengths
│ └── Using CoT on a task where zero-shot works fine
│ Fix: Start simple, add complexity only when needed
│
└── No Evaluation Pipeline
└── Changing prompts without measuring impact
Fix: A/B test prompts, track metrics over time
Resources