tadata
Back to home

Prompt Engineering: Systematic Patterns for Reliable LLM Outputs

#artificial-intelligence#llm#prompt-engineering#best-practices

Prompt engineering is no longer ad hoc tinkering -- it has matured into a discipline with repeatable patterns, evaluation frameworks, and systematic optimization strategies. This post catalogs the core techniques, maps them to use cases, and provides a framework for selecting the right approach.

Technique Taxonomy

Prompt Engineering Techniques
│
├── Basic Prompting
│   ├── Zero-Shot: Direct instruction, no examples
│   ├── Few-Shot: 2-5 examples demonstrating desired output
│   └── System Prompting: Role and behavior definition
│
├── Reasoning Enhancement
│   ├── Chain-of-Thought (CoT): "Think step by step"
│   ├── Tree-of-Thought (ToT): Explore multiple reasoning paths
│   ├── Self-Consistency: Sample multiple CoT paths, majority vote
│   └── Step-Back Prompting: Abstract before solving
│
├── Agent Patterns
│   ├── ReAct: Reason + Act in alternating steps
│   ├── Reflexion: Self-evaluate and retry
│   ├── Plan-and-Execute: Plan first, then execute steps
│   └── Tool-Use: Route to external tools/APIs
│
├── Output Control
│   ├── Structured Output: JSON/XML schema enforcement
│   ├── Constrained Generation: Grammar-based constraints
│   └── Output Parsing: Post-process with validation
│
└── Optimization
    ├── Prompt Chaining: Break complex tasks into subtasks
    ├── Meta-Prompting: Use LLM to generate/refine prompts
    └── DSPy-style: Programmatic prompt optimization

Pattern Comparison Table

TechniqueComplexityLatency ImpactCost ImpactReliability GainBest Model Tier
Zero-ShotLowNoneLowestBaselineAny
Few-ShotLowMinimal (+tokens)Low+15-30%Any
Chain-of-ThoughtLow+30-50%+30-50%+20-40% on reasoningMedium+
Self-ConsistencyMedium3-5x (parallel)3-5x+10-20% over CoTMedium+
Tree-of-ThoughtHigh5-10x5-10x+15-25% over CoTLarge
ReActHighVariable (tool calls)VariableHigh for tool tasksLarge
ReflexionHigh2-3x (retry loop)2-3x+10-30%Large
Structured OutputLowMinimalMinimalHigh for parsingAny (with support)
Prompt ChainingMediumAdditive per stepAdditiveHigh for complex tasksAny

Evaluation Framework

DimensionWhat to MeasureHow to MeasureTarget
AccuracyCorrectness of outputHuman eval or ground truth comparisonTask-specific
ConsistencySame input produces similar outputRun N times, measure varianceVariance < 10%
Format complianceOutput matches schemaSchema validation pass rate> 99%
LatencyTime to first token + totalAPI response timingTask-specific
CostTokens consumed per taskToken countingBudget-constrained
SafetyNo harmful or biased outputRed-team testing + guardrailsZero violations
RobustnessPerformance on edge casesAdversarial test suiteGraceful degradation

Use Case to Technique Decision Matrix

Use CaseRecommended PrimaryFallbackAvoid
ClassificationFew-ShotZero-Shot + examples in systemCoT (overkill)
SummarizationZero-Shot + constraintsFew-Shot with style examplesSelf-Consistency
Code generationFew-Shot + Structured OutputCoT for complex logicZero-Shot for complex
Math/reasoningCoT + Self-ConsistencyTree-of-ThoughtZero-Shot
Data extractionFew-Shot + Structured OutputPrompt ChainingUnstructured output
Multi-step researchReAct / Plan-and-ExecutePrompt ChainingSingle-shot
Creative writingSystem prompt + constraintsFew-Shot for styleOver-constraining
Conversation/chatSystem prompt + context mgmtRAG for knowledgeLong context stuffing
Decision supportCoT + Structured OutputTree-of-ThoughtZero-Shot
Content moderationFew-Shot + classificationConstitutional AI patternComplex reasoning

Anti-Patterns to Avoid

Common Anti-Patterns
│
├── Prompt Stuffing
│   └── Cramming too much context → model loses focus
│       Fix: Prioritize, summarize, use RAG
│
├── Instruction Overload
│   └── 20+ rules in system prompt → contradictions
│       Fix: Hierarchical instructions, test each rule
│
├── Example Contamination
│   └── Few-shot examples bias toward narrow patterns
│       Fix: Diverse examples, include edge cases
│
├── Output Format Ambiguity
│   └── "Return structured data" without schema
│       Fix: Provide explicit JSON schema or example
│
├── Ignoring Model Strengths
│   └── Using CoT on a task where zero-shot works fine
│       Fix: Start simple, add complexity only when needed
│
└── No Evaluation Pipeline
    └── Changing prompts without measuring impact
        Fix: A/B test prompts, track metrics over time

Resources