Prompt Engineering

Prompt engineering is the art and science of crafting inputs to LLMs that reliably produce the desired output. A well-designed prompt doesn’t just get a good answer — it gets consistently good answers, reducing the variance that makes LLM-based systems unreliable.

Prompting vs. Fine-tuning

Prompting is the first tool to reach for. It’s fast to iterate, costs nothing to change, and works with any model. Fine-tuning is for when prompting has been maximised and you need additional performance gains. See Fine-Tuning for when to make the switch.

The Anatomy of a Prompt

Every effective prompt has these components (not all are always needed):

┌──────────────────────────────────────┐
│  System Prompt                        │
│  Role, context, output format rules  │
├──────────────────────────────────────┤
│  Examples (Few-shot)                 │
│  Input/output demonstrations         │
├──────────────────────────────────────┤
│  User Message                        │
│  The actual task/question            │
└──────────────────────────────────────┘

Zero-Shot Prompting

Give the model a task without any examples. Relies entirely on the model’s training.

Classify the sentiment of this text as Positive, Negative, or Neutral.

Text: "The new update completely broke my workflow. Unacceptable."

Sentiment:

When to use: Simple, well-understood tasks. Classification, summarisation, translation, answering factual questions.

When NOT to use: Complex reasoning, unusual formats, tasks where the model hasn’t been trained on your specific domain.

Few-Shot Prompting

Provide a few examples of input → output pairs before the actual task. This “primes” the model on the pattern you want.

Classify sentiment:

Text: "This product is absolutely fantastic!" → Positive
Text: "Arrived late and damaged." → Negative
Text: "It works as described." → Neutral

Text: "The new update completely broke my workflow. Unacceptable." →

When to use: When zero-shot gives inconsistent results. When you need a specific output format. Custom classification schemes. Domain-specific tasks.

Number of examples: 3-5 is usually enough. More doesn’t always help and adds cost. Ensure examples cover edge cases and negative examples.

Chain-of-Thought (CoT) Prompting

For complex reasoning tasks, ask the model to “think step by step” before giving the final answer. This dramatically improves performance on multi-step problems.

A train leaves Station A at 9:00 a.m. travelling at 60 mph. 
Another train leaves Station B (120 miles away) at 10:00 a.m. travelling towards A at 80 mph.
At what time do they meet?

Think step by step:

The model will reason through the problem before arriving at the answer, catching errors it would make if jumping straight to a conclusion.

When to use: Maths problems, logical reasoning, multi-step analysis, complex decision-making, debugging.

Note: CoT prompting costs more tokens and adds latency. Only use it when the task genuinely requires multi-step reasoning.

Few-Shot CoT

Combine few-shot with chain-of-thought by showing examples that include the reasoning:

Q: If I have 5 apples and give away 2, then buy 4 more, how many do I have?
A: Let me think step by step. Start with 5. Give away 2: 5 - 2 = 3. Buy 4 more: 3 + 4 = 7. Answer: 7.

Q: A store has 48 items. They sell 15% and receive a shipment of 20. How many items now?
A:

Role Prompting

Assign a persona or expertise role to the model. This activates relevant knowledge and tone.

You are a senior TypeScript engineer specialising in Domain-Driven Design
and Hexagonal Architecture. You write clean, well-typed code with comprehensive
error handling and always follow SOLID principles.

Review this code and identify any architectural violations:
[code here]

Why it works: Role framing activates patterns from the model’s training data associated with that expert persona — better terminology, more relevant knowledge, appropriate tone.

Best practice: Be specific about the role. “Senior TypeScript engineer specialising in DDD” is far better than “a software engineer.”

System Prompts

The system prompt is a persistent context that frames every interaction. It’s the ideal place for:

Persona definition (role prompting)
Hard rules and constraints
Output format requirements
Context about the application or domain

const systemPrompt = `
You are Aircury's internal engineering assistant, specialised in our
Framework (OpenSpec Extended with DDD and Hexagonal Architecture).

## Rules
- Always produce TypeScript code following SOLID principles
- Domain classes must never import from infrastructure packages
- All dependencies must be constructor-injected as interfaces
- Commit messages must follow Conventional Commits format

## Context
The codebase uses:
- TypeScript 5.4+
- Hexagonal Architecture (src/domain, src/application, src/infrastructure)
- PostgreSQL with a custom ORM wrapper
- Jest for testing, Cucumber for BDD
`;

System Prompt Persistence

In API usage, the system prompt is sent with every request. Keep it focused and concise — verbose system prompts waste tokens and can dilute key instructions. Put task-specific context in the user message, not the system prompt.

Structured Output

Force the model to output in a specific parseable format. Critical for building reliable pipelines.

// Force JSON output via prompt
const prompt = `
Analyse this code review and extract issues in JSON format:

\`\`\`typescript
${code}
\`\`\`

Respond with ONLY valid JSON (no markdown, no explanation):
{
  "issues": [
    {
      "severity": "error" | "warning" | "info",
      "principle": "SRP" | "OCP" | "LSP" | "ISP" | "DIP" | "other",
      "description": "string",
      "line": number | null,
      "suggestion": "string"
    }
  ],
  "summary": "string"
}
`;

Better: Use OpenAI’s JSON mode or function calling / tool use for reliable structured output:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  response_format: { type: 'json_object' },  // Guaranteed valid JSON
  messages: [{ role: 'user', content: prompt }],
});

Or Anthropic’s tool use (forces specific schema):

const response = await anthropic.messages.create({
  model: 'claude-opus-4-5',
  tools: [{
    name: 'extract_code_review',
    description: 'Extract code review findings',
    input_schema: {
      type: 'object',
      properties: {
        issues: { type: 'array', ... },
        summary: { type: 'string' }
      }
    }
  }],
  tool_choice: { type: 'tool', name: 'extract_code_review' },
  messages: [{ role: 'user', content: prompt }],
});

Prompt Engineering Best Practices

Practice	Details
Be specific	”Write a TypeScript function” < “Write a TypeScript function that takes a UserId, queries PostgreSQL via the UserRepository interface, and throws a UserNotFoundError if the user doesn’t exist.”
Put context first	State what the model needs to know before what it needs to do. Context → Task → Format.
Use delimiters	Separate content from instructions: `code blocks`, XML-like `<document>` tags, or `---` separators.
Specify the output format	”Respond with only valid JSON.” “Write a bulleted list.” “Provide a single TypeScript function.”
Avoid negatives	”Do not use var” → “Use const and let only.” Negatives are less reliable.
Test edge cases	Your prompt works on one example. Does it work on 100? On adversarial inputs?
Version your prompts	Treat prompts like code — track changes, use variables for dynamic content, test changes.

Prompt Optimisation Loop

Write prompt → Test on 10+ examples → Find failure cases → 
Diagnose cause → Adjust prompt → Repeat

This is uncomfortable for engineers who expect binary pass/fail. Prompt optimisation is empirical — you’re tuning a fuzzy system. Use Evaluation to make this process rigorous.