Table of Contents

What Is Prompt Engineering?

Prompt engineering is the practice of designing inputs to language models that reliably produce desired outputs. It sits at the intersection of programming and natural language — you are writing instructions for a system that understands intent, not just syntax.

Unlike traditional code, prompts are probabilistic. The same prompt can produce slightly different outputs on each run. Good prompt engineering narrows that variance — it makes the model's behavior predictable enough to build software on top of.

The Spectrum of Complexity

Prompt engineering ranges from single-turn questions to multi-stage production systems:

Level Example Techniques
Basic "Summarize this text" Zero-shot, role assignment
Intermediate Extract structured JSON from job descriptions Schema-first, few-shot examples, output validation
Advanced Voice AI conducting mock interviews System prompts, silence handling, session continuity, signal injection
Production Multi-stage pipeline: parse → enrich → evaluate → extract signals Prompt chaining, bounded concurrency, safety, evaluation rubrics

This refresher covers the full spectrum. Each section builds on the previous one.

Anatomy of a Prompt

Every effective prompt has the same structural DNA. Whether you are writing a one-line question or a 200-line system prompt, the building blocks are the same.

The Six Components

Component Purpose Example
Role Who the model should be You are a resume parser
Context Background information The candidate is applying for a backend role at Stripe
Instructions What to do, step by step Extract skills, seniority, and role category
Constraints Boundaries and rules One question per turn. Under 30 words.
Output format How to structure the response Respond ONLY with valid JSON matching this schema: {...}
Examples Concrete input → output pairs "data engineer in NJ" → {"search_text":"data engineer","location":"NJ"}

Use Labeled Sections

All three major providers agree: organize prompts with clear section headers. This is the single most impactful structural technique.

Provider Preferred format
Anthropic (Claude) XML tags: <role>, <instructions>, <context>
Google (Gemini) XML tags or Markdown headings: ## Role, ## Constraints
OpenAI (GPT) Markdown headings or system/user message separation
Rule
Pick one format per prompt and be consistent. Both Google and Anthropic report degraded performance when formatting is inconsistent across sections.

Here is a real production system prompt (abridged) for a voice AI mock interviewer, using Markdown headings — the recommended format for Google Gemini Live:

## Role
- You are a friendly interviewer having a brief introductory chat
  with a candidate for the role of Backend Engineer at Stripe
- Goal: make the candidate comfortable and learn about their background

## Critical constraints (voice model)
- One question per turn, always
- Keep responses under 30 words
- The candidate should be talking 80% of the time

## Tone
- Warm and conversational — like a real person, not a voice assistant
- Brief verbal acknowledgments are fine ("Got it") — not long affirmations
- Do not repeat the same phrase twice — vary your wording naturally

## Flow
1. Greet the candidate warmly
2. Ask about their background
3. Ask thoughtful follow-up questions — reference what they said
4. Continue until you receive a [WRAP-UP] signal, then close warmly

## Sample greetings (pick one, never reuse)
- "Hi there, thanks for taking the time to chat."
- "Hey, welcome! I'm excited to learn about your background."

## Silence handling
- If the candidate pauses, wait at least 5 seconds before saying anything
- At 8-10 seconds, a brief "Take your time" is fine

## Rules
- Do NOT repeat what the candidate just said back to them
- Do not give empty praise like "Great answer!"
- Never break character or mention that you are an AI

Notice the hierarchy: Role → Critical constraints → Tone → Flow → Edge cases → Rules. This matches Google's recommended four-part structure for Live API system instructions: Persona, Conversational Rules, Tool Calls, Guardrails.

Placement Matters

Where you put information in a prompt changes how the model processes it:

General rule
Context/examples first → instructions → query/task last.

Core Techniques

Zero-Shot Prompting

Give the model a task with no examples. Works well for simple, well-defined tasks where the model's training data includes similar patterns.

prompt = """You are a resume parser. Extract a structured profile
from the following resume text.

Respond ONLY with valid JSON matching this exact schema:
{"skills":["string"],"seniority":"string","role_category":"string",
 "years_experience":0,"summary":"string"}

Resume:
{resume_text}"""

Zero-shot works here because resume parsing is a well-understood task. The model knows what "skills" and "seniority" mean without examples.

Few-Shot Prompting

Provide concrete input → output examples. This is the single most effective technique for consistent formatting (OpenAI, Anthropic, Google all agree).

prompt := fmt.Sprintf(`You are a query parser for a job search tool.
Split the user query into two parts:
1. search_text: the role, skill, or job description intent
2. location: the geographic location filter, if any

Rules:
- Keep US state abbreviations as-is (NJ stays NJ)
- Convert full state names to abbreviation (New Jersey → NJ)
- Expand city abbreviations (SF → San Francisco)
- "remote" sets location to "Remote"
- If no location, set location to ""

Respond ONLY with valid JSON: {"search_text":"string","location":"string"}

Examples:
- "data engineer in NJ" → {"search_text":"data engineer","location":"NJ"}
- "ML engineer" → {"search_text":"ML engineer","location":""}
- "remote SWE" → {"search_text":"SWE","location":"Remote"}
- "frontend developer SF" → {"search_text":"frontend developer","location":"San Francisco"}
- "backend engineer in New York" → {"search_text":"backend engineer","location":"New York"}
- "devops California" → {"search_text":"devops","location":"CA"}

Query: %s`, query)

Key principles for few-shot examples:

Anthropic recommends 3-5 diverse examples
Wrap them in <example> tags for Claude. Google says examples without instructions often outperform instruction-heavy prompts without examples. Include examples first when possible.

Chain-of-Thought (CoT)

Ask the model to show its reasoning before giving a final answer. Introduced by Wei et al. (2022) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", this technique dramatically improves performance on tasks requiring multi-step reasoning.

# Simple CoT trigger
prompt = """Analyze this candidate's skill gap against the job description.

Think through this step by step:
1. List the skills the job requires
2. List the skills the candidate has
3. Identify which required skills are missing
4. Rate each missing skill's importance
5. Provide actionable recommendations

Job Description: {jd}
Candidate Profile: {profile}"""

Variants of chain-of-thought:

Variant How it works Source
Zero-shot CoT Add "Let's think step by step" to any prompt Kojima et al., 2022
Few-shot CoT Include reasoning traces in examples Wei et al., 2022
Self-consistency Sample multiple reasoning paths, take majority vote Wang et al., 2023 (ICLR)
Tree of Thoughts Explore multiple reasoning branches, evaluate and prune Yao et al., 2023 (NeurIPS)

Inner Monologue

Have the model reason internally, then extract only the final answer. This is useful when you want the quality benefits of CoT without exposing raw reasoning to users.

prompt = """Evaluate the candidate's interview performance.

First, think through your evaluation internally:
- What were the strongest moments?
- Where did they struggle?
- How does their performance compare to the rubric?

Then output ONLY the final JSON result (no reasoning):
{"overall_score":1,"summary":"string","strengths":["string"]}

OpenAI specifically recommends this: "Structure the reasoning in a parseable format, then extract only the final answer for the user."

Tell the Model What TO DO

A universal principle from all three providers: frame instructions positively.

Bad (negative only) Better (positive + negative)
Do not use markdown Respond in plain prose paragraphs
Don't ask multiple questions Ask one question per turn, always
Don't give generic responses Reference the candidate's actual words: "You mentioned X"
Don't be verbose Keep responses under 30 words unless asked to elaborate
One instruction per bullet
Compound instructions get partially followed. Instead of "Ask about their background and what interests them and keep it under 2 sentences," split into separate bullets.

System Prompts

System prompts (or system instructions) define persistent behavior across an entire conversation. They are the most important prompt you write — they shape every response the model generates.

System vs. User Messages

All major API providers separate messages by role, with different levels of authority:

Role Priority Use for
System/Developer Highest Identity, behavioral rules, output format, safety constraints
User Medium Task-specific input, context, data to process
Assistant Normal Prior model responses (for multi-turn context)
OpenAI recommendation
Place tone and role guidance in system messages. Place task-specific details in user messages. This separation lets you reuse the same system prompt across many different tasks.

Architecture of a Good System Prompt

Based on Google's recommended hierarchy for voice/Live API agents and OpenAI's GPT-4.1 prompting guide:

  1. Identity & Purpose — Who you are, what your goal is
  2. Critical Constraints — The 3-4 rules that matter most (front-loaded)
  3. Tone & Style — Communication personality
  4. Conversational Flow — Step-by-step behavioral sequence (distinguish one-time steps like greetings from loops like follow-ups)
  5. Edge Case Handling — What to do when things go wrong (confusion, silence, off-topic)
  6. Guardrails — Hard rules with conditional examples ("if X, do Y")

Role Assignment Works

Even a single sentence of role framing changes model behavior significantly. All three providers confirm this:

# Vague role (too generic to change behavior)
"You are a helpful assistant."

# Specific role (model adopts domain expertise and appropriate tone)
"You are a friendly tech recruiter calling on behalf of Skilark,
having a first introductory call with a candidate."

# Expert role for evaluation
"You are a warm and insightful career coach. Review this brief
introductory career conversation with a candidate."

Emphasis: Placement, Not Volume

A common mistake is using ALL CAPS or aggressive emphasis (CRITICAL, YOU MUST) to enforce rules. This has model-specific effects:

Rule
If a rule is violated 50% of the time, move it higher in the prompt — not louder. Front-loading is more effective than shouting.

Structured Output

Most production LLM systems parse model output as structured data. The prompt must guide the model to produce valid, predictable JSON that your code can parse reliably.

Schema-First Prompting

All major providers now offer API-level structured output enforcement:

Provider Feature How it works
OpenAI response_format: {type: "json_schema"} Constrains token generation to valid schema tokens
Google response_mime_type: "application/json" + response_schema (REST) / response_json_schema (Python SDK) JSON schema enforcement via generation config
Anthropic Tool use / JSON mode Structured output via tool definitions; no first-class schema enforcement like OpenAI

Here is a real production example using Gemini's schema enforcement with Pydantic:

from pydantic import BaseModel, Field
from typing import Optional

class ListingEnrichment(BaseModel):
    normalized_title: str = Field(description="Clean, standard job title")
    seniority: str = Field(description="intern|junior|mid|senior|staff|principal|director|vp|c_level")
    role_category: str = Field(description="swe|ai_ml|data_eng|platform|devops|security|product|design|other")
    remote_policy: Optional[str] = Field(description="remote|hybrid|onsite or null")
    location: Optional[str] = Field(description="City, ST format for US locations")
    summary: str = Field(description="2-3 sentence distinctive summary")

# API call with schema enforcement
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt,
    config={
        "response_mime_type": "application/json",
        "response_json_schema": ListingEnrichment.model_json_schema(),
    },
)

# Pydantic validates the response
result = ListingEnrichment.model_validate_json(response.text)
Schema guarantees syntax, not semantics
API-level enforcement ensures valid JSON structure, but the values can still be wrong. Always validate business logic separately — a seniority field will be valid JSON but might say "senior" when the listing clearly describes an intern role.

When API Schema Enforcement Is Unavailable

For APIs that don't support schema enforcement (e.g., Gemini Live realtime audio, or text generation without schema config), use prompt-level techniques:

// Feedback generation prompt — no schema enforcement available
prompt := fmt.Sprintf(`You are an expert interview coach.
Analyze this mock interview transcript.

Rate the candidate on a scale of 1-5:
1 = Poor (major gaps, unprepared)
2 = Below Average (some relevant answers but significant weaknesses)
3 = Average (adequate answers, room for improvement)
4 = Good (strong answers with minor areas to improve)
5 = Excellent (exceptional, well-structured, confident)

Respond ONLY with valid JSON matching this schema:
{"overall_score":1,"summary":"string","strengths":["string"],
 "improvements":["string"],
 "question_scores":[{"question":"string","score":1,"notes":"string"}]}

Transcript:
%s

Remember: respond with ONLY the JSON object, no other text.`,
    transcriptJSON)

The pattern for reliable prompt-level JSON:

  1. Show the exact JSON schema in the prompt (not a description of it)
  2. Include a few-shot example showing a complete valid response
  3. Repeat the format constraint at the end ("respond with ONLY the JSON")
  4. Strip code fences in post-processing — models often wrap JSON in ```json```
  5. Validate and handle failures — parse the response and handle errors gracefully
// Post-processing: strip code fences before parsing.
// Models often wrap JSON in ```json ... ``` blocks. This function
// removes the opening fence line (including language tag) and the
// closing fence, leaving only the JSON body.
func stripCodeFences(s string) string {
    s = strings.TrimSpace(s)
    if strings.HasPrefix(s, "```") {
        // Remove the entire opening fence line (e.g. "```json\n")
        if idx := strings.Index(s, "\n"); idx != -1 {
            s = s[idx+1:]
        }
        // Remove closing fence
        if idx := strings.LastIndex(s, "```"); idx != -1 {
            s = s[:idx]
        }
        s = strings.TrimSpace(s)
    }
    return s
}

// Usage
text = stripCodeFences(text)
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
    log.Printf("feedback: JSON parse error for %s: %v", interviewID, err)
    return
}

Include Example Output

An example output in the prompt anchors the model's response format more reliably than the schema alone:

// In your prompt, after the schema definition:
Example output:
{"summary":"You gave a clear picture of your backend experience and
showed real enthusiasm for distributed systems.",
"highlights":["You gave a specific example of reducing API latency
by 40% — concrete numbers stand out",
"You connected your interest in data pipelines to a real problem"],
"improvements":["When asked about your background, you listed
technologies. Try framing it as a story instead."]}
Go fmt.Sprintf escaping
The example above shows the rendered prompt — what the model sees. In Go source code using fmt.Sprintf, literal % must be escaped as %%. For example, 40% in the rendered output requires 40%% in the Go string literal. If the string is inside a nested Sprintf, you may need %%%%.
Never use ... or comments in example JSON
The model will reproduce them. Show complete, valid JSON with realistic values.

Voice & Realtime Models

Voice models (Gemini Live, OpenAI Realtime) behave fundamentally differently from text models. Prompting for voice requires a distinct set of techniques learned through production experience.

Voice-Specific Constraints

Text prompting habits break down in voice contexts:

Issue Why it happens Mitigation
Verbose responses Text models default to thorough answers Use word count limits ("under 30 words"), not sentence counts — sentences vary wildly in spoken length
Question stacking Model asks 2-3 questions in one turn "One question per turn, always" as a top-level constraint
Filler sounds Model tries to sound conversational "Minimize filler — avoid 'uh', 'so', 'absolutely' as openers"
Echo/parroting Model restates what user said "Do NOT repeat what the candidate just said back to them"
Ignoring signals Model misses injected control tokens Front-load signal instructions, e.g., [WRAP-UP]
Breaking character "Am I talking to an AI?" Provide explicit deflection scripts

Silence Handling

Silence is meaningful in voice conversations. Candidates pause to think. Good prompts teach the model progressive silence handling:

## Silence handling
- If the candidate pauses to think, wait at least 5 seconds
  before saying anything
- At 8-10 seconds of silence, a brief "Take your time" is fine
- At 15-20 seconds, offer a gentle scaffold: "Would it help to
  think about this in terms of [related concept]?"
- Beyond 20 seconds, offer to reframe: "Want to approach this
  differently, or shall we try another question?"
- Never rush them or fill silence with another question

VAD & API-Level Controls

Gemini Live provides API-level knobs that overlap with prompt instructions. Use the API for behavior you need guaranteed; use the prompt for nuance the API can't express.

# Gemini Live API — platform-level voice controls
config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    system_instruction=system_prompt,

    # Gemini speaks first without waiting for user audio.
    # Fixes cold-start delay — the system prompt says "introduce
    # yourself" but Gemini won't act on it until it hears audio.
    proactivity=types.ProactivityConfig(proactive_audio=True),

    realtime_input_config=types.RealtimeInputConfig(
        automatic_activity_detection=types.AutomaticActivityDetection(
            # HIGH: catch quieter phone/handset audio as speech
            start_of_speech_sensitivity=types.StartSensitivity.START_SENSITIVITY_HIGH,
            # LOW: don't cut the user off during thinking pauses
            end_of_speech_sensitivity=types.EndSensitivity.END_SENSITIVITY_LOW,
            # 2s breathing room before Gemini responds
            silence_duration_ms=2_000,
            # Include 500ms before speech start (avoids clipping first syllable)
            prefix_padding_ms=500,
        ),
    ),

    # Context window compression for long sessions
    context_window_compression=types.ContextWindowCompressionConfig(
        trigger_tokens=80_000,
        sliding_window=types.SlidingWindow(target_tokens=20_000),
    ),
)
Don't fight the API with your prompt

If VAD silence_duration_ms is set to 2000ms, a prompt rule saying "wait 5 seconds" creates a race condition — the model may respond at 2s anyway. Prompt-level silence guidance should describe behavior ("let pauses breathe"), not timings the API controls.

Similarly, don't write "start speaking immediately" in the prompt — proactive_audio=True handles this. The prompt should describe what to say, not when.

Greeting Design for Voice Agents

A voice agent's first utterance sets the entire conversational tone. Unlike text chat (where users initiate), voice calls start with the agent speaking into silence.

## Sample greetings (pick one at random, never reuse)
- "Hey, this is Skilark — you signed up for a quick career chat.
   Is now still a good time?"
- "Hi, Skilark here — thanks for signing up for a chat.
   Is this still a good moment to talk?"

Good greeting structure:

  1. Identify yourself ("Hey, this is Skilark")
  2. Set context ("You signed up for a quick career chat")
  3. Check timing ("Is now still a good time?")

Anti-patterns:

Signal Injection for Session Control

In long-running voice sessions, you need to send control signals to the model mid-conversation — like telling it to wrap up. Inject these as user-role messages:

# Send wrap-up signal during a live voice session
await session.send_client_content(
    turns=[Content(role="user", parts=[Part(text=(
        "[WRAP-UP] About 1 minute remaining. "
        "Please wrap up the conversation naturally."
    ))])],
    turn_complete=True,
)

The system prompt must teach the model to recognize these signals:

## Flow
...
5. Continue the conversation naturally until you receive a [WRAP-UP]
   signal, then close warmly — thank them and wish them well
- Do NOT wrap up on your own — keep asking until the signal arrives

Context Management

Every LLM has a finite context window. Production systems must manage token budgets deliberately — exceeding the limit causes hard failures or silent truncation.

Input Truncation

When passing user-supplied text into prompts, always cap the length:

# Enrichment prompt — cap job description at 8,000 chars
ENRICHMENT_BODY_LIMIT = 8000

def _build_prompt(title, body, company_name):
    return f"""Analyze this job listing and extract structured information.

## Job Listing
**Company:** {company_name}
**Title:** {title}
**Description:**
{body[:ENRICHMENT_BODY_LIMIT]}

## Instructions
1. Normalize the title...
"""
// Resume parsing — cap at 16,000 chars
const maxResumeBytes = 16000
if len(resumeText) > maxResumeBytes {
    resumeText = truncateText(resumeText, maxResumeBytes)
}

// Skill gap analysis — each job description capped at 4,000 chars
const maxJDBytes = 4000
for i, jd := range jobDescriptions {
    if len(jd) > maxJDBytes {
        jd = jd[:maxJDBytes]
    }
    fmt.Fprintf(&jdSection, "--- %s ---\n%s\n\n", titles[i], jd)
}

Sliding Window Compression

For long-running sessions (like voice interviews), audio accumulates at ~25 tokens/second. A 10-minute session consumes ~30k tokens. Without compression, you hit the context limit quickly.

# Gemini Live — compress context when nearing limits
context_window_compression=types.ContextWindowCompressionConfig(
    trigger_tokens=80_000,   # Start compressing at 80k
    sliding_window=types.SlidingWindow(
        target_tokens=20_000  # Keep most recent 20k tokens
    ),
)

This is a safety net. Under normal operation, voice sessions reconnect every ~10 minutes (due to Gemini's GoAway mechanism), so individual segments rarely exceed 30k tokens.

Document Format for Long Contexts

OpenAI's GPT-4.1 guide tested different formats for document collections in long contexts:

Format Performance Use case
XML tags Best Multiple documents, structured data
Pipe-delimited (ID: 1 | TITLE: ...) Good Tabular data, logs
JSON Poor for large collections Avoid for document collections > 50k tokens

Multi-Turn & Session Continuity

Long-running conversations (interviews, coaching sessions, multi-step workflows) face a core challenge: maintaining context across session boundaries.

Reconnection with Context Preservation

In a voice interview, the Gemini Live WebSocket connection dies every ~10 minutes (GoAway signal). The bot must reconnect without the model re-introducing itself or re-asking questions. Two strategies:

Strategy 1: Resume Handle (Fast Reconnect)

# Gemini provides an opaque resume handle for session continuity
session_resumption=types.SessionResumptionConfig(
    handle=self._resume_handle,  # From previous session's update
)

# Store the handle when Gemini sends it
def _handle_resumption_update(self, update):
    if update.resumable and update.new_handle:
        self._resume_handle = update.new_handle

If the handle is valid, the model picks up exactly where it left off — same conversational state, no re-greeting.

Strategy 2: Transcript Replay (Fallback)

When the resume handle is stale (gap > 10 minutes), fall back to replaying recent transcript in the system prompt:

MAX_TRANSCRIPT_LINES = 20  # Last ~10 exchanges

def _build_reconnect_prompt(self):
    if not self._transcript:
        return self._base_system_prompt

    recent = self._transcript[-MAX_TRANSCRIPT_LINES:]
    context = "\n".join(recent)
    return (
        f"{self._base_system_prompt}\n\n"
        f"IMPORTANT: This is a continuation of an ongoing interview. "
        f"Do NOT re-introduce yourself or start over. "
        f"Continue naturally from where you left off.\n\n"
        f"Conversation so far:\n{context}"
    )
Key pattern
The "Do NOT re-introduce yourself" instruction is critical. Without it, the model will greet the user again after every reconnection — a terrible user experience in a voice call.

Session Lifecycle Awareness

Your system prompt should account for session lifecycle constraints. One-time instructions (greetings, introductions) should be in the Flow section, not in Critical Constraints:

Prompt Chaining & Pipelines

Complex tasks should be decomposed into focused stages, each with its own prompt. All three providers recommend this: "Split complex tasks into subtasks" (OpenAI, Anthropic, Google).

Why Chain?

Real Pipeline: Interview → Feedback → Signals

Here is a real multi-stage pipeline from a mock interview product:

# Stage 1: Voice Interview (Gemini Live, realtime audio)
# Input:  System prompt + candidate audio
# Output: Bidirectional audio conversation
# Model:  gemini-2.5-flash-native-audio (realtime)

# Stage 2: Feedback Generation (Gemini Flash, text)
# Input:  Interview transcript (JSON)
# Output: Structured feedback JSON
# Model:  gemini-2.0-flash (text generation)

# Stage 3: Career Signal Extraction (same response as Stage 2)
# Input:  Feedback JSON (parsed from Stage 2 output)
# Output: Career signals (role interests, skills, seniority)
# Model:  (no separate call — extracted from Stage 2's JSON)

# Stage 4: Job Matching (Gemini embeddings)
# Input:  Career signal query text
# Output: Matching job listings via vector similarity
# Model:  gemini-embedding-001

Each stage has a focused prompt, small output schema, and explicit validation. The feedback prompt produces both human-readable feedback and machine-readable career signals in a single JSON response — an intentional design choice to avoid an extra API call.

Query Decomposition Pattern

A common first stage is breaking a user query into structured components that downstream stages can act on:

# Stage 1: Decompose natural language query
# "senior data engineer in California, remote OK"
#     ↓
# {"search_text": "senior data engineer", "location": "CA"}

# Stage 2: Embed search_text → vector
# Stage 3: Vector search with location filter (GeoFilterState)
# Stage 4: Re-rank results by relevance

Structured Extraction Pipeline

Another pattern: extracting structured data from unstructured text at scale. This runs against 50,000+ job listings:

prompt = f"""Analyze this job listing and extract structured information.

## Job Listing
**Company:** {company_name}
**Title:** {title}
**Description:**
{body[:8000]}

## Instructions
1. Normalize the title to a clean, standard format
2. Classify seniority: intern, junior, mid, senior, staff, principal
3. Classify role category: swe, ai_ml, data_eng, platform, devops...
4. Identify remote policy: remote, hybrid, onsite, or null
5. Extract location in "City, ST" format for US locations
6. Extract salary range if mentioned (annual, local currency)
7. List technical skills using canonical names: {skills_list}
8. Write a 2-3 sentence summary of distinctive aspects
   Focus on: product/system/domain, what person will own/build,
   concrete signals of scale. Skip filler phrases."""

Key design choices:

Prompt Safety & Injection Prevention

Any time user-supplied text is embedded in a prompt, you face the risk of prompt injection — where the user's input overrides your instructions.

Input Sanitization

The simplest defense: truncate and strip control characters from user input before embedding it in prompts.

// sanitizePromptField truncates and strips newlines from
// user-supplied text before embedding in a Gemini prompt.
func sanitizePromptField(s string, maxLen int) string {
    s = strings.ReplaceAll(s, "\n", " ")
    s = strings.ReplaceAll(s, "\r", " ")
    if len(s) > maxLen {
        s = s[:maxLen]
    }
    return s
}

// Usage: role_title is free text from the user
roleTitle = sanitizePromptField(roleTitle, 100)
companySlug = sanitizePromptField(companySlug, 100)

Structural Defense: JSON Output

Structured output is itself a defense. When the model's response is parsed as JSON, a confused model causes an unmarshal failure, not data corruption:

// Even if the model is confused by injected text in role_title,
// the worst case is a parse error — not data corruption
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
    log.Printf("JSON parse error: %v", err)
    return  // Fail safely — no corrupted data reaches the user
}
Defense in depth

Combine multiple layers:

  1. Sanitize inputs (truncate, strip newlines/control chars)
  2. Separate instructions from data (use XML tags or clear delimiters)
  3. Validate outputs (JSON parsing catches confusion)
  4. Limit blast radius (the model can only produce text — it can't access your database)

Separating Instructions from Data

Anthropic recommends XML tags to clearly separate user data from instructions, making injection harder:

<instructions>
Analyze the resume below and extract structured data.
Respond ONLY with valid JSON.
</instructions>

<resume>
{user_supplied_resume_text}
</resume>

<output_format>
{"skills": ["string"], "seniority": "string"}
</output_format>

Evaluation & Scoring Prompts

Using LLMs to evaluate human performance (interviews, writing, code) requires careful prompt design to produce consistent, fair assessments.

Internal vs. External Rubrics

Approach When to use How to prompt
Internal rubric Coaching tone, qualitative feedback "Evaluate internally across these dimensions. Do not output scores."
External rubric Scoring, ranking, comparison Define the scale explicitly with examples for each level. Clamp in post-processing.

Internal rubric example (coaching feedback, no numeric scores):

Evaluate the conversation across these dimensions
(internally, do not output scores):
1. Story Clarity — Did the candidate tell a coherent narrative
   about their background, or just list facts?
2. Career Direction — Did they articulate what they are looking for?
3. Specificity — Did they give concrete examples with outcomes?
4. Self-Awareness — Did they show understanding of strengths
   and growth areas?
5. Conciseness — Were answers focused, or did they ramble?

External rubric example (numeric scoring):

Rate the candidate on a scale of 1-5:
1 = Poor (major gaps, unprepared)
2 = Below Average (some relevant answers but significant weaknesses)
3 = Average (adequate answers, room for improvement)
4 = Good (strong answers with minor areas to improve)
5 = Excellent (exceptional, well-structured, confident)
Always clamp scores in post-processing
Models sometimes output scores outside the defined range. Clamp to valid bounds: if score < 1 { score = 1 }; if score > 5 { score = 5 }

Coaching Tone for Feedback

When the goal is to help someone improve (not just rate them), the prompt must specify the structure of actionable feedback:

Field descriptions:
- highlights: 2-3 specific things the candidate did well —
  quote or reference what they actually said
- improvements: 2-3 constructive coaching tips, each as a single
  string containing:
  (a) what the candidate said
  (b) why a different approach is stronger
  (c) a concrete reworded example

Keep the tone warm and constructive — like a coach helping them
tell their story better, not a judge marking them down.

The three-part structure (what they did → why change → concrete example) prevents vague feedback like "be more specific." Instead it produces: "When asked about your background, you listed technologies. Try framing it as a story: 'I started in backend engineering, then moved to data pipelines when I saw our team spending 40% of time on manual ETL.'"

Match Dimensions to Context

Never evaluate against criteria the candidate had no opportunity to demonstrate:

Context Good dimensions Bad dimensions
Role-specific interview Role Connection, Enthusiasm Career Direction (they already know their target)
Open career conversation Career Direction, Self-Awareness Role Connection (no specific role discussed)
Technical interview Problem decomposition, Code quality Enthusiasm (irrelevant)

Agentic Prompt Patterns

Agentic systems give LLMs the ability to take actions — calling tools, querying databases, browsing the web — rather than just generating text. The prompt becomes a controller that decides when and how to use each tool.

ReAct: Reason + Act

The ReAct pattern (Yao et al., 2023) interleaves reasoning with action:

You have access to these tools:
- browse_jobs(skill, location, seniority): Search job listings
- get_signals(company): Get market intelligence for a company
- get_skill_trends(limit): Get top skills by demand

For each user question:
1. Think: What information do I need?
2. Act: Call the appropriate tool
3. Observe: Read the tool's output
4. Think: Do I have enough information to answer?
5. If not, go to step 2. If yes, respond to the user.

Tool Definitions

When defining tools for function calling, the descriptions are prompt engineering too. The model uses these descriptions to decide which tool to call:

@mcp.tool(description="""Structured job search with filters.
Best for specific queries with known parameters like skill,
company, location, or seniority level.""")
def browse_jobs(
    skill: str | None = None,
    company: str | None = None,
    location: str | None = None,
    seniority: str | None = None,
) -> dict:
    ...

@mcp.tool(description="""AI-powered semantic search — describe your
ideal role in natural language. Best for exploratory or nuanced
queries where you can't specify exact filters.""")
def find_my_fit(query: str) -> dict:
    ...

Notice how the descriptions differentiate when to use each tool: browse_jobs for structured queries, find_my_fit for natural language exploration. Without clear differentiation, the model will pick arbitrarily.

Multi-Tool Orchestration

For richer answers, instruct the model to combine multiple tools:

## When to combine tools
For richer answers, call multiple tools together:
- Job search + signals: After browse_jobs or find_my_fit,
  call get_signals to surface relevant market context
- Skill trends + jobs: After get_skill_trends, show matching
  job listings that require trending skills
- Company trends + signals: After get_company_trends, get
  recent news signals for top hiring companies

Planning Before Action

For complex multi-step tasks, have the model plan before executing. OpenAI recommends: "Have the model solve the problem first, then compare with the input."

When the user asks a complex question:
1. First, create a plan: what tools do you need to call
   and in what order?
2. Execute the plan step by step
3. After gathering all information, synthesize a coherent answer
4. Ask yourself: "Did I miss anything?" before responding

Production Hardening

Moving prompts from prototype to production requires additional engineering around error handling, cost management, and observability.

Error Handling & Retries

// Bounded concurrency — don't overwhelm the API
var feedbackSemaphore = make(chan struct{}, 5)

func GenerateFeedback(ctx context.Context, gemini *GeminiClient, ...) {
    // Acquire semaphore slot
    select {
    case feedbackSemaphore <- struct{}{}:
        defer func() { <-feedbackSemaphore }()
    case <-ctx.Done():
        return
    }

    text, _, err := gemini.generateContent(ctx, prompt)
    if err != nil {
        log.Printf("Gemini API error: %v", err)
        return  // Fail gracefully, don't crash the pipeline
    }

    text = stripCodeFences(text)

    var feedback InterviewFeedback
    if err := json.Unmarshal([]byte(text), &feedback); err != nil {
        log.Printf("JSON parse error: %v", err)
        return  // Bad output is not a crash
    }

    // Post-process: clamp scores, strip hallucinated fields
    if feedback.OverallScore != nil {
        if *feedback.OverallScore < 1 { *feedback.OverallScore = 1 }
        if *feedback.OverallScore > 5 { *feedback.OverallScore = 5 }
    }
}

Model Selection by Task

Task Model choice Rationale
Structured extraction (enrichment) Flash/Haiku (fast, cheap) Well-defined schema, low ambiguity
Qualitative feedback (coaching) Pro/Sonnet (balanced) Needs nuance, empathy, specific examples
Realtime voice conversation Flash (native audio) Low latency is critical for natural conversation
Complex reasoning (planning) Opus/o1/Pro (capable) Multi-step reasoning, long context
Embeddings Embedding model Purpose-built, cheapest per token

Token Usage Tracking

type TokenUsage struct {
    PromptTokens    int
    CandidateTokens int
    TotalTokens     int
}

// Track usage from every API call
text, usage, err := gemini.generateContent(ctx, prompt)
if usage != nil {
    log.Printf("tokens: prompt=%d candidate=%d total=%d",
        usage.PromptTokens, usage.CandidateTokens, usage.TotalTokens)
}

Timeout & HTTP Client

// Direct HTTP client — no SDK dependency, full control
type GeminiClient struct {
    apiKey     string
    model      string
    httpClient *http.Client
    baseURL    string
}

func NewGeminiClient(apiKey, model string) *GeminiClient {
    return &GeminiClient{
        apiKey:     apiKey,
        model:      model,
        httpClient: &http.Client{Timeout: 30 * time.Second},
        baseURL:    "https://generativelanguage.googleapis.com/v1beta",
    }
}
No SDK, by choice
Using raw HTTP calls instead of an SDK gives full control over timeouts, retries, and error handling. The trade-off is more boilerplate, but the debugging experience is much better — you can see exactly what's going over the wire.

Iteration & Testing

Testing Checklist

Before deploying any prompt, test with:

Iteration Strategies

Google recommends three approaches when a prompt isn't working:

  1. Rephrase: Different wording often yields different results
  2. Reformulate: If classification fails, try multiple-choice framing
  3. Reorder: Move sections around — placement affects quality

Temperature Settings

Provider Recommendation
Gemini Keep temperature at 1.0 (lower risks looping behavior)
Claude Default is fine for most tasks; lower for deterministic output
OpenAI Lower for structured output, higher for creative tasks

Version Your Prompts Like Code

Prompts are code. Store them in version control, review changes, and track which version produced which results:

// Prompts as named constants — easy to review, diff, and test
const quickChatPrompt = `## Role
- You are a friendly interviewer...`

const quickChatOpenPrompt = `## Role
- You are a friendly tech recruiter...`

const standardPrompt = `## Role
- You are a mock %s interviewer...`

// Route to the right prompt based on interview type and context
func buildInterviewPrompt(interviewType, roleTitle, companySlug string) string {
    roleTitle = sanitizePromptField(roleTitle, 100)
    companyClause := ""
    if companySlug != "" {
        companyClause = " at " + sanitizePromptField(companySlug, 100)
    }

    if interviewType == "quick_chat" {
        if roleTitle == "General" && companySlug == "" {
            return quickChatOpenPrompt  // Open-ended, no target role
        }
        return fmt.Sprintf(quickChatPrompt, roleTitle, companyClause)
    }
    return fmt.Sprintf(standardPrompt, interviewType, roleTitle, companyClause)
}

Common Pitfalls

Anti-Pattern Why It Fails Fix
ALL CAPS emphasis Overtriggers on Claude 4.5+; ignored by Gemini Normal casing with front-loading
"Do NOT do X" without alternative Model knows what to avoid but not what to do Add "Instead, do Y"
Vague role ("be helpful") Too generic to change behavior Specific role ("tech recruiter on an intro call")
Sentence-count limits for voice Sentences vary wildly in spoken length Word-count limits ("under 30 words")
Checklist instructions ("cover 3 topics") Creates countdown — model winds down after 3 "Keep exploring, no checklist"
Repeating the same rule 3 times Wastes tokens, doesn't improve compliance Say it once, place it high
JSON example with ... or comments Model reproduces the comments/ellipsis Show complete, valid JSON
Prompt-level timers conflicting with API VAD Creates race condition (model responds at API timing) API controls timing; prompt describes behavior
Greeting + question in one utterance (voice) Overwhelms the user Greeting is its own turn; wait for response
Not escaping % in Go fmt.Sprintf Corrupts the rendered prompt Use %% for literal % in Go string templates

Quick Reference

Technique Comparison

Technique When to Use Tradeoff
Zero-shot Simple, well-defined tasks Low cost, less consistent formatting
Few-shot Structured output, consistent format More tokens, much better consistency
Chain-of-thought Multi-step reasoning, math, logic More tokens, better accuracy
Inner monologue Quality reasoning without exposing it Moderate cost, clean output
Schema enforcement JSON output in production pipelines Guarantees syntax; validate semantics separately
Prompt chaining Complex multi-stage workflows More API calls, better debuggability
ReAct (tool use) Agentic systems that take actions Powerful but unpredictable; needs guardrails

Deployment Checklist

Before deploying any prompt, verify:
  • Critical constraints are in the first section (not buried in Rules)
  • Positive instructions outnumber negative ones
  • JSON schema is shown explicitly (not just described)
  • At least one few-shot example is included (for structured output)
  • Evaluation dimensions match the actual conversation context
  • Voice prompts use word counts, not sentence counts
  • User-supplied fields are sanitized (newlines stripped, length capped)
  • Tested with empty, minimal, and adversarial inputs
  • No ALL CAPS emphasis (use placement instead)
  • Template escaping is correct (%% in Go, {{ in Python f-strings)

Sources & Further Reading

Official Documentation

Provider Resource
Anthropic Claude Prompt Engineering Guide
Google Gemini Prompting Strategies
Google Gemini Structured Output
Google Gemini System Instructions
OpenAI Prompt Engineering Guide
OpenAI GPT-4.1 Prompting Guide
OpenAI Structured Outputs
OpenAI Reasoning Best Practices

Research Papers

Paper Authors Key contribution
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Wei et al., 2022 Showed that including reasoning steps in prompts dramatically improves performance on math and logic tasks
Large Language Models are Zero-Shot Reasoners Kojima et al., 2022 Demonstrated that simply adding "Let's think step by step" triggers chain-of-thought reasoning without examples
Self-Consistency Improves Chain of Thought Reasoning Wang et al., 2023 Sample multiple reasoning paths and take majority vote for more reliable answers
Tree of Thoughts: Deliberate Problem Solving with Large Language Models Yao et al., 2023 (NeurIPS) Explores branching reasoning paths with evaluation and pruning
ReAct: Synergizing Reasoning and Acting in Language Models Yao et al., 2023 (ICLR) Interleaving reasoning traces with tool use for grounded, multi-step problem solving
Toolformer: Language Models Can Teach Themselves to Use Tools Schick et al., 2023 Self-supervised learning of when and how to call external APIs