Prompt Engineering — From Basics to Agentic Workflows

Table of Contents

What Is Prompt Engineering?

Prompt engineering is the practice of designing inputs to language models that reliably produce desired outputs. It sits at the intersection of programming and natural language — you are writing instructions for a system that understands intent, not just syntax.

Unlike traditional code, prompts are probabilistic. The same prompt can produce slightly different outputs on each run. Good prompt engineering narrows that variance — it makes the model's behavior predictable enough to build software on top of.

The Spectrum of Complexity

Prompt engineering ranges from single-turn questions to multi-stage production systems:

Level	Example	Techniques
Basic	"Summarize this text"	Zero-shot, role assignment
Intermediate	Extract structured JSON from job descriptions	Schema-first, few-shot examples, output validation
Advanced	Voice AI conducting mock interviews	System prompts, silence handling, session continuity, signal injection
Production	Multi-stage pipeline: parse → enrich → evaluate → extract signals	Prompt chaining, bounded concurrency, safety, evaluation rubrics

This refresher covers the full spectrum. Each section builds on the previous one.

Anatomy of a Prompt

Every effective prompt has the same structural DNA. Whether you are writing a one-line question or a 200-line system prompt, the building blocks are the same.

The Six Components

Component	Purpose	Example
Role	Who the model should be	`You are a resume parser`
Context	Background information	`The candidate is applying for a backend role at Stripe`
Instructions	What to do, step by step	`Extract skills, seniority, and role category`
Constraints	Boundaries and rules	`One question per turn. Under 30 words.`
Output format	How to structure the response	`Respond ONLY with valid JSON matching this schema: {...}`
Examples	Concrete input → output pairs	`"data engineer in NJ" → {"search_text":"data engineer","location":"NJ"}`

Use Labeled Sections

All three major providers agree: organize prompts with clear section headers. This is the single most impactful structural technique.

Provider	Preferred format
Anthropic (Claude)	XML tags: `<role>`, `<instructions>`, `<context>`
Google (Gemini)	XML tags or Markdown headings: `## Role`, `## Constraints`
OpenAI (GPT)	Markdown headings or system/user message separation

Rule

Pick one format per prompt and be consistent. Both Google and Anthropic report degraded performance when formatting is inconsistent across sections.

Here is a real production system prompt (abridged) for a voice AI mock interviewer, using Markdown headings — the recommended format for Google Gemini Live:

## Role
- You are a friendly interviewer having a brief introductory chat
  with a candidate for the role of Backend Engineer at Stripe
- Goal: make the candidate comfortable and learn about their background

## Critical constraints (voice model)
- One question per turn, always
- Keep responses under 30 words
- The candidate should be talking 80% of the time

## Tone
- Warm and conversational — like a real person, not a voice assistant
- Brief verbal acknowledgments are fine ("Got it") — not long affirmations
- Do not repeat the same phrase twice — vary your wording naturally

## Flow
1. Greet the candidate warmly
2. Ask about their background
3. Ask thoughtful follow-up questions — reference what they said
4. Continue until you receive a [WRAP-UP] signal, then close warmly

## Sample greetings (pick one, never reuse)
- "Hi there, thanks for taking the time to chat."
- "Hey, welcome! I'm excited to learn about your background."

## Silence handling
- If the candidate pauses, wait at least 5 seconds before saying anything
- At 8-10 seconds, a brief "Take your time" is fine

## Rules
- Do NOT repeat what the candidate just said back to them
- Do not give empty praise like "Great answer!"
- Never break character or mention that you are an AI

Notice the hierarchy: Role → Critical constraints → Tone → Flow → Edge cases → Rules. This matches Google's recommended four-part structure for Live API system instructions: Persona, Conversational Rules, Tool Calls, Guardrails.

Placement Matters

Where you put information in a prompt changes how the model processes it:

Claude: Put long documents/context at the top, query at the bottom (up to 30% quality improvement).
Gemini: All context first, then instructions, with transitional phrases like "Based on the information above..."
OpenAI GPT-4.1: For 1M-token contexts, place instructions at both beginning AND end for best results.

General rule

Context/examples first → instructions → query/task last.

Core Techniques

Zero-Shot Prompting

Give the model a task with no examples. Works well for simple, well-defined tasks where the model's training data includes similar patterns.

prompt = """You are a resume parser. Extract a structured profile
from the following resume text.

Respond ONLY with valid JSON matching this exact schema:
{"skills":["string"],"seniority":"string","role_category":"string",
 "years_experience":0,"summary":"string"}

Resume:
{resume_text}"""

Zero-shot works here because resume parsing is a well-understood task. The model knows what "skills" and "seniority" mean without examples.

Few-Shot Prompting

Provide concrete input → output examples. This is the single most effective technique for consistent formatting (OpenAI, Anthropic, Google all agree).

prompt := fmt.Sprintf(`You are a query parser for a job search tool.
Split the user query into two parts:
1. search_text: the role, skill, or job description intent
2. location: the geographic location filter, if any

Rules:
- Keep US state abbreviations as-is (NJ stays NJ)
- Convert full state names to abbreviation (New Jersey → NJ)
- Expand city abbreviations (SF → San Francisco)
- "remote" sets location to "Remote"
- If no location, set location to ""

Respond ONLY with valid JSON: {"search_text":"string","location":"string"}

Examples:
- "data engineer in NJ" → {"search_text":"data engineer","location":"NJ"}
- "ML engineer" → {"search_text":"ML engineer","location":""}
- "remote SWE" → {"search_text":"SWE","location":"Remote"}
- "frontend developer SF" → {"search_text":"frontend developer","location":"San Francisco"}
- "backend engineer in New York" → {"search_text":"backend engineer","location":"New York"}
- "devops California" → {"search_text":"devops","location":"CA"}

Query: %s`, query)

Key principles for few-shot examples:

Diversity: Cover edge cases, not just the happy path (abbreviations, missing location, full names)
Identical formatting: Same JSON keys, same field order, same indentation across all examples
Boundary cases: Include at least one empty field, null value, or short input

Anthropic recommends 3-5 diverse examples

Wrap them in <example> tags for Claude. Google says examples without instructions often outperform instruction-heavy prompts without examples. Include examples first when possible.

Chain-of-Thought (CoT)

Ask the model to show its reasoning before giving a final answer. Introduced by Wei et al. (2022) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", this technique dramatically improves performance on tasks requiring multi-step reasoning.

# Simple CoT trigger
prompt = """Analyze this candidate's skill gap against the job description.

Think through this step by step:
1. List the skills the job requires
2. List the skills the candidate has
3. Identify which required skills are missing
4. Rate each missing skill's importance
5. Provide actionable recommendations

Job Description: {jd}
Candidate Profile: {profile}"""

Variants of chain-of-thought:

Variant	How it works	Source
Zero-shot CoT	Add "Let's think step by step" to any prompt	Kojima et al., 2022
Few-shot CoT	Include reasoning traces in examples	Wei et al., 2022
Self-consistency	Sample multiple reasoning paths, take majority vote	Wang et al., 2023 (ICLR)
Tree of Thoughts	Explore multiple reasoning branches, evaluate and prune	Yao et al., 2023 (NeurIPS)

Inner Monologue

Have the model reason internally, then extract only the final answer. This is useful when you want the quality benefits of CoT without exposing raw reasoning to users.

prompt = """Evaluate the candidate's interview performance.

First, think through your evaluation internally:
- What were the strongest moments?
- Where did they struggle?
- How does their performance compare to the rubric?

Then output ONLY the final JSON result (no reasoning):
{"overall_score":1,"summary":"string","strengths":["string"]}

OpenAI specifically recommends this: "Structure the reasoning in a parseable format, then extract only the final answer for the user."

Tell the Model What TO DO

A universal principle from all three providers: frame instructions positively.

Bad (negative only)	Better (positive + negative)
Do not use markdown	Respond in plain prose paragraphs
Don't ask multiple questions	Ask one question per turn, always
Don't give generic responses	Reference the candidate's actual words: "You mentioned X"
Don't be verbose	Keep responses under 30 words unless asked to elaborate

One instruction per bullet

Compound instructions get partially followed. Instead of "Ask about their background and what interests them and keep it under 2 sentences," split into separate bullets.

System Prompts

System prompts (or system instructions) define persistent behavior across an entire conversation. They are the most important prompt you write — they shape every response the model generates.

System vs. User Messages

All major API providers separate messages by role, with different levels of authority:

Role	Priority	Use for
System/Developer	Highest	Identity, behavioral rules, output format, safety constraints
User	Medium	Task-specific input, context, data to process
Assistant	Normal	Prior model responses (for multi-turn context)

OpenAI recommendation

Place tone and role guidance in system messages. Place task-specific details in user messages. This separation lets you reuse the same system prompt across many different tasks.

Architecture of a Good System Prompt

Based on Google's recommended hierarchy for voice/Live API agents and OpenAI's GPT-4.1 prompting guide:

Identity & Purpose — Who you are, what your goal is
Critical Constraints — The 3-4 rules that matter most (front-loaded)
Tone & Style — Communication personality
Conversational Flow — Step-by-step behavioral sequence (distinguish one-time steps like greetings from loops like follow-ups)
Edge Case Handling — What to do when things go wrong (confusion, silence, off-topic)
Guardrails — Hard rules with conditional examples ("if X, do Y")

Role Assignment Works

Even a single sentence of role framing changes model behavior significantly. All three providers confirm this:

# Vague role (too generic to change behavior)
"You are a helpful assistant."

# Specific role (model adopts domain expertise and appropriate tone)
"You are a friendly tech recruiter calling on behalf of Skilark,
having a first introductory call with a candidate."

# Expert role for evaluation
"You are a warm and insightful career coach. Review this brief
introductory career conversation with a candidate."

Emphasis: Placement, Not Volume

A common mistake is using ALL CAPS or aggressive emphasis (CRITICAL, YOU MUST) to enforce rules. This has model-specific effects:

Claude 4.5/4.6: Aggressive emphasis causes overtriggering — the model overcompensates. Use normal phrasing.
Gemini: State goals clearly without excessive persuasive language.
OpenAI: Frame instructions positively wherever possible.

Rule

If a rule is violated 50% of the time, move it higher in the prompt — not louder. Front-loading is more effective than shouting.

Structured Output

Most production LLM systems parse model output as structured data. The prompt must guide the model to produce valid, predictable JSON that your code can parse reliably.

Schema-First Prompting

All major providers now offer API-level structured output enforcement:

Provider	Feature	How it works
OpenAI	`response_format: {type: "json_schema"}`	Constrains token generation to valid schema tokens
Google	`response_mime_type: "application/json"` + `response_schema` (REST) / `response_json_schema` (Python SDK)	JSON schema enforcement via generation config
Anthropic	Tool use / JSON mode	Structured output via tool definitions; no first-class schema enforcement like OpenAI

Here is a real production example using Gemini's schema enforcement with Pydantic:

from pydantic import BaseModel, Field
from typing import Optional

class ListingEnrichment(BaseModel):
    normalized_title: str = Field(description="Clean, standard job title")
    seniority: str = Field(description="intern|junior|mid|senior|staff|principal|director|vp|c_level")
    role_category: str = Field(description="swe|ai_ml|data_eng|platform|devops|security|product|design|other")
    remote_policy: Optional[str] = Field(description="remote|hybrid|onsite or null")
    location: Optional[str] = Field(description="City, ST format for US locations")
    summary: str = Field(description="2-3 sentence distinctive summary")

# API call with schema enforcement
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt,
    config={
        "response_mime_type": "application/json",
        "response_json_schema": ListingEnrichment.model_json_schema(),
    },
)

# Pydantic validates the response
result = ListingEnrichment.model_validate_json(response.text)

Schema guarantees syntax, not semantics

API-level enforcement ensures valid JSON structure, but the values can still be wrong. Always validate business logic separately — a seniority field will be valid JSON but might say "senior" when the listing clearly describes an intern role.

When API Schema Enforcement Is Unavailable

For APIs that don't support schema enforcement (e.g., Gemini Live realtime audio, or text generation without schema config), use prompt-level techniques:

// Feedback generation prompt — no schema enforcement available
prompt := fmt.Sprintf(`You are an expert interview coach.
Analyze this mock interview transcript.

Rate the candidate on a scale of 1-5:
1 = Poor (major gaps, unprepared)
2 = Below Average (some relevant answers but significant weaknesses)
3 = Average (adequate answers, room for improvement)
4 = Good (strong answers with minor areas to improve)
5 = Excellent (exceptional, well-structured, confident)

Respond ONLY with valid JSON matching this schema:
{"overall_score":1,"summary":"string","strengths":["string"],
 "improvements":["string"],
 "question_scores":[{"question":"string","score":1,"notes":"string"}]}

Transcript:
%s

Remember: respond with ONLY the JSON object, no other text.`,
    transcriptJSON)

The pattern for reliable prompt-level JSON:

Show the exact JSON schema in the prompt (not a description of it)
Include a few-shot example showing a complete valid response
Repeat the format constraint at the end ("respond with ONLY the JSON")
Strip code fences in post-processing — models often wrap JSON in ```json```
Validate and handle failures — parse the response and handle errors gracefully

// Post-processing: strip code fences before parsing.
// Models often wrap JSON in ```json ... ``` blocks. This function
// removes the opening fence line (including language tag) and the
// closing fence, leaving only the JSON body.
func stripCodeFences(s string) string {
    s = strings.TrimSpace(s)
    if strings.HasPrefix(s, "```") {
        // Remove the entire opening fence line (e.g. "```json\n")
        if idx := strings.Index(s, "\n"); idx != -1 {
            s = s[idx+1:]
        }
        // Remove closing fence
        if idx := strings.LastIndex(s, "```"); idx != -1 {
            s = s[:idx]
        }
        s = strings.TrimSpace(s)
    }
    return s
}

// Usage
text = stripCodeFences(text)
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
    log.Printf("feedback: JSON parse error for %s: %v", interviewID, err)
    return
}

Include Example Output

An example output in the prompt anchors the model's response format more reliably than the schema alone:

// In your prompt, after the schema definition:
Example output:
{"summary":"You gave a clear picture of your backend experience and
showed real enthusiasm for distributed systems.",
"highlights":["You gave a specific example of reducing API latency
by 40% — concrete numbers stand out",
"You connected your interest in data pipelines to a real problem"],
"improvements":["When asked about your background, you listed
technologies. Try framing it as a story instead."]}

Go fmt.Sprintf escaping

The example above shows the rendered prompt — what the model sees. In Go source code using fmt.Sprintf, literal % must be escaped as %%. For example, 40% in the rendered output requires 40%% in the Go string literal. If the string is inside a nested Sprintf, you may need %%%%.

Never use ... or comments in example JSON

The model will reproduce them. Show complete, valid JSON with realistic values.

Voice & Realtime Models

Voice models (Gemini Live, OpenAI Realtime) behave fundamentally differently from text models. Prompting for voice requires a distinct set of techniques learned through production experience.

Voice-Specific Constraints

Text prompting habits break down in voice contexts:

Issue	Why it happens	Mitigation
Verbose responses	Text models default to thorough answers	Use word count limits ("under 30 words"), not sentence counts — sentences vary wildly in spoken length
Question stacking	Model asks 2-3 questions in one turn	"One question per turn, always" as a top-level constraint
Filler sounds	Model tries to sound conversational	"Minimize filler — avoid 'uh', 'so', 'absolutely' as openers"
Echo/parroting	Model restates what user said	"Do NOT repeat what the candidate just said back to them"
Ignoring signals	Model misses injected control tokens	Front-load signal instructions, e.g., `[WRAP-UP]`
Breaking character	"Am I talking to an AI?"	Provide explicit deflection scripts

Silence Handling

Silence is meaningful in voice conversations. Candidates pause to think. Good prompts teach the model progressive silence handling:

## Silence handling
- If the candidate pauses to think, wait at least 5 seconds
  before saying anything
- At 8-10 seconds of silence, a brief "Take your time" is fine
- At 15-20 seconds, offer a gentle scaffold: "Would it help to
  think about this in terms of [related concept]?"
- Beyond 20 seconds, offer to reframe: "Want to approach this
  differently, or shall we try another question?"
- Never rush them or fill silence with another question

VAD & API-Level Controls

Gemini Live provides API-level knobs that overlap with prompt instructions. Use the API for behavior you need guaranteed; use the prompt for nuance the API can't express.

# Gemini Live API — platform-level voice controls
config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    system_instruction=system_prompt,

    # Gemini speaks first without waiting for user audio.
    # Fixes cold-start delay — the system prompt says "introduce
    # yourself" but Gemini won't act on it until it hears audio.
    proactivity=types.ProactivityConfig(proactive_audio=True),

    realtime_input_config=types.RealtimeInputConfig(
        automatic_activity_detection=types.AutomaticActivityDetection(
            # HIGH: catch quieter phone/handset audio as speech
            start_of_speech_sensitivity=types.StartSensitivity.START_SENSITIVITY_HIGH,
            # LOW: don't cut the user off during thinking pauses
            end_of_speech_sensitivity=types.EndSensitivity.END_SENSITIVITY_LOW,
            # 2s breathing room before Gemini responds
            silence_duration_ms=2_000,
            # Include 500ms before speech start (avoids clipping first syllable)
            prefix_padding_ms=500,
        ),
    ),

    # Context window compression for long sessions
    context_window_compression=types.ContextWindowCompressionConfig(
        trigger_tokens=80_000,
        sliding_window=types.SlidingWindow(target_tokens=20_000),
    ),
)

Don't fight the API with your prompt

If VAD silence_duration_ms is set to 2000ms, a prompt rule saying "wait 5 seconds" creates a race condition — the model may respond at 2s anyway. Prompt-level silence guidance should describe behavior ("let pauses breathe"), not timings the API controls.

Similarly, don't write "start speaking immediately" in the prompt — proactive_audio=True handles this. The prompt should describe what to say, not when.

Greeting Design for Voice Agents

A voice agent's first utterance sets the entire conversational tone. Unlike text chat (where users initiate), voice calls start with the agent speaking into silence.

## Sample greetings (pick one at random, never reuse)
- "Hey, this is Skilark — you signed up for a quick career chat.
   Is now still a good time?"
- "Hi, Skilark here — thanks for signing up for a chat.
   Is this still a good moment to talk?"

Good greeting structure:

Identify yourself ("Hey, this is Skilark")
Set context ("You signed up for a quick career chat")
Check timing ("Is now still a good time?")

Anti-patterns:

Jumping to a question ("Tell me about yourself") — feels like a cold call
Combining greeting + question in one utterance — overwhelms
The greeting should be its own turn. Wait for a response ("yeah" or "sure") before asking anything

Signal Injection for Session Control

In long-running voice sessions, you need to send control signals to the model mid-conversation — like telling it to wrap up. Inject these as user-role messages:

# Send wrap-up signal during a live voice session
await session.send_client_content(
    turns=[Content(role="user", parts=[Part(text=(
        "[WRAP-UP] About 1 minute remaining. "
        "Please wrap up the conversation naturally."
    ))])],
    turn_complete=True,
)

The system prompt must teach the model to recognize these signals:

## Flow
...
5. Continue the conversation naturally until you receive a [WRAP-UP]
   signal, then close warmly — thank them and wish them well
- Do NOT wrap up on your own — keep asking until the signal arrives

Context Management

Every LLM has a finite context window. Production systems must manage token budgets deliberately — exceeding the limit causes hard failures or silent truncation.

Input Truncation

When passing user-supplied text into prompts, always cap the length:

# Enrichment prompt — cap job description at 8,000 chars
ENRICHMENT_BODY_LIMIT = 8000

def _build_prompt(title, body, company_name):
    return f"""Analyze this job listing and extract structured information.

## Job Listing
**Company:** {company_name}
**Title:** {title}
**Description:**
{body[:ENRICHMENT_BODY_LIMIT]}

## Instructions
1. Normalize the title...
"""

// Resume parsing — cap at 16,000 chars
const maxResumeBytes = 16000
if len(resumeText) > maxResumeBytes {
    resumeText = truncateText(resumeText, maxResumeBytes)
}

// Skill gap analysis — each job description capped at 4,000 chars
const maxJDBytes = 4000
for i, jd := range jobDescriptions {
    if len(jd) > maxJDBytes {
        jd = jd[:maxJDBytes]
    }
    fmt.Fprintf(&jdSection, "--- %s ---\n%s\n\n", titles[i], jd)
}

Sliding Window Compression

For long-running sessions (like voice interviews), audio accumulates at ~25 tokens/second. A 10-minute session consumes ~30k tokens. Without compression, you hit the context limit quickly.

# Gemini Live — compress context when nearing limits
context_window_compression=types.ContextWindowCompressionConfig(
    trigger_tokens=80_000,   # Start compressing at 80k
    sliding_window=types.SlidingWindow(
        target_tokens=20_000  # Keep most recent 20k tokens
    ),
)

This is a safety net. Under normal operation, voice sessions reconnect every ~10 minutes (due to Gemini's GoAway mechanism), so individual segments rarely exceed 30k tokens.

Document Format for Long Contexts

OpenAI's GPT-4.1 guide tested different formats for document collections in long contexts:

Format	Performance	Use case
XML tags	Best	Multiple documents, structured data
Pipe-delimited (`ID: 1 \| TITLE: ...`)	Good	Tabular data, logs
JSON	Poor for large collections	Avoid for document collections > 50k tokens

Multi-Turn & Session Continuity

Long-running conversations (interviews, coaching sessions, multi-step workflows) face a core challenge: maintaining context across session boundaries.

Reconnection with Context Preservation

In a voice interview, the Gemini Live WebSocket connection dies every ~10 minutes (GoAway signal). The bot must reconnect without the model re-introducing itself or re-asking questions. Two strategies:

Strategy 1: Resume Handle (Fast Reconnect)

# Gemini provides an opaque resume handle for session continuity
session_resumption=types.SessionResumptionConfig(
    handle=self._resume_handle,  # From previous session's update
)

# Store the handle when Gemini sends it
def _handle_resumption_update(self, update):
    if update.resumable and update.new_handle:
        self._resume_handle = update.new_handle

If the handle is valid, the model picks up exactly where it left off — same conversational state, no re-greeting.

Strategy 2: Transcript Replay (Fallback)

When the resume handle is stale (gap > 10 minutes), fall back to replaying recent transcript in the system prompt:

MAX_TRANSCRIPT_LINES = 20  # Last ~10 exchanges

def _build_reconnect_prompt(self):
    if not self._transcript:
        return self._base_system_prompt

    recent = self._transcript[-MAX_TRANSCRIPT_LINES:]
    context = "\n".join(recent)
    return (
        f"{self._base_system_prompt}\n\n"
        f"IMPORTANT: This is a continuation of an ongoing interview. "
        f"Do NOT re-introduce yourself or start over. "
        f"Continue naturally from where you left off.\n\n"
        f"Conversation so far:\n{context}"
    )

Key pattern

The "Do NOT re-introduce yourself" instruction is critical. Without it, the model will greet the user again after every reconnection — a terrible user experience in a voice call.

Session Lifecycle Awareness

Your system prompt should account for session lifecycle constraints. One-time instructions (greetings, introductions) should be in the Flow section, not in Critical Constraints:

Flow step 1: "Greet the candidate warmly" — executes once on initial connect
On reconnect with resumed context, the model continues from where it was, skipping the greeting
If you put the greeting in Critical Constraints, the model may re-greet on every reconnection

Prompt Chaining & Pipelines

Complex tasks should be decomposed into focused stages, each with its own prompt. All three providers recommend this: "Split complex tasks into subtasks" (OpenAI, Anthropic, Google).

Why Chain?

Reliability: Small, focused prompts produce more consistent output than monolithic ones
Debuggability: When something goes wrong, you know exactly which stage failed
Cost: Early stages can use cheaper/faster models; expensive models only for nuanced tasks
Validation: Each stage's output is parsed and validated before the next stage runs

Real Pipeline: Interview → Feedback → Signals

Here is a real multi-stage pipeline from a mock interview product:

# Stage 1: Voice Interview (Gemini Live, realtime audio)
# Input:  System prompt + candidate audio
# Output: Bidirectional audio conversation
# Model:  gemini-2.5-flash-native-audio (realtime)

# Stage 2: Feedback Generation (Gemini Flash, text)
# Input:  Interview transcript (JSON)
# Output: Structured feedback JSON
# Model:  gemini-2.0-flash (text generation)

# Stage 3: Career Signal Extraction (same response as Stage 2)
# Input:  Feedback JSON (parsed from Stage 2 output)
# Output: Career signals (role interests, skills, seniority)
# Model:  (no separate call — extracted from Stage 2's JSON)

# Stage 4: Job Matching (Gemini embeddings)
# Input:  Career signal query text
# Output: Matching job listings via vector similarity
# Model:  gemini-embedding-001

Each stage has a focused prompt, small output schema, and explicit validation. The feedback prompt produces both human-readable feedback and machine-readable career signals in a single JSON response — an intentional design choice to avoid an extra API call.

Query Decomposition Pattern

A common first stage is breaking a user query into structured components that downstream stages can act on:

# Stage 1: Decompose natural language query
# "senior data engineer in California, remote OK"
#     ↓
# {"search_text": "senior data engineer", "location": "CA"}

# Stage 2: Embed search_text → vector
# Stage 3: Vector search with location filter (GeoFilterState)
# Stage 4: Re-rank results by relevance

Structured Extraction Pipeline

Another pattern: extracting structured data from unstructured text at scale. This runs against 50,000+ job listings:

prompt = f"""Analyze this job listing and extract structured information.

## Job Listing
**Company:** {company_name}
**Title:** {title}
**Description:**
{body[:8000]}

## Instructions
1. Normalize the title to a clean, standard format
2. Classify seniority: intern, junior, mid, senior, staff, principal
3. Classify role category: swe, ai_ml, data_eng, platform, devops...
4. Identify remote policy: remote, hybrid, onsite, or null
5. Extract location in "City, ST" format for US locations
6. Extract salary range if mentioned (annual, local currency)
7. List technical skills using canonical names: {skills_list}
8. Write a 2-3 sentence summary of distinctive aspects
   Focus on: product/system/domain, what person will own/build,
   concrete signals of scale. Skip filler phrases."""

Key design choices:

Canonical skill taxonomy passed in the prompt — forces consistent naming across 50k listings
Body truncated to 8,000 chars — token budget management
Schema enforcement via API (response_mime_type: "application/json")
Pydantic validation on the response — catches schema violations

Prompt Safety & Injection Prevention

Any time user-supplied text is embedded in a prompt, you face the risk of prompt injection — where the user's input overrides your instructions.

Input Sanitization

The simplest defense: truncate and strip control characters from user input before embedding it in prompts.

// sanitizePromptField truncates and strips newlines from
// user-supplied text before embedding in a Gemini prompt.
func sanitizePromptField(s string, maxLen int) string {
    s = strings.ReplaceAll(s, "\n", " ")
    s = strings.ReplaceAll(s, "\r", " ")
    if len(s) > maxLen {
        s = s[:maxLen]
    }
    return s
}

// Usage: role_title is free text from the user
roleTitle = sanitizePromptField(roleTitle, 100)
companySlug = sanitizePromptField(companySlug, 100)

Structural Defense: JSON Output

Structured output is itself a defense. When the model's response is parsed as JSON, a confused model causes an unmarshal failure, not data corruption:

// Even if the model is confused by injected text in role_title,
// the worst case is a parse error — not data corruption
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
    log.Printf("JSON parse error: %v", err)
    return  // Fail safely — no corrupted data reaches the user
}

Defense in depth

Combine multiple layers:

Sanitize inputs (truncate, strip newlines/control chars)
Separate instructions from data (use XML tags or clear delimiters)
Validate outputs (JSON parsing catches confusion)
Limit blast radius (the model can only produce text — it can't access your database)

Separating Instructions from Data

Anthropic recommends XML tags to clearly separate user data from instructions, making injection harder:

<instructions>
Analyze the resume below and extract structured data.
Respond ONLY with valid JSON.
</instructions>

<resume>
{user_supplied_resume_text}
</resume>

<output_format>
{"skills": ["string"], "seniority": "string"}
</output_format>

Evaluation & Scoring Prompts

Using LLMs to evaluate human performance (interviews, writing, code) requires careful prompt design to produce consistent, fair assessments.

Internal vs. External Rubrics

Approach	When to use	How to prompt
Internal rubric	Coaching tone, qualitative feedback	"Evaluate internally across these dimensions. Do not output scores."
External rubric	Scoring, ranking, comparison	Define the scale explicitly with examples for each level. Clamp in post-processing.

Internal rubric example (coaching feedback, no numeric scores):

Evaluate the conversation across these dimensions
(internally, do not output scores):
1. Story Clarity — Did the candidate tell a coherent narrative
   about their background, or just list facts?
2. Career Direction — Did they articulate what they are looking for?
3. Specificity — Did they give concrete examples with outcomes?
4. Self-Awareness — Did they show understanding of strengths
   and growth areas?
5. Conciseness — Were answers focused, or did they ramble?

External rubric example (numeric scoring):

Rate the candidate on a scale of 1-5:
1 = Poor (major gaps, unprepared)
2 = Below Average (some relevant answers but significant weaknesses)
3 = Average (adequate answers, room for improvement)
4 = Good (strong answers with minor areas to improve)
5 = Excellent (exceptional, well-structured, confident)

Always clamp scores in post-processing

Models sometimes output scores outside the defined range. Clamp to valid bounds: if score < 1 { score = 1 }; if score > 5 { score = 5 }

Coaching Tone for Feedback

When the goal is to help someone improve (not just rate them), the prompt must specify the structure of actionable feedback:

Field descriptions:
- highlights: 2-3 specific things the candidate did well —
  quote or reference what they actually said
- improvements: 2-3 constructive coaching tips, each as a single
  string containing:
  (a) what the candidate said
  (b) why a different approach is stronger
  (c) a concrete reworded example

Keep the tone warm and constructive — like a coach helping them
tell their story better, not a judge marking them down.

The three-part structure (what they did → why change → concrete example) prevents vague feedback like "be more specific." Instead it produces: "When asked about your background, you listed technologies. Try framing it as a story: 'I started in backend engineering, then moved to data pipelines when I saw our team spending 40% of time on manual ETL.'"

Match Dimensions to Context

Never evaluate against criteria the candidate had no opportunity to demonstrate:

Context	Good dimensions	Bad dimensions
Role-specific interview	Role Connection, Enthusiasm	Career Direction (they already know their target)
Open career conversation	Career Direction, Self-Awareness	Role Connection (no specific role discussed)
Technical interview	Problem decomposition, Code quality	Enthusiasm (irrelevant)

Agentic Prompt Patterns

Agentic systems give LLMs the ability to take actions — calling tools, querying databases, browsing the web — rather than just generating text. The prompt becomes a controller that decides when and how to use each tool.

ReAct: Reason + Act

The ReAct pattern (Yao et al., 2023) interleaves reasoning with action:

You have access to these tools:
- browse_jobs(skill, location, seniority): Search job listings
- get_signals(company): Get market intelligence for a company
- get_skill_trends(limit): Get top skills by demand

For each user question:
1. Think: What information do I need?
2. Act: Call the appropriate tool
3. Observe: Read the tool's output
4. Think: Do I have enough information to answer?
5. If not, go to step 2. If yes, respond to the user.

Tool Definitions

When defining tools for function calling, the descriptions are prompt engineering too. The model uses these descriptions to decide which tool to call:

@mcp.tool(description="""Structured job search with filters.
Best for specific queries with known parameters like skill,
company, location, or seniority level.""")
def browse_jobs(
    skill: str | None = None,
    company: str | None = None,
    location: str | None = None,
    seniority: str | None = None,
) -> dict:
    ...

@mcp.tool(description="""AI-powered semantic search — describe your
ideal role in natural language. Best for exploratory or nuanced
queries where you can't specify exact filters.""")
def find_my_fit(query: str) -> dict:
    ...

Notice how the descriptions differentiate when to use each tool: browse_jobs for structured queries, find_my_fit for natural language exploration. Without clear differentiation, the model will pick arbitrarily.

Multi-Tool Orchestration

For richer answers, instruct the model to combine multiple tools:

## When to combine tools
For richer answers, call multiple tools together:
- Job search + signals: After browse_jobs or find_my_fit,
  call get_signals to surface relevant market context
- Skill trends + jobs: After get_skill_trends, show matching
  job listings that require trending skills
- Company trends + signals: After get_company_trends, get
  recent news signals for top hiring companies

Planning Before Action

For complex multi-step tasks, have the model plan before executing. OpenAI recommends: "Have the model solve the problem first, then compare with the input."

When the user asks a complex question:
1. First, create a plan: what tools do you need to call
   and in what order?
2. Execute the plan step by step
3. After gathering all information, synthesize a coherent answer
4. Ask yourself: "Did I miss anything?" before responding

Production Hardening

Moving prompts from prototype to production requires additional engineering around error handling, cost management, and observability.

Error Handling & Retries

// Bounded concurrency — don't overwhelm the API
var feedbackSemaphore = make(chan struct{}, 5)

func GenerateFeedback(ctx context.Context, gemini *GeminiClient, ...) {
    // Acquire semaphore slot
    select {
    case feedbackSemaphore <- struct{}{}:
        defer func() { <-feedbackSemaphore }()
    case <-ctx.Done():
        return
    }

    text, _, err := gemini.generateContent(ctx, prompt)
    if err != nil {
        log.Printf("Gemini API error: %v", err)
        return  // Fail gracefully, don't crash the pipeline
    }

    text = stripCodeFences(text)

    var feedback InterviewFeedback
    if err := json.Unmarshal([]byte(text), &feedback); err != nil {
        log.Printf("JSON parse error: %v", err)
        return  // Bad output is not a crash
    }

    // Post-process: clamp scores, strip hallucinated fields
    if feedback.OverallScore != nil {
        if *feedback.OverallScore < 1 { *feedback.OverallScore = 1 }
        if *feedback.OverallScore > 5 { *feedback.OverallScore = 5 }
    }
}

Model Selection by Task

Task	Model choice	Rationale
Structured extraction (enrichment)	Flash/Haiku (fast, cheap)	Well-defined schema, low ambiguity
Qualitative feedback (coaching)	Pro/Sonnet (balanced)	Needs nuance, empathy, specific examples
Realtime voice conversation	Flash (native audio)	Low latency is critical for natural conversation
Complex reasoning (planning)	Opus/o1/Pro (capable)	Multi-step reasoning, long context
Embeddings	Embedding model	Purpose-built, cheapest per token

Token Usage Tracking

type TokenUsage struct {
    PromptTokens    int
    CandidateTokens int
    TotalTokens     int
}

// Track usage from every API call
text, usage, err := gemini.generateContent(ctx, prompt)
if usage != nil {
    log.Printf("tokens: prompt=%d candidate=%d total=%d",
        usage.PromptTokens, usage.CandidateTokens, usage.TotalTokens)
}

Timeout & HTTP Client

// Direct HTTP client — no SDK dependency, full control
type GeminiClient struct {
    apiKey     string
    model      string
    httpClient *http.Client
    baseURL    string
}

func NewGeminiClient(apiKey, model string) *GeminiClient {
    return &GeminiClient{
        apiKey:     apiKey,
        model:      model,
        httpClient: &http.Client{Timeout: 30 * time.Second},
        baseURL:    "https://generativelanguage.googleapis.com/v1beta",
    }
}

No SDK, by choice

Using raw HTTP calls instead of an SDK gives full control over timeouts, retries, and error handling. The trade-off is more boilerplate, but the debugging experience is much better — you can see exactly what's going over the wire.

Iteration & Testing

Testing Checklist

Before deploying any prompt, test with:

Empty/minimal input — one-word transcript, blank resume
Off-topic input — candidate talks about unrelated things
Hostile input — prompt injection attempts in user fields
Edge-case formatting — very long responses, unicode, special characters
Boundary values — exactly at truncation limits

Iteration Strategies

Google recommends three approaches when a prompt isn't working:

Rephrase: Different wording often yields different results
Reformulate: If classification fails, try multiple-choice framing
Reorder: Move sections around — placement affects quality

Temperature Settings

Provider	Recommendation
Gemini	Keep temperature at 1.0 (lower risks looping behavior)
Claude	Default is fine for most tasks; lower for deterministic output
OpenAI	Lower for structured output, higher for creative tasks

Version Your Prompts Like Code

Prompts are code. Store them in version control, review changes, and track which version produced which results:

// Prompts as named constants — easy to review, diff, and test
const quickChatPrompt = `## Role
- You are a friendly interviewer...`

const quickChatOpenPrompt = `## Role
- You are a friendly tech recruiter...`

const standardPrompt = `## Role
- You are a mock %s interviewer...`

// Route to the right prompt based on interview type and context
func buildInterviewPrompt(interviewType, roleTitle, companySlug string) string {
    roleTitle = sanitizePromptField(roleTitle, 100)
    companyClause := ""
    if companySlug != "" {
        companyClause = " at " + sanitizePromptField(companySlug, 100)
    }

    if interviewType == "quick_chat" {
        if roleTitle == "General" && companySlug == "" {
            return quickChatOpenPrompt  // Open-ended, no target role
        }
        return fmt.Sprintf(quickChatPrompt, roleTitle, companyClause)
    }
    return fmt.Sprintf(standardPrompt, interviewType, roleTitle, companyClause)
}

Common Pitfalls

Anti-Pattern	Why It Fails	Fix
ALL CAPS emphasis	Overtriggers on Claude 4.5+; ignored by Gemini	Normal casing with front-loading
"Do NOT do X" without alternative	Model knows what to avoid but not what to do	Add "Instead, do Y"
Vague role ("be helpful")	Too generic to change behavior	Specific role ("tech recruiter on an intro call")
Sentence-count limits for voice	Sentences vary wildly in spoken length	Word-count limits ("under 30 words")
Checklist instructions ("cover 3 topics")	Creates countdown — model winds down after 3	"Keep exploring, no checklist"
Repeating the same rule 3 times	Wastes tokens, doesn't improve compliance	Say it once, place it high
JSON example with `...` or comments	Model reproduces the comments/ellipsis	Show complete, valid JSON
Prompt-level timers conflicting with API VAD	Creates race condition (model responds at API timing)	API controls timing; prompt describes behavior
Greeting + question in one utterance (voice)	Overwhelms the user	Greeting is its own turn; wait for response
Not escaping `%` in Go `fmt.Sprintf`	Corrupts the rendered prompt	Use `%%` for literal `%` in Go string templates

Quick Reference

Technique Comparison

Technique	When to Use	Tradeoff
Zero-shot	Simple, well-defined tasks	Low cost, less consistent formatting
Few-shot	Structured output, consistent format	More tokens, much better consistency
Chain-of-thought	Multi-step reasoning, math, logic	More tokens, better accuracy
Inner monologue	Quality reasoning without exposing it	Moderate cost, clean output
Schema enforcement	JSON output in production pipelines	Guarantees syntax; validate semantics separately
Prompt chaining	Complex multi-stage workflows	More API calls, better debuggability
ReAct (tool use)	Agentic systems that take actions	Powerful but unpredictable; needs guardrails

Deployment Checklist

Before deploying any prompt, verify:

Critical constraints are in the first section (not buried in Rules)
Positive instructions outnumber negative ones
JSON schema is shown explicitly (not just described)
At least one few-shot example is included (for structured output)
Evaluation dimensions match the actual conversation context
Voice prompts use word counts, not sentence counts
User-supplied fields are sanitized (newlines stripped, length capped)
Tested with empty, minimal, and adversarial inputs
No ALL CAPS emphasis (use placement instead)
Template escaping is correct (%% in Go, {{ in Python f-strings)

Sources & Further Reading

Official Documentation

Provider	Resource
Anthropic	Claude Prompt Engineering Guide
Google	Gemini Prompting Strategies
Google	Gemini Structured Output
Google	Gemini System Instructions
OpenAI	Prompt Engineering Guide
OpenAI	GPT-4.1 Prompting Guide
OpenAI	Structured Outputs
OpenAI	Reasoning Best Practices

Research Papers

Paper	Authors	Key contribution
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	Wei et al., 2022	Showed that including reasoning steps in prompts dramatically improves performance on math and logic tasks
Large Language Models are Zero-Shot Reasoners	Kojima et al., 2022	Demonstrated that simply adding "Let's think step by step" triggers chain-of-thought reasoning without examples
Self-Consistency Improves Chain of Thought Reasoning	Wang et al., 2023	Sample multiple reasoning paths and take majority vote for more reliable answers
Tree of Thoughts: Deliberate Problem Solving with Large Language Models	Yao et al., 2023 (NeurIPS)	Explores branching reasoning paths with evaluation and pruning
ReAct: Synergizing Reasoning and Acting in Language Models	Yao et al., 2023 (ICLR)	Interleaving reasoning traces with tool use for grounded, multi-step problem solving
Toolformer: Language Models Can Teach Themselves to Use Tools	Schick et al., 2023	Self-supervised learning of when and how to call external APIs

Prompt Engineering Refresher

What Is Prompt Engineering?

The Spectrum of Complexity

Anatomy of a Prompt

The Six Components

Use Labeled Sections

Placement Matters

Core Techniques

Zero-Shot Prompting

Few-Shot Prompting

Chain-of-Thought (CoT)

Inner Monologue

Tell the Model What TO DO

System Prompts

System vs. User Messages

Architecture of a Good System Prompt

Role Assignment Works

Emphasis: Placement, Not Volume

Structured Output

Schema-First Prompting

When API Schema Enforcement Is Unavailable

Include Example Output

Voice & Realtime Models

Voice-Specific Constraints

Silence Handling

VAD & API-Level Controls

Greeting Design for Voice Agents

Signal Injection for Session Control

Context Management

Input Truncation

Sliding Window Compression

Document Format for Long Contexts

Multi-Turn & Session Continuity

Reconnection with Context Preservation

Strategy 1: Resume Handle (Fast Reconnect)

Strategy 2: Transcript Replay (Fallback)

Session Lifecycle Awareness

Prompt Chaining & Pipelines

Why Chain?

Real Pipeline: Interview → Feedback → Signals

Query Decomposition Pattern

Structured Extraction Pipeline

Prompt Safety & Injection Prevention

Input Sanitization

Structural Defense: JSON Output

Separating Instructions from Data

Evaluation & Scoring Prompts

Internal vs. External Rubrics

Coaching Tone for Feedback

Match Dimensions to Context

Agentic Prompt Patterns

ReAct: Reason + Act

Tool Definitions

Multi-Tool Orchestration

Planning Before Action

Production Hardening

Error Handling & Retries

Model Selection by Task

Token Usage Tracking

Timeout & HTTP Client

Iteration & Testing

Testing Checklist

Iteration Strategies

Temperature Settings

Version Your Prompts Like Code

Common Pitfalls

Quick Reference

Technique Comparison

Deployment Checklist

Sources & Further Reading

Official Documentation

Research Papers