Prompt Engineering Refresher
From zero-shot basics to production agentic pipelines — with real-world examples from voice AI, structured extraction, and multi-stage workflows.
Table of Contents
What Is Prompt Engineering?
Prompt engineering is the practice of designing inputs to language models that reliably produce desired outputs. It sits at the intersection of programming and natural language — you are writing instructions for a system that understands intent, not just syntax.
Unlike traditional code, prompts are probabilistic. The same prompt can produce slightly different outputs on each run. Good prompt engineering narrows that variance — it makes the model's behavior predictable enough to build software on top of.
The Spectrum of Complexity
Prompt engineering ranges from single-turn questions to multi-stage production systems:
| Level | Example | Techniques |
|---|---|---|
| Basic | "Summarize this text" | Zero-shot, role assignment |
| Intermediate | Extract structured JSON from job descriptions | Schema-first, few-shot examples, output validation |
| Advanced | Voice AI conducting mock interviews | System prompts, silence handling, session continuity, signal injection |
| Production | Multi-stage pipeline: parse → enrich → evaluate → extract signals | Prompt chaining, bounded concurrency, safety, evaluation rubrics |
This refresher covers the full spectrum. Each section builds on the previous one.
Anatomy of a Prompt
Every effective prompt has the same structural DNA. Whether you are writing a one-line question or a 200-line system prompt, the building blocks are the same.
The Six Components
| Component | Purpose | Example |
|---|---|---|
| Role | Who the model should be | You are a resume parser |
| Context | Background information | The candidate is applying for a backend role at Stripe |
| Instructions | What to do, step by step | Extract skills, seniority, and role category |
| Constraints | Boundaries and rules | One question per turn. Under 30 words. |
| Output format | How to structure the response | Respond ONLY with valid JSON matching this schema: {...} |
| Examples | Concrete input → output pairs | "data engineer in NJ" → {"search_text":"data engineer","location":"NJ"} |
Use Labeled Sections
All three major providers agree: organize prompts with clear section headers. This is the single most impactful structural technique.
| Provider | Preferred format |
|---|---|
| Anthropic (Claude) | XML tags: <role>, <instructions>, <context> |
| Google (Gemini) | XML tags or Markdown headings: ## Role, ## Constraints |
| OpenAI (GPT) | Markdown headings or system/user message separation |
Here is a real production system prompt (abridged) for a voice AI mock interviewer, using Markdown headings — the recommended format for Google Gemini Live:
## Role
- You are a friendly interviewer having a brief introductory chat
with a candidate for the role of Backend Engineer at Stripe
- Goal: make the candidate comfortable and learn about their background
## Critical constraints (voice model)
- One question per turn, always
- Keep responses under 30 words
- The candidate should be talking 80% of the time
## Tone
- Warm and conversational — like a real person, not a voice assistant
- Brief verbal acknowledgments are fine ("Got it") — not long affirmations
- Do not repeat the same phrase twice — vary your wording naturally
## Flow
1. Greet the candidate warmly
2. Ask about their background
3. Ask thoughtful follow-up questions — reference what they said
4. Continue until you receive a [WRAP-UP] signal, then close warmly
## Sample greetings (pick one, never reuse)
- "Hi there, thanks for taking the time to chat."
- "Hey, welcome! I'm excited to learn about your background."
## Silence handling
- If the candidate pauses, wait at least 5 seconds before saying anything
- At 8-10 seconds, a brief "Take your time" is fine
## Rules
- Do NOT repeat what the candidate just said back to them
- Do not give empty praise like "Great answer!"
- Never break character or mention that you are an AI
Notice the hierarchy: Role → Critical constraints → Tone → Flow → Edge cases → Rules. This matches Google's recommended four-part structure for Live API system instructions: Persona, Conversational Rules, Tool Calls, Guardrails.
Placement Matters
Where you put information in a prompt changes how the model processes it:
- Claude: Put long documents/context at the top, query at the bottom (up to 30% quality improvement).
- Gemini: All context first, then instructions, with transitional phrases like "Based on the information above..."
- OpenAI GPT-4.1: For 1M-token contexts, place instructions at both beginning AND end for best results.
Core Techniques
Zero-Shot Prompting
Give the model a task with no examples. Works well for simple, well-defined tasks where the model's training data includes similar patterns.
prompt = """You are a resume parser. Extract a structured profile
from the following resume text.
Respond ONLY with valid JSON matching this exact schema:
{"skills":["string"],"seniority":"string","role_category":"string",
"years_experience":0,"summary":"string"}
Resume:
{resume_text}"""
Zero-shot works here because resume parsing is a well-understood task. The model knows what "skills" and "seniority" mean without examples.
Few-Shot Prompting
Provide concrete input → output examples. This is the single most effective technique for consistent formatting (OpenAI, Anthropic, Google all agree).
prompt := fmt.Sprintf(`You are a query parser for a job search tool.
Split the user query into two parts:
1. search_text: the role, skill, or job description intent
2. location: the geographic location filter, if any
Rules:
- Keep US state abbreviations as-is (NJ stays NJ)
- Convert full state names to abbreviation (New Jersey → NJ)
- Expand city abbreviations (SF → San Francisco)
- "remote" sets location to "Remote"
- If no location, set location to ""
Respond ONLY with valid JSON: {"search_text":"string","location":"string"}
Examples:
- "data engineer in NJ" → {"search_text":"data engineer","location":"NJ"}
- "ML engineer" → {"search_text":"ML engineer","location":""}
- "remote SWE" → {"search_text":"SWE","location":"Remote"}
- "frontend developer SF" → {"search_text":"frontend developer","location":"San Francisco"}
- "backend engineer in New York" → {"search_text":"backend engineer","location":"New York"}
- "devops California" → {"search_text":"devops","location":"CA"}
Query: %s`, query)
Key principles for few-shot examples:
- Diversity: Cover edge cases, not just the happy path (abbreviations, missing location, full names)
- Identical formatting: Same JSON keys, same field order, same indentation across all examples
- Boundary cases: Include at least one empty field, null value, or short input
<example> tags for Claude. Google says examples without instructions often outperform instruction-heavy prompts without examples. Include examples first when possible.
Chain-of-Thought (CoT)
Ask the model to show its reasoning before giving a final answer. Introduced by Wei et al. (2022) in "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", this technique dramatically improves performance on tasks requiring multi-step reasoning.
# Simple CoT trigger
prompt = """Analyze this candidate's skill gap against the job description.
Think through this step by step:
1. List the skills the job requires
2. List the skills the candidate has
3. Identify which required skills are missing
4. Rate each missing skill's importance
5. Provide actionable recommendations
Job Description: {jd}
Candidate Profile: {profile}"""
Variants of chain-of-thought:
| Variant | How it works | Source |
|---|---|---|
| Zero-shot CoT | Add "Let's think step by step" to any prompt | Kojima et al., 2022 |
| Few-shot CoT | Include reasoning traces in examples | Wei et al., 2022 |
| Self-consistency | Sample multiple reasoning paths, take majority vote | Wang et al., 2023 (ICLR) |
| Tree of Thoughts | Explore multiple reasoning branches, evaluate and prune | Yao et al., 2023 (NeurIPS) |
Inner Monologue
Have the model reason internally, then extract only the final answer. This is useful when you want the quality benefits of CoT without exposing raw reasoning to users.
prompt = """Evaluate the candidate's interview performance.
First, think through your evaluation internally:
- What were the strongest moments?
- Where did they struggle?
- How does their performance compare to the rubric?
Then output ONLY the final JSON result (no reasoning):
{"overall_score":1,"summary":"string","strengths":["string"]}
OpenAI specifically recommends this: "Structure the reasoning in a parseable format, then extract only the final answer for the user."
Tell the Model What TO DO
A universal principle from all three providers: frame instructions positively.
| Bad (negative only) | Better (positive + negative) |
|---|---|
| Do not use markdown | Respond in plain prose paragraphs |
| Don't ask multiple questions | Ask one question per turn, always |
| Don't give generic responses | Reference the candidate's actual words: "You mentioned X" |
| Don't be verbose | Keep responses under 30 words unless asked to elaborate |
System Prompts
System prompts (or system instructions) define persistent behavior across an entire conversation. They are the most important prompt you write — they shape every response the model generates.
System vs. User Messages
All major API providers separate messages by role, with different levels of authority:
| Role | Priority | Use for |
|---|---|---|
| System/Developer | Highest | Identity, behavioral rules, output format, safety constraints |
| User | Medium | Task-specific input, context, data to process |
| Assistant | Normal | Prior model responses (for multi-turn context) |
Architecture of a Good System Prompt
Based on Google's recommended hierarchy for voice/Live API agents and OpenAI's GPT-4.1 prompting guide:
- Identity & Purpose — Who you are, what your goal is
- Critical Constraints — The 3-4 rules that matter most (front-loaded)
- Tone & Style — Communication personality
- Conversational Flow — Step-by-step behavioral sequence (distinguish one-time steps like greetings from loops like follow-ups)
- Edge Case Handling — What to do when things go wrong (confusion, silence, off-topic)
- Guardrails — Hard rules with conditional examples ("if X, do Y")
Role Assignment Works
Even a single sentence of role framing changes model behavior significantly. All three providers confirm this:
# Vague role (too generic to change behavior)
"You are a helpful assistant."
# Specific role (model adopts domain expertise and appropriate tone)
"You are a friendly tech recruiter calling on behalf of Skilark,
having a first introductory call with a candidate."
# Expert role for evaluation
"You are a warm and insightful career coach. Review this brief
introductory career conversation with a candidate."
Emphasis: Placement, Not Volume
A common mistake is using ALL CAPS or aggressive emphasis (CRITICAL, YOU MUST) to enforce rules. This has model-specific effects:
- Claude 4.5/4.6: Aggressive emphasis causes overtriggering — the model overcompensates. Use normal phrasing.
- Gemini: State goals clearly without excessive persuasive language.
- OpenAI: Frame instructions positively wherever possible.
Structured Output
Most production LLM systems parse model output as structured data. The prompt must guide the model to produce valid, predictable JSON that your code can parse reliably.
Schema-First Prompting
All major providers now offer API-level structured output enforcement:
| Provider | Feature | How it works |
|---|---|---|
| OpenAI | response_format: {type: "json_schema"} |
Constrains token generation to valid schema tokens |
response_mime_type: "application/json" + response_schema (REST) / response_json_schema (Python SDK) |
JSON schema enforcement via generation config | |
| Anthropic | Tool use / JSON mode | Structured output via tool definitions; no first-class schema enforcement like OpenAI |
Here is a real production example using Gemini's schema enforcement with Pydantic:
from pydantic import BaseModel, Field
from typing import Optional
class ListingEnrichment(BaseModel):
normalized_title: str = Field(description="Clean, standard job title")
seniority: str = Field(description="intern|junior|mid|senior|staff|principal|director|vp|c_level")
role_category: str = Field(description="swe|ai_ml|data_eng|platform|devops|security|product|design|other")
remote_policy: Optional[str] = Field(description="remote|hybrid|onsite or null")
location: Optional[str] = Field(description="City, ST format for US locations")
summary: str = Field(description="2-3 sentence distinctive summary")
# API call with schema enforcement
response = client.models.generate_content(
model="gemini-2.0-flash",
contents=prompt,
config={
"response_mime_type": "application/json",
"response_json_schema": ListingEnrichment.model_json_schema(),
},
)
# Pydantic validates the response
result = ListingEnrichment.model_validate_json(response.text)
seniority field will be valid JSON but might say "senior" when the listing clearly describes an intern role.
When API Schema Enforcement Is Unavailable
For APIs that don't support schema enforcement (e.g., Gemini Live realtime audio, or text generation without schema config), use prompt-level techniques:
// Feedback generation prompt — no schema enforcement available
prompt := fmt.Sprintf(`You are an expert interview coach.
Analyze this mock interview transcript.
Rate the candidate on a scale of 1-5:
1 = Poor (major gaps, unprepared)
2 = Below Average (some relevant answers but significant weaknesses)
3 = Average (adequate answers, room for improvement)
4 = Good (strong answers with minor areas to improve)
5 = Excellent (exceptional, well-structured, confident)
Respond ONLY with valid JSON matching this schema:
{"overall_score":1,"summary":"string","strengths":["string"],
"improvements":["string"],
"question_scores":[{"question":"string","score":1,"notes":"string"}]}
Transcript:
%s
Remember: respond with ONLY the JSON object, no other text.`,
transcriptJSON)
The pattern for reliable prompt-level JSON:
- Show the exact JSON schema in the prompt (not a description of it)
- Include a few-shot example showing a complete valid response
- Repeat the format constraint at the end ("respond with ONLY the JSON")
- Strip code fences in post-processing — models often wrap JSON in
```json``` - Validate and handle failures — parse the response and handle errors gracefully
// Post-processing: strip code fences before parsing.
// Models often wrap JSON in ```json ... ``` blocks. This function
// removes the opening fence line (including language tag) and the
// closing fence, leaving only the JSON body.
func stripCodeFences(s string) string {
s = strings.TrimSpace(s)
if strings.HasPrefix(s, "```") {
// Remove the entire opening fence line (e.g. "```json\n")
if idx := strings.Index(s, "\n"); idx != -1 {
s = s[idx+1:]
}
// Remove closing fence
if idx := strings.LastIndex(s, "```"); idx != -1 {
s = s[:idx]
}
s = strings.TrimSpace(s)
}
return s
}
// Usage
text = stripCodeFences(text)
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
log.Printf("feedback: JSON parse error for %s: %v", interviewID, err)
return
}
Include Example Output
An example output in the prompt anchors the model's response format more reliably than the schema alone:
// In your prompt, after the schema definition:
Example output:
{"summary":"You gave a clear picture of your backend experience and
showed real enthusiasm for distributed systems.",
"highlights":["You gave a specific example of reducing API latency
by 40% — concrete numbers stand out",
"You connected your interest in data pipelines to a real problem"],
"improvements":["When asked about your background, you listed
technologies. Try framing it as a story instead."]}
fmt.Sprintf escapingfmt.Sprintf, literal % must be escaped as %%. For example, 40% in the rendered output requires 40%% in the Go string literal. If the string is inside a nested Sprintf, you may need %%%%.
... or comments in example JSONVoice & Realtime Models
Voice models (Gemini Live, OpenAI Realtime) behave fundamentally differently from text models. Prompting for voice requires a distinct set of techniques learned through production experience.
Voice-Specific Constraints
Text prompting habits break down in voice contexts:
| Issue | Why it happens | Mitigation |
|---|---|---|
| Verbose responses | Text models default to thorough answers | Use word count limits ("under 30 words"), not sentence counts — sentences vary wildly in spoken length |
| Question stacking | Model asks 2-3 questions in one turn | "One question per turn, always" as a top-level constraint |
| Filler sounds | Model tries to sound conversational | "Minimize filler — avoid 'uh', 'so', 'absolutely' as openers" |
| Echo/parroting | Model restates what user said | "Do NOT repeat what the candidate just said back to them" |
| Ignoring signals | Model misses injected control tokens | Front-load signal instructions, e.g., [WRAP-UP] |
| Breaking character | "Am I talking to an AI?" | Provide explicit deflection scripts |
Silence Handling
Silence is meaningful in voice conversations. Candidates pause to think. Good prompts teach the model progressive silence handling:
## Silence handling
- If the candidate pauses to think, wait at least 5 seconds
before saying anything
- At 8-10 seconds of silence, a brief "Take your time" is fine
- At 15-20 seconds, offer a gentle scaffold: "Would it help to
think about this in terms of [related concept]?"
- Beyond 20 seconds, offer to reframe: "Want to approach this
differently, or shall we try another question?"
- Never rush them or fill silence with another question
VAD & API-Level Controls
Gemini Live provides API-level knobs that overlap with prompt instructions. Use the API for behavior you need guaranteed; use the prompt for nuance the API can't express.
# Gemini Live API — platform-level voice controls
config = types.LiveConnectConfig(
response_modalities=["AUDIO"],
system_instruction=system_prompt,
# Gemini speaks first without waiting for user audio.
# Fixes cold-start delay — the system prompt says "introduce
# yourself" but Gemini won't act on it until it hears audio.
proactivity=types.ProactivityConfig(proactive_audio=True),
realtime_input_config=types.RealtimeInputConfig(
automatic_activity_detection=types.AutomaticActivityDetection(
# HIGH: catch quieter phone/handset audio as speech
start_of_speech_sensitivity=types.StartSensitivity.START_SENSITIVITY_HIGH,
# LOW: don't cut the user off during thinking pauses
end_of_speech_sensitivity=types.EndSensitivity.END_SENSITIVITY_LOW,
# 2s breathing room before Gemini responds
silence_duration_ms=2_000,
# Include 500ms before speech start (avoids clipping first syllable)
prefix_padding_ms=500,
),
),
# Context window compression for long sessions
context_window_compression=types.ContextWindowCompressionConfig(
trigger_tokens=80_000,
sliding_window=types.SlidingWindow(target_tokens=20_000),
),
)
If VAD silence_duration_ms is set to 2000ms, a prompt rule saying "wait 5 seconds" creates a race condition — the model may respond at 2s anyway. Prompt-level silence guidance should describe behavior ("let pauses breathe"), not timings the API controls.
Similarly, don't write "start speaking immediately" in the prompt — proactive_audio=True handles this. The prompt should describe what to say, not when.
Greeting Design for Voice Agents
A voice agent's first utterance sets the entire conversational tone. Unlike text chat (where users initiate), voice calls start with the agent speaking into silence.
## Sample greetings (pick one at random, never reuse)
- "Hey, this is Skilark — you signed up for a quick career chat.
Is now still a good time?"
- "Hi, Skilark here — thanks for signing up for a chat.
Is this still a good moment to talk?"
Good greeting structure:
- Identify yourself ("Hey, this is Skilark")
- Set context ("You signed up for a quick career chat")
- Check timing ("Is now still a good time?")
Anti-patterns:
- Jumping to a question ("Tell me about yourself") — feels like a cold call
- Combining greeting + question in one utterance — overwhelms
- The greeting should be its own turn. Wait for a response ("yeah" or "sure") before asking anything
Signal Injection for Session Control
In long-running voice sessions, you need to send control signals to the model mid-conversation — like telling it to wrap up. Inject these as user-role messages:
# Send wrap-up signal during a live voice session
await session.send_client_content(
turns=[Content(role="user", parts=[Part(text=(
"[WRAP-UP] About 1 minute remaining. "
"Please wrap up the conversation naturally."
))])],
turn_complete=True,
)
The system prompt must teach the model to recognize these signals:
## Flow
...
5. Continue the conversation naturally until you receive a [WRAP-UP]
signal, then close warmly — thank them and wish them well
- Do NOT wrap up on your own — keep asking until the signal arrives
Context Management
Every LLM has a finite context window. Production systems must manage token budgets deliberately — exceeding the limit causes hard failures or silent truncation.
Input Truncation
When passing user-supplied text into prompts, always cap the length:
# Enrichment prompt — cap job description at 8,000 chars
ENRICHMENT_BODY_LIMIT = 8000
def _build_prompt(title, body, company_name):
return f"""Analyze this job listing and extract structured information.
## Job Listing
**Company:** {company_name}
**Title:** {title}
**Description:**
{body[:ENRICHMENT_BODY_LIMIT]}
## Instructions
1. Normalize the title...
"""
// Resume parsing — cap at 16,000 chars
const maxResumeBytes = 16000
if len(resumeText) > maxResumeBytes {
resumeText = truncateText(resumeText, maxResumeBytes)
}
// Skill gap analysis — each job description capped at 4,000 chars
const maxJDBytes = 4000
for i, jd := range jobDescriptions {
if len(jd) > maxJDBytes {
jd = jd[:maxJDBytes]
}
fmt.Fprintf(&jdSection, "--- %s ---\n%s\n\n", titles[i], jd)
}
Sliding Window Compression
For long-running sessions (like voice interviews), audio accumulates at ~25 tokens/second. A 10-minute session consumes ~30k tokens. Without compression, you hit the context limit quickly.
# Gemini Live — compress context when nearing limits
context_window_compression=types.ContextWindowCompressionConfig(
trigger_tokens=80_000, # Start compressing at 80k
sliding_window=types.SlidingWindow(
target_tokens=20_000 # Keep most recent 20k tokens
),
)
This is a safety net. Under normal operation, voice sessions reconnect every ~10 minutes (due to Gemini's GoAway mechanism), so individual segments rarely exceed 30k tokens.
Document Format for Long Contexts
OpenAI's GPT-4.1 guide tested different formats for document collections in long contexts:
| Format | Performance | Use case |
|---|---|---|
| XML tags | Best | Multiple documents, structured data |
Pipe-delimited (ID: 1 | TITLE: ...) |
Good | Tabular data, logs |
| JSON | Poor for large collections | Avoid for document collections > 50k tokens |
Multi-Turn & Session Continuity
Long-running conversations (interviews, coaching sessions, multi-step workflows) face a core challenge: maintaining context across session boundaries.
Reconnection with Context Preservation
In a voice interview, the Gemini Live WebSocket connection dies every ~10 minutes (GoAway signal). The bot must reconnect without the model re-introducing itself or re-asking questions. Two strategies:
Strategy 1: Resume Handle (Fast Reconnect)
# Gemini provides an opaque resume handle for session continuity
session_resumption=types.SessionResumptionConfig(
handle=self._resume_handle, # From previous session's update
)
# Store the handle when Gemini sends it
def _handle_resumption_update(self, update):
if update.resumable and update.new_handle:
self._resume_handle = update.new_handle
If the handle is valid, the model picks up exactly where it left off — same conversational state, no re-greeting.
Strategy 2: Transcript Replay (Fallback)
When the resume handle is stale (gap > 10 minutes), fall back to replaying recent transcript in the system prompt:
MAX_TRANSCRIPT_LINES = 20 # Last ~10 exchanges
def _build_reconnect_prompt(self):
if not self._transcript:
return self._base_system_prompt
recent = self._transcript[-MAX_TRANSCRIPT_LINES:]
context = "\n".join(recent)
return (
f"{self._base_system_prompt}\n\n"
f"IMPORTANT: This is a continuation of an ongoing interview. "
f"Do NOT re-introduce yourself or start over. "
f"Continue naturally from where you left off.\n\n"
f"Conversation so far:\n{context}"
)
Session Lifecycle Awareness
Your system prompt should account for session lifecycle constraints. One-time instructions (greetings, introductions) should be in the Flow section, not in Critical Constraints:
- Flow step 1: "Greet the candidate warmly" — executes once on initial connect
- On reconnect with resumed context, the model continues from where it was, skipping the greeting
- If you put the greeting in Critical Constraints, the model may re-greet on every reconnection
Prompt Chaining & Pipelines
Complex tasks should be decomposed into focused stages, each with its own prompt. All three providers recommend this: "Split complex tasks into subtasks" (OpenAI, Anthropic, Google).
Why Chain?
- Reliability: Small, focused prompts produce more consistent output than monolithic ones
- Debuggability: When something goes wrong, you know exactly which stage failed
- Cost: Early stages can use cheaper/faster models; expensive models only for nuanced tasks
- Validation: Each stage's output is parsed and validated before the next stage runs
Real Pipeline: Interview → Feedback → Signals
Here is a real multi-stage pipeline from a mock interview product:
# Stage 1: Voice Interview (Gemini Live, realtime audio)
# Input: System prompt + candidate audio
# Output: Bidirectional audio conversation
# Model: gemini-2.5-flash-native-audio (realtime)
# Stage 2: Feedback Generation (Gemini Flash, text)
# Input: Interview transcript (JSON)
# Output: Structured feedback JSON
# Model: gemini-2.0-flash (text generation)
# Stage 3: Career Signal Extraction (same response as Stage 2)
# Input: Feedback JSON (parsed from Stage 2 output)
# Output: Career signals (role interests, skills, seniority)
# Model: (no separate call — extracted from Stage 2's JSON)
# Stage 4: Job Matching (Gemini embeddings)
# Input: Career signal query text
# Output: Matching job listings via vector similarity
# Model: gemini-embedding-001
Each stage has a focused prompt, small output schema, and explicit validation. The feedback prompt produces both human-readable feedback and machine-readable career signals in a single JSON response — an intentional design choice to avoid an extra API call.
Query Decomposition Pattern
A common first stage is breaking a user query into structured components that downstream stages can act on:
# Stage 1: Decompose natural language query
# "senior data engineer in California, remote OK"
# ↓
# {"search_text": "senior data engineer", "location": "CA"}
# Stage 2: Embed search_text → vector
# Stage 3: Vector search with location filter (GeoFilterState)
# Stage 4: Re-rank results by relevance
Structured Extraction Pipeline
Another pattern: extracting structured data from unstructured text at scale. This runs against 50,000+ job listings:
prompt = f"""Analyze this job listing and extract structured information.
## Job Listing
**Company:** {company_name}
**Title:** {title}
**Description:**
{body[:8000]}
## Instructions
1. Normalize the title to a clean, standard format
2. Classify seniority: intern, junior, mid, senior, staff, principal
3. Classify role category: swe, ai_ml, data_eng, platform, devops...
4. Identify remote policy: remote, hybrid, onsite, or null
5. Extract location in "City, ST" format for US locations
6. Extract salary range if mentioned (annual, local currency)
7. List technical skills using canonical names: {skills_list}
8. Write a 2-3 sentence summary of distinctive aspects
Focus on: product/system/domain, what person will own/build,
concrete signals of scale. Skip filler phrases."""
Key design choices:
- Canonical skill taxonomy passed in the prompt — forces consistent naming across 50k listings
- Body truncated to 8,000 chars — token budget management
- Schema enforcement via API (
response_mime_type: "application/json") - Pydantic validation on the response — catches schema violations
Prompt Safety & Injection Prevention
Any time user-supplied text is embedded in a prompt, you face the risk of prompt injection — where the user's input overrides your instructions.
Input Sanitization
The simplest defense: truncate and strip control characters from user input before embedding it in prompts.
// sanitizePromptField truncates and strips newlines from
// user-supplied text before embedding in a Gemini prompt.
func sanitizePromptField(s string, maxLen int) string {
s = strings.ReplaceAll(s, "\n", " ")
s = strings.ReplaceAll(s, "\r", " ")
if len(s) > maxLen {
s = s[:maxLen]
}
return s
}
// Usage: role_title is free text from the user
roleTitle = sanitizePromptField(roleTitle, 100)
companySlug = sanitizePromptField(companySlug, 100)
Structural Defense: JSON Output
Structured output is itself a defense. When the model's response is parsed as JSON, a confused model causes an unmarshal failure, not data corruption:
// Even if the model is confused by injected text in role_title,
// the worst case is a parse error — not data corruption
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
log.Printf("JSON parse error: %v", err)
return // Fail safely — no corrupted data reaches the user
}
Combine multiple layers:
- Sanitize inputs (truncate, strip newlines/control chars)
- Separate instructions from data (use XML tags or clear delimiters)
- Validate outputs (JSON parsing catches confusion)
- Limit blast radius (the model can only produce text — it can't access your database)
Separating Instructions from Data
Anthropic recommends XML tags to clearly separate user data from instructions, making injection harder:
<instructions>
Analyze the resume below and extract structured data.
Respond ONLY with valid JSON.
</instructions>
<resume>
{user_supplied_resume_text}
</resume>
<output_format>
{"skills": ["string"], "seniority": "string"}
</output_format>
Evaluation & Scoring Prompts
Using LLMs to evaluate human performance (interviews, writing, code) requires careful prompt design to produce consistent, fair assessments.
Internal vs. External Rubrics
| Approach | When to use | How to prompt |
|---|---|---|
| Internal rubric | Coaching tone, qualitative feedback | "Evaluate internally across these dimensions. Do not output scores." |
| External rubric | Scoring, ranking, comparison | Define the scale explicitly with examples for each level. Clamp in post-processing. |
Internal rubric example (coaching feedback, no numeric scores):
Evaluate the conversation across these dimensions
(internally, do not output scores):
1. Story Clarity — Did the candidate tell a coherent narrative
about their background, or just list facts?
2. Career Direction — Did they articulate what they are looking for?
3. Specificity — Did they give concrete examples with outcomes?
4. Self-Awareness — Did they show understanding of strengths
and growth areas?
5. Conciseness — Were answers focused, or did they ramble?
External rubric example (numeric scoring):
Rate the candidate on a scale of 1-5:
1 = Poor (major gaps, unprepared)
2 = Below Average (some relevant answers but significant weaknesses)
3 = Average (adequate answers, room for improvement)
4 = Good (strong answers with minor areas to improve)
5 = Excellent (exceptional, well-structured, confident)
if score < 1 { score = 1 }; if score > 5 { score = 5 }
Coaching Tone for Feedback
When the goal is to help someone improve (not just rate them), the prompt must specify the structure of actionable feedback:
Field descriptions:
- highlights: 2-3 specific things the candidate did well —
quote or reference what they actually said
- improvements: 2-3 constructive coaching tips, each as a single
string containing:
(a) what the candidate said
(b) why a different approach is stronger
(c) a concrete reworded example
Keep the tone warm and constructive — like a coach helping them
tell their story better, not a judge marking them down.
The three-part structure (what they did → why change → concrete example) prevents vague feedback like "be more specific." Instead it produces: "When asked about your background, you listed technologies. Try framing it as a story: 'I started in backend engineering, then moved to data pipelines when I saw our team spending 40% of time on manual ETL.'"
Match Dimensions to Context
Never evaluate against criteria the candidate had no opportunity to demonstrate:
| Context | Good dimensions | Bad dimensions |
|---|---|---|
| Role-specific interview | Role Connection, Enthusiasm | Career Direction (they already know their target) |
| Open career conversation | Career Direction, Self-Awareness | Role Connection (no specific role discussed) |
| Technical interview | Problem decomposition, Code quality | Enthusiasm (irrelevant) |
Agentic Prompt Patterns
Agentic systems give LLMs the ability to take actions — calling tools, querying databases, browsing the web — rather than just generating text. The prompt becomes a controller that decides when and how to use each tool.
ReAct: Reason + Act
The ReAct pattern (Yao et al., 2023) interleaves reasoning with action:
You have access to these tools:
- browse_jobs(skill, location, seniority): Search job listings
- get_signals(company): Get market intelligence for a company
- get_skill_trends(limit): Get top skills by demand
For each user question:
1. Think: What information do I need?
2. Act: Call the appropriate tool
3. Observe: Read the tool's output
4. Think: Do I have enough information to answer?
5. If not, go to step 2. If yes, respond to the user.
Tool Definitions
When defining tools for function calling, the descriptions are prompt engineering too. The model uses these descriptions to decide which tool to call:
@mcp.tool(description="""Structured job search with filters.
Best for specific queries with known parameters like skill,
company, location, or seniority level.""")
def browse_jobs(
skill: str | None = None,
company: str | None = None,
location: str | None = None,
seniority: str | None = None,
) -> dict:
...
@mcp.tool(description="""AI-powered semantic search — describe your
ideal role in natural language. Best for exploratory or nuanced
queries where you can't specify exact filters.""")
def find_my_fit(query: str) -> dict:
...
Notice how the descriptions differentiate when to use each tool: browse_jobs for structured queries, find_my_fit for natural language exploration. Without clear differentiation, the model will pick arbitrarily.
Multi-Tool Orchestration
For richer answers, instruct the model to combine multiple tools:
## When to combine tools
For richer answers, call multiple tools together:
- Job search + signals: After browse_jobs or find_my_fit,
call get_signals to surface relevant market context
- Skill trends + jobs: After get_skill_trends, show matching
job listings that require trending skills
- Company trends + signals: After get_company_trends, get
recent news signals for top hiring companies
Planning Before Action
For complex multi-step tasks, have the model plan before executing. OpenAI recommends: "Have the model solve the problem first, then compare with the input."
When the user asks a complex question:
1. First, create a plan: what tools do you need to call
and in what order?
2. Execute the plan step by step
3. After gathering all information, synthesize a coherent answer
4. Ask yourself: "Did I miss anything?" before responding
Production Hardening
Moving prompts from prototype to production requires additional engineering around error handling, cost management, and observability.
Error Handling & Retries
// Bounded concurrency — don't overwhelm the API
var feedbackSemaphore = make(chan struct{}, 5)
func GenerateFeedback(ctx context.Context, gemini *GeminiClient, ...) {
// Acquire semaphore slot
select {
case feedbackSemaphore <- struct{}{}:
defer func() { <-feedbackSemaphore }()
case <-ctx.Done():
return
}
text, _, err := gemini.generateContent(ctx, prompt)
if err != nil {
log.Printf("Gemini API error: %v", err)
return // Fail gracefully, don't crash the pipeline
}
text = stripCodeFences(text)
var feedback InterviewFeedback
if err := json.Unmarshal([]byte(text), &feedback); err != nil {
log.Printf("JSON parse error: %v", err)
return // Bad output is not a crash
}
// Post-process: clamp scores, strip hallucinated fields
if feedback.OverallScore != nil {
if *feedback.OverallScore < 1 { *feedback.OverallScore = 1 }
if *feedback.OverallScore > 5 { *feedback.OverallScore = 5 }
}
}
Model Selection by Task
| Task | Model choice | Rationale |
|---|---|---|
| Structured extraction (enrichment) | Flash/Haiku (fast, cheap) | Well-defined schema, low ambiguity |
| Qualitative feedback (coaching) | Pro/Sonnet (balanced) | Needs nuance, empathy, specific examples |
| Realtime voice conversation | Flash (native audio) | Low latency is critical for natural conversation |
| Complex reasoning (planning) | Opus/o1/Pro (capable) | Multi-step reasoning, long context |
| Embeddings | Embedding model | Purpose-built, cheapest per token |
Token Usage Tracking
type TokenUsage struct {
PromptTokens int
CandidateTokens int
TotalTokens int
}
// Track usage from every API call
text, usage, err := gemini.generateContent(ctx, prompt)
if usage != nil {
log.Printf("tokens: prompt=%d candidate=%d total=%d",
usage.PromptTokens, usage.CandidateTokens, usage.TotalTokens)
}
Timeout & HTTP Client
// Direct HTTP client — no SDK dependency, full control
type GeminiClient struct {
apiKey string
model string
httpClient *http.Client
baseURL string
}
func NewGeminiClient(apiKey, model string) *GeminiClient {
return &GeminiClient{
apiKey: apiKey,
model: model,
httpClient: &http.Client{Timeout: 30 * time.Second},
baseURL: "https://generativelanguage.googleapis.com/v1beta",
}
}
Iteration & Testing
Testing Checklist
Before deploying any prompt, test with:
- Empty/minimal input — one-word transcript, blank resume
- Off-topic input — candidate talks about unrelated things
- Hostile input — prompt injection attempts in user fields
- Edge-case formatting — very long responses, unicode, special characters
- Boundary values — exactly at truncation limits
Iteration Strategies
Google recommends three approaches when a prompt isn't working:
- Rephrase: Different wording often yields different results
- Reformulate: If classification fails, try multiple-choice framing
- Reorder: Move sections around — placement affects quality
Temperature Settings
| Provider | Recommendation |
|---|---|
| Gemini | Keep temperature at 1.0 (lower risks looping behavior) |
| Claude | Default is fine for most tasks; lower for deterministic output |
| OpenAI | Lower for structured output, higher for creative tasks |
Version Your Prompts Like Code
Prompts are code. Store them in version control, review changes, and track which version produced which results:
// Prompts as named constants — easy to review, diff, and test
const quickChatPrompt = `## Role
- You are a friendly interviewer...`
const quickChatOpenPrompt = `## Role
- You are a friendly tech recruiter...`
const standardPrompt = `## Role
- You are a mock %s interviewer...`
// Route to the right prompt based on interview type and context
func buildInterviewPrompt(interviewType, roleTitle, companySlug string) string {
roleTitle = sanitizePromptField(roleTitle, 100)
companyClause := ""
if companySlug != "" {
companyClause = " at " + sanitizePromptField(companySlug, 100)
}
if interviewType == "quick_chat" {
if roleTitle == "General" && companySlug == "" {
return quickChatOpenPrompt // Open-ended, no target role
}
return fmt.Sprintf(quickChatPrompt, roleTitle, companyClause)
}
return fmt.Sprintf(standardPrompt, interviewType, roleTitle, companyClause)
}
Common Pitfalls
| Anti-Pattern | Why It Fails | Fix |
|---|---|---|
| ALL CAPS emphasis | Overtriggers on Claude 4.5+; ignored by Gemini | Normal casing with front-loading |
| "Do NOT do X" without alternative | Model knows what to avoid but not what to do | Add "Instead, do Y" |
| Vague role ("be helpful") | Too generic to change behavior | Specific role ("tech recruiter on an intro call") |
| Sentence-count limits for voice | Sentences vary wildly in spoken length | Word-count limits ("under 30 words") |
| Checklist instructions ("cover 3 topics") | Creates countdown — model winds down after 3 | "Keep exploring, no checklist" |
| Repeating the same rule 3 times | Wastes tokens, doesn't improve compliance | Say it once, place it high |
JSON example with ... or comments |
Model reproduces the comments/ellipsis | Show complete, valid JSON |
| Prompt-level timers conflicting with API VAD | Creates race condition (model responds at API timing) | API controls timing; prompt describes behavior |
| Greeting + question in one utterance (voice) | Overwhelms the user | Greeting is its own turn; wait for response |
Not escaping % in Go fmt.Sprintf |
Corrupts the rendered prompt | Use %% for literal % in Go string templates |
Quick Reference
Technique Comparison
| Technique | When to Use | Tradeoff |
|---|---|---|
| Zero-shot | Simple, well-defined tasks | Low cost, less consistent formatting |
| Few-shot | Structured output, consistent format | More tokens, much better consistency |
| Chain-of-thought | Multi-step reasoning, math, logic | More tokens, better accuracy |
| Inner monologue | Quality reasoning without exposing it | Moderate cost, clean output |
| Schema enforcement | JSON output in production pipelines | Guarantees syntax; validate semantics separately |
| Prompt chaining | Complex multi-stage workflows | More API calls, better debuggability |
| ReAct (tool use) | Agentic systems that take actions | Powerful but unpredictable; needs guardrails |
Deployment Checklist
- Critical constraints are in the first section (not buried in Rules)
- Positive instructions outnumber negative ones
- JSON schema is shown explicitly (not just described)
- At least one few-shot example is included (for structured output)
- Evaluation dimensions match the actual conversation context
- Voice prompts use word counts, not sentence counts
- User-supplied fields are sanitized (newlines stripped, length capped)
- Tested with empty, minimal, and adversarial inputs
- No ALL CAPS emphasis (use placement instead)
- Template escaping is correct (
%%in Go,{{in Python f-strings)
Sources & Further Reading
Official Documentation
| Provider | Resource |
|---|---|
| Anthropic | Claude Prompt Engineering Guide |
| Gemini Prompting Strategies | |
| Gemini Structured Output | |
| Gemini System Instructions | |
| OpenAI | Prompt Engineering Guide |
| OpenAI | GPT-4.1 Prompting Guide |
| OpenAI | Structured Outputs |
| OpenAI | Reasoning Best Practices |
Research Papers
| Paper | Authors | Key contribution |
|---|---|---|
| Chain-of-Thought Prompting Elicits Reasoning in Large Language Models | Wei et al., 2022 | Showed that including reasoning steps in prompts dramatically improves performance on math and logic tasks |
| Large Language Models are Zero-Shot Reasoners | Kojima et al., 2022 | Demonstrated that simply adding "Let's think step by step" triggers chain-of-thought reasoning without examples |
| Self-Consistency Improves Chain of Thought Reasoning | Wang et al., 2023 | Sample multiple reasoning paths and take majority vote for more reliable answers |
| Tree of Thoughts: Deliberate Problem Solving with Large Language Models | Yao et al., 2023 (NeurIPS) | Explores branching reasoning paths with evaluation and pruning |
| ReAct: Synergizing Reasoning and Acting in Language Models | Yao et al., 2023 (ICLR) | Interleaving reasoning traces with tool use for grounded, multi-step problem solving |
| Toolformer: Language Models Can Teach Themselves to Use Tools | Schick et al., 2023 | Self-supervised learning of when and how to call external APIs |