Prompt engineering is the practice of designing inputs to language models that reliably produce high-quality, accurate, and well-formatted outputs. As LLMs become central to production applications, the quality of your prompts directly determines the quality of your product. A well-engineered prompt can turn an unreliable model output into a deterministic, structured result — without the cost or complexity of fine-tuning.
This guide covers the most impactful prompting techniques in 2026: system prompt design, chain-of-thought reasoning, few-shot examples, structured output enforcement, role prompting, self-consistency, and systematic prompt evaluation — all with concrete Python examples against real APIs.
The system prompt is your most powerful tool — it sets the model's persona, constraints, output format, and domain knowledge. A good system prompt eliminates the need to repeat instructions in every user message. For production systems, think of the system prompt as configuration code: version-control it, review it carefully, and test changes before deploying.
Key principles: be specific about role and expertise level, define the output format explicitly, list constraints as positive rules ("always" rather than "never"), and provide examples of ideal responses when possible.
from openai import OpenAI
client = OpenAI()
# Weak system prompt (vague)
WEAK_SYSTEM = "You are a helpful assistant."
# Strong system prompt (specific, structured, constrained)
STRONG_SYSTEM = """You are a senior Python code reviewer with 10+ years of experience.
When reviewing code:
1. Identify bugs and logic errors first (Priority: Critical)
2. Note security vulnerabilities (Priority: High)
3. Suggest performance improvements (Priority: Medium)
4. Comment on style and readability (Priority: Low)
Output format:
- Use the exact structure: CRITICAL / HIGH / MEDIUM / LOW sections
- Each issue: [LINE N] Description. Fix: suggested_code
- End with a SUMMARY score: /10 with 1-sentence justification
If the code has no issues, say "LGTM: No issues found." and explain why it's good."""
code_to_review = '''
def get_user(user_id):
query = "SELECT * FROM users WHERE id = " + user_id
return db.execute(query)
'''
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": STRONG_SYSTEM},
{"role": "user", "content": f"Review this code:\n```python\n{code_to_review}\n```"}
],
temperature=0.1, # Low temperature for consistent, deterministic output
)
print(response.choices[0].message.content)
<instructions>, <examples>, <constraints>. This is Anthropic's recommended pattern for complex system prompts.
Chain-of-thought (CoT) prompting instructs the model to show its reasoning step-by-step before giving a final answer. This dramatically improves accuracy on multi-step problems — math, logic, code debugging, and complex analysis — because the model self-corrects during reasoning. Simply adding "Think step by step" to a prompt can increase accuracy by 10–40% on reasoning benchmarks.
from openai import OpenAI
client = OpenAI()
# Without CoT — model jumps to answer, more likely to make errors
def ask_direct(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
max_tokens=100,
)
return response.choices[0].message.content
# With CoT — model reasons before answering
def ask_with_cot(question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": f"{question}\n\nThink step by step, then give your final answer on a new line starting with 'ANSWER:'"
}],
max_tokens=500,
)
return response.choices[0].message.content
# Zero-shot CoT
problem = "If a train travels 150km in 2.5 hours, then slows to 60% of that speed for another 1.5 hours, what is the total distance traveled?"
print(ask_with_cot(problem))
# Structured CoT with explicit reasoning format
STRUCTURED_COT_SYSTEM = """Solve problems using this exact format:
GIVEN: [list known facts]
FIND: [what we need to calculate]
STEPS:
1. [step with calculation]
2. [step with calculation]
...
ANSWER: [final answer with units]
VERIFY: [sanity check the answer]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": STRUCTURED_COT_SYSTEM},
{"role": "user", "content": problem}
],
temperature=0,
)
print(response.choices[0].message.content)
Few-shot prompting provides 2–5 examples of ideal input/output pairs before the actual query. This teaches the model the exact format, tone, and depth you expect — far more effectively than describing it in words. Few-shot examples are especially powerful for extraction tasks, custom classification schemes, and domain-specific formatting.
from openai import OpenAI
client = OpenAI()
# Few-shot extraction: extract product specs from unstructured text
FEW_SHOT_SYSTEM = """Extract product specifications as JSON. Examples:
Input: "The XR-500 camera shoots 4K at 60fps with 24MP stills and 10-hour battery life"
Output: {"model": "XR-500", "video": "4K@60fps", "photo_mp": 24, "battery_hours": 10}
Input: "Dell XPS 15 features Intel i9-13900H, 32GB DDR5 RAM, 1TB NVMe SSD, OLED display"
Output: {"model": "Dell XPS 15", "cpu": "Intel i9-13900H", "ram_gb": 32, "storage_tb": 1, "display": "OLED"}
Input: "Sony WH-1000XM5 headphones: 30hr battery, ANC, Bluetooth 5.2, 3-minute quick charge"
Output: {"model": "Sony WH-1000XM5", "battery_hours": 30, "anc": true, "bluetooth": "5.2", "quick_charge_minutes": 3}
Return only valid JSON, no explanation."""
test_inputs = [
"Apple iPhone 16 Pro: A18 Pro chip, 48MP main camera, 6.3-inch ProMotion display, all-day battery",
"Samsung 85-inch QLED TV, 4K 120Hz, HDR10+, 4x HDMI 2.1 ports, built-in Alexa",
]
import json
for text in test_inputs:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": FEW_SHOT_SYSTEM},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
max_tokens=200,
)
data = json.loads(response.choices[0].message.content)
print(data)
Reliable structured output is essential for production pipelines. Use JSON mode (response_format={"type": "json_object"}) to guarantee valid JSON output. For complex schemas, describe the exact structure in your system prompt or use Pydantic with OpenAI's structured outputs feature for automatic schema enforcement.
from openai import OpenAI
from pydantic import BaseModel
from typing import Optional
import json
client = OpenAI()
# Method 1: JSON mode with schema in prompt
def extract_entity(text: str) -> dict:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": """Extract entities as JSON with this exact schema:
{
"people": [{"name": str, "role": str, "company": str | null}],
"organizations": [{"name": str, "type": str}],
"locations": [{"name": str, "type": "city|country|region"}],
"dates": [{"text": str, "iso": str | null}]
}"""},
{"role": "user", "content": text}
],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
# Method 2: Pydantic structured outputs (OpenAI beta feature)
class SentimentResult(BaseModel):
sentiment: str # positive | negative | neutral
confidence: float # 0.0 - 1.0
key_phrases: list[str] # supporting phrases from text
suggested_action: Optional[str]
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Analyze the sentiment of customer feedback."},
{"role": "user", "content": "The product arrived late and was damaged. Very disappointed."}
],
response_format=SentimentResult,
)
result: SentimentResult = response.choices[0].message.parsed
print(f"Sentiment: {result.sentiment} ({result.confidence:.0%})")
print(f"Key phrases: {result.key_phrases}")
Assigning a specific expert role to the model ("You are a senior security engineer...", "You are an experienced oncologist reviewing...") consistently improves output quality for specialized domains. The model draws on training data associated with that role, producing more accurate, domain-appropriate responses. Combine role prompting with explicit expertise level for best results.
from openai import OpenAI
client = OpenAI()
ROLES = {
"security_reviewer": """You are a senior application security engineer with OWASP expertise.
Review code for: SQL injection, XSS, CSRF, insecure deserialization, broken auth, sensitive data exposure.
Rate each finding: CRITICAL / HIGH / MEDIUM / LOW / INFO.
Always suggest the specific fix, not just the problem.""",
"performance_analyst": """You are a database performance expert specializing in PostgreSQL query optimization.
When analyzing queries: identify full table scans, missing indexes, N+1 patterns, excessive joins.
Always provide the EXPLAIN ANALYZE output interpretation and the optimized query.""",
"code_explainer": """You are a patient programming teacher explaining to a junior developer.
Use simple analogies, avoid jargon. Break complex code into logical steps.
After explaining, give a "Key Takeaway" in one sentence.""",
}
def expert_review(code: str, role: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": ROLES[role]},
{"role": "user", "content": f"```python\n{code}\n```"}
],
temperature=0.1,
)
return response.choices[0].message.content
sample_code = """
def login(username, password):
sql = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
result = db.execute(sql)
return result.fetchone() is not None
"""
print(expert_review(sample_code, "security_reviewer"))
Self-consistency generates multiple responses at higher temperature, then selects the most common answer by majority vote. This technique significantly improves accuracy on ambiguous or difficult questions where any single model call might be wrong but multiple calls converge on the correct answer. It trades cost (N API calls) for accuracy.
from openai import OpenAI
from collections import Counter
client = OpenAI()
def self_consistent_answer(question: str, n_samples: int = 5, temperature: float = 0.7) -> str:
"""Generate N answers and return the most common one."""
answers = []
for _ in range(n_samples):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer the question. End your response with 'FINAL ANSWER: [your answer]'"},
{"role": "user", "content": question}
],
temperature=temperature,
max_tokens=300,
)
text = response.choices[0].message.content
# Extract the final answer
if "FINAL ANSWER:" in text:
answer = text.split("FINAL ANSWER:")[-1].strip()
else:
answer = text.strip()
answers.append(answer)
# Majority vote
counter = Counter(answers)
best_answer, count = counter.most_common(1)[0]
confidence = count / n_samples
return best_answer, confidence
answer, confidence = self_consistent_answer(
"A store sells apples for $0.75 each and oranges for $1.25 each. "
"If you buy 3 apples and 4 oranges, what is the total cost?",
n_samples=5
)
print(f"Answer: {answer}")
print(f"Confidence: {confidence:.0%} ({int(confidence*5)}/5 samples agreed)")
Controlling output length and format is critical for production systems. Vague prompts produce variable-length responses that break downstream parsing. Use explicit format instructions, length constraints in words or bullet points, and delimiters to clearly separate sections. For code generation, specify the exact function signature, docstring format, and whether to include tests.
from openai import OpenAI
client = OpenAI()
# Strict format control
FORMAT_SYSTEM = """You write concise technical documentation.
Rules:
- Maximum 3 bullet points per section
- Each bullet point: max 15 words
- Use present tense ("Returns X" not "Will return X")
- No filler phrases ("This function...", "Please note that...")
- Code terms in backticks
Output structure:
## Purpose (1 sentence)
## Parameters (table: Name | Type | Description)
## Returns (1 bullet)
## Example (code block, max 5 lines)"""
def document_function(code: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": FORMAT_SYSTEM},
{"role": "user", "content": f"Document this function:\n```python\n{code}\n```"}
],
temperature=0,
max_tokens=400,
)
return response.choices[0].message.content
sample = """
def paginate(items: list, page: int, page_size: int = 20) -> dict:
start = (page - 1) * page_size
end = start + page_size
return {"items": items[start:end], "page": page, "total": len(items), "pages": -(-len(items) // page_size)}
"""
print(document_function(sample))
Prompt engineering without evaluation is guessing. Build a test suite of representative inputs with expected outputs, then measure your prompt against it. Use an LLM as a judge to score outputs when ground truth is subjective (quality, tone, helpfulness). Track prompt versions like code — small wording changes can have large accuracy impacts.
from openai import OpenAI
from dataclasses import dataclass
from typing import Callable
client = OpenAI()
@dataclass
class TestCase:
input: str
expected_output: str
description: str
def llm_judge(question: str, expected: str, actual: str) -> dict:
"""Use GPT-4o-mini to judge output quality."""
import json
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Judge this LLM output:
Question: {question}
Expected: {expected}
Actual: {actual}
Rate as JSON: {{"score": 1-5, "correct": true/false, "reason": "brief explanation"}}"""
}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(response.choices[0].message.content)
def evaluate_prompt(system_prompt: str, test_cases: list[TestCase], model: str = "gpt-4o-mini") -> dict:
"""Run all test cases and compute aggregate metrics."""
results = []
for tc in test_cases:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": tc.input}
],
temperature=0, max_tokens=200,
)
actual = response.choices[0].message.content
judgment = llm_judge(tc.input, tc.expected_output, actual)
results.append({"case": tc.description, "score": judgment["score"], "correct": judgment["correct"]})
avg_score = sum(r["score"] for r in results) / len(results)
accuracy = sum(1 for r in results if r["correct"]) / len(results)
return {"avg_score": avg_score, "accuracy": accuracy, "results": results}