Prompt Engineering Guide: Techniques for Better LLM Output

Prompt engineering is the practice of designing inputs to language models that reliably produce high-quality, accurate, and well-formatted outputs. As LLMs become central to production applications, the quality of your prompts directly determines the quality of your product. A well-engineered prompt can turn an unreliable model output into a deterministic, structured result — without the cost or complexity of fine-tuning.

This guide covers the most impactful prompting techniques in 2026: system prompt design, chain-of-thought reasoning, few-shot examples, structured output enforcement, role prompting, self-consistency, and systematic prompt evaluation — all with concrete Python examples against real APIs.

System Prompt Design
Chain-of-Thought Reasoning
Few-Shot Prompting
Structured Output and JSON Mode
Role and Persona Prompting
Self-Consistency and Majority Voting
Output Length and Format Control
Prompt Evaluation and Testing

System Prompt Design

The system prompt is your most powerful tool — it sets the model's persona, constraints, output format, and domain knowledge. A good system prompt eliminates the need to repeat instructions in every user message. For production systems, think of the system prompt as configuration code: version-control it, review it carefully, and test changes before deploying.

Key principles: be specific about role and expertise level, define the output format explicitly, list constraints as positive rules ("always" rather than "never"), and provide examples of ideal responses when possible.

from openai import OpenAI

client = OpenAI()

# Weak system prompt (vague)
WEAK_SYSTEM = "You are a helpful assistant."

# Strong system prompt (specific, structured, constrained)
STRONG_SYSTEM = """You are a senior Python code reviewer with 10+ years of experience.

When reviewing code:
1. Identify bugs and logic errors first (Priority: Critical)
2. Note security vulnerabilities (Priority: High)
3. Suggest performance improvements (Priority: Medium)
4. Comment on style and readability (Priority: Low)

Output format:
- Use the exact structure: CRITICAL / HIGH / MEDIUM / LOW sections
- Each issue: [LINE N] Description. Fix: suggested_code
- End with a SUMMARY score: /10 with 1-sentence justification

If the code has no issues, say "LGTM: No issues found." and explain why it's good."""

code_to_review = '''
def get_user(user_id):
    query = "SELECT * FROM users WHERE id = " + user_id
    return db.execute(query)
'''

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": STRONG_SYSTEM},
        {"role": "user", "content": f"Review this code:\n```python\n{code_to_review}\n```"}
    ],
    temperature=0.1,  # Low temperature for consistent, deterministic output
)
print(response.choices[0].message.content)

Claude tip: Claude responds especially well to XML-tagged sections in system prompts: <instructions>, <examples>, <constraints>. This is Anthropic's recommended pattern for complex system prompts.

Chain-of-Thought Reasoning

Chain-of-thought (CoT) prompting instructs the model to show its reasoning step-by-step before giving a final answer. This dramatically improves accuracy on multi-step problems — math, logic, code debugging, and complex analysis — because the model self-corrects during reasoning. Simply adding "Think step by step" to a prompt can increase accuracy by 10–40% on reasoning benchmarks.

from openai import OpenAI

client = OpenAI()

# Without CoT — model jumps to answer, more likely to make errors
def ask_direct(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": question}],
        max_tokens=100,
    )
    return response.choices[0].message.content

# With CoT — model reasons before answering
def ask_with_cot(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"{question}\n\nThink step by step, then give your final answer on a new line starting with 'ANSWER:'"
        }],
        max_tokens=500,
    )
    return response.choices[0].message.content

# Zero-shot CoT
problem = "If a train travels 150km in 2.5 hours, then slows to 60% of that speed for another 1.5 hours, what is the total distance traveled?"
print(ask_with_cot(problem))

# Structured CoT with explicit reasoning format
STRUCTURED_COT_SYSTEM = """Solve problems using this exact format:
GIVEN: [list known facts]
FIND: [what we need to calculate]
STEPS:
1. [step with calculation]
2. [step with calculation]
...
ANSWER: [final answer with units]
VERIFY: [sanity check the answer]"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": STRUCTURED_COT_SYSTEM},
        {"role": "user", "content": problem}
    ],
    temperature=0,
)
print(response.choices[0].message.content)

Few-Shot Prompting

Few-shot prompting provides 2–5 examples of ideal input/output pairs before the actual query. This teaches the model the exact format, tone, and depth you expect — far more effectively than describing it in words. Few-shot examples are especially powerful for extraction tasks, custom classification schemes, and domain-specific formatting.

from openai import OpenAI

client = OpenAI()

# Few-shot extraction: extract product specs from unstructured text
FEW_SHOT_SYSTEM = """Extract product specifications as JSON. Examples:

Input: "The XR-500 camera shoots 4K at 60fps with 24MP stills and 10-hour battery life"
Output: {"model": "XR-500", "video": "4K@60fps", "photo_mp": 24, "battery_hours": 10}

Input: "Dell XPS 15 features Intel i9-13900H, 32GB DDR5 RAM, 1TB NVMe SSD, OLED display"
Output: {"model": "Dell XPS 15", "cpu": "Intel i9-13900H", "ram_gb": 32, "storage_tb": 1, "display": "OLED"}

Input: "Sony WH-1000XM5 headphones: 30hr battery, ANC, Bluetooth 5.2, 3-minute quick charge"
Output: {"model": "Sony WH-1000XM5", "battery_hours": 30, "anc": true, "bluetooth": "5.2", "quick_charge_minutes": 3}

Return only valid JSON, no explanation."""

test_inputs = [
    "Apple iPhone 16 Pro: A18 Pro chip, 48MP main camera, 6.3-inch ProMotion display, all-day battery",
    "Samsung 85-inch QLED TV, 4K 120Hz, HDR10+, 4x HDMI 2.1 ports, built-in Alexa",
]

import json
for text in test_inputs:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": FEW_SHOT_SYSTEM},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0,
        max_tokens=200,
    )
    data = json.loads(response.choices[0].message.content)
    print(data)

Structured Output and JSON Mode

Reliable structured output is essential for production pipelines. Use JSON mode (response_format={"type": "json_object"}) to guarantee valid JSON output. For complex schemas, describe the exact structure in your system prompt or use Pydantic with OpenAI's structured outputs feature for automatic schema enforcement.

from openai import OpenAI
from pydantic import BaseModel
from typing import Optional
import json

client = OpenAI()

# Method 1: JSON mode with schema in prompt
def extract_entity(text: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """Extract entities as JSON with this exact schema:
{
  "people": [{"name": str, "role": str, "company": str | null}],
  "organizations": [{"name": str, "type": str}],
  "locations": [{"name": str, "type": "city|country|region"}],
  "dates": [{"text": str, "iso": str | null}]
}"""},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

# Method 2: Pydantic structured outputs (OpenAI beta feature)
class SentimentResult(BaseModel):
    sentiment: str           # positive | negative | neutral
    confidence: float        # 0.0 - 1.0
    key_phrases: list[str]   # supporting phrases from text
    suggested_action: Optional[str]

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Analyze the sentiment of customer feedback."},
        {"role": "user", "content": "The product arrived late and was damaged. Very disappointed."}
    ],
    response_format=SentimentResult,
)
result: SentimentResult = response.choices[0].message.parsed
print(f"Sentiment: {result.sentiment} ({result.confidence:.0%})")
print(f"Key phrases: {result.key_phrases}")

Role and Persona Prompting

Assigning a specific expert role to the model ("You are a senior security engineer...", "You are an experienced oncologist reviewing...") consistently improves output quality for specialized domains. The model draws on training data associated with that role, producing more accurate, domain-appropriate responses. Combine role prompting with explicit expertise level for best results.

from openai import OpenAI

client = OpenAI()

ROLES = {
    "security_reviewer": """You are a senior application security engineer with OWASP expertise.
Review code for: SQL injection, XSS, CSRF, insecure deserialization, broken auth, sensitive data exposure.
Rate each finding: CRITICAL / HIGH / MEDIUM / LOW / INFO.
Always suggest the specific fix, not just the problem.""",

    "performance_analyst": """You are a database performance expert specializing in PostgreSQL query optimization.
When analyzing queries: identify full table scans, missing indexes, N+1 patterns, excessive joins.
Always provide the EXPLAIN ANALYZE output interpretation and the optimized query.""",

    "code_explainer": """You are a patient programming teacher explaining to a junior developer.
Use simple analogies, avoid jargon. Break complex code into logical steps.
After explaining, give a "Key Takeaway" in one sentence.""",
}

def expert_review(code: str, role: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": ROLES[role]},
            {"role": "user", "content": f"```python\n{code}\n```"}
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content

sample_code = """
def login(username, password):
    sql = f"SELECT * FROM users WHERE username='{username}' AND password='{password}'"
    result = db.execute(sql)
    return result.fetchone() is not None
"""

print(expert_review(sample_code, "security_reviewer"))

Self-Consistency and Majority Voting

Self-consistency generates multiple responses at higher temperature, then selects the most common answer by majority vote. This technique significantly improves accuracy on ambiguous or difficult questions where any single model call might be wrong but multiple calls converge on the correct answer. It trades cost (N API calls) for accuracy.

from openai import OpenAI
from collections import Counter

client = OpenAI()

def self_consistent_answer(question: str, n_samples: int = 5, temperature: float = 0.7) -> str:
    """Generate N answers and return the most common one."""
    answers = []
    for _ in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Answer the question. End your response with 'FINAL ANSWER: [your answer]'"},
                {"role": "user", "content": question}
            ],
            temperature=temperature,
            max_tokens=300,
        )
        text = response.choices[0].message.content
        # Extract the final answer
        if "FINAL ANSWER:" in text:
            answer = text.split("FINAL ANSWER:")[-1].strip()
        else:
            answer = text.strip()
        answers.append(answer)

    # Majority vote
    counter = Counter(answers)
    best_answer, count = counter.most_common(1)[0]
    confidence = count / n_samples
    return best_answer, confidence

answer, confidence = self_consistent_answer(
    "A store sells apples for $0.75 each and oranges for $1.25 each. "
    "If you buy 3 apples and 4 oranges, what is the total cost?",
    n_samples=5
)
print(f"Answer: {answer}")
print(f"Confidence: {confidence:.0%} ({int(confidence*5)}/5 samples agreed)")

Output Length and Format Control

Controlling output length and format is critical for production systems. Vague prompts produce variable-length responses that break downstream parsing. Use explicit format instructions, length constraints in words or bullet points, and delimiters to clearly separate sections. For code generation, specify the exact function signature, docstring format, and whether to include tests.

from openai import OpenAI

client = OpenAI()

# Strict format control
FORMAT_SYSTEM = """You write concise technical documentation.

Rules:
- Maximum 3 bullet points per section
- Each bullet point: max 15 words
- Use present tense ("Returns X" not "Will return X")
- No filler phrases ("This function...", "Please note that...")
- Code terms in backticks

Output structure:
## Purpose (1 sentence)
## Parameters (table: Name | Type | Description)
## Returns (1 bullet)
## Example (code block, max 5 lines)"""

def document_function(code: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": FORMAT_SYSTEM},
            {"role": "user", "content": f"Document this function:\n```python\n{code}\n```"}
        ],
        temperature=0,
        max_tokens=400,
    )
    return response.choices[0].message.content

sample = """
def paginate(items: list, page: int, page_size: int = 20) -> dict:
    start = (page - 1) * page_size
    end = start + page_size
    return {"items": items[start:end], "page": page, "total": len(items), "pages": -(-len(items) // page_size)}
"""
print(document_function(sample))

Prompt Evaluation and Testing

Prompt engineering without evaluation is guessing. Build a test suite of representative inputs with expected outputs, then measure your prompt against it. Use an LLM as a judge to score outputs when ground truth is subjective (quality, tone, helpfulness). Track prompt versions like code — small wording changes can have large accuracy impacts.

from openai import OpenAI
from dataclasses import dataclass
from typing import Callable

client = OpenAI()

@dataclass
class TestCase:
    input: str
    expected_output: str
    description: str

def llm_judge(question: str, expected: str, actual: str) -> dict:
    """Use GPT-4o-mini to judge output quality."""
    import json
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"""Judge this LLM output:
Question: {question}
Expected: {expected}
Actual: {actual}

Rate as JSON: {{"score": 1-5, "correct": true/false, "reason": "brief explanation"}}"""
        }],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(response.choices[0].message.content)

def evaluate_prompt(system_prompt: str, test_cases: list[TestCase], model: str = "gpt-4o-mini") -> dict:
    """Run all test cases and compute aggregate metrics."""
    results = []
    for tc in test_cases:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": tc.input}
            ],
            temperature=0, max_tokens=200,
        )
        actual = response.choices[0].message.content
        judgment = llm_judge(tc.input, tc.expected_output, actual)
        results.append({"case": tc.description, "score": judgment["score"], "correct": judgment["correct"]})

    avg_score = sum(r["score"] for r in results) / len(results)
    accuracy = sum(1 for r in results if r["correct"]) / len(results)
    return {"avg_score": avg_score, "accuracy": accuracy, "results": results}

Best practice: Keep a prompt registry with version numbers, evaluation scores, and change notes. A prompt that scores 4.2/5 on your test suite is a baseline — any new version must beat it before deployment.