Multimodal AI: Vision, Audio and Document Understanding

Multimodal AI systems process multiple types of input — images, audio, video, and documents — alongside text, unlocking a new tier of applications that purely text-based models cannot address. In 2026, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all offer native multimodal capabilities in a single unified API call, eliminating the need to chain separate vision, OCR, and transcription models. This shift enables compelling real-world applications: automated invoice processing, visual code debugging, meeting transcription with speaker identification, and intelligent document extraction.

This guide covers the practical implementation of vision, audio, and document understanding using the leading multimodal APIs, with production-ready Python code for each modality.

Vision API Basics with GPT-4o
Image Analysis Use Cases
Document Understanding and OCR
Claude Vision for Document Analysis
Audio Transcription and Understanding
Multi-Image Comparison
Production Patterns
Cost and Latency Tradeoffs

Vision API Basics with GPT-4o

GPT-4o accepts images via base64-encoded data or public URLs. Images are embedded directly into the messages array as content blocks alongside text. The detail parameter controls the resolution tier: "low" processes a 512×512 thumbnail for speed, while "high" tiles the image at full resolution for accuracy on text-dense content like receipts and forms.

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()

def analyze_image(image_path: str, prompt: str, detail: str = "auto") -> str:
    """Analyze a local image with GPT-4o."""
    image_bytes = Path(image_path).read_bytes()
    b64 = base64.b64encode(image_bytes).decode("utf-8")

    # Detect media type from extension
    ext = Path(image_path).suffix.lower()
    media_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg",
                   ".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"}
    media_type = media_types.get(ext, "image/jpeg")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:{media_type};base64,{b64}", "detail": detail}
                },
                {"type": "text", "text": prompt}
            ]
        }],
        max_tokens=1024,
    )
    return response.choices[0].message.content

# Describe an image
description = analyze_image("product.jpg", "Describe this product image in detail.")

# Extract text from screenshot
text = analyze_image("screenshot.png", "Extract all text from this screenshot.", detail="high")

# Analyze a chart
insights = analyze_image("sales_chart.png",
    "What are the key trends? What month had the highest value? Any anomalies?")

print(insights)

Token cost: "low" detail = ~85 tokens per image. "high" detail = 85 + 170 tokens per 512×512 tile. A 1024×1024 image at high detail = ~765 tokens. Always use "low" for thumbnails and "high" only when text or fine detail matters.

Image Analysis Use Cases

Vision models excel at several real-world tasks beyond simple description. Product image classification, UI bug detection from screenshots, chart interpretation, and quality control inspection are all viable production use cases with modern multimodal models. Structuring your prompt around a specific task dramatically improves output quality.

from openai import OpenAI
import json, base64
from pathlib import Path

client = OpenAI()

def extract_structured_data(image_path: str, schema_description: str) -> dict:
    """Extract structured JSON data from an image."""
    b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}", "detail": "high"}},
                {"type": "text", "text": f"Extract the following data as JSON: {schema_description}. Return only valid JSON, no markdown."}
            ]
        }],
        response_format={"type": "json_object"},
        max_tokens=512,
    )
    return json.loads(response.choices[0].message.content)

# Extract receipt data
receipt_data = extract_structured_data(
    "receipt.jpg",
    "{vendor, date, total_amount, tax_amount, line_items: [{description, quantity, unit_price, total}]}"
)
print(receipt_data)

# Classify product condition
def classify_product_condition(image_path: str) -> dict:
    b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # mini is sufficient for classification
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}"}},
                {"type": "text", "text": 'Rate this product condition as JSON: {"condition": "new|good|fair|poor", "defects": [], "confidence": 0.0-1.0}'}
            ]
        }],
        response_format={"type": "json_object"},
        max_tokens=200,
    )
    return json.loads(response.choices[0].message.content)

Document Understanding and OCR

Modern vision LLMs are superior to traditional OCR for documents with complex layouts — they understand tables, form fields, checkboxes, and multi-column text semantically, not just pixel-by-pixel. For multi-page PDFs, convert pages to images and process each page or batch multiple pages per request within the context window.

import base64
import fitz  # PyMuPDF: pip install pymupdf
from pathlib import Path
from openai import OpenAI

client = OpenAI()

def pdf_page_to_base64(pdf_path: str, page_num: int = 0, dpi: int = 200) -> str:
    """Convert a PDF page to base64-encoded PNG."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    mat = fitz.Matrix(dpi / 72, dpi / 72)  # Scale factor
    pix = page.get_pixmap(matrix=mat)
    return base64.b64encode(pix.tobytes("png")).decode("utf-8")

def extract_invoice(pdf_path: str) -> dict:
    """Extract structured data from a PDF invoice."""
    import json
    b64_page = pdf_page_to_base64(pdf_path, page_num=0)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_page}", "detail": "high"}},
                {"type": "text", "text": """Extract this invoice as JSON:
{
  "invoice_number": "",
  "vendor_name": "",
  "vendor_address": "",
  "invoice_date": "",
  "due_date": "",
  "line_items": [{"description": "", "qty": 0, "unit_price": 0.0, "total": 0.0}],
  "subtotal": 0.0,
  "tax": 0.0,
  "total": 0.0
}
Return only valid JSON."""}
            ]
        }],
        response_format={"type": "json_object"},
        max_tokens=1024,
    )
    return json.loads(response.choices[0].message.content)

def process_multi_page_pdf(pdf_path: str, pages: list[int] = None) -> list[str]:
    """Extract text from multiple PDF pages."""
    doc = fitz.open(pdf_path)
    page_range = pages or range(len(doc))
    results = []
    for page_num in page_range:
        b64 = pdf_page_to_base64(pdf_path, page_num)
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}", "detail": "high"}},
                    {"type": "text", "text": "Extract all text from this document page, preserving structure and tables. Use markdown for tables."}
                ]
            }],
            max_tokens=2048,
        )
        results.append(response.choices[0].message.content)
    return results

Claude Vision for Document Analysis

Claude's vision API accepts images as base64 data or URLs within the content array. Claude excels at understanding complex document layouts, handwritten notes, technical diagrams, and mixed-format content. Its large 200K context window means you can pass many images or pages in a single call.

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def claude_analyze_image(image_path: str, prompt: str) -> str:
    """Analyze an image with Claude."""
    image_data = base64.standard_b64encode(Path(image_path).read_bytes()).decode("utf-8")
    ext = Path(image_path).suffix.lower()
    media_type_map = {".jpg": "image/jpeg", ".jpeg": "image/jpeg",
                      ".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"}
    media_type = media_type_map.get(ext, "image/jpeg")

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": image_data}},
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text

# Technical diagram analysis
diagram_analysis = claude_analyze_image(
    "architecture_diagram.png",
    "Describe the system architecture shown. Identify all components, their relationships, and any potential bottlenecks."
)

# Handwriting recognition
handwriting = claude_analyze_image(
    "notes.jpg",
    "Transcribe all handwritten text from this image. Preserve line breaks and formatting."
)

# Medical/scientific image analysis
chart_data = claude_analyze_image(
    "lab_results.png",
    "Extract all values from this lab results form as a structured list. Include test name, value, unit, and reference range."
)

Audio Transcription and Understanding

OpenAI Whisper provides high-accuracy speech-to-text transcription. The newer GPT-4o Audio model goes further — it understands tone, emotion, and context from audio, not just transcribed words. For most production applications, Whisper via the API is the right choice: it handles 99 languages, noisy environments, and technical vocabulary far better than browser-based speech APIs.

from openai import OpenAI
from pathlib import Path

client = OpenAI()

def transcribe_audio(audio_path: str, language: str = None) -> dict:
    """Transcribe audio with timestamps using Whisper."""
    with open(audio_path, "rb") as audio_file:
        transcript = client.audio.transcriptions.create(
            model="whisper-1",
            file=audio_file,
            language=language,          # None = auto-detect
            response_format="verbose_json",  # Includes timestamps
            timestamp_granularities=["word", "segment"],
        )
    return {
        "text": transcript.text,
        "language": transcript.language,
        "duration": transcript.duration,
        "segments": [
            {"start": s.start, "end": s.end, "text": s.text}
            for s in (transcript.segments or [])
        ],
    }

# Transcribe a meeting recording
result = transcribe_audio("meeting.mp3", language="en")
print(f"Duration: {result['duration']:.1f}s")
print(f"Transcript: {result['text'][:200]}...")

# Translate audio to English
def translate_audio(audio_path: str) -> str:
    """Translate non-English audio to English."""
    with open(audio_path, "rb") as f:
        translation = client.audio.translations.create(
            model="whisper-1",
            file=f,
            response_format="text",
        )
    return translation

# Post-process transcript with GPT-4o
def summarize_meeting(transcript: str) -> dict:
    """Extract action items and summary from meeting transcript."""
    import json
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Extract structured meeting notes from the transcript."},
            {"role": "user", "content": f"Transcript:\n{transcript}\n\nReturn JSON: {{summary, key_decisions, action_items: [{{owner, task, deadline}}]}}"}
        ],
        response_format={"type": "json_object"},
        max_tokens=1024,
    )
    return json.loads(response.choices[0].message.content)

Multi-Image Comparison

Both GPT-4o and Claude can process multiple images in a single request, enabling before/after comparison, visual diff detection, product variant comparison, and multi-frame video analysis. Passing multiple images is as simple as adding multiple image content blocks to the same message.

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()

def compare_images(image_path_a: str, image_path_b: str, comparison_prompt: str) -> str:
    """Compare two images with GPT-4o."""
    def to_b64(path):
        return base64.b64encode(Path(path).read_bytes()).decode()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": "Image A (before):"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{to_b64(image_path_a)}", "detail": "high"}},
                {"type": "text", "text": "Image B (after):"},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{to_b64(image_path_b)}", "detail": "high"}},
                {"type": "text", "text": comparison_prompt}
            ]
        }],
        max_tokens=1024,
    )
    return response.choices[0].message.content

# UI regression testing
diff = compare_images(
    "ui_before.png",
    "ui_after.png",
    "List every visual difference between Image A and Image B. Focus on layout changes, color changes, missing or added elements."
)
print(diff)

Production Patterns

Production multimodal pipelines require careful attention to image preprocessing, error handling, and cost control. Images should be resized before sending — most vision tasks don't require full 4K resolution and oversized images waste tokens and slow down responses. Implement retry logic for transient API errors and validate outputs before trusting them downstream.

from PIL import Image
import io, base64
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential

client = OpenAI()

def preprocess_image(image_path: str, max_size: int = 1024) -> str:
    """Resize and compress image before sending to API."""
    img = Image.open(image_path)
    # Convert to RGB (handles PNG with alpha)
    if img.mode in ("RGBA", "P"):
        img = img.convert("RGB")
    # Resize maintaining aspect ratio
    img.thumbnail((max_size, max_size), Image.LANCZOS)
    # Compress to JPEG
    buf = io.BytesIO()
    img.save(buf, format="JPEG", quality=85, optimize=True)
    return base64.b64encode(buf.getvalue()).decode("utf-8")

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def robust_vision_call(image_path: str, prompt: str) -> str:
    """Vision API call with retry logic."""
    b64 = preprocess_image(image_path)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{b64}", "detail": "auto"}},
                {"type": "text", "text": prompt}
            ]
        }],
        max_tokens=512,
        timeout=30,
    )
    return response.choices[0].message.content

Cost and Latency Tradeoffs

Multimodal API costs are driven by image tokens plus text tokens. Understanding the tradeoffs helps you choose the right model and detail level for each task. For high-volume pipelines, the choice between cloud APIs and locally-hosted models (LLaVA, Moondream, Phi-3 Vision) becomes economically significant above a few million images per month.

GPT-4o at "low" detail: ~85 image tokens regardless of resolution. Use for thumbnails, product classification, sentiment from photos. Fast (~1–2s) and cheap.

GPT-4o at "high" detail: 85 + 170 per tile. A 1024×1024 image = ~765 tokens. Use for documents, receipts, code screenshots, detailed diagrams. Slower (~3–5s) but accurate.

GPT-4o-mini: Supports vision at a fraction of GPT-4o cost. Excellent for classification tasks — product category, image moderation, simple Q&A. Not reliable for complex OCR or detailed structured extraction.

Batch processing: Use the OpenAI Batch API for offline image processing (receipts, catalog images). 50% cost discount with 24h turnaround — often the right choice for non-real-time workflows.

Recommendation: Route vision tasks by complexity. Simple classification → GPT-4o-mini low detail. Document extraction → GPT-4o high detail. Bulk offline processing → Batch API. Real-time consumer apps with cost constraints → consider fine-tuned small vision models (Phi-3-vision, LLaVA-1.6).