OpenAI API Guide: GPT-4, Assistants and Batch Processing

The OpenAI API provides access to GPT-4o and o-series reasoning models through a clean REST interface with official Python and Node.js SDKs. In 2026, GPT-4o remains the gold standard for multimodal reasoning tasks, combining text, image, and audio processing in a single model call. The platform has expanded significantly with the Assistants API for stateful agent experiences, the Batch API for 50%-discounted bulk processing, and enhanced function calling that powers tool-using agents.

This guide covers everything you need: installation, chat completions, streaming, vision, function calling, the Assistants API with threads, batch processing, and embeddings — with production-ready Python code throughout.

Table of Contents

Installation and Authentication

Install the official OpenAI Python SDK and configure your API key. The SDK handles retries with exponential backoff, timeout management, and streaming automatically. Always store API keys as environment variables — never in source code.

# Install OpenAI SDK
pip install openai

# Set API key
export OPENAI_API_KEY="sk-proj-..."
from openai import OpenAI
import os

# Client auto-reads OPENAI_API_KEY from environment
client = OpenAI()

# Or pass explicitly
client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
    timeout=30.0,         # Request timeout in seconds
    max_retries=3,        # Auto-retry on rate limits / server errors
)

# Available models (June 2026)
MODELS = {
    "fast":      "gpt-4o-mini",    # Cheapest, great for simple tasks
    "standard":  "gpt-4o",         # Best price/performance for most tasks
    "reasoning": "o3-mini",        # Chain-of-thought for complex reasoning
    "reasoning_full": "o3",        # Full reasoning model
}

# Quick test
response = client.chat.completions.create(
    model=MODELS["standard"],
    messages=[{"role": "user", "content": "Say hello in 3 languages."}],
    max_tokens=100
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")
Note: The new OpenAI SDK (v1.x) uses a synchronous OpenAI() client and AsyncOpenAI() for async. The old openai.ChatCompletion.create() pattern is deprecated. Run pip install --upgrade openai to get the latest version.

Chat Completions API

The Chat Completions API is the core interface for all text generation. It accepts a list of messages (system, user, assistant) and returns a completion. You control temperature for creativity, max_tokens for output length, and can request JSON output with response_format.

from openai import OpenAI

client = OpenAI()

# Single-turn with system prompt
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a senior Python developer. Write clean, idiomatic code with type hints."},
        {"role": "user", "content": "Write a function to paginate a SQLAlchemy query."}
    ],
    temperature=0.2,      # Lower = more deterministic
    max_tokens=1024,
)
print(response.choices[0].message.content)

# Force JSON output (structured extraction)
import json

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract data as JSON. Return {name, email, company} from the text."},
        {"role": "user", "content": "Hi, I'm Jane Doe from Acme Corp, reach me at jane@acme.com"}
    ],
    response_format={"type": "json_object"},
    temperature=0,
)
data = json.loads(response.choices[0].message.content)
print(data)  # {'name': 'Jane Doe', 'email': 'jane@acme.com', 'company': 'Acme Corp'}

# Streaming for real-time output
print("Streaming: ", end="", flush=True)
with client.chat.completions.stream(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain the CAP theorem in 200 words."}],
    max_tokens=300,
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
print()

Function Calling and Tool Use

Function calling (also called tool use) lets the model request execution of your application's functions to retrieve real-time data or perform actions. The model returns a structured JSON call instead of plain text — you execute the function and return results in the next turn. This is the foundation for building reliable tool-using agents.

import json
from openai import OpenAI

client = OpenAI()

# Define tools available to the model
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name, e.g. 'London'"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "default": "celsius"}
                },
                "required": ["city"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_database",
            "description": "Search the product database for items matching a query",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "limit": {"type": "integer", "default": 10}
                },
                "required": ["query"]
            }
        }
    }
]

def get_weather(city: str, unit: str = "celsius") -> dict:
    """Mock weather function — replace with real API call."""
    return {"city": city, "temperature": 22, "unit": unit, "condition": "sunny"}

def search_database(query: str, limit: int = 10) -> list:
    """Mock database search — replace with real DB query."""
    return [{"id": 1, "name": f"Product matching '{query}'"} for _ in range(min(limit, 3))]

FUNCTION_MAP = {"get_weather": get_weather, "search_database": search_database}

def run_agent(user_message: str) -> str:
    """Run a single-step tool-using agent."""
    messages = [{"role": "user", "content": user_message}]

    # First call: model may request a tool
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    msg = response.choices[0].message
    messages.append(msg)

    # If the model called a tool, execute it and continue
    if msg.tool_calls:
        for tool_call in msg.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)
            result = FUNCTION_MAP[fn_name](**fn_args)
            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": json.dumps(result)
            })

        # Second call: model sees tool result and generates final answer
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
        )
        return response.choices[0].message.content

    return msg.content

print(run_agent("What's the weather in Tokyo right now?"))

Vision and Multimodal Input

GPT-4o is natively multimodal — you can pass images alongside text in the same message. Images can be provided as base64-encoded data or as public URLs. GPT-4o can describe scenes, extract text (OCR), analyze charts and diagrams, compare images, and answer visual questions. This unlocks document intelligence, screenshot debugging, and visual QA workflows.

import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()

# Analyze a local image (base64)
image_bytes = Path("screenshot.png").read_bytes()
image_b64 = base64.b64encode(image_bytes).decode("utf-8")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{image_b64}",
                    "detail": "high"   # "low" for faster/cheaper, "high" for detailed analysis
                }
            },
            {"type": "text", "text": "What does this screenshot show? List any errors or issues."}
        ]
    }],
    max_tokens=512,
)
print(response.choices[0].message.content)

# Analyze a public URL image
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/chart.png"}},
            {"type": "text", "text": "Describe the trends in this chart. What month had peak values?"}
        ]
    }],
    max_tokens=256,
)
print(response.choices[0].message.content)
Note: Image tokens count toward your input token usage. A 512×512 image at "low" detail costs ~85 tokens; at "high" detail it costs up to 765 tokens per 512×512 tile. Use "low" detail for thumbnails and "high" for documents requiring precise text extraction.

Assistants API with Threads

The Assistants API manages conversation state server-side using Threads, eliminating the need to manually maintain message history. An Assistant is a configured agent with tools and instructions; a Thread holds conversation history; a Run executes the assistant against a thread. This is ideal for chatbots, help desks, and applications requiring persistent multi-turn conversations.

import time
from openai import OpenAI

client = OpenAI()

# Create an assistant (do this once; store the ID)
assistant = client.beta.assistants.create(
    name="Python Tutor",
    instructions="""You are an expert Python tutor. Explain concepts clearly with code examples.
    When asked about code, always provide runnable examples with comments.""",
    model="gpt-4o",
    tools=[{"type": "code_interpreter"}],  # Can run Python code
)
print(f"Assistant ID: {assistant.id}")  # Store this: asst_xxxx

# Create a thread (one per user conversation)
thread = client.beta.threads.create()
print(f"Thread ID: {thread.id}")  # Store per-session

def chat(thread_id: str, assistant_id: str, user_message: str) -> str:
    """Send a message and get a response."""
    # Add message to thread
    client.beta.threads.messages.create(
        thread_id=thread_id,
        role="user",
        content=user_message
    )

    # Start a run
    run = client.beta.threads.runs.create(
        thread_id=thread_id,
        assistant_id=assistant_id,
    )

    # Poll until complete
    while run.status in ("queued", "in_progress"):
        time.sleep(0.5)
        run = client.beta.threads.runs.retrieve(thread_id=thread_id, run_id=run.id)

    if run.status != "completed":
        raise RuntimeError(f"Run failed with status: {run.status}")

    # Get the latest assistant message
    messages = client.beta.threads.messages.list(thread_id=thread_id, order="desc", limit=1)
    return messages.data[0].content[0].text.value

# Multi-turn conversation — thread maintains history automatically
reply1 = chat(thread.id, assistant.id, "What is a Python decorator?")
print(reply1)

reply2 = chat(thread.id, assistant.id, "Show me a memoization decorator example.")
print(reply2)

Batch API for Bulk Processing

The OpenAI Batch API processes large volumes of requests asynchronously (within 24 hours) at a 50% cost discount. Submit a JSONL file with up to 50,000 requests, poll for completion, and download results. This is ideal for dataset annotation, document classification, bulk content generation, and evaluation pipelines.

import json
import time
from pathlib import Path
from openai import OpenAI

client = OpenAI()

# 1. Prepare batch requests as JSONL
reviews = [
    "This product is fantastic! Works exactly as described.",
    "Complete waste of money. Broke after one day.",
    "It's okay I guess, nothing special.",
    "Absolutely love it, bought three already!",
]

batch_file = Path("batch_requests.jsonl")
with batch_file.open("w") as f:
    for i, review in enumerate(reviews):
        request = {
            "custom_id": f"review-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-4o-mini",
                "messages": [
                    {"role": "system", "content": "Classify as positive, negative, or neutral. Reply with one word."},
                    {"role": "user", "content": review}
                ],
                "max_tokens": 10,
            }
        }
        f.write(json.dumps(request) + "\n")

# 2. Upload the file
with batch_file.open("rb") as f:
    upload = client.files.create(file=f, purpose="batch")
print(f"File uploaded: {upload.id}")

# 3. Create batch
batch = client.batches.create(
    input_file_id=upload.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(f"Batch ID: {batch.id}, Status: {batch.status}")

# 4. Poll for completion
while batch.status in ("validating", "in_progress", "finalizing"):
    time.sleep(30)
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status} | Done: {batch.request_counts.completed}/{batch.request_counts.total}")

# 5. Download and parse results
if batch.status == "completed":
    content = client.files.content(batch.output_file_id).text
    for line in content.strip().split("\n"):
        result = json.loads(line)
        sentiment = result["response"]["body"]["choices"][0]["message"]["content"]
        print(f"{result['custom_id']}: {sentiment}")

Embeddings and Semantic Search

OpenAI embeddings convert text into high-dimensional vectors that capture semantic meaning. Similar texts have vectors with high cosine similarity, enabling semantic search, clustering, anomaly detection, and recommendation systems. The text-embedding-3-small model offers excellent quality at very low cost; text-embedding-3-large gives higher accuracy for demanding use cases.

import numpy as np
from openai import OpenAI

client = OpenAI()

def embed(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    """Get embeddings for a list of texts."""
    response = client.embeddings.create(model=model, input=texts)
    return np.array([item.embedding for item in response.data])

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Build a simple semantic search index
documents = [
    "Python asyncio enables concurrent I/O without threads.",
    "FastAPI is a modern web framework built on Pydantic and Starlette.",
    "Docker containers package applications with their dependencies.",
    "PostgreSQL is a powerful open-source relational database.",
    "Machine learning models are trained on labeled datasets.",
]

# Embed all documents (in production, cache these in a vector DB)
doc_embeddings = embed(documents)

def semantic_search(query: str, top_k: int = 3) -> list[tuple[float, str]]:
    query_embedding = embed([query])[0]
    scores = [cosine_similarity(query_embedding, doc_emb) for doc_emb in doc_embeddings]
    ranked = sorted(zip(scores, documents), reverse=True)
    return ranked[:top_k]

results = semantic_search("How does Python handle concurrent operations?")
for score, doc in results:
    print(f"{score:.3f} | {doc}")

Cost Optimization Strategies

OpenAI API costs can grow quickly at scale. These strategies reduce spend without sacrificing quality. First, route by task complexity: use gpt-4o-mini for classification, extraction, and simple Q&A (roughly 15× cheaper than GPT-4o). Reserve GPT-4o for tasks requiring deep reasoning, code generation, or nuanced understanding.

Second, use the Batch API for any non-real-time workload — the 50% discount applies to all models and endpoints. Annotation, evaluation, and bulk generation should always use the Batch API.

Third, cache aggressively. OpenAI automatically discounts repeated prefix tokens (Prompt Caching), but you should also cache at the application layer using Redis for frequent identical requests.

import hashlib, json
import redis
from openai import OpenAI

client = OpenAI()
cache = redis.Redis(decode_responses=True)

def smart_completion(
    system: str,
    user: str,
    complexity: str = "low",   # "low" → mini, "high" → gpt-4o
    ttl: int = 3600,
) -> str:
    """Route to cheapest model and cache results."""
    model = "gpt-4o-mini" if complexity == "low" else "gpt-4o"
    cache_key = hashlib.md5(f"{model}:{system}:{user}".encode()).hexdigest()

    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user}
        ],
        max_tokens=512,
        temperature=0 if complexity == "low" else 0.3,
    )
    result = response.choices[0].message.content
    cache.setex(cache_key, ttl, json.dumps(result))
    return result

# Simple classification → mini model
sentiment = smart_completion(
    "Classify as positive/negative/neutral. One word only.",
    "The delivery was super fast and the product is great!",
    complexity="low"
)
print(sentiment)  # positive
Tip: Set max_tokens precisely. If a task needs 100 tokens of output, don't default to 4096 — you pay per output token. Tight limits also prevent runaway responses in production.