Text Embeddings: Semantic Search and Similarity Guide

Text embeddings convert words, sentences, and documents into dense numerical vectors that capture semantic meaning. Unlike keyword search which matches exact strings, embedding-based semantic search understands intent and meaning — "car" and "automobile" map to nearby vectors, so a search for "vehicle purchase" finds documents about "buying cars" without any keyword overlap. In 2026, embeddings are the foundational technology behind RAG systems, recommendation engines, duplicate detection, and intelligent document search.

This guide covers everything from generating embeddings with OpenAI and open-source models, computing similarity, building semantic search indexes, storing vectors in databases, and applying embeddings to clustering and anomaly detection.

Table of Contents

Generating Embeddings with OpenAI

OpenAI's text-embedding-3-small produces 1536-dimensional vectors at very low cost and is the right default for most applications. text-embedding-3-large produces 3072-dimensional vectors with higher accuracy for demanding retrieval tasks. Both models support dimensionality reduction via the dimensions parameter, trading a small accuracy loss for faster search and lower storage.

import numpy as np
from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    """Embed a batch of texts. Returns (N, D) numpy array."""
    # Clean inputs — remove newlines which can degrade quality
    texts = [t.replace("\n", " ").strip() for t in texts]
    response = client.embeddings.create(model=model, input=texts)
    return np.array([item.embedding for item in response.data])

def embed_single(text: str, model: str = "text-embedding-3-small") -> np.ndarray:
    """Embed a single text."""
    return embed_texts([text], model)[0]

# Single embedding
vec = embed_single("Python asyncio enables concurrent I/O")
print(f"Dimensions: {len(vec)}")  # 1536

# Batch embedding (more efficient — one API call for many texts)
documents = [
    "Python asyncio enables concurrent I/O without threads.",
    "FastAPI is a modern web framework built on Starlette.",
    "Docker containers package apps with their dependencies.",
    "PostgreSQL is a powerful open-source relational database.",
    "Machine learning models learn patterns from training data.",
]
doc_embeddings = embed_texts(documents)
print(f"Shape: {doc_embeddings.shape}")  # (5, 1536)

# Reduced dimensions (cheaper storage, slightly less accurate)
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=documents,
    dimensions=512   # Reduce from 1536 to 512
)
small_embeddings = np.array([item.embedding for item in response.data])
print(f"Reduced shape: {small_embeddings.shape}")  # (5, 512)
Cost tip: text-embedding-3-small is ~5× cheaper than text-embedding-3-large and performs well for most search tasks. Only upgrade to large if you're indexing highly technical or specialized content where small misses semantically similar results.

Cosine Similarity and Distance Metrics

Cosine similarity measures the angle between two vectors, ranging from -1 (opposite) to 1 (identical). It's the standard metric for embedding similarity because it is invariant to vector magnitude — a short document and a long document covering the same topic get similar scores. Dot product is equivalent to cosine similarity when vectors are normalized (which OpenAI embeddings are).

import numpy as np
from openai import OpenAI

client = OpenAI()

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine similarity between two vectors."""
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def cosine_similarity_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """Compute similarity between all pairs in A and B. Returns (|A|, |B|) matrix."""
    # Normalize
    A_norm = A / np.linalg.norm(A, axis=1, keepdims=True)
    B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
    return A_norm @ B_norm.T

# Semantic similarity examples
pairs = [
    ("car", "automobile"),           # Very similar — should be ~0.9+
    ("car", "vehicle"),              # Similar — ~0.85
    ("car", "banana"),               # Unrelated — ~0.1-0.3
    ("Python programming", "coding in Python"),  # Near-identical meaning
]

for text_a, text_b in pairs:
    vecs = embed_texts([text_a, text_b])
    sim = cosine_similarity(vecs[0], vecs[1])
    print(f"{text_a!r} ↔ {text_b!r}: {sim:.3f}")

# Find most similar document to a query
def semantic_search(query: str, documents: list[str], doc_embeddings: np.ndarray, top_k: int = 3):
    query_vec = embed_single(query)
    similarities = [cosine_similarity(query_vec, doc_emb) for doc_emb in doc_embeddings]
    ranked = sorted(zip(similarities, documents), reverse=True)
    return ranked[:top_k]

docs = [
    "Python asyncio enables concurrent I/O without threads.",
    "FastAPI is a modern web framework built on Starlette.",
    "Docker containers package apps with their dependencies.",
    "PostgreSQL is a powerful open-source relational database.",
]
doc_vecs = embed_texts(docs)
results = semantic_search("How does Python handle concurrent operations?", docs, doc_vecs)
for score, doc in results:
    print(f"{score:.3f} | {doc}")

Building Semantic Search

A complete semantic search system indexes documents as embeddings, then for each query computes similarity against all indexed embeddings and returns the top-K matches. For small corpora (under 100K documents), an in-memory numpy index is sufficient. For larger datasets, FAISS or a dedicated vector database provides approximate nearest neighbor search at scale.

import numpy as np
import json
from pathlib import Path
from openai import OpenAI

client = OpenAI()

class SemanticSearchIndex:
    """In-memory semantic search with numpy. Good for up to ~100K documents."""

    def __init__(self, model: str = "text-embedding-3-small"):
        self.model = model
        self.documents: list[dict] = []
        self.embeddings: np.ndarray | None = None

    def add_documents(self, docs: list[dict], text_field: str = "content"):
        """Add documents to the index. Each doc is a dict with at least a text field."""
        texts = [doc[text_field] for doc in docs]
        batch_size = 100  # OpenAI allows up to 2048 inputs per call
        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            response = client.embeddings.create(model=self.model, input=batch)
            all_embeddings.extend([item.embedding for item in response.data])
        new_embeddings = np.array(all_embeddings)
        self.documents.extend(docs)
        self.embeddings = new_embeddings if self.embeddings is None else np.vstack([self.embeddings, new_embeddings])

    def search(self, query: str, top_k: int = 5, threshold: float = 0.0) -> list[dict]:
        """Search for documents similar to the query."""
        q_vec = np.array(client.embeddings.create(model=self.model, input=[query]).data[0].embedding)
        # Cosine similarity (embeddings are normalized, so dot product == cosine)
        norms = np.linalg.norm(self.embeddings, axis=1)
        q_norm = np.linalg.norm(q_vec)
        similarities = (self.embeddings @ q_vec) / (norms * q_norm)
        top_indices = np.argsort(similarities)[::-1][:top_k]
        results = []
        for idx in top_indices:
            score = float(similarities[idx])
            if score >= threshold:
                results.append({**self.documents[idx], "similarity": round(score, 4)})
        return results

    def save(self, path: str):
        np.save(f"{path}.npy", self.embeddings)
        Path(f"{path}.json").write_text(json.dumps(self.documents))

    def load(self, path: str):
        self.embeddings = np.load(f"{path}.npy")
        self.documents = json.loads(Path(f"{path}.json").read_text())

# Usage
index = SemanticSearchIndex()
index.add_documents([
    {"id": 1, "title": "Asyncio Guide", "content": "Python asyncio enables concurrent I/O without OS threads."},
    {"id": 2, "title": "FastAPI Tutorial", "content": "FastAPI builds REST APIs with automatic OpenAPI docs and type validation."},
    {"id": 3, "title": "Docker Guide", "content": "Docker containers package applications with their runtime dependencies."},
])
results = index.search("concurrent programming in Python", top_k=2)
for r in results:
    print(f"{r['similarity']:.3f} | {r['title']}")

Vector Databases: Chroma and Pinecone

Vector databases manage large embedding collections with persistent storage, metadata filtering, and fast approximate nearest neighbor (ANN) search. ChromaDB is the easiest local option for development and moderate production scale. Pinecone is a fully managed cloud vector database for large-scale production with automatic scaling and 99.9% uptime SLA.

# ChromaDB — pip install chromadb
import chromadb
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    return client.embeddings.create(model="text-embedding-3-small", input=[text]).data[0].embedding

# Persistent local ChromaDB
chroma = chromadb.PersistentClient(path="./chroma_db")
collection = chroma.get_or_create_collection(
    name="articles",
    metadata={"hnsw:space": "cosine"}  # Use cosine distance
)

# Add documents with embeddings + metadata
documents = [
    {"id": "doc1", "text": "Python asyncio concurrent I/O", "category": "python"},
    {"id": "doc2", "text": "FastAPI REST API framework", "category": "python"},
    {"id": "doc3", "text": "Kubernetes container orchestration", "category": "devops"},
]
collection.add(
    ids=[d["id"] for d in documents],
    embeddings=[get_embedding(d["text"]) for d in documents],
    documents=[d["text"] for d in documents],
    metadatas=[{"category": d["category"]} for d in documents],
)

# Search with optional metadata filter
results = collection.query(
    query_embeddings=[get_embedding("Python web framework")],
    n_results=2,
    where={"category": "python"},  # Optional filter
)
for doc, distance in zip(results["documents"][0], results["distances"][0]):
    print(f"{1 - distance:.3f} | {doc}")  # Convert distance to similarity

Embeddings in RAG Pipelines

Retrieval-Augmented Generation (RAG) uses embeddings to find relevant context from a knowledge base, then feeds that context to an LLM for answer generation. The quality of the embedding model and chunking strategy directly determine RAG accuracy. Chunk size matters: too small and chunks lose context; too large and you include irrelevant text that dilutes the retrieved passage.

from openai import OpenAI
import numpy as np

client = OpenAI()

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_batch(texts: list[str]) -> np.ndarray:
    response = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([item.embedding for item in response.data])

def rag_answer(query: str, knowledge_base: list[str], kb_embeddings: np.ndarray, top_k: int = 3) -> str:
    """Answer a question using RAG: retrieve relevant chunks, then generate."""
    # Retrieve
    q_vec = embed_batch([query])[0]
    norms = np.linalg.norm(kb_embeddings, axis=1) * np.linalg.norm(q_vec)
    scores = kb_embeddings @ q_vec / norms
    top_indices = np.argsort(scores)[::-1][:top_k]
    context = "\n\n".join(knowledge_base[i] for i in top_indices)

    # Generate
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based only on the provided context. If the answer isn't in the context, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
        ],
        max_tokens=512,
        temperature=0,
    )
    return response.choices[0].message.content

Local Embeddings with Sentence Transformers

For applications requiring data privacy, offline operation, or very high volumes where API costs become prohibitive, local embedding models via the Sentence Transformers library are an excellent alternative. all-MiniLM-L6-v2 (22MB) provides a good balance of speed and quality; all-mpnet-base-v2 gives higher quality at moderate size; bge-large-en-v1.5 rivals OpenAI's small model in many benchmarks.

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model (downloads once, cached locally)
model = SentenceTransformer("all-MiniLM-L6-v2")  # 22MB, fast

documents = [
    "Python asyncio enables concurrent I/O without threads.",
    "FastAPI is a modern web framework built on Starlette.",
    "Docker containers package apps with their dependencies.",
    "PostgreSQL is a powerful open-source relational database.",
]

# Batch embedding — runs on CPU or GPU
doc_embeddings = model.encode(documents, normalize_embeddings=True)
print(f"Shape: {doc_embeddings.shape}")  # (4, 384)

# Semantic search
query = "How does Python handle async operations?"
q_vec = model.encode([query], normalize_embeddings=True)[0]
scores = doc_embeddings @ q_vec  # Dot product == cosine when normalized
top_idx = np.argsort(scores)[::-1]
for idx in top_idx[:2]:
    print(f"{scores[idx]:.3f} | {documents[idx]}")

# Higher quality model for production
model_large = SentenceTransformer("BAAI/bge-large-en-v1.5")  # Rivals OpenAI text-embedding-3-small

Clustering and Topic Discovery

Embeddings enable unsupervised topic discovery through clustering. K-Means groups similar documents into clusters; UMAP reduces dimensions for visualization; DBSCAN finds clusters of varying density and automatically identifies outliers. These techniques are powerful for understanding large document collections, finding duplicate content, and detecting anomalies.

import numpy as np
from openai import OpenAI
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize

client = OpenAI()

def cluster_documents(documents: list[str], n_clusters: int = 5) -> dict:
    """Cluster documents by semantic similarity."""
    # Embed
    response = client.embeddings.create(model="text-embedding-3-small", input=documents)
    embeddings = np.array([item.embedding for item in response.data])

    # K-Means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    labels = kmeans.fit_predict(embeddings)

    # Group documents by cluster
    clusters = {}
    for doc, label in zip(documents, labels):
        clusters.setdefault(int(label), []).append(doc)

    return clusters

def find_duplicates(documents: list[str], threshold: float = 0.95) -> list[tuple]:
    """Find near-duplicate documents by embedding similarity."""
    response = client.embeddings.create(model="text-embedding-3-small", input=documents)
    embeddings = normalize(np.array([item.embedding for item in response.data]))
    sim_matrix = embeddings @ embeddings.T
    duplicates = []
    for i in range(len(documents)):
        for j in range(i + 1, len(documents)):
            if sim_matrix[i, j] >= threshold:
                duplicates.append((i, j, float(sim_matrix[i, j])))
    return duplicates

Production Best Practices

Building reliable embedding-based systems requires attention to several operational concerns. Always cache embeddings — recomputing them on every request wastes money and adds latency. Normalize your embeddings before storing so dot product equals cosine similarity, enabling the most optimized index operations. Monitor embedding drift if your content distribution changes significantly over time.

Chunking strategy: Use 256–512 token chunks with 10–20% overlap for general document retrieval. Use smaller chunks (128 tokens) for FAQ-style Q&A. Use sentence-level chunks for precise factual retrieval. Always include document title and section heading in each chunk for better context.

Hybrid search: Pure semantic search struggles with exact keyword matches (product codes, names, IDs). Use hybrid search combining BM25 keyword search and embedding similarity, fused with Reciprocal Rank Fusion (RRF) for best overall recall.

Re-ranking: After retrieving top-K candidates with fast embedding search, apply a cross-encoder re-ranker for more accurate final ordering. The cross-encoder jointly processes query+document pairs, trading speed for precision. Common pipeline: embed → retrieve top-20 → rerank → return top-5.

Indexing pipeline: Pre-compute and cache all document embeddings. Store in a vector DB with metadata. Rebuild the index nightly or on document updates. Never compute embeddings at query time for the knowledge base — only compute the query embedding live.