Vector databases are the backbone of every production RAG system, semantic search engine, and recommendation engine built on LLMs. They store high-dimensional embedding vectors and support fast approximate nearest-neighbor (ANN) queries — finding the most semantically similar items in milliseconds across millions of records.
In 2026, the market has consolidated around a handful of strong contenders: Pinecone (managed, fully serverless), Weaviate (open-source with rich filtering), Chroma (developer-first, embedded), Qdrant (Rust-based, high performance), and FAISS (in-memory, research/local). This guide compares them with real code examples so you can pick the right one for your use case.
A vector database stores embedding vectors alongside metadata. When you query with a new vector (your question, converted to an embedding), the database uses an ANN index — typically HNSW (Hierarchical Navigable Small World graphs) or IVF (Inverted File Index) — to find the nearest vectors without scanning every record.
HNSW is the dominant algorithm in 2026: it builds a layered graph where each node connects to its nearest neighbours. Search starts at the top layer (coarse), drilling down to fine-grained comparisons. It achieves >99% recall at very high QPS with sub-millisecond latency at millions of vectors.
import numpy as np
# Manual cosine similarity — what vector DBs compute internally
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Simulate embedding vectors (normally 1536 or 3072 dims)
query = np.random.randn(1536)
doc1 = np.random.randn(1536)
doc2 = query + np.random.randn(1536) * 0.1 # Similar to query
sim1 = cosine_similarity(query, doc1)
sim2 = cosine_similarity(query, doc2)
print(f"Random doc similarity: {sim1:.4f}")
print(f"Similar doc similarity: {sim2:.4f}") # Will be much higher
Chroma is the fastest way to get a vector store running locally. It works embedded (in-process, no server) or as a client-server setup. It's the default choice for prototyping RAG applications and is deeply integrated with LangChain and LlamaIndex. Data is persisted to disk automatically.
Best for: Local development, prototyping, small to medium datasets (<1M vectors), when you want zero infrastructure. Not ideal for horizontal scaling or multi-tenant production.
import chromadb
from chromadb.utils import embedding_functions
# Embedded mode — no server required
client = chromadb.PersistentClient(path="./chroma_storage")
# Use OpenAI embeddings automatically
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-3-small"
)
collection = client.get_or_create_collection(
name="techoral_docs",
embedding_function=openai_ef,
metadata={"hnsw:space": "cosine"}
)
# Add documents — Chroma handles embedding automatically
collection.add(
documents=[
"RAG grounds LLM responses in retrieved context",
"Vector databases store high-dimensional embeddings",
"LangChain provides RAG pipeline abstractions",
],
metadatas=[{"source": "blog"}, {"source": "wiki"}, {"source": "docs"}],
ids=["doc1", "doc2", "doc3"]
)
# Query
results = collection.query(
query_texts=["How do I improve LLM accuracy?"],
n_results=2,
include=["documents", "distances", "metadatas"]
)
print(results["documents"]) # Most relevant docs
print(results["distances"]) # Lower = more similar
Pinecone is the market leader for managed vector databases. In 2026, its serverless tier scales to zero when idle and charges per query — perfect for applications with variable traffic. No infrastructure management, automatic replication, and a simple REST/Python API make it the fastest path to production.
Best for: Production applications, teams without DevOps capacity, multi-tenant SaaS (use namespaces per tenant), variable-traffic workloads. The free tier supports 2M vectors.
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI
import time
pc = Pinecone(api_key="your-pinecone-key")
openai_client = OpenAI(api_key="your-openai-key")
# Create serverless index
index_name = "techoral-knowledge"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
time.sleep(30) # Wait for index to be ready
index = pc.Index(index_name)
# Embed and upsert
def embed(text: str) -> list[float]:
return openai_client.embeddings.create(
model="text-embedding-3-small", input=text
).data[0].embedding
vectors = [
("v1", embed("Pinecone is a managed vector database"), {"source": "docs"}),
("v2", embed("Weaviate supports hybrid search"), {"source": "wiki"}),
]
index.upsert(vectors=vectors)
# Query
query_vector = embed("Which vector DB is managed?")
results = index.query(vector=query_vector, top_k=2, include_metadata=True)
for match in results["matches"]:
print(f"Score: {match['score']:.4f} | ID: {match['id']}")
Weaviate is the most feature-rich open-source vector database. It supports hybrid search (BM25 + vector), multi-tenancy, auto-vectorization with built-in model integrations, and a powerful GraphQL query language. You can run it locally with Docker or use Weaviate Cloud Services (WCS).
Best for: Complex filtering requirements, hybrid search, multi-modal search (images + text), teams that want open-source with enterprise features, and workloads requiring rich metadata filtering alongside vector search.
import weaviate
import weaviate.classes as wvc
# Connect to local Weaviate instance (Docker)
client = weaviate.connect_to_local()
# Create schema
collection = client.collections.create(
name="Article",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
model="text-embedding-3-small"
),
properties=[
wvc.config.Property(name="title", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
wvc.config.Property(name="category", data_type=wvc.config.DataType.TEXT),
]
)
# Insert objects — Weaviate vectorizes automatically
collection.data.insert_many([
{"title": "RAG Guide", "content": "Retrieval augmented generation...", "category": "AI"},
{"title": "Vector DBs", "content": "HNSW index for fast ANN search...", "category": "AI"},
])
# Hybrid search: semantic + keyword
results = collection.query.hybrid(
query="vector similarity search",
alpha=0.7, # 0=BM25 only, 1=vector only
limit=5
)
for obj in results.objects:
print(obj.properties["title"])
client.close()
alpha parameter in hybrid search is a key tuning knob. Start at 0.7 (favouring semantic), then tune based on your evaluation metrics.
Qdrant is written in Rust and delivers exceptional throughput and low memory usage compared to Python-based alternatives. It supports payload filtering at query time (filter before ANN search, not after), sparse vectors for hybrid search, and quantization for up to 4× memory reduction with minimal accuracy loss.
from qdrant_client import QdrantClient
from qdrant_client.models import (
Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue
)
import numpy as np
client = QdrantClient(host="localhost", port=6333)
# Create collection
client.recreate_collection(
collection_name="articles",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# Insert points with payload metadata
points = [
PointStruct(id=1, vector=np.random.randn(1536).tolist(),
payload={"category": "AI", "year": 2026}),
PointStruct(id=2, vector=np.random.randn(1536).tolist(),
payload={"category": "DevOps", "year": 2026}),
]
client.upsert(collection_name="articles", points=points)
# Filtered search — only AI articles
results = client.search(
collection_name="articles",
query_vector=np.random.randn(1536).tolist(),
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="AI"))]
),
limit=5
)
for r in results:
print(f"ID: {r.id}, Score: {r.score:.4f}, Payload: {r.payload}")
FAISS (Facebook AI Similarity Search) is the foundational library that most vector databases build on top of. It runs entirely in memory (with optional disk persistence) and is extremely fast for local use. It has no server, no REST API, and no metadata filtering — just raw vector search. Use it when you need maximum speed with a small-to-medium dataset and can handle the operational simplicity yourself.
import faiss
import numpy as np
# Build HNSW index — best accuracy/speed tradeoff
dimension = 1536
index = faiss.IndexHNSWFlat(dimension, 32) # 32 = M parameter
index.hnsw.efSearch = 128 # Higher = more accurate, slower
# Add 10,000 random vectors
vectors = np.random.randn(10000, dimension).astype(np.float32)
faiss.normalize_L2(vectors) # Normalize for cosine similarity
index.add(vectors)
# Query
query = np.random.randn(1, dimension).astype(np.float32)
faiss.normalize_L2(query)
distances, indices = index.search(query, k=5)
print(f"Top 5 neighbours: {indices[0]}")
print(f"Distances: {distances[0]}")
# Persist to disk
faiss.write_index(index, "my_index.faiss")
loaded_index = faiss.read_index("my_index.faiss")
| Feature | Chroma | Pinecone | Weaviate | Qdrant | FAISS |
|---|---|---|---|---|---|
| Hosting | Local/Self | Managed | Self/Managed | Self/Cloud | In-process |
| Open Source | Yes | No | Yes | Yes | Yes |
| Metadata Filtering | Basic | Good | Excellent | Excellent | None |
| Hybrid Search | No | Yes | Yes | Yes | No |
| Horizontal Scale | Limited | Automatic | Yes | Yes | No |
| Best Use Case | Prototyping | Production SaaS | Complex queries | High QPS | Research/Local |
Choose Chroma when you're prototyping, building demos, or need a local vector store with zero setup. It's the fastest path from idea to working RAG pipeline.
Choose Pinecone when you need production reliability without managing infrastructure. The serverless tier is cost-effective for variable loads, and multi-tenancy via namespaces is a first-class feature.
Choose Weaviate when your queries involve complex metadata filtering, you need hybrid search out of the box, or you're building a multi-modal system that searches across images and text.
Choose Qdrant when raw throughput matters — it consistently benchmarks highest in QPS for filtered searches, and its quantization support makes it very memory-efficient for large collections.
Choose FAISS when you need maximum speed on a local dataset, you're doing research/ML experimentation, or you're building a custom vector search solution on top of a lower-level library.