Python Performance: Profiling and Optimization (2026)
Published June 6, 2026 • 15 min read
Python is fast enough for most applications — until it isn't. When it isn't, the mistake is usually guessing where the bottleneck is instead of measuring it. This guide covers the profiling tools that tell you exactly where your program spends time and memory, then the optimization techniques — from trivial wins like list comprehensions to 100× speedups with NumPy vectorization and the right concurrency model — so you fix the actual bottleneck instead of the wrong thing.
cProfile + pstats — Where Is the Time Going?
Always profile before optimizing. cProfile is built into Python's standard library:
import cProfile
import pstats
from pstats import SortKey
import io
def profile(func):
"""Decorator that profiles a function and prints the top 20 hotspots."""
def wrapper(*args, **kwargs):
pr = cProfile.Profile()
pr.enable()
result = func(*args, **kwargs)
pr.disable()
s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats(SortKey.CUMULATIVE)
ps.print_stats(20)
print(s.getvalue())
return result
return wrapper
@profile
def slow_function():
# ... your code here ...
pass
# Profile from the command line
python -m cProfile -s cumtime my_script.py
# Save to a file and inspect interactively
python -m cProfile -o profile_output.prof my_script.py
python -m pstats profile_output.prof
# In pstats interactive shell:
# sort cumtime
# stats 30
# Programmatic profiling for production — sample in background
import cProfile, pstats, io, atexit
_profiler = cProfile.Profile()
_profiler.enable()
def _dump_profile():
_profiler.disable()
s = io.StringIO()
pstats.Stats(_profiler, stream=s).sort_stats('cumulative').print_stats(30)
with open('/tmp/profile_dump.txt', 'w') as f:
f.write(s.getvalue())
atexit.register(_dump_profile)
pyinstrument (statistical profiler, ~1% overhead) or py-spy (attaches to a running process without restarting it). cProfile's deterministic profiling adds ~30–100% overhead and is only suitable for development.line_profiler — Line-by-Line Timing
Once cProfile points you to a slow function, line_profiler shows you exactly which lines within it are slow:
pip install line_profiler
# Decorate the function you want to profile
from line_profiler import profile
@profile
def process_records(records: list[dict]) -> list[dict]:
result = []
for record in records: # line 5
cleaned = {k: v.strip() for k, v in record.items() if isinstance(v, str)}
if not cleaned.get('email'): # line 7
continue
cleaned['score'] = compute_score(cleaned) # line 9 ← likely slow
result.append(cleaned)
return result
# Run with kernprof
kernprof -l -v my_script.py
# Sample output:
# Line # Hits Time Per Hit % Time Line Contents
# ================================================
# 9 50000 4.2e+06 84.0 94.3% cleaned['score'] = compute_score(cleaned)
# 6 50000 2.1e+05 4.2 4.7% cleaned = {k: v.strip() ...}
memory_profiler — Memory Usage Per Line
pip install memory_profiler
from memory_profiler import profile as mprofile
@mprofile
def load_large_dataset(filepath: str) -> list:
# Line 3: baseline ~50 MB
with open(filepath) as f:
data = f.read() # +200 MB — whole file in memory
lines = data.splitlines() # +200 MB — another copy
parsed = [line.split(',') for line in lines] # +300 MB
return parsed
# Better approach — streaming avoids the memory spikes:
def load_large_dataset_streaming(filepath: str):
with open(filepath) as f:
for line in f: # reads one line at a time — O(1) memory
yield line.strip().split(',')
python -m memory_profiler my_script.py
# Track memory over time (for servers)
mprof run my_server.py
mprof plot # generates a matplotlib graph
List Comprehensions vs Loops — Benchmark Numbers
Python list comprehensions are 30–50% faster than equivalent for-loops because they run partially in C rather than pure Python bytecode:
import timeit
data = list(range(1_000_000))
# Pure for-loop
def loop_square():
result = []
for x in data:
result.append(x ** 2)
return result
# List comprehension
def comprehension_square():
return [x ** 2 for x in data]
# Generator expression (memory-efficient, same speed as comprehension)
def generator_square():
return (x ** 2 for x in data)
# NumPy (vectorized — see next section)
import numpy as np
arr = np.arange(1_000_000)
def numpy_square():
return arr ** 2
results = {
'loop': timeit.timeit(loop_square, number=10),
'comprehension': timeit.timeit(comprehension_square, number=10),
'numpy': timeit.timeit(numpy_square, number=10),
}
# Typical results (seconds for 10 runs of 1M elements):
# loop: ~0.95s
# comprehension: ~0.65s (1.5× faster than loop)
# numpy: ~0.008s (120× faster than loop)
print(results)
| Approach | 1M elements (10 runs) | Relative Speed |
|---|---|---|
| for-loop + append | ~0.95s | 1× |
| list comprehension | ~0.65s | ~1.5× |
| map() with lambda | ~0.80s | ~1.2× |
| NumPy vectorized | ~0.008s | ~120× |
NumPy Vectorization — 10x to 100x Speedups
NumPy operations run in C over contiguous memory arrays. Replacing Python loops with NumPy operations is the single biggest performance win available to data-heavy Python code:
import numpy as np
import timeit
N = 1_000_000
# Normalize a list of values: (x - mean) / std
data_py = list(range(N))
data_np = np.arange(N, dtype=np.float64)
def normalize_python(data):
n = len(data)
mean = sum(data) / n
std = (sum((x - mean) ** 2 for x in data) / n) ** 0.5
return [(x - mean) / std for x in data]
def normalize_numpy(arr):
return (arr - arr.mean()) / arr.std()
# Benchmark
py_time = timeit.timeit(lambda: normalize_python(data_py), number=5)
np_time = timeit.timeit(lambda: normalize_numpy(data_np), number=5)
print(f"Python: {py_time:.3f}s | NumPy: {np_time:.3f}s | Speedup: {py_time/np_time:.0f}×")
# Python: 1.823s | NumPy: 0.018s | Speedup: 101×
# Broadcasting — element-wise ops across arrays of different shapes
prices = np.array([10.0, 20.0, 30.0, 40.0]) # shape (4,)
qty = np.array([[1], [5], [10]]) # shape (3, 1)
totals = prices * qty # shape (3, 4) — broadcast automatically
# Fancy indexing — filter without a Python loop
scores = np.random.rand(1_000_000)
high_scorers = scores[scores > 0.95] # boolean mask indexing
Concurrency Decision Guide
Python has three concurrency models. Picking the wrong one wastes effort:
| Model | Best For | GIL Impact | Overhead |
|---|---|---|---|
asyncio | I/O-bound: HTTP, DB, file reads (thousands of concurrent tasks) | Runs in one thread — GIL not an issue | Very low |
ThreadPoolExecutor | I/O-bound tasks that use blocking libraries (no async support) | GIL released during I/O — threads work | Low |
ProcessPoolExecutor | CPU-bound: compression, encryption, numerical computation | Each process has its own GIL | High (process spawn + IPC) |
import asyncio
import concurrent.futures
import time
# asyncio — best for high-concurrency I/O
async def fetch_url(url: str, session) -> dict:
async with session.get(url) as resp:
return {'url': url, 'status': resp.status, 'size': len(await resp.read())}
async def fetch_all(urls: list[str]) -> list[dict]:
import aiohttp
async with aiohttp.ClientSession() as session:
tasks = [fetch_url(url, session) for url in urls]
return await asyncio.gather(*tasks)
# ThreadPoolExecutor — for blocking I/O in sync code
def read_file(path: str) -> bytes:
with open(path, 'rb') as f:
return f.read()
def read_files_parallel(paths: list[str]) -> list[bytes]:
with concurrent.futures.ThreadPoolExecutor(max_workers=16) as pool:
futures = {pool.submit(read_file, p): p for p in paths}
return [f.result() for f in concurrent.futures.as_completed(futures)]
# ProcessPoolExecutor — for CPU-bound work
import hashlib
def hash_data(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def hash_many_parallel(items: list[bytes]) -> list[str]:
with concurrent.futures.ProcessPoolExecutor() as pool:
return list(pool.map(hash_data, items, chunksize=100))
lru_cache and cache
from functools import lru_cache, cache
import time
# cache = lru_cache(maxsize=None) — unbounded cache (Python 3.9+)
@cache
def fibonacci(n: int) -> int:
if n < 2:
return n
return fibonacci(n - 1) + fibonacci(n - 2)
# Bounded cache — evicts least-recently-used entries
@lru_cache(maxsize=512)
def expensive_db_lookup(user_id: int) -> dict:
time.sleep(0.1) # simulate DB query
return {'id': user_id, 'name': f'User {user_id}'}
# Check cache stats
print(expensive_db_lookup.cache_info())
# CacheInfo(hits=45, misses=10, maxsize=512, currsize=10)
# Clear the cache (e.g., after a write)
expensive_db_lookup.cache_clear()
# Method caching — requires hashable arguments
class PricingEngine:
@lru_cache(maxsize=1000)
def get_price(self, product_id: int, currency: str) -> float:
# ... expensive computation ...
return 0.0
lru_cache requires all arguments to be hashable (no lists or dicts). If you need to cache calls with unhashable arguments, convert them to tuples first or use a custom cache key.__slots__ for Memory Reduction
By default, Python stores instance attributes in a per-object __dict__ (a hash map). For classes with millions of instances, this overhead adds up. __slots__ replaces the dict with a fixed-size array:
import sys
from memory_profiler import memory_usage
# Without __slots__
class PointNoSlots:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
# With __slots__
class Point:
__slots__ = ('x', 'y', 'z')
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
p_no_slots = PointNoSlots(1.0, 2.0, 3.0)
p_slots = Point(1.0, 2.0, 3.0)
print(sys.getsizeof(p_no_slots)) # ~48 bytes + dict overhead (~200 bytes)
print(sys.getsizeof(p_slots)) # ~56 bytes — no dict
# For 1 million instances:
# Without slots: ~250 MB
# With slots: ~56 MB (4.5× less memory)
# Benchmark: attribute access is also ~10% faster with __slots__
import timeit
t1 = timeit.timeit(lambda: p_no_slots.x + p_no_slots.y, number=10_000_000)
t2 = timeit.timeit(lambda: p_slots.x + p_slots.y, number=10_000_000)
print(f"No slots: {t1:.3f}s | Slots: {t2:.3f}s")
__slots__ on any class you'll create in large numbers — data records, graph nodes, simulation particles. Combine it with __repr__ and compare with dataclasses.dataclass(slots=True) (Python 3.10+), which gives you the memory benefit plus auto-generated __init__, __repr__, and __eq__.Cython Type Annotations — Basic Speedups
Cython compiles Python code to C extensions. Adding static type declarations to a hot function can yield 10–100× speedups with minimal code changes:
pip install cython
# Create a .pyx file (Cython source)
# math_utils.pyx — Cython source file
# cython: language_level=3
# Pure Python version (no speedup)
def py_sum_of_squares(numbers):
total = 0
for n in numbers:
total += n * n
return total
# Cython with static types — compiled to C
def cy_sum_of_squares(list numbers):
cdef double total = 0.0
cdef double n
for n in numbers:
total += n * n
return total
# Fully typed — fastest (no Python object overhead)
def cy_sum_of_squares_typed(double[:] arr):
cdef long n = arr.shape[0]
cdef long i
cdef double total = 0.0
for i in range(n):
total += arr[i] * arr[i]
return total
# setup.py — build the extension
from setuptools import setup
from Cython.Build import cythonize
import numpy as np
setup(
ext_modules=cythonize("math_utils.pyx", language_level=3),
include_dirs=[np.get_include()],
)
python setup.py build_ext --inplace
# Creates math_utils.cpython-312-x86_64-linux-gnu.so
python -c "
import math_utils, timeit, numpy as np
arr = list(range(1_000_000))
arr_np = np.array(arr, dtype=np.float64)
print('Python:', timeit.timeit(lambda: math_utils.py_sum_of_squares(arr), number=10))
print('Cython:', timeit.timeit(lambda: math_utils.cy_sum_of_squares_typed(arr_np), number=10))
"
# Python: 0.82s
# Cython: 0.004s (205× speedup)
Frequently Asked Questions
- Should I use PyPy instead of CPython for performance?
- PyPy's JIT compiler typically delivers 3–10× speedups on pure Python code with no changes to your source. It's a great option for CPU-bound scripts and server applications that don't depend on C extensions incompatible with PyPy. The main drawbacks: PyPy has a 1–3 second warm-up period before the JIT kicks in (bad for short scripts), and some native C extensions need separate PyPy builds. NumPy and Django both work with PyPy but with caveats.
- When should I use multiprocessing instead of threading?
- Use multiprocessing (
ProcessPoolExecutor) when your bottleneck is CPU-bound computation — compression, hashing, ML inference, image processing. The GIL prevents threads from running Python bytecode in parallel, so threading doesn't help for CPU-bound work. Use threading (or asyncio) for I/O-bound tasks where threads spend most of their time waiting on network or disk, which releases the GIL. - How do I profile an async Python application?
- Use
pyinstrumentwith its--async-mode=enabledflag, or theasyncio-aware mode. For web applications,py-spycan attach to a running Uvicorn/Gunicorn process and generate a flamegraph. StandardcProfileworks in async code but shows distorted timings because coroutines are interleaved — use statistical profilers instead. - What is the fastest way to read a large CSV file in Python?
- In order of speed: (1)
pandas.read_csv()with DType hints (C parser); (2)polars.read_csv()(Rust-based, often 3× faster than pandas); (3)csv.reader()for streaming with zero memory spike; (4) manual file read withnumpy.loadtxt()for numeric-only data. For files over 1 GB, always prefer a streaming approach over loading the whole file into memory. - How do I find memory leaks in a Python service?
- Run the service under
tracemalloc(tracemalloc.start()at startup), then calltracemalloc.take_snapshot()periodically and compare snapshots to find growing allocations. Theobjgraphlibrary shows which object types are accumulating. Common causes: unbounded caches (uselru_cache(maxsize=N)), circular references preventing garbage collection, and global collections that grow without bounds.