Python Generators and Iterators: Memory-Efficient Pipelines
Generators and iterators are Python's secret weapon for processing large datasets without loading everything into memory. A list of 10 million records consumes hundreds of megabytes; a generator producing the same data uses kilobytes. This guide covers the iterator protocol, the yield keyword, generator expressions, the send() and throw() methods, itertools, and how to compose generators into memory-efficient data pipelines.
Table of Contents
The Iterator Protocol
Python's iterator protocol requires two methods: __iter__ returns the iterator object itself, and __next__ returns the next value or raises StopIteration. Any object implementing both is an iterator. Any object implementing only __iter__ (returning an iterator) is an iterable. The for loop calls iter(obj) to get an iterator, then repeatedly calls next() until StopIteration.
class CountUp:
"""Custom iterator that counts from start to stop."""
def __init__(self, start, stop):
self.current = start
self.stop = stop
def __iter__(self):
return self # the iterator is itself
def __next__(self):
if self.current >= self.stop:
raise StopIteration
value = self.current
self.current += 1
return value
counter = CountUp(1, 5)
for n in counter:
print(n) # 1 2 3 4
# Equivalent using built-in next()
counter2 = CountUp(10, 13)
print(next(counter2)) # 10
print(next(counter2)) # 11
print(next(counter2)) # 12
print(next(counter2, "done")) # "done" — default on StopIteration
__iter__) but not an iterator (no __next__). iter(my_list) creates a list_iterator. Generators are both iterables and iterators.
Generator Functions and yield
A generator function uses yield instead of return. When called, it returns a generator object without executing any body code. Each call to next() runs the function until the next yield, suspends there, and returns the yielded value. The local state (variables, execution position) is preserved between calls.
def read_large_file(filepath):
"""Read a huge file line by line without loading it all into memory."""
with open(filepath) as f:
for line in f:
yield line.rstrip()
def parse_csv_rows(filepath):
"""Generator that yields parsed dicts from a CSV."""
import csv
with open(filepath, newline='') as f:
reader = csv.DictReader(f)
for row in reader:
yield row
# Process 1 GB log file using only a few KB of memory
for line in read_large_file("/var/log/nginx/access.log"):
if "ERROR" in line:
print(line)
# yield from — delegate to a sub-generator
def chain_files(*paths):
for path in paths:
yield from read_large_file(path)
for line in chain_files("part1.log", "part2.log", "part3.log"):
print(line)
yield from iterable delegates to a sub-iterator, forwarding send() and throw() calls transparently. Essential for composing generators.
Generator Expressions
Generator expressions use the same syntax as list comprehensions but with parentheses instead of brackets. They are lazy — they produce values on demand without building a list in memory. Use them when you only need to iterate once and don't need random access.
import os
# List comprehension — loads all paths into memory
all_files = [
os.path.join(root, f)
for root, dirs, files in os.walk("/data")
for f in files
if f.endswith(".json")
]
# Generator expression — lazy, uses O(1) memory
file_gen = (
os.path.join(root, f)
for root, dirs, files in os.walk("/data")
for f in files
if f.endswith(".json")
)
# sum() and max() accept generators directly
total_size = sum(os.path.getsize(p) for p in file_gen)
# Chaining generators without intermediate collections
numbers = range(10_000_000)
squared = (x * x for x in numbers)
filtered = (x for x in squared if x % 3 == 0)
result = sum(filtered) # only one value in memory at a time
send(), throw(), and close()
Generators support two-way communication. send(value) resumes the generator and injects a value as the result of the current yield expression. throw(exc) raises an exception inside the generator at the suspension point. close() throws GeneratorExit for cleanup. This makes generators useful as coroutines for pipelines that need feedback.
def running_average():
"""Coroutine that accepts numbers and yields the running average."""
total = 0.0
count = 0
average = None
while True:
value = yield average # yield current avg, receive next value
if value is None:
break
total += value
count += 1
average = total / count
# Must prime the coroutine with next() before send()
avg = running_average()
next(avg) # prime — advances to first yield
print(avg.send(10)) # 10.0
print(avg.send(20)) # 15.0
print(avg.send(30)) # 20.0
avg.close() # triggers GeneratorExit
# Generator with cleanup via try/finally
def managed_resource():
print("Acquiring resource")
try:
while True:
data = yield
print(f"Processing: {data}")
finally:
print("Releasing resource") # runs on close() or throw()
gen = managed_resource()
next(gen)
gen.send("batch 1")
gen.close() # prints "Releasing resource"
itertools: Batteries for Iterators
The itertools module provides high-performance, memory-efficient iterator building blocks. All functions return iterators, never lists, making them safe to compose even with infinite inputs.
import itertools
# chain — iterate over multiple iterables sequentially
for item in itertools.chain([1, 2], [3, 4], [5]):
print(item) # 1 2 3 4 5
# islice — slice any iterator (including infinite ones)
first_5 = list(itertools.islice(itertools.count(100), 5)) # [100, 101, 102, 103, 104]
# groupby — group consecutive elements by key
data = [{"type": "A", "v": 1}, {"type": "A", "v": 2}, {"type": "B", "v": 3}]
sorted_data = sorted(data, key=lambda x: x["type"])
for key, group in itertools.groupby(sorted_data, key=lambda x: x["type"]):
print(key, list(group))
# product — Cartesian product
params = list(itertools.product(["lr", "momentum"], [0.001, 0.01, 0.1]))
# [('lr', 0.001), ('lr', 0.01), ('lr', 0.1), ('momentum', 0.001), ...]
# batched (Python 3.12+) — chunk an iterator into fixed-size batches
for batch in itertools.batched(range(10), 3):
print(list(batch)) # [0,1,2] [3,4,5] [6,7,8] [9]
# takewhile / dropwhile
under_100 = list(itertools.takewhile(lambda x: x < 100, itertools.count())) # [0..99]
Building Data Pipelines
Generators compose naturally into ETL pipelines where each stage is a generator function. Data flows one record at a time through the entire pipeline, keeping memory usage flat regardless of dataset size. This pattern is particularly powerful for log processing, ETL jobs, and streaming analytics.
import csv
import json
from pathlib import Path
def read_csv(path):
with open(path, newline='') as f:
yield from csv.DictReader(f)
def parse_timestamps(rows):
from datetime import datetime
for row in rows:
row['ts'] = datetime.fromisoformat(row['timestamp'])
yield row
def filter_errors(rows):
for row in rows:
if row.get('level') == 'ERROR':
yield row
def enrich(rows, lookup):
for row in rows:
row['service'] = lookup.get(row['code'], 'unknown')
yield row
def to_jsonl(rows, outpath):
with open(outpath, 'w') as f:
for row in rows:
row.pop('timestamp', None) # remove redundant field
f.write(json.dumps(row) + '\n')
# Compose the pipeline — nothing executes yet
service_map = {"E001": "auth", "E002": "billing", "E003": "api"}
rows = read_csv("events.csv")
rows = parse_timestamps(rows)
rows = filter_errors(rows)
rows = enrich(rows, service_map)
# Execute — pulls one row through all stages at a time
to_jsonl(rows, "errors.jsonl")
# Memory usage: O(1) regardless of input size
Infinite Sequences
Generators can produce infinite sequences safely because they only compute the next value when requested. Combined with itertools.islice or takewhile, you can work with infinite streams without looping forever.
import itertools
import math
def fibonacci():
"""Infinite Fibonacci sequence."""
a, b = 0, 1
while True:
yield a
a, b = b, a + b
def primes():
"""Infinite prime number generator using trial division."""
yield 2
seen = [2]
candidate = 3
while True:
sqrt_c = math.isqrt(candidate)
if all(candidate % p != 0 for p in seen if p <= sqrt_c):
yield candidate
seen.append(candidate)
candidate += 2
# First 10 Fibonacci numbers
fib10 = list(itertools.islice(fibonacci(), 10))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]
# Primes below 50
primes_50 = list(itertools.takewhile(lambda p: p < 50, primes()))
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]
# Sliding window over an infinite stream
def sliding_window(iterable, n):
from collections import deque
window = deque(maxlen=n)
for item in iterable:
window.append(item)
if len(window) == n:
yield tuple(window)
for w in itertools.islice(sliding_window(fibonacci(), 3), 8):
print(w)
# (0,1,1) (1,1,2) (1,2,3) (2,3,5) ...
Frequently Asked Questions
- Can I reuse a generator?
- No. Generators are stateful and exhausted after one pass. If you need to iterate multiple times, either convert to a list or wrap the generator factory in a class with
__iter__that creates a fresh generator each time. - When should I use a list vs a generator?
- Use a list when you need random access, multiple passes, or
len(). Use a generator when you process data once top-to-bottom and memory matters — especially for files, database rows, or API pages. - How do I debug a generator pipeline?
- Insert a debug stage:
def debug(rows): for row in rows: print(row); yield row. Insert it between any two stages without affecting the pipeline structure. - What is the difference between yield and yield from?
yield valueyields a single value.yield from iterabledelegates to an iterable, yielding all its values one by one — equivalent tofor v in iterable: yield vbut also transparently forwardingsend()andthrow().