Python Generators and Iterators: Memory-Efficient Pipelines

Generators and iterators are Python's secret weapon for processing large datasets without loading everything into memory. A list of 10 million records consumes hundreds of megabytes; a generator producing the same data uses kilobytes. This guide covers the iterator protocol, the yield keyword, generator expressions, the send() and throw() methods, itertools, and how to compose generators into memory-efficient data pipelines.

The Iterator Protocol
Generator Functions and yield
Generator Expressions
send(), throw(), and close()
itertools: Batteries for Iterators
Building Data Pipelines
Infinite Sequences
Frequently Asked Questions

The Iterator Protocol

Python's iterator protocol requires two methods: __iter__ returns the iterator object itself, and __next__ returns the next value or raises StopIteration. Any object implementing both is an iterator. Any object implementing only __iter__ (returning an iterator) is an iterable. The for loop calls iter(obj) to get an iterator, then repeatedly calls next() until StopIteration.

class CountUp:
    """Custom iterator that counts from start to stop."""
    def __init__(self, start, stop):
        self.current = start
        self.stop = stop

    def __iter__(self):
        return self  # the iterator is itself

    def __next__(self):
        if self.current >= self.stop:
            raise StopIteration
        value = self.current
        self.current += 1
        return value

counter = CountUp(1, 5)
for n in counter:
    print(n)  # 1 2 3 4

# Equivalent using built-in next()
counter2 = CountUp(10, 13)
print(next(counter2))  # 10
print(next(counter2))  # 11
print(next(counter2))  # 12
print(next(counter2, "done"))  # "done" — default on StopIteration

Iterable vs Iterator: A list is an iterable (has __iter__) but not an iterator (no __next__). iter(my_list) creates a list_iterator. Generators are both iterables and iterators.

Generator Functions and yield

A generator function uses yield instead of return. When called, it returns a generator object without executing any body code. Each call to next() runs the function until the next yield, suspends there, and returns the yielded value. The local state (variables, execution position) is preserved between calls.

def read_large_file(filepath):
    """Read a huge file line by line without loading it all into memory."""
    with open(filepath) as f:
        for line in f:
            yield line.rstrip()

def parse_csv_rows(filepath):
    """Generator that yields parsed dicts from a CSV."""
    import csv
    with open(filepath, newline='') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

# Process 1 GB log file using only a few KB of memory
for line in read_large_file("/var/log/nginx/access.log"):
    if "ERROR" in line:
        print(line)

# yield from — delegate to a sub-generator
def chain_files(*paths):
    for path in paths:
        yield from read_large_file(path)

for line in chain_files("part1.log", "part2.log", "part3.log"):
    print(line)

yield from: Introduced in Python 3.3, yield from iterable delegates to a sub-iterator, forwarding send() and throw() calls transparently. Essential for composing generators.

Generator Expressions

Generator expressions use the same syntax as list comprehensions but with parentheses instead of brackets. They are lazy — they produce values on demand without building a list in memory. Use them when you only need to iterate once and don't need random access.

import os

# List comprehension — loads all paths into memory
all_files = [
    os.path.join(root, f)
    for root, dirs, files in os.walk("/data")
    for f in files
    if f.endswith(".json")
]

# Generator expression — lazy, uses O(1) memory
file_gen = (
    os.path.join(root, f)
    for root, dirs, files in os.walk("/data")
    for f in files
    if f.endswith(".json")
)

# sum() and max() accept generators directly
total_size = sum(os.path.getsize(p) for p in file_gen)

# Chaining generators without intermediate collections
numbers = range(10_000_000)
squared = (x * x for x in numbers)
filtered = (x for x in squared if x % 3 == 0)
result = sum(filtered)  # only one value in memory at a time

send(), throw(), and close()

Generators support two-way communication. send(value) resumes the generator and injects a value as the result of the current yield expression. throw(exc) raises an exception inside the generator at the suspension point. close() throws GeneratorExit for cleanup. This makes generators useful as coroutines for pipelines that need feedback.

def running_average():
    """Coroutine that accepts numbers and yields the running average."""
    total = 0.0
    count = 0
    average = None
    while True:
        value = yield average  # yield current avg, receive next value
        if value is None:
            break
        total += value
        count += 1
        average = total / count

# Must prime the coroutine with next() before send()
avg = running_average()
next(avg)           # prime — advances to first yield
print(avg.send(10))  # 10.0
print(avg.send(20))  # 15.0
print(avg.send(30))  # 20.0
avg.close()         # triggers GeneratorExit

# Generator with cleanup via try/finally
def managed_resource():
    print("Acquiring resource")
    try:
        while True:
            data = yield
            print(f"Processing: {data}")
    finally:
        print("Releasing resource")  # runs on close() or throw()

gen = managed_resource()
next(gen)
gen.send("batch 1")
gen.close()  # prints "Releasing resource"

itertools: Batteries for Iterators

The itertools module provides high-performance, memory-efficient iterator building blocks. All functions return iterators, never lists, making them safe to compose even with infinite inputs.

import itertools

# chain — iterate over multiple iterables sequentially
for item in itertools.chain([1, 2], [3, 4], [5]):
    print(item)  # 1 2 3 4 5

# islice — slice any iterator (including infinite ones)
first_5 = list(itertools.islice(itertools.count(100), 5))  # [100, 101, 102, 103, 104]

# groupby — group consecutive elements by key
data = [{"type": "A", "v": 1}, {"type": "A", "v": 2}, {"type": "B", "v": 3}]
sorted_data = sorted(data, key=lambda x: x["type"])
for key, group in itertools.groupby(sorted_data, key=lambda x: x["type"]):
    print(key, list(group))

# product — Cartesian product
params = list(itertools.product(["lr", "momentum"], [0.001, 0.01, 0.1]))
# [('lr', 0.001), ('lr', 0.01), ('lr', 0.1), ('momentum', 0.001), ...]

# batched (Python 3.12+) — chunk an iterator into fixed-size batches
for batch in itertools.batched(range(10), 3):
    print(list(batch))  # [0,1,2] [3,4,5] [6,7,8] [9]

# takewhile / dropwhile
under_100 = list(itertools.takewhile(lambda x: x < 100, itertools.count()))  # [0..99]

Building Data Pipelines

Generators compose naturally into ETL pipelines where each stage is a generator function. Data flows one record at a time through the entire pipeline, keeping memory usage flat regardless of dataset size. This pattern is particularly powerful for log processing, ETL jobs, and streaming analytics.

import csv
import json
from pathlib import Path

def read_csv(path):
    with open(path, newline='') as f:
        yield from csv.DictReader(f)

def parse_timestamps(rows):
    from datetime import datetime
    for row in rows:
        row['ts'] = datetime.fromisoformat(row['timestamp'])
        yield row

def filter_errors(rows):
    for row in rows:
        if row.get('level') == 'ERROR':
            yield row

def enrich(rows, lookup):
    for row in rows:
        row['service'] = lookup.get(row['code'], 'unknown')
        yield row

def to_jsonl(rows, outpath):
    with open(outpath, 'w') as f:
        for row in rows:
            row.pop('timestamp', None)  # remove redundant field
            f.write(json.dumps(row) + '\n')

# Compose the pipeline — nothing executes yet
service_map = {"E001": "auth", "E002": "billing", "E003": "api"}
rows = read_csv("events.csv")
rows = parse_timestamps(rows)
rows = filter_errors(rows)
rows = enrich(rows, service_map)

# Execute — pulls one row through all stages at a time
to_jsonl(rows, "errors.jsonl")
# Memory usage: O(1) regardless of input size

Infinite Sequences

Generators can produce infinite sequences safely because they only compute the next value when requested. Combined with itertools.islice or takewhile, you can work with infinite streams without looping forever.

import itertools
import math

def fibonacci():
    """Infinite Fibonacci sequence."""
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

def primes():
    """Infinite prime number generator using trial division."""
    yield 2
    seen = [2]
    candidate = 3
    while True:
        sqrt_c = math.isqrt(candidate)
        if all(candidate % p != 0 for p in seen if p <= sqrt_c):
            yield candidate
            seen.append(candidate)
        candidate += 2

# First 10 Fibonacci numbers
fib10 = list(itertools.islice(fibonacci(), 10))
# [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

# Primes below 50
primes_50 = list(itertools.takewhile(lambda p: p < 50, primes()))
# [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47]

# Sliding window over an infinite stream
def sliding_window(iterable, n):
    from collections import deque
    window = deque(maxlen=n)
    for item in iterable:
        window.append(item)
        if len(window) == n:
            yield tuple(window)

for w in itertools.islice(sliding_window(fibonacci(), 3), 8):
    print(w)
# (0,1,1) (1,1,2) (1,2,3) (2,3,5) ...

Frequently Asked Questions

Can I reuse a generator?: No. Generators are stateful and exhausted after one pass. If you need to iterate multiple times, either convert to a list or wrap the generator factory in a class with __iter__ that creates a fresh generator each time.
When should I use a list vs a generator?: Use a list when you need random access, multiple passes, or len(). Use a generator when you process data once top-to-bottom and memory matters — especially for files, database rows, or API pages.
How do I debug a generator pipeline?: Insert a debug stage: def debug(rows): for row in rows: print(row); yield row. Insert it between any two stages without affecting the pipeline structure.
What is the difference between yield and yield from?: yield value yields a single value. yield from iterable delegates to an iterable, yielding all its values one by one — equivalent to for v in iterable: yield v but also transparently forwarding send() and throw().

Python Generators and Iterators: Memory-Efficient Pipelines

Table of Contents

The Iterator Protocol

Generator Functions and yield

Generator Expressions

send(), throw(), and close()

itertools: Batteries for Iterators

Building Data Pipelines

Infinite Sequences

Frequently Asked Questions

Read Next

Python Asyncio Complete Guide

Python Decorators Guide

Python Articles