Python Polars: Fast DataFrames for Modern Data Engineering

Polars is a high-performance DataFrame library written in Rust that consistently outperforms Pandas by 5—30x on real-world data engineering tasks. It supports true multi-threaded execution, lazy evaluation for query optimization, and streaming for out-of-core processing. This guide covers Polars' expression API, lazy frames, groupby, joins, I/O with Parquet and CSV, and the key migration patterns from Pandas.

Why Polars? Benchmarks vs Pandas
DataFrame Basics
The Expression API
Lazy Evaluation with LazyFrame
GroupBy and Aggregation
Joins
I/O: Parquet, CSV, JSON
Streaming Large Files
Migrating from Pandas
Frequently Asked Questions

Why Polars? Benchmarks vs Pandas

Polars is built in Rust, uses Apache Arrow as its memory format, and parallelizes all operations across available CPU cores automatically. The result: 5—30x faster than Pandas for groupby, join, and sorting operations on datasets over 1 million rows. It also uses significantly less memory — Arrow's columnar format is more compact than Pandas' object arrays.

When to use Polars: Data engineering pipelines, ETL jobs, processing files > 100MB, anything where Pandas is slow. Keep Pandas for ML feature engineering (Scikit-learn expects Pandas DataFrames), Jupyter notebook exploration, and when Pandas-specific libraries are required.

DataFrame Basics

pip install polars

import polars as pl
import numpy as np

# Create DataFrame
df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["Eng", "Eng", "Sales", "Sales"],
    "salary": [95000, 88000, 72000, 67000],
    "years": [5, 3, 7, 2],
})

# Read CSV, Parquet, JSON
df = pl.read_csv("employees.csv")
df = pl.read_parquet("data.parquet")
df = pl.read_json("data.json")

# Inspect
print(df.shape)         # (4, 4)
print(df.dtypes)        # column dtypes
print(df.describe())    # statistics
print(df.head(3))
print(df.schema)        # {name: Utf8, dept: Utf8, salary: Int64, years: Int64}

# Column selection
df.select("salary")
df.select(["name", "salary"])
df.select(pl.col("^sal.*$"))    # regex column selection

# Filter
df.filter(pl.col("salary") > 80000)
df.filter((pl.col("dept") == "Eng") & (pl.col("years") >= 3))

The Expression API

Polars' expression API is its core innovation. Expressions are composable, lazy computations that Polars optimizes and parallelizes. You build expressions using pl.col() and chain operations — they run in parallel across columns automatically.

import polars as pl

df = pl.read_csv("employees.csv")

# with_columns — add or transform columns
df = df.with_columns([
    (pl.col("salary") * 0.1).alias("bonus"),
    (pl.col("salary") + pl.col("salary") * 0.1).alias("total_comp"),
    pl.col("name").str.to_uppercase().alias("name_upper"),
    pl.col("dept").cast(pl.Categorical),
])

# Multiple expressions run in parallel
df = df.with_columns([
    pl.col("salary").mean().over("dept").alias("dept_avg_salary"),
    pl.col("years").rank("dense").over("dept").alias("seniority_rank"),
    (pl.col("salary") > pl.col("salary").mean().over("dept")).alias("above_avg"),
])

# String operations
df.with_columns([
    pl.col("name").str.split(" ").list.first().alias("first_name"),
    pl.col("email").str.extract(r"@(.+)$", 1).alias("domain"),
    pl.col("desc").str.replace_all(r"\s+", " "),
])

# Date/time operations
df.with_columns([
    pl.col("hire_date").str.to_date("%Y-%m-%d"),
    pl.col("hire_date").dt.year().alias("hire_year"),
    pl.col("hire_date").dt.month().alias("hire_month"),
    (pl.col("hire_date") - pl.lit(pl.date(2026, 1, 1))).dt.total_days().alias("days_ago"),
])

Lazy Evaluation with LazyFrame

A LazyFrame represents a deferred computation. When you call collect(), Polars optimizes the entire query — pushing filters early, removing unnecessary columns, fusing operations — before executing. This can be 2—5x faster than eager execution for complex pipelines.

import polars as pl

# Lazy frame — no computation yet
lf = pl.scan_csv("huge_file.csv")

result = (
    lf
    .filter(pl.col("year") == 2026)
    .filter(pl.col("revenue") > 1000)
    .with_columns([
        (pl.col("revenue") - pl.col("cost")).alias("profit"),
    ])
    .group_by("region")
    .agg([
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("profit").mean().alias("avg_profit"),
        pl.col("customer_id").n_unique().alias("unique_customers"),
    ])
    .sort("total_revenue", descending=True)
    .limit(10)
)

# Show the query plan
print(result.explain(optimized=True))

# Execute
df = result.collect()

# Scan Parquet — even faster with predicate pushdown
lf = pl.scan_parquet("data/*.parquet")
df = (
    lf
    .filter(pl.col("date") >= pl.date(2026, 1, 1))
    .select(["date", "product_id", "revenue"])
    .collect()
)

GroupBy and Aggregation

df = pl.read_csv("sales.csv")

# Single aggregation
dept_avg = df.group_by("dept").agg(pl.col("salary").mean())

# Multiple aggregations
summary = df.group_by("dept").agg([
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").max().alias("max_salary"),
    pl.col("name").count().alias("headcount"),
    pl.col("years").sum().alias("total_experience"),
    (pl.col("years") >= 5).sum().alias("senior_count"),
])

# Rolling / window aggregations
df.with_columns([
    pl.col("revenue").rolling_mean(window_size=7).alias("revenue_7d_ma"),
    pl.col("revenue").rolling_sum(window_size=30).alias("revenue_30d_sum"),
])

# Group by then explode (unnest lists)
df.group_by("dept").agg(
    pl.col("name").sort().alias("members")
).explode("members")

# Dynamic groupby (time-based)
df.group_by_dynamic("date", every="1mo").agg(
    pl.col("revenue").sum()
)

Joins

employees = pl.read_csv("employees.csv")
departments = pl.DataFrame({
    "dept": ["Eng", "Sales", "Marketing"],
    "budget": [500000, 300000, 200000],
    "manager": ["Alice", "Dave", "Eve"],
})

# Inner join
merged = employees.join(departments, on="dept", how="inner")

# Left join
merged = employees.join(departments, on="dept", how="left")

# Join on different column names
merged = employees.join(departments,
                        left_on="department", right_on="dept",
                        how="left")

# Anti-join — rows in left NOT in right
new_employees = employees.join(existing_db, on="email", how="anti")

# Cross join (Cartesian product)
combos = employees.join(departments, how="cross")

# Asof join — nearest match on sorted key (time series)
df_trades.join_asof(df_quotes, on="timestamp", by="symbol", strategy="backward")

I/O: Parquet, CSV, JSON

import polars as pl

# CSV
df = pl.read_csv("data.csv",
                  infer_schema_length=10_000,
                  dtypes={"id": pl.Int32, "date": pl.Date},
                  null_values=["", "NA", "N/A"],
                  try_parse_dates=True)
df.write_csv("output.csv")

# Parquet — fastest format, best compression
df = pl.read_parquet("data.parquet")
df.write_parquet("output.parquet", compression="snappy")

# Scan multiple Parquet files with glob
lf = pl.scan_parquet("data/year=2026/**/*.parquet")

# JSON
df = pl.read_json("data.json")
df = pl.read_ndjson("data.ndjson")  # newline-delimited JSON
df.write_ndjson("output.ndjson")

# Arrow IPC (fastest inter-process format)
df.write_ipc("data.arrow")
df = pl.read_ipc("data.arrow")

# To/from Pandas
pandas_df = df.to_pandas()
polars_df = pl.from_pandas(pandas_df)

Streaming Large Files

import polars as pl

# Stream a file larger than RAM
result = (
    pl.scan_csv("100gb_file.csv")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
    .collect(streaming=True)  # process in batches
)

# Process Parquet files in chunks
for chunk in pl.read_csv_batched("large.csv", batch_size=100_000):
    df = pl.concat(chunk) if isinstance(chunk, list) else chunk
    # process chunk...

Migrating from Pandas

import pandas as pd
import polars as pl

# Common Pandas → Polars translations

# Pandas: df[df["col"] > 5]
# Polars:
df.filter(pl.col("col") > 5)

# Pandas: df["col"].apply(func)
# Polars (prefer expressions):
df.with_columns(pl.col("col").map_elements(func, return_dtype=pl.Utf8))

# Pandas: df.groupby("col").agg({"val": "sum"})
# Polars:
df.group_by("col").agg(pl.col("val").sum())

# Pandas: df.merge(df2, on="key")
# Polars:
df.join(df2, on="key")

# Pandas: df["new"] = df["a"] + df["b"]
# Polars:
df.with_columns((pl.col("a") + pl.col("b")).alias("new"))

# Pandas: df.sort_values("col", ascending=False)
# Polars:
df.sort("col", descending=True)

# Pandas: df.dropna()
# Polars:
df.drop_nulls()

# Pandas: df.fillna(0)
# Polars:
df.fill_null(0)

Frequently Asked Questions

Does Polars replace Pandas completely?: Not yet in practice. Scikit-learn, Matplotlib, Seaborn, and many ML libraries expect Pandas DataFrames. Convert with df.to_pandas() at the integration boundary. Use Polars for all heavy data processing upstream, then convert to Pandas only for the final step that needs it.
How does Polars handle missing values differently than Pandas?: Polars uses a separate null validity bitmap instead of NaN sentinels. This means None is always null regardless of dtype, and operations on null values are explicitly defined. There is no ambiguity between NaN and None — a common source of bugs in Pandas.
Can I use Polars with Spark or Dask?: Polars is a single-machine library; it doesn't distributed. For truly big data (multi-TB), use PySpark or Dask. For data that fits on one machine (up to ~100GB with streaming), Polars is often faster than Spark due to lower overhead and no JVM startup cost.

Python Polars: Fast DataFrames for Modern Data Engineering

Table of Contents