Python Pandas: Data Manipulation and Analysis Guide

Pandas is the cornerstone of data analysis in Python, providing the DataFrame abstraction that makes tabular data manipulation intuitive and fast. Whether you're cleaning raw CSV files, aggregating sales data, merging datasets, or building time series analyses, pandas handles it with expressive, readable code. In 2026, pandas 2.x brings significant performance improvements through the Apache Arrow backend and Copy-on-Write semantics that eliminate an entire class of silent mutation bugs.

This guide covers the essential pandas operations every data engineer and data scientist needs: DataFrame creation and inspection, filtering and selection, groupby aggregations, merging datasets, pivot tables, time series operations, and performance tuning for large datasets.

DataFrame Basics and I/O
Selection, Filtering and loc/iloc
Data Cleaning and Transformation
GroupBy Aggregations
Merging and Joining DataFrames
Pivot Tables and Reshaping
Time Series Analysis
Performance Optimization

DataFrame Basics and I/O

A DataFrame is a 2D labeled data structure with columns of potentially different types — think of it as an in-memory spreadsheet with superpowers. Pandas can read from CSV, Excel, JSON, Parquet, SQL databases, and many other formats. Always inspect a new dataset with head(), info(), and describe() before analysis.

import pandas as pd
import numpy as np

# Create DataFrame from dict
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave", "Eve"],
    "department": ["Engineering", "Marketing", "Engineering", "HR", "Marketing"],
    "salary": [95000, 72000, 105000, 68000, 78000],
    "hire_date": pd.to_datetime(["2020-03-15", "2019-07-01", "2021-11-20", "2018-05-10", "2022-01-15"]),
    "active": [True, True, True, False, True],
})

# Read from CSV (most common)
# df = pd.read_csv("employees.csv", parse_dates=["hire_date"])

# Read from Parquet (fast columnar format for large data)
# df = pd.read_parquet("employees.parquet")

# Inspection
print(df.head(3))       # First 3 rows
print(df.shape)         # (5, 5)
print(df.dtypes)        # Column data types
print(df.info())        # Non-null counts and dtypes
print(df.describe())    # Statistics for numeric columns

# Save to various formats
df.to_csv("output.csv", index=False)
df.to_parquet("output.parquet", index=False)
df.to_excel("output.xlsx", index=False, sheet_name="Employees")

Pandas 2.x: Use dtype_backend="pyarrow" in read functions for 3–5× faster I/O and lower memory usage: pd.read_csv("file.csv", dtype_backend="pyarrow").

Selection, Filtering and loc/iloc

Pandas provides two primary indexers: loc (label-based) and (integer position-based). Use boolean masks for filtering rows by condition. For complex multi-condition filters, query() provides a cleaner string-based syntax.

# Column selection
names = df["name"]                    # Series
subset = df[["name", "salary"]]       # DataFrame with 2 columns

# Row selection by index label
row = df.loc[2]                       # Row with label 2
rows = df.loc[0:2, "name":"salary"]   # Rows 0-2, columns name through salary

# Row selection by integer position
first = df.iloc[0]                    # First row
block = df.iloc[1:3, 0:2]            # Rows 1-2, columns 0-1

# Boolean filtering
active = df[df["active"] == True]
high_earners = df[df["salary"] > 80000]
eng_active = df[(df["department"] == "Engineering") & (df["active"] == True)]

# query() — cleaner for complex filters
result = df.query("salary > 70000 and department == 'Marketing'")
result = df.query("salary.between(70000, 100000)", engine="python")

# isin() for membership tests
selected_depts = df[df["department"].isin(["Engineering", "Marketing"])]

# String operations
df["name_lower"] = df["name"].str.lower()
df["name_length"] = df["name"].str.len()
starts_a = df[df["name"].str.startswith("A")]

Data Cleaning and Transformation

Real-world data is messy — missing values, wrong types, duplicates, inconsistent strings. Pandas provides comprehensive tools to handle all of these. The apply() function lets you run arbitrary Python functions row- or column-wise, while vectorized string methods and map() handle common transformations without loops.

# Handling missing values
df_dirty = df.copy()
df_dirty.loc[1, "salary"] = np.nan
df_dirty.loc[3, "department"] = None

print(df_dirty.isnull().sum())         # Count nulls per column
df_clean = df_dirty.dropna()           # Drop rows with any null
df_filled = df_dirty.fillna({"salary": df_dirty["salary"].median(), "department": "Unknown"})

# Duplicates
df_dedup = df.drop_duplicates(subset=["name", "department"])
print(df.duplicated().sum())           # Count duplicate rows

# Type conversion
df["salary"] = df["salary"].astype(int)
df["hire_date"] = pd.to_datetime(df["hire_date"])

# Apply custom transformations
df["salary_band"] = df["salary"].apply(
    lambda x: "Senior" if x > 90000 else "Mid" if x > 70000 else "Junior"
)

# Vectorized operations (much faster than apply for numeric data)
df["salary_k"] = df["salary"] / 1000
df["years_employed"] = (pd.Timestamp.now() - df["hire_date"]).dt.days / 365

# Map values
dept_codes = {"Engineering": "ENG", "Marketing": "MKT", "HR": "HR"}
df["dept_code"] = df["department"].map(dept_codes)

# Rename columns
df = df.rename(columns={"hire_date": "start_date", "active": "is_active"})

GroupBy Aggregations

GroupBy is one of pandas' most powerful features, implementing the split-apply-combine pattern. Group rows by one or more columns, apply aggregation functions, and combine results. Use agg() with a dict to compute multiple statistics per column in a single pass.

# Basic groupby
dept_stats = df.groupby("department")["salary"].mean()
print(dept_stats)

# Multiple aggregations
dept_summary = df.groupby("department").agg(
    headcount=("name", "count"),
    avg_salary=("salary", "mean"),
    max_salary=("salary", "max"),
    min_salary=("salary", "min"),
    total_payroll=("salary", "sum"),
).round(2)
print(dept_summary)

# Multiple group keys
multi = df.groupby(["department", "active"])["salary"].mean()

# Transform: add aggregated values back to original rows
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
df["salary_vs_dept_avg"] = df["salary"] - df["dept_avg_salary"]

# Filter groups: keep only depts with avg salary > 80k
high_pay_depts = df.groupby("department").filter(lambda g: g["salary"].mean() > 80000)

# Named aggregation with custom functions
result = df.groupby("department").agg(
    salary_range=("salary", lambda x: x.max() - x.min()),
    p75_salary=("salary", lambda x: x.quantile(0.75)),
)
print(result)

Merging and Joining DataFrames

Pandas merge() implements SQL-style joins: inner, left, right, and outer. Use concat() to stack DataFrames vertically (union) or horizontally. Understanding the difference prevents data duplication bugs that are common when joining on non-unique keys.

# Create related DataFrames
employees = pd.DataFrame({
    "emp_id": [1, 2, 3, 4],
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept_id": [10, 20, 10, 30],
})

departments = pd.DataFrame({
    "dept_id": [10, 20, 40],
    "dept_name": ["Engineering", "Marketing", "Finance"],
    "budget": [500000, 200000, 300000],
})

projects = pd.DataFrame({
    "emp_id": [1, 1, 2, 3],
    "project": ["Alpha", "Beta", "Alpha", "Gamma"],
})

# Inner join (only matching rows on both sides)
inner = employees.merge(departments, on="dept_id", how="inner")

# Left join (all employees, even without a matching dept)
left = employees.merge(departments, on="dept_id", how="left")

# Join on different column names
left2 = employees.merge(departments, left_on="dept_id", right_on="dept_id", how="left")

# Many-to-many: employee projects
with_projects = employees.merge(projects, on="emp_id", how="left")

# Concatenate (stack vertically)
q1 = pd.DataFrame({"month": ["Jan", "Feb", "Mar"], "sales": [100, 120, 95]})
q2 = pd.DataFrame({"month": ["Apr", "May", "Jun"], "sales": [110, 130, 88]})
full_year = pd.concat([q1, q2], ignore_index=True)

# Merge with indicator to find unmatched rows
merged = employees.merge(departments, on="dept_id", how="left", indicator=True)
unmatched = merged[merged["_merge"] == "left_only"]

Pivot Tables and Reshaping

Pivot tables transform long-format data into wide-format summaries, identical to Excel pivot tables but far more powerful. melt() does the reverse — converting wide to long format, which is often required for plotting and machine learning pipelines.

sales_data = pd.DataFrame({
    "region": ["North", "North", "South", "South", "East", "East"],
    "product": ["Widget", "Gadget", "Widget", "Gadget", "Widget", "Gadget"],
    "quarter": ["Q1", "Q1", "Q1", "Q1", "Q1", "Q1"],
    "revenue": [12000, 8500, 9000, 11000, 7500, 13000],
    "units": [120, 85, 90, 110, 75, 130],
})

# Pivot table: regions as rows, products as columns, revenue as values
pivot = pd.pivot_table(
    sales_data,
    values="revenue",
    index="region",
    columns="product",
    aggfunc="sum",
    fill_value=0,
    margins=True,       # Add row/column totals
    margins_name="Total"
)
print(pivot)

# crosstab: frequency counts
ct = pd.crosstab(df["department"], df["active"], margins=True)

# Melt: wide → long format
wide = pd.DataFrame({"name": ["Alice", "Bob"], "jan_sales": [100, 200], "feb_sales": [150, 180]})
long = wide.melt(id_vars=["name"], var_name="month", value_name="sales")
print(long)

# stack / unstack
stacked = pivot.stack()    # Columns → row levels
unstacked = stacked.unstack("product")  # Row levels → columns

Time Series Analysis

Pandas has first-class time series support with DatetimeIndex, period arithmetic, resampling, and rolling windows. This makes it ideal for financial data, IoT telemetry, web analytics, and any data with a temporal dimension.

import pandas as pd
import numpy as np

# Create a time series
dates = pd.date_range(start="2026-01-01", end="2026-06-30", freq="D")
ts = pd.Series(np.random.randn(len(dates)).cumsum() + 100, index=dates, name="price")

# DatetimeIndex operations
print(ts["2026-03"])             # All of March
print(ts["2026-01":"2026-02"])  # Jan through Feb

# Resample: daily → monthly statistics
monthly = ts.resample("ME").agg(["mean", "std", "min", "max"])
weekly_avg = ts.resample("W").mean()

# Rolling windows
ts_df = ts.to_frame()
ts_df["ma_7"] = ts.rolling(window=7).mean()     # 7-day moving average
ts_df["ma_30"] = ts.rolling(window=30).mean()   # 30-day moving average
ts_df["std_7"] = ts.rolling(window=7).std()     # Rolling volatility

# Expanding windows (cumulative)
ts_df["cum_max"] = ts.expanding().max()

# Shift and diff for returns
ts_df["daily_return"] = ts.pct_change()         # Daily % return
ts_df["lag_1"] = ts.shift(1)                    # Previous day value

# Business day frequency
bdays = pd.date_range("2026-01-01", "2026-03-31", freq="B")

# Time zone handling
ts_utc = ts.tz_localize("UTC")
ts_ist = ts_utc.tz_convert("Asia/Kolkata")

Performance Optimization

Pandas is fast for moderate data sizes but requires attention to approach for large datasets. The key rules: avoid row-wise apply() when vectorized operations exist, use appropriate dtypes to reduce memory, and consider chunking or switching to Polars/DuckDB for datasets above ~1 million rows.

import pandas as pd
import numpy as np

# Use appropriate dtypes to save memory
df = pd.read_csv("large_file.csv")
print(df.memory_usage(deep=True).sum() / 1e6, "MB")

# Downcast numeric types
df["int_col"] = pd.to_numeric(df["int_col"], downcast="integer")
df["float_col"] = pd.to_numeric(df["float_col"], downcast="float")

# Use Categorical for low-cardinality string columns
df["department"] = df["department"].astype("category")
print(df.memory_usage(deep=True).sum() / 1e6, "MB")  # Often 50-70% smaller

# Vectorized > apply for numeric operations
# Slow: df["result"] = df["a"].apply(lambda x: x * 2 + 1)
# Fast:
df["result"] = df["a"] * 2 + 1

# Process large files in chunks
chunk_iter = pd.read_csv("huge_file.csv", chunksize=100_000)
results = []
for chunk in chunk_iter:
    # Process each chunk independently
    processed = chunk[chunk["value"] > 0].groupby("category")["value"].sum()
    results.append(processed)
final = pd.concat(results).groupby(level=0).sum()

# Use query() for filtering — often faster than boolean indexing for large DFs
large_df = pd.DataFrame(np.random.randn(1_000_000, 5), columns=list("ABCDE"))
result = large_df.query("A > 0 and B < 0.5")

# For very large data, use DuckDB or Polars
# import duckdb
# result = duckdb.query("SELECT dept, AVG(salary) FROM df GROUP BY dept").df()

Tip: Always use pd.read_parquet() instead of CSV for large datasets you read repeatedly. Parquet files are 5–10× smaller and load 10× faster due to columnar compression.