October 28, 2025
Python Pandas Data Science Data Transformation

Data Transformation with Pandas: Merge, Join, Group, Pivot, and Apply

You've cleaned your data. Columns are consistent, missing values are handled, duplicates are gone. But now what? Your data still might be spread across multiple tables, organized in ways that don't match your analysis questions, or packed into formats that need serious restructuring.

Here's the uncomfortable truth that nobody warns you about when you first start working with data: cleaning is the easy part. The real challenge, the thing that separates a data analyst who gets answers from one who spins their wheels, is transformation. Raw data almost never arrives in the shape your analysis needs. Sales records live in one table. Customer demographics live in another. Product metadata lives in a third. And even when you manage to pull it all together, you still need to reshape, aggregate, and restructure before any meaningful pattern becomes visible.

This is where data transformation comes in. It's the bridge between raw (or cleaned) data and insights. And in pandas, you've got a toolkit that rivals SQL, sometimes surpassing it, because you have the full power of Python behind every operation. Unlike SQL, which is declarative and somewhat rigid, pandas lets you combine transformations in ways that would require multiple nested subqueries in a database. You can chain merges, group them, pivot the result, and apply custom logic in a single readable pipeline.

Let's walk through the core transformation operations: merging, grouping, pivoting, and the catch-all apply(). By the end, you'll understand when to reach for each tool and why. More importantly, you'll understand the mental model behind each operation, because once that clicks, you won't need to memorize syntax. You'll just think in transformations.

Table of Contents
  1. The Problem: Your Data Lives in Multiple Tables
  2. Merge: Combining DataFrames with Keys
  3. Basic Merge (Inner Join)
  4. The Four Join Types (and How They Map to SQL)
  5. Merging on Different Column Names
  6. Multi-Key Merges
  7. Merge vs Join vs Concat: When to Use Which
  8. Concat: Stacking DataFrames
  9. GroupBy: Aggregating and Transforming
  10. Simple Aggregation
  11. Multiple Aggregations
  12. Custom Aggregation Functions
  13. GroupBy Internals: What's Actually Happening
  14. Transform: Group-Relative Calculations
  15. Pivot Tables: Reshaping for Analysis
  16. Basic Pivot
  17. Pivot Table (with Aggregation)
  18. Pivot Table Patterns
  19. Melt: Unpivoting
  20. Apply: Custom Functions Across Data
  21. Apply to Rows
  22. Apply to Columns
  23. Apply with GroupBy
  24. When NOT to Use Apply
  25. Stack and Unstack: Pivoting Index Levels
  26. Performance Tips for Large DataFrames
  27. Putting It Together: A Real Example
  28. Summary

The Problem: Your Data Lives in Multiple Tables

Imagine you're analyzing sales data. You've got:

  • A customers DataFrame with customer IDs, names, and regions
  • An orders DataFrame with order IDs, customer IDs, and amounts
  • A products DataFrame with product IDs and categories

Your analysis needs all three. How do you combine them? That's where merge comes in.

This isn't a contrived scenario, it's the reality of almost every real-world dataset you'll encounter. Databases are designed around normalization principles that deliberately split data across tables to reduce redundancy. That's great for storage efficiency, but it means you'll constantly need to reassemble pieces before you can answer questions. Pandas gives you three main tools for combining DataFrames: merge() for key-based joins, concat() for stacking, and join() for index-based alignment. Let's start with the one you'll use most.

Merge: Combining DataFrames with Keys

The merge() function is pandas' answer to SQL's JOIN. You specify two DataFrames and the columns they share, and pandas sticks them together. Think of it as a lookup, "for each row in the left table, find the matching row in the right table based on this shared key." That's the core mental model, and it'll serve you well whether you're doing inner joins, left joins, or anything else.

Basic Merge (Inner Join)

The default behavior of merge() is an inner join, which means you only get rows where the key exists in both DataFrames. Before you look at the code, picture what that means physically: rows that don't have a match on both sides get silently dropped.

python
import pandas as pd
 
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Carol'],
    'region': ['North', 'South', 'East']
})
 
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 3],
    'amount': [150, 200, 300]
})
 
# Inner join: only customers with orders
merged = pd.merge(customers, orders, on='customer_id')
print(merged)

Output:

   customer_id     name region  order_id  amount
0            1    Alice  North       101     150
1            1    Alice  North       102     200
2            3    Carol   East       103     300

Notice what happened: Bob (customerid=2) disappeared. With an inner join, you only get rows where the key exists in _both DataFrames. That's useful when you want to filter to "customers with orders." Also notice that Alice appears twice, once for each of her two orders. The merge expands rows on the left to match every corresponding row on the right. This one-to-many relationship is exactly how you'd expect it to work if you were doing this manually in a spreadsheet, and it's one of the behaviors that catches newcomers off guard the first time they see their row count increase after a merge.

The Four Join Types (and How They Map to SQL)

Here's where most people get confused. Let me show you a SQL-to-Pandas translation table:

Join TypeSQLPandasResult
InnerINNER JOINhow='inner'Only matching keys in both tables
LeftLEFT JOINhow='left'All rows from left, matching from right (NaN if no match)
RightRIGHT JOINhow='right'All rows from right, matching from left (NaN if no match)
OuterFULL OUTER JOINhow='outer'All rows from both tables (NaN where no match)

The mental model for choosing which join type to use is straightforward once you know it: ask yourself "which table's rows am I allowed to lose?" If you can't lose any rows from either side, use outer. If you can't lose rows from the left table but you don't care about unmatched rows on the right, use left. If only matched rows matter for your analysis, use inner. Let's see these in action:

python
# Left join: keep all customers, even those without orders
left_merged = pd.merge(customers, orders, on='customer_id', how='left')
print(left_merged)

Output:

   customer_id     name region  order_id  amount
0            1    Alice  North       101    150.0
1            1    Alice  North       102    200.0
2            2      Bob  South       NaN      NaN
3            3    Carol   East       103    300.0

Bob's now included, but his order columns are NaN because he has no orders. This is the pattern you use when you want to preserve all records from your "left" table (the first argument). Notice something subtle here: amount changed from integer type to float. That's because NaN can't exist in an integer column in pandas, so the entire column gets promoted to float to accommodate the missing value. This is important to remember when you're doing calculations downstream, you may need to fill or cast those NaN values first.

python
# Outer join: all customers AND all orders
outer_merged = pd.merge(customers, orders, on='customer_id', how='outer')
print(outer_merged)

Output:

   customer_id     name region  order_id  amount
0            1    Alice  North       101    150.0
1            1    Alice  North       102    200.0
2            2      Bob  South       NaN      NaN
3            3    Carol   East       103    300.0

In this case, outer looks the same as left because both tables happen to share the same customer IDs. But if there were an order for an unknown customer (customer_id=999), it would appear with NaN for name/region. Outer joins are most useful for auditing and data quality work, they let you see everything that matched and everything that didn't, which is exactly what you want when you're trying to figure out why your row counts don't add up the way you expected.

Merging on Different Column Names

Sometimes the columns you're joining on have different names. Use left_on and right_on:

This situation comes up constantly in practice, especially when you're working with data from different systems or teams that didn't coordinate their naming conventions. The left_on and right_on parameters handle this gracefully, you don't need to rename columns before merging.

python
orders_renamed = orders.rename(columns={'customer_id': 'cust_id'})
 
merged = pd.merge(
    customers,
    orders_renamed,
    left_on='customer_id',
    right_on='cust_id',
    how='inner'
)

One thing to watch out for: when you use left_on and right_on with different names, both columns end up in the result. You'll have both customer_id and cust_id in your merged DataFrame, which can create confusion. It's usually worth dropping one of them afterward with merged.drop(columns=['cust_id']).

Multi-Key Merges

Real-world data often needs merges on multiple columns. This is less common than single-key merges, but when you need it, you really need it. The classic case is when your data has a natural composite key, like a combination of date and region that uniquely identifies a record, and you need to match on both dimensions simultaneously.

python
# Imagine both tables have 'customer_id' AND 'region'
merged = pd.merge(
    customers,
    orders,
    on=['customer_id', 'region'],
    how='inner'
)

Passing a list to the on parameter tells pandas that both columns must match for a row to be included. This is more restrictive than a single-key merge and can dramatically reduce your result set if either key has discrepancies between tables. Always check your row counts after a multi-key merge to make sure you haven't accidentally dropped data you needed.

Merge vs Join vs Concat: When to Use Which

This is the question that trips up even experienced pandas users, so let's settle it clearly. All three operations combine DataFrames, but they work on fundamentally different principles.

merge() is your go-to for key-based joining. You have a column (or columns) in both DataFrames that represent the same entity, and you want to align rows based on those values. This is the pandas equivalent of SQL's JOIN, and it's the most semantically meaningful of the three operations because you're saying "these two rows belong together because they share this identifier." Use merge when your DataFrames represent related entities, customers and their orders, products and their categories, employees and their departments.

concat() is about physical stacking, not logical alignment. You're not saying "these rows are related", you're saying "these rows are the same kind of thing and I want them in one place." The typical use case is combining DataFrames that have the same schema: monthly sales reports that all have the same columns but cover different time periods, or data exports from different regional offices that follow the same format. When you concat, pandas aligns by column name (for axis=0) or by row index (for axis=1), and fills NaN wherever a column or row doesn't exist on one side.

join() is essentially a shorthand for merge when you want to join on the index rather than on a regular column. It's less commonly needed in practice, but when your DataFrames are already indexed by a meaningful identifier (like customer_id or date), join() can be more concise than merge(). Under the hood, it's calling merge() anyway.

The practical rule of thumb: if you're thinking "match these rows by a shared identifier," use merge. If you're thinking "stack these tables on top of each other," use concat. If your identifier is the index, use join. When in doubt, merge is the most explicit and readable option, so default to that.

Concat: Stacking DataFrames

While merge() is about joining tables horizontally (side by side), concat() is about stacking them vertically or horizontally by position.

The most common scenario for concat is combining data that was split across multiple files. Maybe you downloaded twelve monthly CSVs from your company's reporting system, and you need to analyze the whole year at once. Each file has the same columns, and you just want them stacked into one tall DataFrame. That's concat at its most natural.

python
df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})
 
df2 = pd.DataFrame({
    'A': [5, 6],
    'B': [7, 8]
})
 
# Stack vertically (default axis=0)
stacked = pd.concat([df1, df2], axis=0)
print(stacked)

Output:

   A  B
0  1  3
1  2  4
0  5  7
1  6  8

Notice the index repeats (0, 1, 0, 1). Use ignore_index=True to reset it:

When you see that repeating index, pandas is telling you something important: each original DataFrame had its own index, and concat preserved both. This can cause subtle bugs if you later try to access rows by index label, you'll get multiple rows instead of one. The ignore_index=True parameter prevents this by generating a brand new sequential index across the combined result.

python
stacked = pd.concat([df1, df2], axis=0, ignore_index=True)
print(stacked)

Output:

   A  B
0  1  3
1  2  4
2  5  7
3  6  8

For horizontal stacking (side-by-side):

python
# Stack horizontally (axis=1)
side_by_side = pd.concat([df1, df2], axis=1)
print(side_by_side)

Output:

   A  B  A  B
0  1  3  5  7
1  2  4  6  8

Horizontal concat is less common but useful when you have computed feature sets that need to be combined into a single feature matrix. Just be aware that when column names clash, as they do here with A and B appearing twice, you'll end up with duplicate column names, which is a pandas anti-pattern that can cause unexpected behavior. You'll usually want to rename columns before or after a horizontal concat.

Use concat() when combining similar datasets (like monthly reports). Use merge() when joining on shared keys.

GroupBy: Aggregating and Transforming

Now you've got your data in one place. But it's probably too granular. You need to group it, sum sales by region, count orders by customer, calculate mean values. That's groupby().

GroupBy is one of the most powerful and frequently used operations in the entire pandas library, and it's worth spending some real time understanding how it works. The conceptual model is called "split-apply-combine," a phrase coined by Hadley Wickham in the context of R's dplyr. The idea is simple but profound: you split your DataFrame into groups based on some criterion, you apply a function to each group independently, and then you combine the results back into a single output. Every groupby operation follows this pattern, even when it's not obvious.

Simple Aggregation

python
orders_detailed = pd.DataFrame({
    'customer_id': [1, 1, 2, 2, 3],
    'region': ['North', 'North', 'South', 'South', 'East'],
    'amount': [150, 200, 100, 300, 250]
})
 
# Sum by region
by_region = orders_detailed.groupby('region')['amount'].sum()
print(by_region)

Output:

region
East      250
North     350
South     400
Name: amount, dtype: int64

What just happened here? Pandas identified three distinct groups in the region column (East, North, South), extracted the amount values for each group, and computed the sum for each. The result is a Series indexed by the group label. Notice the result is sorted alphabetically, groupby always sorts the group keys by default. If you're working with large datasets and don't need sorted output, passing sort=False to groupby can give you a small performance boost.

Multiple Aggregations

python
# Count orders and sum amounts by region
summary = orders_detailed.groupby('region')['amount'].agg(['sum', 'count', 'mean'])
print(summary)

Output:

       sum  count  mean
region
East     250      1   250.0
North    350      2   175.0
South    400      2   200.0

The agg() method (short for "aggregate") is where groupby really opens up. You can pass a list of function names as strings, and pandas will compute all of them in one pass. The result is a DataFrame with one column per aggregation function. This is far more efficient than running separate groupby operations for sum, count, and mean, pandas only needs to split the data into groups once.

Custom Aggregation Functions

python
# Custom: range (max - min)
def range_val(x):
    return x.max() - x.min()
 
summary = orders_detailed.groupby('region')['amount'].agg(range_val)
print(summary)

Output:

region
East      0
North    50
South    200
Name: amount, dtype: int64

When the built-in aggregation functions don't cover your use case, you can pass a custom function to agg(). That function receives each group's data as a Series (or DataFrame if you're aggregating multiple columns) and should return a scalar. Here, range_val computes the spread between the maximum and minimum order amounts within each region. The East region has only one order, so its range is 0. South has a much wider spread, 300 versus 100.

GroupBy Internals: What's Actually Happening

Understanding what groupby does under the hood will help you use it more effectively and debug it when things go wrong. When you call groupby('region'), pandas doesn't actually compute anything yet. It creates a GroupBy object, essentially a lazy blueprint that says "I know how to split this DataFrame by region when asked." This deferred execution is intentional; it lets pandas optimize the computation when you actually request a result.

When you then call .sum() or .agg() on that GroupBy object, pandas activates the split-apply-combine pipeline. Internally, it builds a hash table that maps each unique group key to the indices of rows belonging to that group. Then it iterates through those groups, passing each subset to the aggregation function, and collects the results. For built-in functions like sum, count, and mean, this is done in compiled C code and is extremely fast. For custom Python functions passed to agg() or apply(), the performance is slower because Python's interpreter overhead applies to each group.

One subtle but important behavior: groupby skips NaN values in the grouping columns by default. If a row has NaN in the column you're grouping by, that row won't appear in any group and will be silently excluded from your results. This can cause mysterious row count discrepancies. You can change this with groupby('region', dropna=False) if you need NaN to form its own group.

Transform: Group-Relative Calculations

Sometimes you don't want to collapse to one row per group. You want to keep the original rows but add a group-level calculation to each. That's transform().

The distinction between agg() and transform() is fundamental and worth internalizing: aggregation reduces many rows to one per group; transform preserves the original row count and broadcasts the group-level result back to every row in that group. This is the operation you need for "what percentage of the regional total does each order represent?" or "how does each observation compare to the group mean?"

python
# Add a column: how much does each order represent as a % of its region's total?
orders_detailed['pct_of_region'] = (
    orders_detailed.groupby('region')['amount']
    .transform('sum')
)
orders_detailed['pct_of_region'] = (
    orders_detailed['amount'] / orders_detailed['pct_of_region'] * 100
)
print(orders_detailed)

Output:

   customer_id region  amount  pct_of_region
0            1  North     150       42.857143
1            1  North     200       57.142857
2            2  South     100       25.000000
3            2  South     300       75.000000
4            3   East     250      100.000000

Each row keeps its original amount but now includes a percentage of its region's total. This is perfect for "ratio to aggregate" calculations. The first Alice order represents about 43% of all North sales, while the second represents 57%. This kind of within-group normalization is extremely common in feature engineering for machine learning, before you train a model on sales data, you often want to express each observation relative to its peer group rather than in raw absolute terms.

Pivot Tables: Reshaping for Analysis

Sometimes you need to see data in a wide format. Pivot tables reshape your data so that unique values from one column become new columns. This operation is so fundamental to data analysis that it's been a core feature of Excel for decades. Pandas brings that same power into your code, with the added benefit that it's programmable and reproducible.

The core mental model: a pivot takes a "long" table (many rows, few columns) and transforms it into a "wide" table (fewer rows, many columns). Long format has each observation as its own row, date 2024-01-01, region North, amount 100. Wide format has one row per date, with separate columns for each region. Neither format is universally better; different analyses require different shapes, and being comfortable pivoting between them is an essential skill.

Basic Pivot

python
sales = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
    'region': ['North', 'South', 'North', 'South'],
    'amount': [100, 200, 150, 250]
})
 
# Pivot: rows=date, columns=region, values=amount
pivoted = sales.pivot(index='date', columns='region', values='amount')
print(pivoted)

Output:

region        North  South
date
2024-01-01    100.0  200.0
2024-01-02    150.0  250.0

The pivot() function requires that every combination of index and columns values is unique, one row per date-region pair. If that condition isn't met, you'll get a ValueError. This is actually a useful constraint: it tells you immediately if your data has unexpected duplicates.

Pivot Table (with Aggregation)

When you have duplicate index/column combinations, use pivot_table() and specify an aggregation function:

python
sales_with_duplicates = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-01', '2024-01-02'],
    'region': ['North', 'North', 'South', 'South'],
    'amount': [100, 50, 200, 250]
})
 
# Two sales in North on 2024-01-01, so sum them
pivoted = pd.pivot_table(
    sales_with_duplicates,
    index='date',
    columns='region',
    values='amount',
    aggfunc='sum'
)
print(pivoted)

Output:

region        North  South
date
2024-01-01    150.0  200.0
2024-01-02      NaN  250.0

Notice the NaN in the North column for 2024-01-02, there's no North sale on that date, so the cell is empty. This is the expected and correct behavior. You can fill these with fill_value=0 if zeros make more semantic sense than NaN for your analysis. Also notice that pivot_table() is more forgiving than pivot(), it handles duplicates gracefully by aggregating them, which makes it the safer default choice when you're not certain about the uniqueness of your data.

Pivot Table Patterns

Pivot tables shine in a few specific scenarios worth knowing about. The first is cross-tabulation analysis, you want to see how two categorical variables interact. How many orders did each customer make in each region? How does product category performance vary by month? Pivot tables turn these questions into immediately readable matrices.

The second pattern is feature matrix construction for machine learning. When you have a long-format log of user actions (one row per event), and you want to build a feature vector per user (one row per user, with columns representing different event types and their counts), pivot_table() with aggfunc='count' is your tool. This transformation is called "widening" the data, and it's a common preprocessing step before fitting a model.

The third pattern is report generation. Business stakeholders often want to see data in a wide tabular format, sales by month across the top, regions down the side. Pivot tables produce that format directly, and the result can be exported to Excel or formatted as a markdown table with minimal additional work.

Melt: Unpivoting

The inverse of pivot is melt(). It takes wide data and makes it long:

Most visualization libraries and statistical models expect long-format data. Seaborn, for example, works best when each row represents one observation with a category label in a column, not when each category has its own column. Melt is the operation that converts wide data back into the long format that these tools expect.

python
wide = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-02'],
    'North': [100, 150],
    'South': [200, 250]
})
 
# Melt: turn region columns into rows
melted = pd.melt(wide, id_vars='date', var_name='region', value_name='amount')
print(melted)

Output:

         date region  amount
0  2024-01-01  North     100
1  2024-01-02  North     150
2  2024-01-01  South     200
3  2024-01-02  South     250

This is the long format, the format pandas (and most statistical libraries) prefer. The id_vars parameter specifies which columns to keep as identifiers (they'll be repeated for each row), while the remaining columns get "melted" into key-value pairs. The var_name and value_name parameters let you name the resulting columns meaningfully. Together, pivot and melt give you complete control over the shape of your data, letting you reshape it to match whatever structure your downstream analysis requires.

Apply: Custom Functions Across Data

Sometimes groupby, transform, and pivot don't cover your use case. You need a custom function applied to every row, column, or group. That's apply().

Before we dive into the syntax, I want to be direct about something: apply() is a double-edged sword. It's extraordinarily flexible, you can make it do almost anything, but that flexibility comes at a cost. Apply runs your function in pure Python, one row or group at a time, which means it can be 10x to 100x slower than equivalent vectorized operations. Use it when you need it, but always ask yourself first: "Can I accomplish this with vectorized operations instead?" More on that below.

Apply to Rows

python
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
 
# Apply a function across each row
result = df.apply(lambda row: row['A'] + row['B'], axis=1)
print(result)

Output:

0    5
1    7
2    9
dtype: int64

With axis=1, your function receives each row as a Series, and you can access any column by name. The function runs once per row and its return value becomes an element in the result Series. This is extremely readable and flexible, but notice that this particular example is silly, df['A'] + df['B'] would produce identical results in a single vectorized operation. Apply makes sense here for demonstration purposes, but in production code, always check whether arithmetic or string operations can be done directly on columns first.

Apply to Columns

python
# Apply across columns
result = df.apply(lambda col: col.mean(), axis=0)
print(result)

Output:

A    2.0
B    5.0
dtype: float64

With axis=0, your function receives each column as a Series and runs once per column. Again, this particular example is redundant, df.mean() does the same thing more efficiently. But when you have truly custom per-column logic that can't be expressed as a simple operation, this is the pattern to use.

Apply with GroupBy

python
orders_detailed.groupby('region').apply(
    lambda group: group.nlargest(1, 'amount')
)

This finds the largest order per region. apply() here receives each group as a DataFrame and can return anything. This is where apply genuinely earns its place, when you need to perform complex, multi-column logic on each group independently, and the result for each group is itself a DataFrame or Series (not just a scalar). Finding the top-N rows per group, performing custom normalizations, or computing group-level statistics that don't map cleanly to built-in aggregations, these are the legitimate use cases for apply with groupby.

When NOT to Use Apply

Here's the critical insight: apply() is slow. It's Python-level, not vectorized. Prefer vectorized operations:

python
# Slow (apply)
df['new_col'] = df.apply(lambda row: row['A'] * 2, axis=1)
 
# Fast (vectorized)
df['new_col'] = df['A'] * 2

Use apply only when you truly need custom logic that can't be vectorized. The performance difference isn't academic, on a DataFrame with a million rows, a vectorized multiplication might take 2 milliseconds while the equivalent apply might take 2 seconds. That's a 1000x difference. At small scales it doesn't matter; at production scale with large datasets, it absolutely does.

The general hierarchy to follow: first, try to express your operation as arithmetic or comparison operators directly on columns. If that doesn't work, check if there's a built-in pandas method (str.replace, dt.floor, clip, where, etc.). If no built-in works, use map() or apply() on a single column rather than apply(axis=1) on the whole DataFrame. And if you truly need row-level custom logic, apply(axis=1) is your last resort, not your first instinct.

Stack and Unstack: Pivoting Index Levels

DataFrames can have multi-level indexes and columns. stack() and unstack() pivot these levels:

Multi-level (hierarchical) indexes are a powerful pandas feature that you encounter more often than you might expect, especially after groupby operations that group by multiple columns simultaneously. Stack and unstack let you move data between the row index and the column index, which is a finer-grained form of the pivoting we discussed earlier.

python
df = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
}, index=['X', 'Y'])
 
# Stack: column level becomes index level
stacked = df.stack()
print(stacked)

Output:

X  A    1
   B    3
Y  A    2
   B    4
dtype: int64
python
# Unstack: bring index level back to columns
unstacked = stacked.unstack()
print(unstacked)

Output:

   A  B
X  1  3
Y  2  4

Stack and unstack are particularly useful when you have the result of a multi-column groupby and want to reshape it for a specific analysis or visualization. They're also the underlying operations that make pivot and melt possible, pandas implements those higher-level functions using stack/unstack internally.

Performance Tips for Large DataFrames

You've got powerful tools now. Here's how to use them responsibly at scale. When your DataFrames get into the millions of rows, the choices you make about which operations to use start to matter a great deal.

The single most impactful performance choice you can make is to minimize the use of Python-level loops and apply operations on large datasets. Every time you write apply(lambda row: ..., axis=1), Python's interpreter executes your function once per row. For a million rows, that's a million function calls with all the overhead that entails. Vectorized operations implemented in C (arithmetic on columns, built-in aggregations, string operations via the .str accessor) sidestep this entirely and operate on entire arrays at once using SIMD instructions on modern processors.

Memory is the other major consideration. Pandas DataFrames load entirely into RAM by default. When you merge two large DataFrames, the intermediate result can be much larger than either input, especially with outer joins that generate NaN-filled rows. Being deliberate about which join type you use (inner vs. outer) can mean the difference between a pipeline that runs and one that crashes with a MemoryError. Similarly, reducing column dtypes before a merge (int64 to int32, float64 to float32, object strings to categorical) can cut memory usage in half or more.

1. Avoid apply() on large datasets. Vectorize when possible:

python
# Instead of:
df['new'] = df.apply(lambda row: some_function(row), axis=1)
 
# Try:
df['new'] = df['col1'].apply(some_function)  # Column-only

2. Use merge() with appropriate join types. An inner join on large tables can save memory:

python
# If you only need matched rows:
result = pd.merge(big_df1, big_df2, how='inner')
 
# Not:
result = pd.merge(big_df1, big_df2, how='outer')

3. Groupby is fast, use it. It's vectorized and efficient:

python
# Groupby is better than looping and filtering:
result = df.groupby('region')['amount'].sum()
 
# Not:
result = {}
for region in df['region'].unique():
    result[region] = df[df['region'] == region]['amount'].sum()

4. Consider memory before concat(). Stacking many large DataFrames uses memory fast:

python
# If you have 100 CSVs, don't load all then concat:
dfs = [pd.read_csv(f) for f in files]
result = pd.concat(dfs)  # All in memory at once
 
# Instead, process incrementally or use chunks

5. Use inplace=True carefully. It saves memory but can be risky in long scripts.

Another tip worth mentioning: when you know in advance that a column will be used as a groupby or merge key repeatedly, consider setting it as the DataFrame index with set_index(). Pandas uses hash-based lookups on the index, which can be significantly faster than scanning a regular column. For merge operations specifically, pre-sorting on the join key before the merge can help with certain join algorithms. And if you're doing repeated groupby operations on the same column, the Categorical dtype can provide dramatic speedups because pandas can represent and compare category labels as integers internally rather than as strings.

Putting It Together: A Real Example

Let's say you have customer sales data and you want: the total sales per region, but only regions with more than 2 orders, ranked by total:

This kind of chained query is where pandas really shows its elegance. Rather than creating intermediate variables at each step, you can read the entire transformation as a single narrative: "start with sales data, group by region, compute total and count, keep only the groups with more than 2 orders, sort by total descending." The code maps almost directly to that English description.

python
sales = pd.DataFrame({
    'region': ['North', 'North', 'North', 'South', 'South', 'East'],
    'amount': [100, 150, 200, 300, 200, 50]
})
 
result = (
    sales
    .groupby('region')
    .agg(total=('amount', 'sum'), count=('amount', 'count'))
    .query('count > 2')
    .sort_values('total', ascending=False)
)
print(result)

Output:

       total  count
region
North    450      3

One query. Chained operations. Clean and fast. The named aggregation syntax, total=('amount', 'sum'), is particularly clean; it lets you name the output columns while specifying the aggregation, all in one place. And query() reads almost like English, which makes the intent of the filter immediately clear to anyone reading the code later.

The method-chaining style shown here (using parentheses to wrap a multi-line chain) is a pandas best practice for complex transformations. It's more readable than assigning to intermediate variables, and it makes the data transformation pipeline visible as a single coherent unit. When something goes wrong, you can add a .pipe(print) or temporarily comment out a step to inspect the intermediate state.

Summary

You've now got the core tools for data transformation:

  • Merge to join tables on shared keys (inner, left, right, outer)
  • Concat to stack DataFrames vertically or horizontally
  • GroupBy to aggregate and summarize by groups
  • Transform to add group-level calculations to original rows
  • Pivot/Pivot Table to reshape long data to wide
  • Melt to reshape wide data to long
  • Apply for custom functions (use sparingly, it's slow)
  • Stack/Unstack for multi-level index manipulation

But the real takeaway from this article isn't a list of function names. It's a set of questions to ask when you're faced with a data transformation problem. Does my data live in multiple tables that need to be combined based on shared identifiers? Merge. Do I have DataFrames that are structurally similar and should be stacked? Concat. Do I need to aggregate from row-level detail to group-level summaries? GroupBy with agg. Do I need group-level statistics attached to every original row? Transform. Do I need to change the shape of my data, from long to wide or wide to long? Pivot and melt. Do I need complex custom logic that doesn't fit any of the above? Apply, but use it as a last resort.

These operations compose. A typical real-world pipeline might merge several tables together, filter and clean the result, groupby with multiple aggregations, then melt the wide result back into long format for visualization. Each step transforms the shape and granularity of the data, moving it steadily from raw complexity toward analytical clarity.

The best analysts aren't the ones who know the most syntax, they're the ones who can look at a messy dataset and immediately see the sequence of transformations needed to make it answer a question. That intuition comes from practice, from making mistakes (merging in the wrong direction and getting 10x more rows than expected), and from building up a mental library of transformation patterns. You're building that library now.

Next up: we'll turn this transformed data into visualizations. Time to make it visible. See you in the next article.

Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project