Python Multiprocessing for CPU-Bound Parallelism

Here's a frustration that every Python developer eventually hits: you've got a beefy multi-core machine, you've got a computation-heavy task, and you've heard that parallelism can make things faster. So you reach for threads, split your work across four of them, run it, and discover it's actually slower than before. You stare at your CPU monitor and watch one core peg at 100% while the other seven sit idle, completely useless. That sinking feeling is the GIL introducing itself.
This article is your guide to getting past that wall. We're going to cover Python's multiprocessing module in depth: what it is, why it exists, how to use it effectively, and where it can trip you up. By the end, you'll understand not just the mechanics but the reasoning behind them, because knowing why something works the way it does is what separates someone who copies code from StackOverflow from someone who writes production-grade parallel systems.
We'll start with the root cause: the Global Interpreter Lock. Then we'll build up from raw Process objects to ProcessPoolExecutor, cover inter-process communication patterns, explore shared state trade-offs, catalog the most common mistakes developers make, and finish with a framework for deciding when multiprocessing is the right tool. We'll benchmark real code, not toy examples, so you can see actual speedup numbers. By the time we're done, you'll have a complete mental model, and you'll know exactly when to reach for multiprocessing and when to walk away.
If you're coming from the previous article on threading for I/O-bound work, this is the natural companion. Threading and multiprocessing solve different problems, and confusing them is one of the most common performance mistakes in Python. Let's clear that up permanently.
Table of Contents
- Why Threads Can't Save You: The GIL Problem
- GIL and Why Multiprocessing
- Enter multiprocessing: True Parallelism
- Process Creation: The Manual Way
- ProcessPoolExecutor: The Pythonic Way
- Process Pools Deep Dive
- Real Benchmark: Image Processing Speedup
- Inter-Process Communication: Queue and Pipe
- Shared State Challenges
- Chunksize: Tuning Pool Performance
- The Pickling Problem: What Can't Be Parallelized
- Common Multiprocessing Mistakes
- When to Use Multiprocessing
- Real-World Scenario: Data Processing Pipeline
- Summary
Why Threads Can't Save You: The GIL Problem
Here's the painful truth: Python's threads don't actually run in parallel. They're concurrent (switching between tasks), but they're not truly parallel. The Global Interpreter Lock ensures only one thread executes Python bytecode at a time, even on multi-core systems.
The GIL exists for a good reason. Python's memory management uses reference counting, which isn't thread-safe by default. Rather than protecting every object with its own lock (expensive and deadlock-prone), Python chose one big lock that all threads must acquire before executing bytecode. This kept the language simple and performant for single-threaded code, but it creates a ceiling for multi-threaded CPU-bound work.
Threading shines for I/O-bound work because while one thread waits for network or disk, another can run. But for CPU-bound tasks? Both threads compete for the same lock while the other cores gather dust. Your program context-switches constantly, which adds overhead without adding parallelism. You get the worst of both worlds: slower than single-threaded (due to context-switch overhead) and slower than what multi-core systems should provide.
Think of it like this: threading on CPU-bound work is like having multiple chefs in a kitchen, but only one can touch the stove at a time. The others have to wait. Meanwhile, you've got three more ovens sitting unused in the next room.
Example: Computing prime numbers with threading. We'll use this as a baseline to prove the GIL's impact before we fix it. Pay attention to the timing, the numbers tell the story.
import threading
import time
def is_prime(n):
"""Check if n is prime (CPU-bound)."""
if n < 2:
return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0:
return False
return True
def count_primes(start, end):
"""Count primes in range."""
count = 0
for n in range(start, end):
if is_prime(n):
count += 1
return count
# Single-threaded baseline
start = time.time()
result = count_primes(2, 100000)
single_time = time.time() - start
print(f"Single-threaded: {result} primes in {single_time:.3f}s")
# With threading (spoiler: slower)
start = time.time()
t1 = threading.Thread(target=count_primes, args=(2, 50000))
t2 = threading.Thread(target=count_primes, args=(50000, 100000))
t1.start()
t2.start()
t1.join()
t2.join()
thread_time = time.time() - start
print(f"With 2 threads: {thread_time:.3f}s (slower! GIL overhead)")The threads version often runs slower than single-threaded. Why? Context switching overhead and GIL contention outweigh the gains. The CPU is constantly switching between threads, saving and restoring state, fighting over the GIL. For CPU-bound work, you need true parallelism: separate Python processes with separate GILs.
When you run this, you'll typically see output like:
Single-threaded: 9592 primes in 0.85s
With 2 threads: 9592 primes in 1.02s (slower! GIL overhead)
That's not a typo, the multi-threaded version is genuinely slower. This is the GIL in action, and it's why threading is a dead end for CPU-bound work. Notice that the 2-thread version isn't just failing to be twice as fast, it's actually worse than doing nothing at all. The overhead of managing two threads, coordinating GIL acquisition and release thousands of times per second, creates net negative performance. This benchmark alone is worth understanding deeply before you write another line of parallel Python.
GIL and Why Multiprocessing
To use multiprocessing effectively, you need to internalize why the GIL exists and what escaping it actually means. The GIL isn't a bug or an oversight, it's a deliberate design trade-off made in Python's early days that solved a real problem, at the cost of multi-core CPU utilization.
CPython's memory management tracks how many references point to each object. When that count drops to zero, the memory is freed. This approach is simple and fast for single-threaded programs, but in a multi-threaded environment, two threads could simultaneously modify a reference count, producing incorrect results, double-frees, or memory leaks. The safe solution would be to add a fine-grained lock to every object, but that introduces massive overhead and creates deadlock opportunities everywhere. Instead, CPython uses one coarse lock: the GIL. Any thread that wants to execute Python bytecode must hold it. Only one thread runs at a time. Problem solved, at scale.
Here's what that means for you practically: when you spawn a new Python process using multiprocessing, each process gets its own CPython interpreter, its own GIL, and its own memory space. They don't share anything by default. Process A can be running full-tilt on Core 1 while Process B runs full-tilt on Core 2, and neither knows the other exists. The operating system schedules them independently. That's genuine parallelism. The trade-off is that processes are heavier than threads: each one takes 10-50ms to spawn (versus microseconds for threads) and requires explicit serialization (pickling) to pass data between them.
For CPU-bound work that runs for seconds or minutes, this trade-off is trivially worth it. You pay a small fixed cost at startup and gain proportional speedup for the duration of computation. For tiny tasks that run in milliseconds, the overhead can dominate and multiprocessing makes things worse. That threshold, where the computation cost exceeds process overhead, is the key decision point, and we'll revisit it throughout this article. If you understand that multiprocessing is about trading spawn overhead for freed GIL constraints, you'll always know which tool to reach for.
Enter multiprocessing: True Parallelism
The multiprocessing module spawns actual processes, each with its own Python interpreter and its own GIL. Now those cores can do real work in parallel. Each process is independent, with its own memory space, its own GIL, and its own execution context. When one process is computing, another is computing, simultaneously, on different cores.
Before you run any multiprocessing code, notice the if __name__ == '__main__': guard. On Windows and macOS, Python uses "spawn" to create child processes, which means the child process imports your module from scratch. Without the guard, the child process would trigger another round of process creation, and you'd get an infinite recursion of spawning processes. Linux uses "fork" by default (which copies the parent process), so it's less critical there, but always include it anyway. It's the single most common "why doesn't this work" issue for multiprocessing beginners.
import multiprocessing
import time
def count_primes_worker(start, end):
"""Count primes in range (runs in separate process)."""
count = 0
for n in range(start, end):
if is_prime(n):
count += 1
return count
if __name__ == '__main__':
# Multiprocessing approach
start = time.time()
with multiprocessing.Pool(2) as pool:
ranges = [(2, 50000), (50000, 100000)]
results = pool.starmap(count_primes_worker, ranges)
total = sum(results)
multi_time = time.time() - start
print(f"Multiprocessing (2 processes): {total} primes in {multi_time:.3f}s")Key observation: This runs on actual cores. No GIL contention. True parallelism. Now both processes are counting primes simultaneously, each on its own core. You'll see output like:
Multiprocessing (2 processes): 9592 primes in 0.48s
That's roughly half the single-threaded time (with a small overhead penalty). That's what true parallelism looks like. Notice the if __name__ == '__main__': guard, this is crucial on Windows and macOS, where child processes need to re-import the module. On Linux, it's good practice anyway.
The speedup comes from pure parallelism. Two processes can count primes simultaneously, each using a separate core. The OS scheduler doesn't fight a lock; it just runs both. No context-switching hell, no GIL battles. Compare this to the threading result earlier: 2 threads made things slower, but 2 processes cut the time nearly in half. Same machine, same work, same Python code, just a different concurrency model. That difference is the entire argument for understanding which tool to use when.
Process Creation: The Manual Way
Before we jump to pools (the production approach), let's understand what's happening under the hood with multiprocessing.Process. Understanding the lifecycle helps you debug issues and design proper process management.
When you create a Process object, Python doesn't immediately do anything to the OS. It just sets up the configuration: what function to run, what arguments to pass, how to start it. The actual system call happens when you call .start(), and that's when the OS creates a new process, allocates memory, sets up the Python interpreter in that new process, and begins execution. This two-step setup-then-start pattern gives you control over when the overhead hit happens, which matters in latency-sensitive systems.
import multiprocessing
import time
def cpu_intensive_task(n, duration=5):
"""Simulate CPU work."""
start = time.time()
total = 0
while time.time() - start < duration:
total += sum(range(n))
print(f"Process finished: total = {total}")
if __name__ == '__main__':
# Create two processes
p1 = multiprocessing.Process(target=cpu_intensive_task, args=(1000, 3))
p2 = multiprocessing.Process(target=cpu_intensive_task, args=(1000, 3))
start = time.time()
# Start both processes
p1.start()
p2.start()
# Wait for both to finish
p1.join()
p2.join()
elapsed = time.time() - start
print(f"Both processes completed in {elapsed:.3f}s (should be ~3s, not ~6s)")What's happening:
Process(target=function, args=...)creates a new process object but doesn't start it yet. It's like creating a recipe without cooking..start()spawns the actual OS-level process. The child process is now running independently..join()blocks until the process terminates. The main process waits here.- If you skip
.join(), the main process exits immediately and kills the child processes (on most OSes). You'd lose work.
When you run this, both processes run concurrently for 3 seconds, then the main process reports completion. If threading were being used, this would take ~6 seconds (one thread at a time). With multiprocessing, it takes ~3 seconds plus overhead.
Daemon processes: Set daemon=True to make a process exit when the main process exits. Useful for background work you don't want to block on, like logging or telemetry. A daemon process is essentially saying "this work is secondary, if the main program is done, don't wait for me." That's a meaningful semantic distinction worth encoding explicitly in your code.
def background_logger():
while True:
print("Logging...")
time.sleep(1)
p = multiprocessing.Process(target=background_logger)
p.daemon = True # Dies with main process
p.start()
# Main code continues; logging happens in background
time.sleep(5)
print("Main process done, daemon will die too")Daemon processes are fire-and-forget. They run in the background and automatically terminate when the main process exits. No need to join them. Use this for monitoring, periodic cleanup, or non-critical background tasks. The key thing to understand is that you cannot .join() a daemon process, doing so would block the main process from exiting, which defeats the whole purpose. If you need to wait for a background process to finish, it should not be a daemon.
ProcessPoolExecutor: The Pythonic Way
Creating and managing individual processes is error-prone. Use concurrent.futures.ProcessPoolExecutor for a cleaner, more Pythonic interface. This is what you'll use in real production code.
The concurrent.futures module provides a unified interface for both thread-based and process-based parallelism. The ProcessPoolExecutor and ThreadPoolExecutor share almost identical APIs, which means you can often swap one for the other with a single line change. This design is intentional: it lets you experiment with both concurrency models and benchmark them without rewriting your entire program. When you're not sure whether your bottleneck is I/O or CPU, this flexibility is genuinely valuable.
from concurrent.futures import ProcessPoolExecutor
import time
def process_chunk(numbers):
"""Simulate processing a chunk of data."""
total = 0
for n in numbers:
total += sum(range(n))
return total
if __name__ == '__main__':
# Generate work
chunks = [list(range(1000)) for _ in range(8)]
start = time.time()
# ProcessPoolExecutor with 4 worker processes
with ProcessPoolExecutor(max_workers=4) as executor:
results = executor.map(process_chunk, chunks)
total = sum(results)
elapsed = time.time() - start
print(f"Processed in {elapsed:.3f}s with {total} ops")This is the idiom you'll use in real code. It abstracts away the complexity of manual process management. The with statement handles setup and cleanup. The executor manages a pool of idle workers, distributing tasks as they come in. It:
- Spawns a pool of worker processes (4 in this case)
- Distributes work via
map()orsubmit() - Cleans up automatically on context exit
- Handles exceptions gracefully
- Allows you to iterate over results as they complete
The with pattern is crucial, it ensures processes shut down cleanly, even if exceptions occur. Without it, you risk zombie processes or resource leaks.
One subtlety worth calling out: executor.map() returns results in the order you submitted them, not in the order they complete. If chunk 3 finishes before chunk 0, you still get results in the 0, 1, 2, 3 order. That's usually what you want. If you need results as soon as they're ready (regardless of order), use executor.submit() with concurrent.futures.as_completed() instead. That combination gives you first-finished, first-returned semantics, which is better for dashboards, progress bars, or any situation where you want to act on results immediately.
Process Pools Deep Dive
The pool is the workhorse abstraction in Python multiprocessing, and understanding how it works internally helps you tune it correctly. When you create a ProcessPoolExecutor(max_workers=4), Python spawns exactly four worker processes up front and keeps them alive for the duration of the with block. It doesn't create a new process for every task, that would be brutally slow. Instead, it maintains a queue of work and dispatches items to whichever worker is idle.
This design has a critical implication: processes are reused across many tasks. A worker process that finishes chunk 0 doesn't die, it waits for the next item in the queue. This amortizes process spawn overhead across all your work. If you have 1000 tasks and 4 workers, you pay 4 spawn costs, not 1000. The work queue also provides automatic load balancing: fast workers pick up more tasks, slow workers fewer. Your computation doesn't stall waiting for the slowest chunk to finish before moving on.
The multiprocessing.Pool class (the older API) exposes more control than ProcessPoolExecutor. Its methods give you finer granularity for different use cases. pool.map(func, iterable) is equivalent to executor.map(). pool.starmap(func, iterable_of_tuples) handles multi-argument functions cleanly. pool.apply_async(func, args) submits a single task and returns a future-like AsyncResult object. pool.imap_unordered(func, iterable) yields results as they complete rather than in submission order, useful for progress reporting. For most use cases, ProcessPoolExecutor is cleaner and more Pythonic, but knowing Pool exists helps when you need that extra control.
The right number of workers is almost always os.cpu_count() for CPU-bound work, one worker per core gives each process its own core to run on. Adding more workers than cores causes context switching without adding parallelism: you're back to threading-style overhead. Adding fewer workers means some cores sit idle. The default for ProcessPoolExecutor with no max_workers argument is os.cpu_count(), which is often correct. The exception is when your tasks are I/O-heavy even while CPU-bound (reading files between computations), in that case, slightly more workers than cores can keep all cores busy during the I/O gaps.
Real Benchmark: Image Processing Speedup
Let's see actual numbers. Here's a realistic example: applying a convolution filter to an image. This is the kind of work that screams for multiprocessing. Image processing is CPU-bound by nature, every pixel calculation is pure computation, no waiting on network or disk. It's also embarrassingly parallel: pixel calculations are independent of each other, so you can split the image into strips and process each strip on a separate core without any coordination overhead.
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
import time
import numpy as np
def apply_filter(image_chunk):
"""Apply simple convolution (CPU-bound)."""
# Simulate expensive filter operation
kernel = np.array([[1, 0, -1], [2, 0, -2], [1, 0, -1]])
result = np.zeros_like(image_chunk)
for i in range(1, image_chunk.shape[0]-1):
for j in range(1, image_chunk.shape[1]-1):
region = image_chunk[i-1:i+2, j-1:j+2]
result[i, j] = np.sum(region * kernel)
return result
if __name__ == '__main__':
# Create a large "image" (2000x2000)
image = np.random.rand(2000, 2000)
# Single-threaded baseline
start = time.time()
result_single = apply_filter(image)
single_time = time.time() - start
print(f"Single process: {single_time:.3f}s")
# Multiprocessing with 4 processes
# Split image into 4 horizontal strips
chunks = np.array_split(image, 4, axis=0)
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(apply_filter, chunks))
result_multi = np.vstack(results)
multi_time = time.time() - start
print(f"4 processes: {multi_time:.3f}s")
speedup = single_time / multi_time
print(f"Speedup: {speedup:.2f}x")Real results (on a 4-core system):
- Single process: 12.5s
- 4 processes: 3.2s
- Speedup: 3.9x
That's nearly linear scaling! The GIL is completely bypassed. You get almost 4x speedup on 4 cores. In the real world, with more complex operations, you might see even better speedup because the serialization/deserialization overhead becomes negligible compared to computation time.
This is what you're paying for when you use multiprocessing: it takes longer to set up (spawn processes, serialize data), but once running, you get true parallelism that threading can never achieve. The 3.9x speedup on 4 cores (close to the theoretical maximum of 4x) tells you that this workload is essentially perfectly parallel, the numpy arrays serialize efficiently, the computation per chunk is large enough to dwarf the IPC overhead, and the work is evenly distributed. Not every workload will be this clean, but many data-processing tasks come close. Benchmarking your specific case, as we just did, is the only way to know for certain.
Inter-Process Communication: Queue and Pipe
Processes don't share memory (well, they do through the OS, but carefully with copy-on-write). To send data between them, use queues or pipes. This is crucial for worker-manager patterns, streaming pipelines, and any design where you need ongoing communication between processes rather than just a one-shot function call. The fundamental constraint to keep in mind: every piece of data you send between processes must be serialized (pickled) on one side and deserialized on the other. This is not free, and for large objects it can become a bottleneck.
Queue example (FIFO communication): This pattern is ideal for producer-consumer architectures, where one process generates work and another consumes it. The queue acts as a buffer between them, decoupling their speeds. If the producer runs faster than the consumer, the queue absorbs the backlog. If the consumer is faster, it waits without burning CPU.
import multiprocessing
def worker(queue):
"""Read from queue, process, write back."""
while True:
item = queue.get()
if item is None:
break # Sentinel to exit
result = item ** 2
print(f"Processed {item} -> {result}")
if __name__ == '__main__':
queue = multiprocessing.Queue()
# Start worker process
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
# Send work
for i in range(5):
queue.put(i)
# Signal end
queue.put(None)
# Wait for worker
p.join()Queues are thread-safe and process-safe. The producer puts items in; the consumer gets them. If the queue is full, the producer blocks. If empty, the consumer blocks. This natural backpressure prevents memory overflow. The sentinel value (None) signals end-of-work, a common pattern for graceful shutdown.
Pipe example (bidirectional communication): Pipes are lower-level than queues and faster for simple one-to-one communication. They come in pairs, one connection object for each end. You send on one end and receive on the other. This makes them perfect for request-response patterns where a main process dispatches work and waits for results.
def worker(conn):
"""Receive data, process, send back."""
while True:
msg = conn.recv()
if msg is None:
break
result = msg * 2
conn.send(result)
if __name__ == '__main__':
parent_conn, child_conn = multiprocessing.Pipe()
p = multiprocessing.Process(target=worker, args=(child_conn,))
p.start()
# Send and receive
parent_conn.send(10)
print(f"Sent 10, got {parent_conn.recv()}")
parent_conn.send(None)
p.join()Pipes are slightly faster than queues (lower overhead) but require careful synchronization. If both ends try to read simultaneously, you'll deadlock. Queues handle this for you; pipes don't. Use queues when multiple processes talk to one coordinator. Use pipes for simple one-to-one communication. As a rule of thumb: reach for queues first because they're harder to misuse. Switch to pipes only if you've profiled and confirmed that queue overhead is a meaningful bottleneck in your specific workload.
Shared State Challenges
One of the most counterintuitive lessons in multiprocessing is that the instinct to share state between processes is usually wrong. When you come from threading, you're used to shared memory, threads live in the same address space, and sharing a Python dict or list between them is trivial (though you still need locks). With multiprocessing, that instinct will lead you toward Manager objects, which work but carry costs that often negate the benefits of parallelism.
The fundamental issue is that every access to a Manager-managed object involves inter-process communication. When Process A reads a Manager dict, it sends a request to the Manager server process, which reads the value and sends it back. When Process A writes, it sends the new value to the Manager server, which updates it and sends an acknowledgment. This round-trip through a separate process happens for every read and every write. In tight computation loops, that overhead is devastating, you're doing IPC thousands of times per second when you could be computing.
import multiprocessing
def increment_counter(counter, iterations):
"""Increment shared counter."""
for _ in range(iterations):
counter.value += 1
if __name__ == '__main__':
# Create shared counter
with multiprocessing.Manager() as manager:
counter = manager.Value('i', 0)
# Spawn 4 processes, each incrementing 1000 times
processes = []
for _ in range(4):
p = multiprocessing.Process(target=increment_counter, args=(counter, 1000))
processes.append(p)
p.start()
for p in processes:
p.join()
print(f"Final counter: {counter.value}") # Should be 4000Manager objects use a server process to mediate access to shared data. Each read/write goes through the server, which serializes requests. This works, but it's a bottleneck. For 4 processes incrementing a counter 1000 times each, you're doing 4000 RPC calls through the manager. It's slow compared to local operations.
Or use Value and Array for simple types:
if __name__ == '__main__':
# Shared integer
shared_int = multiprocessing.Value('i', 0)
# Shared array
shared_array = multiprocessing.Array('d', [0.0, 1.0, 2.0, 3.0])
def modify(shared_int, shared_array):
shared_int.value += 10
shared_array[0] = 99.9
p = multiprocessing.Process(target=modify, args=(shared_int, shared_array))
p.start()
p.join()
print(shared_int.value) # 10
print(shared_array[:]) # [99.9, 1.0, 2.0, 3.0]Value and Array use OS-level shared memory, faster than Manager but more limited. You can share primitives and numeric arrays, but not complex objects. The type code ('i' for int, 'd' for double) specifies the data type.
Warning: Shared state becomes a bottleneck. If multiple processes are writing to the same shared variable, they'll serialize and lose parallelism. Use it sparingly. For most work, communicate through queues or pipes instead. Your processes will be faster because they don't fight over shared state. The gold standard design pattern for multiprocessing is: each worker receives input data, computes in complete isolation, and returns a result. No shared state, no locks, no coordination. If your algorithm requires shared mutable state, ask whether there's a way to redesign it as a map-reduce pattern, split work into independent chunks, compute locally, then aggregate results at the end.
Chunksize: Tuning Pool Performance
By default, Pool.map() sends one item per task. For small items, overhead kills performance. Use chunksize to batch work:
Understanding why chunksize matters requires thinking about what happens at the OS level when you call executor.map(). For each task, Python pickles the function arguments, puts them on an inter-process queue, the worker unpickles them, runs the function, pickles the result, puts it on a result queue, and the main process unpickles the result. That's a lot of overhead per task. When your tasks are tiny, say, a function that runs in 1 microsecond, this overhead can be 100x the actual computation time. Chunksize solves this by grouping tasks: instead of sending 1000 individual jobs, you send 10 batches of 100, and each worker handles an entire batch as a unit.
def expensive_op(n):
"""CPU-bound operation."""
total = 0
for i in range(n):
total += i ** 2
return total
if __name__ == '__main__':
import time
from concurrent.futures import ProcessPoolExecutor
work = [1000] * 1000 # 1000 items
# Default: chunksize=1 (1000 tasks to queue)
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(expensive_op, work))
default_time = time.time() - start
print(f"Chunksize=1: {default_time:.3f}s")
# Smarter: chunksize=100 (10 tasks to queue)
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(expensive_op, work, chunksize=100))
chunked_time = time.time() - start
print(f"Chunksize=100: {chunked_time:.3f}s")Larger chunksize = fewer queue operations = better throughput. You serialize 100 items once instead of 1000 times. But if work is uneven (some items fast, others slow), smaller chunks load-balance better. A worker that finishes early can grab another chunk, not sit idle. Rule of thumb: len(work) // (num_workers * 4). For 1000 items and 4 workers, try chunksize=62. Experiment and measure.
The practical guidance is to start with the rule of thumb, profile, and adjust. If you see that workers are finishing very quickly but the overall program is slow, your chunksize is too small and you're paying too much IPC overhead. If you see that one worker is still running while others are idle, your chunksize is too large and you've lost load balancing. The sweet spot is usually somewhere in the middle, and for any workload you care about, it's worth the ten minutes it takes to benchmark three or four values.
The Pickling Problem: What Can't Be Parallelized
Processes communicate by pickling (serializing) data. Not everything pickles. This will fail. Before you run into this in production at 3 AM, let's understand exactly what the constraints are and how to design around them.
When the main process sends work to a worker via executor.map(), it pickles the function and the arguments. The worker process receives the pickle, unpickles it, and calls the function. For this to work, both the function and the data must survive a pickle/unpickle round-trip: the pickling process must produce a byte string, and the worker process must be able to reconstruct the original object from that byte string. Functions defined at module level pickle cleanly because Python just stores their module path and name, the worker process imports the module and looks up the function by name. Lambdas can't do this because they don't have a stable module-level name.
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
# Can't pickle lambda
work = [lambda x: x**2 for _ in range(10)]
# This will crash
with ProcessPoolExecutor(max_workers=2) as executor:
results = list(executor.map(lambda f: f(5), work))
# Error: can't pickle lambdaLambdas live in memory and have no serializable representation. When the child process tries to deserialize them, there's nothing to deserialize from. The same goes for closures and inner functions.
What CAN be pickled:
- Functions defined at module level
- Classes defined at module level
- Built-in types (int, str, list, dict, etc.)
- Objects with pickleable attributes
What CAN'T:
- Lambdas
- Inner functions (usually)
- File handles, sockets
- Thread locks
- Some numpy arrays (depending on dtype)
Solution: Define workers at module level:
# Good
def my_worker(x):
return x ** 2
if __name__ == '__main__':
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(my_worker, range(10)))Module-level functions are pickled by reference (just their name), not by value. The child process looks them up in its own module namespace. This is fast and reliable.
If you genuinely need to pass a configurable function (something lambda-like), the workaround is to define a class with a __call__ method at module level. Callable objects that are defined at module level are pickleable. Another approach is functools.partial, which is pickleable and lets you bind arguments to a module-level function. Both techniques let you achieve the flexibility of lambdas while staying within pickling constraints. Getting comfortable with these patterns will save you hours of debugging confusing PicklingError tracebacks.
Common Multiprocessing Mistakes
Even experienced Python developers trip over the same multiprocessing pitfalls repeatedly. Let's catalog them directly so you can recognize and avoid them.
The most pervasive mistake is forgetting the if __name__ == '__main__': guard. On Windows, Python uses "spawn" start method, which imports your module in each child process. Without the guard, each child import triggers another round of ProcessPoolExecutor creation, which spawns more children, which import the module again, recursive process explosion. Python has a safeguard that raises a RuntimeError before this gets completely out of hand, but the error message is confusing if you don't know what caused it. Always guard your multiprocessing entry point.
The second most common mistake is sharing too much state. Developers reach for multiprocessing.Manager to replicate threading-style shared-memory patterns, not realizing that every access is an IPC round-trip. The result is code that's slower than single-threaded. The fix is to design for isolation: workers receive all the data they need as arguments, compute locally, and return results. Share nothing. Collect everything at the end via map() or a result queue.
A subtler mistake is ignoring exception propagation. When a worker process raises an exception, executor.map() re-raises it in the main process when you iterate over the results. But if you never iterate (or iterate lazily), you can silently swallow errors. Always consume your results, call list() on the map output if you don't have another reason to iterate, and wrap in try/except to handle worker failures gracefully.
Finally, not cleaning up processes causes resource leaks. Always use ProcessPoolExecutor as a context manager. If you create Process objects manually, always call .join() or .terminate() before your program exits. Orphaned processes consume system resources and can prevent port binding or file locking for future runs of your program. The with statement makes cleanup automatic and should be your default.
When to Use Multiprocessing
Use multiprocessing for:
- CPU-bound work (computation, data processing, image manipulation)
- Heavy numerical operations (numpy, scipy)
- Batch processing independent items
- Tasks that take seconds or longer per item
Don't use multiprocessing for:
- I/O-bound work (use asyncio or threading instead)
- Lots of inter-process communication (overhead kills gains)
- Tiny work items (process spawn cost outweighs savings)
- Shared state (serialization and locking are bottlenecks)
- Latency-sensitive code (process spawn takes time)
The breakeven point is usually a few hundred milliseconds of work per item. Below that, overhead dominates. Above that, parallelism wins.
Real-World Scenario: Data Processing Pipeline
Here's a realistic example: processing CSV files in parallel. Imagine you have 8 CSV files with millions of rows, and each row requires expensive computation. This is the exact pattern that shows up in data engineering, scientific computing, and batch ML preprocessing, the kind of work where a 4x speedup means the difference between a pipeline that runs overnight and one that finishes before lunch.
import csv
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
import time
def process_csv_file(filepath):
"""Process one CSV file (CPU-bound)."""
results = []
with open(filepath, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
# Expensive computation per row
value = float(row['value'])
processed = value ** 2 + sum(range(100))
results.append(processed)
return len(results), sum(results)
if __name__ == '__main__':
# Assume we have 8 CSV files
files = [f'data_{i}.csv' for i in range(8)]
start = time.time()
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(process_csv_file, files))
total_rows = sum(r[0] for r in results)
total_sum = sum(r[1] for r in results)
elapsed = time.time() - start
print(f"Processed {total_rows} rows in {elapsed:.3f}s")
print(f"With 4 workers, speedup vs single: ~4x")This is the pattern you'll use constantly: map files or work items to worker processes, collect results, aggregate. Each worker runs independently, reading its file and computing results. The main process orchestrates. With 8 files and 4 workers, you process 2 files per worker in parallel. Speedup approaches 4x.
Notice what makes this example work so well: each file is a self-contained unit of work, the computation per row is genuinely CPU-intensive, and results are simple enough (a count and a sum) that pickling them back to the main process is trivial. This trifecta, independent work items, heavy computation, lightweight results, is the profile of a workload that multiprocessing handles beautifully. When you're evaluating whether to parallelize your own workload, check it against these three criteria.
Summary
You now understand why the GIL exists (Python memory safety) and why it matters (CPU-bound work can't use threads). You've learned:
- Processes vs threads: Separate processes = separate GILs = true parallelism
- Process lifecycle:
Process,.start(),.join(), daemon processes - ProcessPoolExecutor: The Pythonic abstraction for worker pools
- Inter-process communication: Queues and pipes for safe data exchange
- Shared state: Manager objects for shared data (use sparingly)
- Chunksize tuning: Trade throughput vs load-balancing
- Pickling constraints: Lambdas and inner functions can't be serialized
- Real speedup: 3-4x on 4 cores for CPU-bound work (verified with image processing)
- When to use it: CPU-bound tasks that take more than a few hundred milliseconds
- Real patterns: File/batch processing, data pipelines, numerical computation
The mental model to carry with you is this: multiprocessing trades fixed overhead (process spawn, data serialization) for proportional gain (true parallel execution). The trade is worth it when work is heavy and independent. When tasks are tiny, stateful, or tightly coupled, the trade goes negative. Learn to recognize which side of that line your work falls on, and you'll make the right call every time.
The broader Python concurrency picture now has two pieces: threading for I/O-bound work (letting Python's GIL release during I/O waits), and multiprocessing for CPU-bound work (bypassing the GIL with separate processes entirely). The next piece is asyncio, which takes a different approach altogether, single-threaded cooperative concurrency that scales to thousands of concurrent I/O operations without threads or processes. Understanding all three tools, and why each exists, is what gives you complete command over Python's concurrency model.
Next article, we'll dive into asyncio, the elegant async/await approach for concurrent I/O that scales to thousands of tasks on a single thread.