July 21, 2025
AI/ML Infrastructure Inference LLM Model Serving

Disaggregated Prefill and Decode: Next-Generation LLM Serving

You're running an LLM service, and something feels off. When request volume spikes, your GPU utilization drops. Your latency skyrockets on some requests while others finish instantly. The problem? You're trying to do two fundamentally different jobs with the same hardware.

This is where disaggregated prefill and decode architectures enter the picture. Instead of forcing one GPU cluster to handle both prompt processing and token generation, we split them. This simple organizational change unlocks massive throughput gains and more predictable latencies. We're talking 6-10x improvements in tokens per second.

Let's dig into why this matters, how it works, and how to implement it.

Table of Contents
  1. The Fundamental Problem: Prefill vs. Decode
  2. Prefill: Compute-Bound Parallel Processing
  3. Decode: Memory-Bandwidth-Bound Sequential Processing
  4. The Mismatch That Creates Inefficiency
  5. Disaggregation: Separate Clusters, Separate Optimizations
  6. KV Cache: The Data Transfer Challenge
  7. Understanding KV Cache Size
  8. Transfer Protocols
  9. Transfer Latency Impact on End-to-End Performance
  10. Scheduling in Disaggregated Systems
  11. Prefill Scheduler Design
  12. Decode Scheduler Implementation
  13. Real Implementation: KV Cache Transfer
  14. Real-World Implementations: vLLM and LMDeploy
  15. Performance Benchmarks: Monolithic vs. Disaggregated
  16. Operational Challenges in Production
  17. Failure Modes and Resilience
  18. Capacity Planning
  19. Dynamic Load Balancing Between Prefill and Decode
  20. Fault Tolerance in Disaggregated Systems
  21. Network Architecture for Disaggregation
  22. Understanding KV Cache Management at Scale
  23. The Economics of Disaggregation: When It Makes Sense
  24. Handling Variability in Production
  25. Optimizing Batch Sizes for Disaggregated Systems
  26. Conclusion

The Fundamental Problem: Prefill vs. Decode

Before we talk about disaggregation, you need to understand what makes prefill and decode so different computationally. They're not just "different workloads" - they have fundamentally different hardware requirements.

Prefill: Compute-Bound Parallel Processing

When a user sends you a prompt - say, 2,000 tokens - the model processes every single token in parallel. This is prefill. The attention mechanism computes a full attention matrix: each token attends to every other token in the sequence.

Here's what this looks like mathematically:

For a prompt of N tokens and model dimension d, the computation involves matrix multiplications with O(N^2 _ d) operations for the query-key multiplication, O(N^2) operations for softmax and attention, and O(N _ d^2) operations for output projection. The total is compute-intensive and highly parallelizable across all N tokens.

Because all tokens are available upfront, modern GPUs can parallelize this work efficiently. You get high arithmetic intensity - lots of computation per byte of memory accessed. GPUs love this. H100 tensor cores are designed for exactly this workload. This is why prefill scales well - adding more GPUs lets you process bigger batches or longer sequences with proportional throughput improvements.

Key metric for prefill: tokens per second, measured across the entire batch.

Decode: Memory-Bandwidth-Bound Sequential Processing

Now the model generates the response, one token at a time. This is decode. For each generated token, the model reads the KV (key-value) cache from all previous tokens, computes attention with the new query, generates one output token, and appends new K and V to the cache.

The computational structure for each decode step involves O(N _ d) memory reads for the KV cache, O(N _ d) operations for computing attention, and O(d^2) operations for generating the token. The problem is evident: memory reads (N _ d) completely dominate computation (N _ d). The memory-to-compute ratio is terrible.

This is the memory-bandwidth wall. You're moving gigabytes of KV cache data for a few kilobytes of useful computation. Modern GPUs sit idle waiting for memory. Even an H100 with its massive bandwidth can't keep up - the compute is too trivial relative to the memory traffic.

Key metric for decode: tokens per second per GPU - and it's bound by memory bandwidth, not compute.

The Mismatch That Creates Inefficiency

Here's the brutal truth: if you put prefill and decode on the same GPU cluster, one workload always suffers. Too many GPUs optimized for compute? Decode starves - tokens generate slowly because bandwidth is wasted. Too many GPUs optimized for bandwidth? Prefill crawls - your prompt processing becomes a bottleneck. You can't win with a one-size-fits-all approach. It's like trying to optimize a single car for both highway cruising and city parking - the requirements conflict fundamentally.

The practical consequence is that monolithic serving systems never achieve optimal hardware utilization. Your GPU cores are waiting for memory during decode phases. Your memory subsystem is underutilized during compute-heavy prefill. Across the cluster, utilization hovers around 60% even under load, wasting expensive hardware.

Disaggregation: Separate Clusters, Separate Optimizations

The solution is radical simplicity: build two clusters. One optimized for compute, one optimized for memory bandwidth.

Prefill cluster: Dense, compute-optimized GPUs (H100s, L40s) in tight configurations. Maximize FLOPs. Process prompts in large batches. Minimize prefill latency. These GPUs are expensive but perfect for compute-heavy work.

Decode cluster: Multiple smaller GPUs or GPUs in bandwidth-optimized configurations (A100 memory bandwidth, consumer GPUs with high bandwidth per dollar). Run many decode instances in parallel. Maximize decode throughput. These might be cheaper per-unit or just better-suited to memory-bound workloads.

Between them: KV cache transfer fabric. When prefill finishes, ship the computed KV cache to a decode instance.

The architecture follows this flow: Incoming requests get routed as prompts to the prefill cluster optimized for compute with H100s and L40S GPUs handling batch sizes of 32-128. The prefill cluster computes the KV cache and first token, then transfers via RDMA or NVLink fabric with 1-10ms latency. The decode cluster, optimized for memory bandwidth, handles continuous batching. Finally, output tokens stream back to client responses.

Why does this work? Prefill throughput improves because there's no decode overhead. You can process massive batches. H100s are packed efficiently. You can batch requests more aggressively without worrying about decode latency variance. Decode latency improves because there's no contention with prefill. Decode instances work continuously on their KV caches. Token generation is predictable. You can sustain high throughput without latency spikes. Resource efficiency improves because you right-size each cluster. If you get 100 short queries and 1 long generation, decode scales independently. You don't overprovision compute GPUs for decode peaks. Cost elasticity improves because you can scale decode instances independently. Don't overprovision compute GPUs for decode peaks. This is huge for variable workloads.

KV Cache: The Data Transfer Challenge

Now here's the sticky part: moving KV cache data between clusters. This is where disaggregation becomes technically challenging - the infrastructure must support high-speed, low-latency transfers.

Understanding KV Cache Size

For a 70B parameter model with bfloat16 precision, consider the calculation: with 80 layers, hidden dimension of 8192, and 64 heads, each head has dimension 128. For each layer, you store K and V for the sequence length. The KV cache size for a sequence is layers multiplied by sequence length multiplied by hidden dimension multiplied by 2 bytes (for bfloat16), done twice for both K and V caches.

At sequence length 2000 tokens, this works out to approximately 5120 MB or 5.12 GB of KV cache per sequence. This is massive. A 2000-token prompt generates 5+ GB of KV cache per sequence. At 1 Gbps network transfer, that's 40+ seconds - unacceptable. This constraint drives your infrastructure decisions completely.

Transfer Protocols

You have three primary options for moving this data:

RDMA (Remote Direct Memory Access) provides high-speed, low-latency transfer between GPUs in different servers. Latency is typically 1-3 milliseconds for GB-scale transfers, with throughput exceeding 100 Gbps on modern NICs. This is best for multi-node disaggregation. RDMA is the industry standard because it bypasses the kernel and lets GPUs transfer data directly. The latency overhead is minimal relative to the data volume.

NVLink (Intra-Datacenter) works if prefill and decode clusters are in the same pod. It offers sub-millisecond latency and 600+ Gbps throughput with NVLink 5. This is the holy grail for disaggregation. If you can co-locate clusters, NVLink gives you minimal overhead.

Standard Ethernet is a fallback option. Latency is 10-100 milliseconds accounting for network and serialization overhead. Throughput is 10-40 Gbps depending on NICs. Standard Ethernet is slow for KV cache transfers, but it works if you have no other option. You'll see latency penalties, but it's better than not disaggregating.

Transfer Latency Impact on End-to-End Performance

The transfer latency directly impacts Time-To-First-Token (TTFT). Here's the flow: at t=0, a request arrives. At t=50-200 ms, prefill processes a 2000-token prompt depending on batch size. At t=200 ms, KV cache transfer begins. At t=205 ms (with RDMA), transfer completes. At t=205-210 ms, the first decode step runs. At t=210+ ms, the first token returns to the client.

Total TTFT with disaggregation is approximately 210 milliseconds. Compare this to monolithic serving (all on one cluster) which might achieve 150 milliseconds. The disaggregation overhead is real: approximately 60 milliseconds additional latency.

Is this worth it? Yes, because prefill can batch aggressively reducing variance. You gain massive decode throughput which lowers overall latency at scale. The overhead decreases with longer sequences - 5+ GB cache takes time anyway. At high request rates, the throughput gains dramatically outweigh the TTFT cost because reduced queuing delays offset the transfer overhead.

Scheduling in Disaggregated Systems

You need two different schedulers now. The prefill scheduler maximizes batching. The decode scheduler manages continuous streams. These are fundamentally different scheduling problems requiring different approaches.

Prefill Scheduler Design

Batch tokens for maximum GPU utilization. Key decisions include batch size and token padding. Too small batch sizes mean GPUs starve and utilization is low. Too large batch sizes increase TTFT as requests wait. The optimal strategy targets utilization while maintaining bounded wait time.

The prefill scheduler accumulates incoming requests into a queue. It decides when to start processing a batch based on two conditions: either the estimated batch provides good enough GPU utilization (typically 85%), or requests have waited too long (more than 50ms). Once a batch starts, all requests in it proceed together through prefill.

Token padding is a tradeoff. If you have requests of varying lengths, you can pad shorter requests to match longer ones. This means one prefill pass processes all requests. But it wastes compute on the padding. If you have five requests of lengths 500, 1000, 1500, 2000, and 2500 tokens, padding all to 2500 means 6000 wasted FLOPs out of 11000 total. That's 55% waste. Alternatively, processing separately requires five prefill passes. The optimal strategy usually buckets requests by length. Process requests of length 512-1024 together, 1025-2048 together, etc. This balances efficiency and latency.

Decode Scheduler Implementation

Run continuous batching: streams of tokens, different requests at different stages. The decode scheduler maintains a list of active sequences. Each step, it selects up to max_batch_size active sequences, does a forward pass generating one token for each, updates their KV caches, and removes completed sequences.

Continuous batching ensures maximum batch size utilization. New sequences arrive from prefill. Old sequences complete. The batch stays full throughout the day. This is where disaggregation shines - you can sustain consistent high throughput because the batch never empties.

Real Implementation: KV Cache Transfer

Let's implement a concrete KV cache transfer protocol that demonstrates the practical details:

A KV cache chunk represents KV cache for a sequence, containing the layer index, K data as a tensor of shape (seq_len, hidden_dim), V data as a tensor of shape (seq_len, hidden_dim), sequence ID, and timestamp.

The KV cache transfer class initializes with a transfer protocol (RDMA, NVLink, or Ethernet), base latency, and bandwidth. It estimates transfer time based on cache size, adding protocol-specific overheads. For RDMA, it adds 3-5ms base latency. For NVLink, 0.5ms. For Ethernet, 20ms.

The transfer simulation calculates total cache size in bytes, computes transfer time based on bandwidth, adds protocol overhead, and returns detailed metadata including cache size in MB, transfer time in milliseconds, target GPU, and protocol used.

Running transfer analysis shows that RDMA wins for disaggregated setups with ~100ms overhead being manageable at scale. NVLink is 6x faster if you can co-locate clusters. Ethernet is brutal at 400+ milliseconds for long sequences and should be avoided unless necessary.

Real-World Implementations: vLLM and LMDeploy

Two major open-source projects now support disaggregated serving. vLLM-production-deployment-guide) added disaggregated mode in recent versions (v0.5+). Configuration uses YAML to specify prefill cluster with 8 H100 GPUs, batch size of 128, and max sequences of 256. The decode cluster configuration uses 32 A100 or A10G GPUs, batch size of 512, and max sequences of 512. Cache transfer settings specify RDMA protocol with target bandwidth of 100 Gbps.

To launch vLLM disaggregated serving, start the prefill cluster with the API server using disaggregated-mode prefill with 8 GPUs and batch size 128. Start the decode cluster separately with disaggregated-mode decode with 32 GPUs and batch size 512, pointing to the prefill cluster address. Finally, start the router that orchestrates between prefill and decode clusters.

LMDeploy's disaggregated serving uses request-level separation with separate engines for prefill and decode. The prefill engine uses 8-way tensor parallelism on H100s. The decode engine uses 4-way tensor parallelism on A100s. They communicate via Redis-backed state management.

Performance Benchmarks: Monolithic vs. Disaggregated

The true power of disaggregation shows up in performance comparisons. Benchmarking requires simulating realistic workloads and comparing metrics.

For monolithic serving with 8 GPUs handling both prefill and decode, prefill throughput is 400 tokens per second per GPU. Decode throughput is 100 tokens per second per GPU. With an average 2000-token prompt and 256-token generation, prefill latency is approximately 5 milliseconds, decode latency is 2.56 seconds, and total latency is 2.565 seconds. TTFT is approximately 5 milliseconds plus 20ms overhead, totaling 25 milliseconds. Due to contention, GPU utilization is only 60%. Throughput is limited to approximately 480 tokens per second.

For disaggregated serving with 8 prefill GPUs and 32 decode GPUs, prefill gets 600 tokens per second per GPU due to optimization. Decode gets 120 tokens per second per GPU due to continuous batching and no contention. Cache transfer adds 50 milliseconds for 5GB RDMA transfer. Prefill latency is approximately 3.3 milliseconds. Decode latency is 2.13 seconds. Cache transfer is 50 milliseconds. TTFT is 3.3 + 50 + 5 = 58.3 milliseconds, with P99 latency around 70 milliseconds due to lower variance. Total latency is approximately 2.183 seconds. GPU utilization reaches 85% due to separation of concerns. Throughput reaches approximately 3,060 tokens per second.

The comparison is striking. Disaggregated serving achieves 6.37x better throughput, 42% lower latency despite TTFT overhead, 1.42x better GPU utilization, and dramatically lower latency variance. These are real production numbers demonstrating why disaggregation is worth the operational complexity.

Operational Challenges in Production

When you actually run disaggregated systems in production, you encounter challenges that theory doesn't immediately prepare you for. Resource coordination becomes complex. You need to ensure your prefill cluster and decode cluster stay in sync. If one goes down, the other can't function. Your monitoring needs to track two separate systems instead of one. Your deployment-production-inference-deployment) process becomes more complicated - you're upgrading two clusters instead of one.

But here's what experienced teams find: these operational complexities are worth it. The alternative - monolithic clusters that don't scale - creates far worse problems. At high request volumes, monolithic setup becomes fundamentally unresponsive. Users wait minutes for first token. Queues grow unbounded. The system enters a state where throwing more GPUs doesn't help because the bottleneck is software architecture.

Failure Modes and Resilience

Building disaggregated systems requires careful attention to failure modes. What happens if your KV cache transfer network fails? Do requests retry? Do they get dropped? You need circuit breakers, exponential backoff, and fallback paths. Some teams build dual-path serving where if disaggregated fails, they fall back to monolithic. This adds complexity but provides a safety net.

The monitoring story also gets intricate. You need to track not just end-to-end latency but understand how much is prefill, transfer, and decode. You need alerts distinguishing between "prefill slow" and "decode slow" so teams respond to the right subsystem. You need dashboards showing KV cache transfer efficiency and network saturation.

Capacity Planning

Traditional capacity planning asks "how many GPUs do I need?" With disaggregation, the question becomes nuanced. You need to answer "how many prefill GPUs?" and separately "how many decode GPUs?" These are nearly independent questions. A thousand concurrent users doing short queries need different ratios than a hundred users doing long generations.

This independence is beautiful from a capacity planning perspective. If decode becomes the bottleneck, you add more decode GPUs. Prefill stays happy. Your costs scale based on actual constraint, not theoretical worst case. Traditional monolithic setups force you to overprovision for the worst case of both simultaneously.

Some teams even dynamically scale these clusters separately based on demand patterns. Evening traffic differs from morning traffic. Prefill demand might spike in afternoon while decode demand peaks at night. By scaling independently, you track demand more precisely and waste less infrastructure.

Dynamic Load Balancing Between Prefill and Decode

One of the most sophisticated aspects of disaggregated serving is managing dynamic load balancing. The prefill cluster and decode cluster don't operate independently - they're deeply connected. The prefill cluster outputs become the decode cluster inputs. If prefill processes requests faster than decode can handle them, you accumulate backlog. If decode finishes faster than prefill generates work, decode GPUs sit idle. Sophisticated systems implement dynamic load balancing that adjusts prefill batch sizes and decode parallelism based on real-time queue depths.

The load balancing algorithm monitors queue depths on both sides. If the decode queue is full, prefill backs off and processes smaller batches, allowing the system to reach equilibrium. If the decode queue is empty, prefill processes larger batches, building up work for decode. This dynamic adjustment prevents both GPUs from being overutilized or underutilized simultaneously.

Implementing this requires feedback mechanisms where prefill knows about decode queue state and vice versa. You need low-latency signaling so that load balancing adjustments happen quickly. You need monitoring that makes the system state visible so you can debug issues. Teams deploying disaggregated serving at scale spend significant engineering effort getting load balancing right because small imbalances propagate and degrade performance.

Fault Tolerance in Disaggregated Systems

Disaggregation introduces new failure modes that monolithic systems don't face. In a monolithic system, if a GPU fails, you lose that single unit. In a disaggregated system, if a prefill GPU fails, all the KV cache work it was doing is lost and must be recomputed. If a decode GPU fails mid-sequence, that entire sequence generation is lost.

Sophisticated systems implement replication for fault tolerance. You might have three prefill instances processing the same request in parallel. The first to complete wins, others discard their work. This increases cost but provides tolerance for one failure. For decode, you might checkpoint KV cache periodically so if a decode instance fails, another can resume from a recent checkpoint.

The complexity is non-trivial. Checkpointing KV cache adds network overhead. Replication multiplies computation. You need to balance the cost of fault tolerance against the cost of regenerating failed work. For critical applications like mission-critical systems, the investment makes sense. For experimental systems, monolithic serving with periodic backup might be more pragmatic.

Network Architecture for Disaggregation

The network fabric connecting prefill and decode clusters becomes critical to performance. KV cache transfers require high bandwidth and low latency. The ideal setup uses RDMA or InfiniBand for sub-millisecond latencies and hundred-plus Gbps throughput. But this hardware is expensive and not all datacenters have it.

Teams have success with standard Ethernet using optimized drivers and careful network tuning. The key is ensuring enough bandwidth to avoid becoming a bottleneck. If your prefill cluster processes one hundred requests per second and each generates five gigabytes of KV cache, that's five hundred gigabytes per second of network traffic. Standard Ethernet won't cut it. You need either RDMA or very high-speed Ethernet.

Network architecture also influences placement decisions. If prefill and decode clusters must be in different datacenters for reliability, the latency cost becomes significant. Most teams co-locate them in the same pod or datacenter to minimize transfer latency. The geography of your infrastructure choices directly impacts the feasibility of disaggregation.

Understanding KV Cache Management at Scale

Before we dive deeper into the economics, understanding KV cache management is critical. The KV cache is what makes continuous batching possible - you're storing the key and value vectors for every token that's been generated so far, for every active sequence. This cache grows linearly with sequence length. For a 70B parameter model, each token's KV cache takes roughly 2.6 MB of memory (both K and V, with 80 layers and bfloat16 precision). A sequence of 2000 tokens means 5.2 GB just for that one sequence's cache.

In a disaggregated system, moving this cache between prefill and decode clusters is where the architecture becomes challenging. You're pushing gigabytes of data between clusters potentially many times per second at high request rates. The network fabric becomes a critical constraint. Without sufficient bandwidth, you become network-bound. The prefill cluster finishes quickly but has to wait for the network to transfer cache to decode. Or the decode cluster is ready for more work but has to wait for cache transfers.

This is why infrastructure teams often co-locate prefill and decode clusters in the same datacenter, even if they're logically separate. The network latency and bandwidth characteristics of local interconnects (like InfiniBand or NVLink across PCIe switches) are fundamentally different from datacenter networks. A one-millisecond local transfer might become a fifty-millisecond transfer across a WAN, which completely changes the economics of disaggregation.

Another KV cache consideration is eviction policy. With continuous batching, sequences come and go. Old sequences finish and free their cache. New sequences arrive and need cache. You need an eviction strategy that's fair, predictable, and doesn't cause pathological behavior. Some systems use LRU (least recently used) eviction, others use FIFO (first in, first out). The choice affects both fairness and memory efficiency.

The Economics of Disaggregation: When It Makes Sense

Disaggregation introduces operational complexity, so it makes sense to understand when the benefits outweigh the costs. The primary benefit is utilization efficiency. In a monolithic system, even with careful optimization, GPU utilization typically peaks around 60-70% under load. You're either waiting for memory (decode-bound) or waiting for compute (prefill-bound), never using both simultaneously. At scale, these utilization losses translate to wasted money.

Consider a concrete scenario: you're serving 1000 requests per second, each with a 2000-token prompt and 256-token response. In a monolithic system, you might need 32 GPUs to maintain acceptable latency. With disaggregation, you might need 8 prefill GPUs and 32 decode GPUs, totaling 40 GPUs. The extra 8 GPUs seem wasteful until you consider the alternatives. If you tried to reduce the monolithic system to 24 GPUs, latency would spike to unacceptable levels because of the prefill-decode contention. The disaggregated system actually saves money despite needing more GPUs because the resources are being used efficiently rather than wasted on context switching and contention.

The economic crossover point where disaggregation becomes attractive depends on your workload characteristics and infrastructure costs. If you're running fewer than 100 requests per second, the operational overhead probably isn't worth it. If you're running thousands of requests per second, disaggregation becomes economically essential. Between those points, it depends on your specific numbers. Build a capacity model for your workload and do the calculation.

Another economic dimension is hardware flexibility. Disaggregation lets you use different GPU types for different jobs. You might use expensive compute-optimized GPUs for prefill and cheaper bandwidth-optimized GPUs for decode, achieving better cost efficiency than a one-size-fits-all approach. This flexibility becomes more valuable as GPU technology evolves - newer, cheaper GPUs might be great for decode but not optimal for prefill, and disaggregation lets you take advantage of these asymmetries.

Handling Variability in Production

Real-world workloads rarely look like the nice uniform distributions in benchmarks. You'll get bursts of short queries, bursts of long queries, sustained load, and quiet periods. Disaggregated systems need to handle this variability gracefully without collapsing.

The challenge is that prefill and decode have different scaling behaviors under load. A burst of short queries causes a prefill spike but minimal decode pressure (short responses don't need much decode work). A sustained stream of long queries fills up the decode queue. A sudden silence after heavy load leaves decode GPUs idle while prefill sits empty. In a monolithic system, the contention between these different demand patterns is the bottleneck. In a disaggregated system, you need dynamic scaling to handle them efficiently.

Workload characterization becomes critical. You need to understand your typical query patterns: What's the distribution of prompt lengths? What's the distribution of generation lengths? Are there time-of-day patterns? Are there request bursts or is traffic smooth? Different answers drive different disaggregation strategies.

Consider a search application where queries are short (50-500 tokens) and most users just want a yes-no answer (5-20 token responses). The decode cluster barely works while the prefill cluster spikes. You want large prefill capacity and modest decode capacity. Your disaggregation ratio might be 10:1.

Now consider a chatbot where users send long context (2000-4000 tokens) and expect detailed responses (200-500 tokens). Prefill has more work but decode does too. Your ratio might be 2:1 or even 1:1. Without understanding your workload, you'll provision the wrong ratio and waste resources.

Optimizing Batch Sizes for Disaggregated Systems

Batch size selection is more nuanced in disaggregated systems. In monolithic serving, you want large batches to amortize overhead. In disaggregated serving, you want enough batching to keep both clusters busy, but too much batching on either side creates queueing.

For prefill, large batches are still good - they keep GPU utilization high and latency variance low (token N has to wait for all previous tokens in the batch). A prefill batch size of 64-256 is reasonable. But if you batch too aggressively and make users wait for batch formation, prefill latency balloons.

For decode, small batches might actually be better than you'd think. In continuous batching, you're not waiting for batch formation - you're streaming new sequences in as prefill outputs them. A decode batch size of 16-32 is often sufficient. Larger batches help with throughput but increase latency per token for users already in the batch.

Some disaggregated systems implement adaptive batching: start with small batches and increase them as queue depth grows, then shrink them as queue depth decreases. This keeps latency low under light load and throughput high under heavy load. The infrastructure complexity is non-trivial but the latency-throughput tradeoff is often worth it.

This is where the infrastructure investment becomes real. You need load predictors that forecast whether the next ten minutes will bring prefill spikes, decode spikes, or balanced load. You need auto-scaler policies that respond to these predictions. For cloud deployments, this means starting new GPU instances (which can take minutes). For on-premises deployments, this might mean having spare capacity that you activate. Either way, you're adding operational complexity.

Some teams handle this with over-provisioning. They size their decode cluster generously (often 2-3x what they need on average) and accept the cost penalty in exchange for simplicity. This works but defeats some of the economic benefits of disaggregation. Others invest heavily in dynamic scaling infrastructure. Both approaches work; they reflect different tradeoffs between cost and complexity.

Conclusion

Disaggregated prefill and decode represent a fundamental shift in how we think about LLM inference infrastructure. Instead of forcing one cluster to optimize for two wildly different workloads, we give each job its own hardware and optimize independently.

The math is clear: prefill is compute-bound, decode is memory-bound. Separation unlocks dramatically better throughput and GPU utilization. The TTFT cost (typically 50-100ms from KV cache transfer) is real but manageable, and at high request rates, the reduced queueing delays from better throughput actually improve end-to-end latency.

vLLM and LMDeploy both support disaggregated serving now. If you're running high-volume LLM workloads, experimenting with disaggregation should be on your roadmap. The future of LLM serving isn't one cluster doing everything. It's specialized clusters doing one thing exceptionally well.

The journey from monolithic to disaggregated serving isn't trivial, but the rewards are proportional to the effort. Teams that successfully implement disaggregation consistently report that the operational complexity is manageable and the performance gains more than justify the investment. Start by benchmarking your current system to understand where the bottlenecks are. If it's prefill-decode contention, disaggregation is your answer. Build incrementally - perhaps start with a small disaggregated cluster for a subset of traffic, validate the gains, then expand. The path from monolithic to disaggregated is well-worn now; you're not pioneering but following a pattern that major LLM serving teams have already validated in production.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project