July 7, 2025
AI/ML Infrastructure Inference vLLM LLM

vLLM Production Deployment: Architecture and Configuration Guide

You've built an impressive LLM application. Your prototype works locally. Users love it. Now you need to serve 1,000 concurrent requests per second without hemorrhaging GPU memory or watching latency spike.

This is where vLLM enters the picture - and where most teams make costly mistakes.

We're not talking about spinning up a FastAPI server and hoping for the best. We're talking about understanding the architectural choices that separate a hobbyist inference setup from a production system that handles Stripe's 50 million daily API calls (they cut inference costs by 73% migrating to vLLM). This guide walks you through the configuration decisions that matter.

Table of Contents
  1. The Problem: Why Standard Inference Stalls
  2. PagedAttention: Managing the KV Cache Like Virtual Memory
  3. Continuous Batching: Scheduling at the Iteration Level
  4. Prefix Caching: Reusing Shared Prompts
  5. Quantization: Trading Model Precision for Speed
  6. Tensor Parallelism: Scaling Across Multiple GPUs
  7. Cost Optimization at Production Scale
  8. When to Use vLLM vs. TensorRT-LLM
  9. Common Production Pitfalls and How to Avoid Them
  10. Moving from Development to Production
  11. Key Takeaways

The Problem: Why Standard Inference Stalls

Before we dive into solutions, let's understand what breaks a naive LLM serving setup.

A baseline FastAPI server serving Llama-3.1-8B on an A100 GPU handles 2-5 concurrent requests before users feel pain. That's not a limitation of the GPU - it's a limitation of how you're managing memory and scheduling.

Here's what happens:

You load a batch of requests. The GPU processes them. You wait for all of them to finish. Then you load the next batch. Dead time emerges between batches. Memory allocates greedily. The KV cache - the key-value tensor pairs that each request accumulates during generation - fragments across GPU memory like a hard drive with no defragmentation.

By the time your fourth or fifth request joins the party, you've exhausted 40GB of an 80GB A100. You hit the OOM wall.

vLLM's architecture fixes this at three levels: memory management (PagedAttention), scheduling (continuous batching), and caching (prefix caching). Let's examine each.

PagedAttention: Managing the KV Cache Like Virtual Memory

Think of PagedAttention as virtual memory for your GPU.

The KV cache is the reason most naive LLM serving systems fail at scale. It's worth understanding in detail because it determines whether your system can handle 10 concurrent users or 1,000.

During inference, each token position in the sequence generates a key vector and a value vector. These accumulate as the model generates more tokens. For a 70B parameter model generating 2000 tokens, the KV cache is enormous - larger than the model weights themselves. With multiple sequences generating simultaneously, the KV cache becomes the dominant memory consumer.

Traditional attention implementation allocates a contiguous block of GPU memory for each sequence's KV cache. Sequence 1 gets a 2GB block. Sequence 2 gets a 2GB block. And so forth. If you have 100 concurrent sequences, you need 200GB of GPU memory just for KV caches. One A100 has 80GB. You're blocked.

What's worse, memory becomes fragmented. When a sequence completes, its 2GB block is freed, but if the GPU memory manager has already allocated other sequences' blocks after it, that freed block becomes trapped - too large for short sequences, but unable to be used for long sequences. This is exactly like disk fragmentation on old hard drives. Your GPU memory becomes scattered and inefficient.

PagedAttention fixes this through a level of indirection. Instead of allocating contiguous blocks, the KV cache is split into fixed-size blocks - typically 16 tokens each. Each block is small (about 512KB for a 70B model). When a sequence needs to store KV for tokens 0-15, it gets block 0. Tokens 16-31 get block 1. And so forth. A block table maps logical block numbers to physical GPU memory addresses.

When a sequence completes, its blocks are freed. These small blocks immediately serve new sequences. Fragmentation drops from 60-70% to 10-20%. The same A100 that held 5 concurrent sequences now holds 50-200 sequences because memory is used efficiently.

This level of indirection - one extra lookup per memory access - has negligible performance cost. The benefit is exponential. You unlock an order-of-magnitude improvement in concurrent user capacity without changing the model or the compute hardware.

The production impact is staggering. With contiguous allocation, you're constrained by KV cache memory. With paged allocation, you're constrained by compute. And compute is where the improvements matter. You can saturate compute cores. You can't saturate arbitrary amounts of contiguous memory.

Understanding virtual memory and how operating systems manage it helps intuition here. Your CPU uses page tables to map logical addresses to physical memory addresses, enabling swapping to disk and efficient memory sharing. PagedAttention applies the same principle to GPU memory. It's one of those ideas that seems obvious in retrospect but represents a major rethinking of how memory management works for inference.

In traditional attention, the KV cache for a sequence grows linearly with generation length. If your model generates 2048 tokens, you store 2048 key and value vectors in contiguous GPU memory. Multiply that by 100 concurrent sequences and you've allocated 200K+ vectors - a fragmented, wasteful mess.

Here's the math. For a Llama-3-70B model:

  • Hidden dimension: 8192
  • Each token's KV: 2 × 8192 float16 values = 32KB per token
  • 100 concurrent requests, 2000 tokens average: 100 × 2000 × 32KB = 6.4TB

That's impossible. But this is exactly what happens with contiguous allocation - the allocator tries to fit it, fragmentation explodes, and the system bottlenecks.

PagedAttention partitions each sequence's KV cache into fixed-size blocks (typically 16 tokens per block). Instead of allocating contiguous memory, these blocks live in arbitrary locations on the GPU. A block table - essentially a page table like your OS uses - maps logical blocks to physical memory addresses.

Each block consumes 16 × 32KB = 512KB. That same 100 concurrent sequences × 2000 tokens = 200,000 blocks total. The system maintains a free pool of blocks. When a sequence completes, its blocks return to the pool. When a new sequence arrives, the scheduler grabs available blocks. No contiguity requirement. No fragmentation.

Why this matters in production:

When a sequence completes generation and frees memory, you don't have a 4GB hole that blocks new allocations. You have 125 freed 16-token blocks (512KB each) that immediately serve new sequences. Memory fragmentation drops from 60-70% waste to 10-20%.

Stripe's infrastructure team published benchmarks showing KV cache memory efficiency improvements of 60-90% with PagedAttention enabled. The same A100 that handled 5 concurrent requests with standard attention now handles 50-200+.

The architectural implication is profound: with contiguous allocation, KV cache is the resource bottleneck. With PagedAttention, GPU compute is the bottleneck. You can actually saturate the tensor cores instead of thrashing memory allocators.

Here's how you enable it in vLLM:

python
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,  # Split across 4 GPUs
    gpu_memory_utilization=0.90,  # Use 90% of GPU VRAM
    max_model_len=4096,  # KV cache pages for sequences up to 4096 tokens
)
 
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=2048,
)

The gpu_memory_utilization=0.90 parameter is critical. It tells vLLM to allocate 90% of available VRAM for the KV cache pool. In production, you want this aggressive - 0.85-0.95 depending on your safety margin. Conservative values (0.50) waste GPU dollars.

Continuous Batching: Scheduling at the Iteration Level

PagedAttention solves memory. Continuous batching solves throughput.

The core issue with traditional batching is that sequences complete at different times. In a real production system, users don't all submit requests at the same moment. Requests arrive continuously. Early requests might be short (asking for a summary). Late-arriving requests might be long (asking for a detailed essay). By the time you've collected 32 requests into a batch, request 1 might be done generating and waiting for the batch to finish before its result is returned.

This is fundamentally incompatible with high utilization. You either fix the batch size at 32 and accept that late requests wait, or you wait for the slowest request, accepting inefficiency. Neither option is optimal.

Continuous batching changes the scheduling model entirely. Instead of thinking about batches, think about iterations. Each forward pass through the model is an iteration. The scheduler maintains a queue of active sequences and decides which sequences to process in each iteration.

In iteration 0, you process sequences 1-32. In iteration 1, if sequence 5 has finished generating its tokens, the scheduler pops sequence 5 from the queue, pops sequence 33 from the waiting queue, and processes sequences 1-4, 6-32, 33. The batch size stays 32. Utilization stays high.

The algorithmic complexity is higher than fixed batching, but it's manageable. The scheduler needs to track request state, understand when each sequence completes, and manage a pool of waiting requests. But this complexity is worth it because throughput scales dramatically.

In production systems, continuous batching typically delivers 2-4x throughput improvement compared to fixed batching. The why is the math: fixed batching has effective batch size that shrinks as requests finish. Continuous batching maintains effective batch size across iterations. If you can double the effective batch size, you double throughput.

The latency benefit is equally important. With fixed batching, late-arriving requests wait. With continuous batching, requests are processed as soon as there's capacity. This smooths out latency variance. Your P99 latency stops being dictated by batch windows and start being dictated by actual request processing time.

This architectural insight - batching at the iteration level instead of the request level - propagates throughout the system. Your memory allocator changes. Your scheduler changes. Your profiling tools need to understand iteration-level work distribution. Your monitoring needs to track iteration efficiency, not just request throughput.

Understanding this model is crucial when debugging production problems. If throughput is lower than expected, you need to know if it's because iterations are underutilized (not enough active sequences) or overutilized (batch can't grow larger). Different problems require different solutions. Underutilization suggests you need more concurrent requests (load balancer improvement). Overutilization suggests you've hit memory or compute bottlenecks (scaling or quantization-pipeline-pipelines-training-orchestration)-fundamentals))-automated-model-compression)-production-inference-deployment)-llms) needed).

Traditional static batching groups requests into a fixed-size batch - say 32 sequences. The GPU processes all 32 through the transformer forward pass (iteration 1), then iteration 2, and so on. The batch completes when the slowest sequence finishes generation.

Here's where pain emerges: if requests vary in output length, you get severe load imbalance. One user asks for a 100-token summary (finishes after iteration 10). Another asks for a 2000-token essay (finishes after iteration 200). The GPU idles for 190 iterations waiting for stragglers.

In a real production scenario serving 100 diverse requests, you might see 40% of the time wasted on idle cycles just waiting for sequences to complete.

Continuous batching replaces this model entirely. The scheduler operates at the iteration level - each forward pass through the transformer. When a sequence finishes iteration N (meaning it has generated its max_tokens or hit an end-of-sequence token), the scheduler immediately slots a new request into that position, with no waiting.

The mechanics are elegant. Iteration zero processes your initial 32 sequences. Iteration one starts with 32 active sequences, but if sequence 5 finished during iteration zero, the scheduler loads a new request into position 5. Now you're processing 31 original sequences plus one new one. The batch size stays constant. The GPU stays saturated.

For sequences of varied lengths, this is transformative. Short sequences finish early and immediately get replaced. Long sequences continue processing. The GPU never waits. All 32 GPU compute slots stay filled with productive work.

Here's the configuration that matters:

python
from vllm import LLM
from vllm.sampling_params import SamplingParams
 
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.90,
    max_num_batched_tokens=8192,  # Key parameter: tokens per iteration
    enable_prefix_caching=True,  # Discussed next
)

The max_num_batched_tokens parameter controls how many total tokens the model processes per iteration. For a 70B model on 4x A100s, 8192 is aggressive. For smaller 7B models, 4096-6144 is safer.

How to tune this in practice:

  1. Start with (max_model_len × max_batch_size) / 2 as a baseline.
  2. Monitor GPU memory utilization via nvidia-smi.
  3. If GPU memory spikes above 90% during heavy load, reduce it by 20%.
  4. If GPU utilization drops below 85%, increase it by 20%.

You can track batching efficiency via vLLM's metrics API:

python
import requests
 
# Query Prometheus metrics endpoint
response = requests.get("http://localhost:8000/metrics")
metrics = response.text
 
# Look for:
# vllm:num_requests_running
# vllm:batch_size
# vllm:prompt_tokens_total

When num_requests_running stays near your max_batch_size and batch_size stays above 20, continuous batching is working. If num_requests_running drops below 10, you're I/O bound - network latency to your load balancer is the bottleneck, not the GPU.

Prefix Caching: Reusing Shared Prompts

Here's a free efficiency win that most teams leave on the table.

Prefix caching detects when multiple requests share the same system prompt or context prefix and reuses their KV cache blocks. A customer service chatbot with 1000 concurrent users and a 500-token system prompt can cache those tokens once.

Impact: 40-90% time-to-first-token (TTFT) reduction for chatbot workloads.

Enable it:

python
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.90,
    enable_prefix_caching=True,
    max_num_batched_tokens=8192,
)
 
# Track cache hits via metrics
# Look for: vllm:prefix_cache_hit_rate

The cache hit rate metric tells the story. Monitor it:

python
def check_cache_performance():
    response = requests.get("http://localhost:8000/metrics")
    for line in response.text.split('\n'):
        if 'prefix_cache_hit_rate' in line and not line.startswith('#'):
            print(f"Cache hit rate: {line}")

For a typical chatbot with consistent system prompts, you should see 60-80% cache hit rates. Below 30% suggests your prefix isn't shared enough or is too short.

Quantization: Trading Model Precision for Speed

Here's a reality that surprises many engineers: your model doesn't need full float32 precision to produce useful outputs.

Quantization reduces model weights from float32 (32 bits per number) to lower precision formats like int8 (8 bits) or float16 (16 bits). For a 70B parameter model, reducing from float32 to int8 cuts memory from 280GB to 70GB and increases throughput by 2-4x.

The catch: quantization introduces numerical error. You lose about 1-2% of accuracy, depending on the quantization method. For many applications - customer service, summarization, code generation - this loss is imperceptible. For others - mathematical reasoning, exact retrieval - you might notice.

vLLM supports multiple quantization schemes. AWQ (Activation-aware Weight Quantization) is a good starting point:

python
llm = LLM(
    model="TheBloke/Llama-2-70B-chat-AWQ",  # Pre-quantized model
    quantization="awq",
    tensor_parallel_size=2,  # Can use fewer GPUs now
    gpu_memory_utilization=0.90,
)

If you want to quantize a model yourself:

bash
# Install packages
pip install auto-gptq
 
# Quantize locally (CPU)
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-2-70b")
model.quantize(dataset, use_triton=True)
model.save_pretrained("llama-2-70b-gptq")

In production, quantized models save money without sacrificing user experience. A 70B model using 8 A100s might drop to 2 A100s with quantization. At AWS on-demand pricing, that's roughly 30K dollars per month saved.

Tensor Parallelism: Scaling Across Multiple GPUs

A 70B parameter model in FP16 (half precision) consumes 140GB of GPU memory. A single 80GB A100 can't hold it. You need to split the model across multiple GPUs.

Scaling a model across multiple GPUs introduces communication overhead that fundamentally limits throughput. Understanding this constraint is crucial because many teams assume that throwing more GPUs at the problem always helps. The truth is more subtle.

Tensor parallelism divides the weight matrices of each layer across GPUs. This seems straightforward until you realize the implication: every forward and backward pass requires synchronization. You can't have GPU 0 compute its portion of the attention head without waiting for GPU 1's portion. The computation is sequential. The parallelism comes from spreading that sequential computation across multiple GPU cores.

For a 32-layer model with 64 attention heads, tensor parallelism across 4 GPUs means each GPU computes 16 heads. But every attention output must be combined - an all-reduce operation - before feeding into the next layer. So the execution timeline is: GPU 0-3 compute heads (parallel), all-reduce (synchronized), next layer. Repeat 32 times.

All-reduce latency on NVLink (within-node GPU communication) is about 10 microseconds. Across PCIe (slower), it's 100-1000 microseconds. For a 2000-token generation, that's 32 layers × 2000 tokens × 10 microseconds = 640 milliseconds of pure communication overhead. Compare to the actual compute time (maybe 2000ms), and suddenly communication is significant.

This is why tensor parallelism is primarily a memory optimization, not a throughput optimization. You use it because you have to (model doesn't fit on one GPU), not because you want to (throughput improves). The sweet spot depends on your interconnect and model size.

For a 7B model on a single GPU, tensor parallelism is unnecessary - it wastes communication. For a 70B model on a single GPU, impossible - you have to use tensor parallelism. For a 70B model on 4 GPUs with NVLink, tensor parallelism makes sense - communication overhead is small. For a 70B model on 8 GPUs across two nodes with standard Ethernet, tensor parallelism is suboptimal - communication overhead becomes dominant.

The architectural implication is important: don't use tensor parallelism as a blanket scaling approach. Use it within a node (over NVLink), then use data parallelism across nodes. This hybrid approach lets each layer of your infrastructure specialize: fast, low-latency NVLink for within-node parallelism; higher-latency but sufficient InfiniBand for across-node synchronization.

When you're designing your cluster, understanding this constraint shapes everything. If you're targeting 70B models, you need 4-GPU nodes with NVLink. If you're targeting 13B models, 2-GPU nodes with NVLink suffice. If you tried to use PCIe-connected GPUs, the communication overhead becomes unacceptable. This physical constraint shapes your purchasing decisions, your rack layout, your power delivery, and your network topology.

Understanding the limits of tensor parallelism also prevents overcommitting to distributed inference. If you have 100 concurrent 7B requests, you're better off with multiple independent instances than one big distributed system. Each instance runs on a single GPU (or 2 with NVLink for memory). Communication is eliminated. Throughput is higher. This is counterintuitive to teams used to monolithic systems, but for inference, distributed often means slower.

Tensor parallelism splits weight matrices column-wise across GPUs. This is different from data parallelism (which replicates the model and distributes data). Here's why column-wise matters:

In a linear layer computing Y = X @ W.T:

  • Traditional approach: Each GPU holds full W, processes a chunk of X. Works well for inference but wastes compute.
  • Tensor parallel approach: GPU 0 holds W[:, :dim/2], GPU 1 holds W[:, dim/2:]. Both GPUs process all of X, each computing their shard of the output. All-reduce combines Y_0 and Y_1 into the full output.

This matters because every attention head and FFN layer in a transformer can be parallelized this way. Each GPU holds 1/N of the weights and performs 1/N of the computation, with synchronization overhead limited to all-reduce operations.

The forward pass with 4-GPU tensor parallelism on Llama-3-70B:

  1. Input distribution: Tokens arrive at GPU 0. Broadcasted to GPUs 1-3 (all-gather, microseconds).
  2. Layer N (Attention): Each GPU computes its portion of attention heads. For 64 heads on 4 GPUs, each computes 16 heads in parallel.
  3. All-reduce: Combine attention outputs. All-reduce latency: 10-100µs on NVLink, 1-5ms on PCIe.
  4. Layer N (FFN): Each GPU computes its portion of the up-projection and down-projection weights.
  5. Repeat for all 80 transformer layers.

The all-reduce happens at every layer. For an 80-layer model, that's 160 all-reduces per forward pass. If each takes 10µs, that's 1.6ms of communication overhead per forward pass. If running a 2000-token generation, that's 1.6ms × 2000 = 3.2 seconds of pure communication time.

Contrast with a single GPU with no communication: same throughput, but slightly lower latency. The sweet spots depend on your model and interconnect:

For Llama-3-7B on single node:

  • Use tensor_parallel_size=1 on a single A100/H100.
  • You get full model on one GPU, minimal communication overhead.

For Llama-3-13B:

  • Use tensor_parallel_size=2 on 2x GPUs.
  • Model fits in 40GB (A100/H100), all-reduce stays sub-millisecond with NVLink.

For Llama-3-70B:

  • Use tensor_parallel_size=4 or 8 depending on your interconnect.
  • 4x GPUs = 10ms communication per iteration (NVLink).
  • 8x GPUs = 15-20ms communication per iteration (PCIe).

Check your interconnect topology:

bash
nvidia-smi topo -m
 
# Output shows GPU interconnect speed:
# X   = Cross-GPU P2P via PCIe (slow)
# SXS = Same-GPU connection
# NV  = NVIDIA NVLink (fast, <5us latency)

Configuration for Llama-3-70B on 4x A100s with NVLink:

python
llm = LLM(
    model="meta-llama/Llama-3-70b-instruct",
    tensor_parallel_size=4,
    pipeline_parallel_size=1,
    distributed_executor_backend="nccl",  # Use NCCL for all-reduce
    gpu_memory_utilization=0.90,
    max_num_batched_tokens=8192,
    enable_prefix_caching=True,
)

For multi-node deployments (e.g., 16 GPUs across 2 nodes):

python
llm = LLM(
    model="meta-llama/Llama-3-70b-instruct",
    tensor_parallel_size=8,
    pipeline_parallel_size=2,  # Split across nodes
    distributed_executor_backend="nccl",
    gpu_memory_utilization=0.85,  # Conservative for multi-node
)

Cost Optimization at Production Scale

At large scale, a few percent efficiency improvement translates to thousands of dollars per month. Here are cost reduction strategies that actually work:

Strategy 1: Right-size instance types. Larger GPUs have higher per-unit throughput but also higher per-unit cost. Sometimes multiple smaller GPUs are more cost-efficient than one large GPU. Model this out for your actual workload.

Strategy 2: Use spot instances where applicable. Cloud providers offer 60-80 percent discounts for interruptible VMs. For batch inference, spot instances are ideal (just requeue failed batches). For real-time serving, use spot for non-critical models or as overflow capacity.

Strategy 3: Implement request sampling in development and testing. Not every request needs to hit the full inference pipeline. If your batch processing job involves 10M requests, sample 100k for quality testing rather than validating all 10M.

Strategy 4: Optimize model selection based on request content. A long context request doesn't benefit from running inference 5x (wasting GPUs for nothing). Build request classifiers that route heavy computations only to complex requests.

When to Use vLLM vs. TensorRT-LLM

This is a question every team building LLM infrastructure faces. Both are excellent, but they have different strengths.

Use vLLM if: Your primary workload is serving LLMs, you want operational simplicity, you prefer OpenAI API compatibility, your team has Python expertise. vLLM is purpose-built for this use case.

Use TensorRT-llm-optimization-guide)-LLM with Triton if: You're serving heterogeneous models (LLMs mixed with computer vision or other types), you need production-grade infrastructure you trust with critical systems, you want the ability to tune every detail yourself. Triton is more general-purpose.

In practice, many large teams use both: vLLM for LLM-only workloads (simpler, better performance) and Triton for mixed model serving (more flexible, more control).

Common Production Pitfalls and How to Avoid Them

Real deployments teach hard lessons. Here are the mistakes we see repeatedly:

Pitfall 1: Setting gpu_memory_utilization too conservatively.

Teams set it to 0.50 or 0.60 "for safety." This wastes 40-50% of GPU capacity. A 15,000 dollar per year A100 running at 50% utilization costs effectively 30,000 dollars per year. Start at 0.85. Monitor for OOM crashes. If stable for 24 hours, increase to 0.90.

Pitfall 2: Tuning max_num_batched_tokens once and leaving it.

This parameter needs context. It depends on:

  • Model size (70B needs smaller values than 7B)
  • Request mix (short prompts allow larger batches)
  • GPU generation rate (H100 can handle larger batches than A100)
  • User SLA (high-latency services can batch more aggressively)

Run load tests weekly. Compare latency percentiles. If p99 latency creeps up 10% month-over-month, reduce max_num_batched_tokens by 15%.

Pitfall 3: Not monitoring prefix cache hit rates.

You enable prefix caching and assume it works. Weeks later, you realize hit rates are 5% because users aren't actually sharing system prompts. You've added overhead with zero benefit. Monitor this metric from day one.

Pitfall 4: Mixing tensor parallelism across different GPU types.

If you have 2x A100s and 2x L40s (mixed cluster), tensor parallelism communication latency will be bottlenecked by the slowest GPU-to-GPU link. Always use homogeneous hardware. If you can't, use single-GPU deployment with data parallelism.

Pitfall 5: Deploying without a load test harness.

You can't tune what you can't measure. Build a load test suite before going live:

python
# Minimal load test
def load_test_suite():
    tests = [
        ("short", 32, 100),      # batch_size=32, tokens=100
        ("medium", 32, 1000),    # batch_size=32, tokens=1000
        ("heavy", 32, 4000),     # batch_size=32, tokens=4000
        ("concurrent", 128, 500), # high concurrency, moderate length
    ]
 
    for name, batch_size, token_count in tests:
        results = simulate_load(batch_size, token_count)
        print(f"{name}: p99={results['p99']}ms, GPU={results['gpu_pct']}%")

Run this monthly. Track trends. When p99 latency jumps 20% month-over-month, you've hit a capacity wall.

Pitfall 6: Forgetting about network bottlenecks in multi-node setups.

Tensor parallelism communication requires high-bandwidth, low-latency networking. 10Gbps Ethernet isn't enough for 8+ GPUs. You need InfiniBand (200Gbps+) or high-speed Ethernet (100Gbps+). Test your cluster's bandwidth with:

bash
# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make
 
# Run all-reduce test (simulates vLLM synchronization)
./build/all_reduce_perf -b 1G -e 10G -f 2 -t 1 -g 8

If all-reduce throughput is less than 50% of raw bandwidth, your network is congested. Add more bandwidth or reduce tensor parallelism size.

Moving from Development to Production

The transition from prototype to production is where many teams discover they've misunderstood their workload. In development, you optimize for latency. In production, you optimize for cost and throughput. These are often at odds.

A latency-optimized system minimizes the time from request arrival to result generation. You might use small batch sizes (2-4) to keep requests moving fast. You might disable prefix caching to reduce complexity. You might use float32 precision to maximize accuracy.

A throughput-optimized system maximizes the number of requests processed per GPU-hour. You want larger batch sizes (32-128). You want prefix caching enabled. You want quantization enabled. You accept slightly higher latency to saturate compute.

The production system you need depends on your workload. Customer-facing chat applications need latency optimization - users expect responses in under 2 seconds. Batch processing applications can accept 30-second latency if it increases throughput tenfold.

Know which you're optimizing for before you design your cluster. That decision shapes everything: your instance type selection, your quantization strategy, your batching configuration, even your model selection. A latency-sensitive system might use a smaller, faster model. A throughput-optimized system might use a larger, more capable model.

Key Takeaways

You've now seen the architecture decisions that separate hobby inference from production systems.

PagedAttention eliminates memory fragmentation, letting you handle 10-50x more concurrent requests. Continuous batching ensures your GPU never sits idle waiting for stragglers. Prefix caching turns repeated context into free throughput. Tensor parallelism lets you scale beyond a single GPU without rewriting your inference code. Quantization cuts model size and latency while keeping quality high.

Start conservative: test on small models, validate your metrics matter, then scale. One mis-tuned parameter - like gpu_memory_utilization=0.50 instead of 0.90 - can waste 40% of your GPU budget.

The configuration matrix seems large. It's not. Pick your model size, set tensor parallelism to fit your GPU count, enable prefix caching, dial in max_num_batched_tokens until GPU memory hits 85-90%, and deploy.

Monitor those four metric categories. When p99 latency climbs or cache hit rate drops, you'll know exactly which lever to pull.

This is what separates a system that breaks at scale from one that stands up to Stripe's 50 million daily requests.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project