Autoscaling ML Inference: KEDA, HPA, and Custom Metrics
You've trained the perfect model. It crushes your test suite. But here's the problem: in production, your inference pods sit idle 90% of the time while occasional traffic spikes make others run hot. Your cloud bill is eating into margins, and you're paying for capacity you don't use.
This is where autoscaling comes in - but not the simple CPU-based scaling you might be familiar with. Modern ML inference demands something smarter: scaling based on queue depth, GPU utilization, and business metrics that actually matter.
Let's explore how to build a production-grade autoscaling system that keeps costs low and performance high.
Table of Contents
- The Problem with Standard Kubernetes Autoscaling
- Introducing KEDA: Event-Driven Autoscaling
- HPA vs KEDA: Side-by-Side Comparison
- Why This Matters in Production
- Architecture: GPU-Aware Autoscaling
- Setting Up KEDA with Kafka and GPU Metrics
- The Scale-to-Zero Challenge: Cold Starts
- Tuning Scaling Policies for Smooth Performance
- Combining Metrics: The Multi-Signal Approach
- Measuring Success: Cold Start Analysis
- Deep Dive: Metrics Collection and Prometheus Integration
- Common Pitfalls and How to Avoid Them
- Production Debugging: When Autoscaling Doesn't Work
- Optimization: Multi-Region and Geo-Failover
- Cost Optimization: The Economics of Autoscaling
- Putting It All Together: Complete Production Example
- The Feedback Loop Problem: Scaling Pathologies
- Summary
- Building Autoscaling Culture
- When Autoscaling Isn't the Right Answer
- The Organizational Angle: Autoscaling as a Tool for Different Stakeholder Groups
- The Feedback Loop Problem Revisited
The Problem with Standard Kubernetes Autoscaling
Kubernetes' default Horizontal Pod Autoscaler (HPA) works great for web services. It watches CPU and memory, scales up when requests pile up, and scales down when things calm down. Simple. Effective.
But ML inference is different.
Here's why HPA alone isn't enough:
Your GPU utilization might be 5%, but you're sitting on a waiting queue of 10,000 inference requests. HPA doesn't know that. It sees low CPU, decides everything's fine, and keeps your pods starved while customers wait.
Or worse: you scale based on CPU threshold, but inference workloads have weird CPU/GPU relationships. Your GPU is maxed, but CPU sits at 40%. Your model performs inference almost entirely on the GPU - CPU is just glue.
Classic HPA is metric-blind. It optimizes for the wrong signals.
The fundamental issue here is that HPA was designed with stateless web services in mind - the kind where every request is independent and similar in cost. ML inference breaks these assumptions. A single inference request might take 100ms with light GPU load or 2 seconds with heavy compute. HPA's linear scaling assumption fails. You end up with scenarios where you're scaling up but still have terrible latency because your metric (CPU%) doesn't reflect what actually matters for your model's performance.
Introducing KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) flips the script. Instead of watching generic system metrics, KEDA scales on events and custom metrics that actually drive your workload.
What KEDA adds to the table:
- Queue depth awareness: Scale based on message queue length (Kafka, RabbitMQ, AWS SQS)
- Custom Prometheus metrics: GPU utilization, inference latency, throughput
- Scale-to-zero: Actually shut down pods when idle, not just minimum replicas
- Business metrics: Requests in queue, pending jobs, inference confidence scores
Think of KEDA as HPA's smarter cousin. It's HPA-compatible (they can work together), but KEDA understands asynchronous workloads and external event sources.
The beauty of KEDA is that it solves the fundamental problem: your scaling signal becomes domain-aware. Instead of "the CPU is at 75%," you're saying "I have 100 inference requests in my queue and each pod processes 10/second, so I need 10 pods." That direct mapping between business reality and infrastructure is what makes KEDA so powerful.
HPA vs KEDA: Side-by-Side Comparison
| Feature | HPA | KEDA |
|---|---|---|
| Default Metrics | CPU, Memory | Queue depth, custom metrics, events |
| Custom Metrics | Prometheus (manual setup) | Native integrations (40+) |
| Scale to Zero | No (minReplicas ≥ 1) | Yes |
| Event Sources | None | Kafka, SQS, Postgres, HTTP webhooks |
| Complex Logic | Limited | Full ScaledObject configs |
| GPU Awareness | No | Yes (via custom metrics) |
Here's the key insight: HPA and KEDA aren't either/or. They're complementary. You can use KEDA's ScaledObject for queue-based scaling and HPA for secondary scaling on CPU spikes. Best of both worlds.
In practice, most production ML systems use KEDA as the primary scaler because it's optimized for asynchronous workloads (which is how most inference systems operate). HPA acts as a safety valve for unexpected compute patterns that don't correlate with queue depth.
Why This Matters in Production
Let's ground this in real economics. Many teams think autoscaling is optional - something you add later. This is wrong. Autoscaling is how you ship cost-efficient ML systems.
Consider a typical inference service that gets bursty traffic:
- Off-peak: 10 inference requests/hour
- Peak times: 1,000 requests/minute
- Model inference cost: $2.50/GPU/hour
Without autoscaling, you either pay for peak capacity all day ($60/day) or users experience 30+ second latencies during spikes. With proper autoscaling using KEDA, you pay for what you use, cutting costs by 70-90%.
But here's the complexity: getting autoscaling right in ML is hard. You need to understand:
- How long your model takes to load (cold start time)
- How many requests per pod you can sustain
- What latency targets you need to hit
- How sensitive your users are to scaling-related delays
KEDA lets you control these variables. HPA doesn't.
Architecture: GPU-Aware Autoscaling
Let's design a system that scales an ML inference service on multiple signals: queue depth, GPU utilization, and request throughput.
graph TB
A["🔄 Request Queue<br/>Kafka Topic"]
B["📊 Prometheus<br/>Metrics"]
C["🔌 KEDA<br/>ScaledObject"]
D["⚙️ Custom Metrics<br/>Adapter"]
E["🚀 Inference Pods<br/>GPU Cluster"]
F["📉 HPA<br/>Fallback"]
A -->|Queue Depth| C
B -->|GPU Util %<br/>Throughput| D
D -->|Scaled Values| C
C -->|Scale Signal| E
F -->|CPU Spike| E
E -->|Processed| A
E -->|Metrics Export| B
style A fill:#ff6b6b
style B fill:#4ecdc4
style C fill:#45b7d1
style E fill:#ffd93dThis is a composite system:
- Request Queue (Kafka) holds inference jobs
- Prometheus scrapes GPU metrics from inference pods
- Custom Metrics Adapter transforms Prometheus data into scaling signals
- KEDA ScaledObject watches both queue depth and custom metrics
- HPA acts as a safety net for unexpected CPU surges
- Inference Pods process jobs and export telemetry
Now let's make it real.
Setting Up KEDA with Kafka and GPU Metrics
First, install KEDA on your cluster:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespaceNow, let's configure a ScaledObject that scales on queue depth and GPU utilization:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ml-inference-scaler
namespace: ml-inference
spec:
scaleTargetRef:
name: inference-deployment
kind: Deployment
minReplicaCount: 0 # Scale to zero!
maxReplicaCount: 50
# Cooldown: be conservative on scale-down
cooldownPeriod: 300
idleReplicaCount: 0
fallback:
failureThreshold: 3
replicas: 2
# Define scaling triggers
triggers:
# Trigger 1: Kafka queue depth
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka:9092
consumerGroup: ml-inference-group
topic: inference-requests
# Scale up 1 pod per 10 messages in queue
lagThreshold: "10"
offsetResetPolicy: "latest"
authenticationRef:
name: kafka-auth
# Trigger 2: GPU utilization from Prometheus
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: gpu_utilization_percent
query: |
avg(gpu_utilization_percent{job="inference-pods"})
# Scale up when GPU util > 75%
threshold: "75"
authenticationRef:
name: prometheus-auth
# Trigger 3: Custom metric - inference latency
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: inference_latency_p99_ms
query: |
histogram_quantile(0.99,
rate(inference_latency_ms_bucket[30s]))
# If p99 latency exceeds 2 seconds, scale up
threshold: "2000"This config does several things:
- minReplicaCount: 0 - Actually scales to zero. No pods running = no costs.
- Multiple triggers - Combines queue depth, GPU utilization, and latency. Any trigger can initiate scaling.
- Fallback replicas - If all metrics fail, maintain 2 pods (safety net).
- Conservative cooldown - Waits 5 minutes before scaling down (avoids thrashing).
Now let's define the Deployment that KEDA manages:
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-deployment
namespace: ml-inference
spec:
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
# Schedule on GPU nodes
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: inference-server
image: my-inference-server:v1
# Resource requests for scheduling
resources:
requests:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 1
limits:
cpu: 4
memory: 16Gi
nvidia.com/gpu: 1
# Expose Prometheus metrics
ports:
- name: metrics
containerPort: 8080
# Liveness/readiness for graceful scale-down
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
env:
- name: KAFKA_BROKERS
value: kafka-broker.kafka:9092
- name: KAFKA_TOPIC
value: inference-requests
- name: GPU_MEMORY_FRACTION
value: "0.95"Key details:
- GPU scheduling: Node selector ensures pods run on GPU hardware
- Metrics port: Prometheus scrapes from
:8080/metrics - Readiness probes: Lets KEDA gracefully drain connections before scale-down
The Scale-to-Zero Challenge: Cold Starts
Here's where it gets interesting. Scaling to zero saves money - you pay nothing when idle. But ML models take time to load.
A typical NVIDIA GPU inference server-inference-server-multi-model-serving) startup sequence:
- Container initialization: ~5 seconds
- Model load (weights, quantization-pipeline-pipelines-training-orchestration)-fundamentals)-automated-model-compression)-production-inference-deployment)-llms)): ~30-90 seconds (depends on model size)
- Warmup inference: ~5 seconds
- Ready to serve: Total ~40-120 seconds
That's your cold start latency. If requests arrive every few hours, you're paying this tax each time.
Let's measure it:
# Add to your inference server startup
import time
import os
from prometheus_client import Histogram, Counter
startup_time = Histogram(
'inference_startup_seconds',
'Time to load model and become ready',
buckets=(10, 30, 60, 90, 120)
)
startup_requests = Counter(
'inference_cold_starts_total',
'Total cold starts from zero replicas'
)
class InferenceServer:
def __init__(self):
start = time.time()
self.load_model()
self.warmup()
elapsed = time.time() - start
startup_time.observe(elapsed)
print(f"Ready in {elapsed:.1f}s")
def load_model(self):
# Load your ONNX/TorchScript model
import onnx
self.model = onnx.load(os.getenv('MODEL_PATH'))
print("Model loaded")
def warmup(self):
# Dummy inference to warm up GPU
import numpy as np
dummy_input = np.random.randn(1, 3, 224, 224).astype(np.float32)
self.model(dummy_input)
print("Warmup complete")Now, decide: Is cold start acceptable?
If yes: Use scale-to-zero. Save 100% of idle costs. Accept 40-120s latency on first request.
If no: Keep 1-2 "warm pool" replicas running. Trade idle costs (~$200-500/month per GPU) for sub-second latency on every request.
Many teams use a hybrid: scale-to-zero during off-hours (nights, weekends), maintain warm pool during business hours.
Here's a KEDA config that implements this:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ml-inference-with-warmpool
spec:
scaleTargetRef:
name: inference-deployment
# During business hours (8am-8pm US/Eastern): keep 2 warm
# During off-hours: scale to zero
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka:9092
topic: inference-requests
lagThreshold: "10"
# Add a time-based "always on" trigger for business hours
- type: cron
metadata:
timezone: America/New_York
start: 8 * * * MON-FRI
end: 20 * * * MON-FRI
desiredReplicas: "2"This keeps 2 replicas warm during business hours (9-5 ET, M-F), then scales to zero nights and weekends.
Tuning Scaling Policies for Smooth Performance
Aggressive scaling creates chaos. Pod thrashing, unnecessary churn, cold starts, wasted resources. We need smart policies.
KEDA supports scaling policies that control how fast you scale up vs down:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-tuned
spec:
scaleTargetRef:
name: inference-deployment
minReplicaCount: 1
maxReplicaCount: 50
# Stabilization windows prevent thrashing
advanced:
horizontalPodAutoscalerConfig:
behavior:
# Scale UP aggressively when queue builds
scaleUp:
stabilizationWindowSeconds: 60
policies:
# Add 4 pods per minute until max
- type: Percent
value: 100 # Double current replicas
periodSeconds: 15
# OR add 10 pods per minute (whichever is larger)
- type: Pods
value: 10
periodSeconds: 60
selectPolicy: Max # Pick the most aggressive
# Scale DOWN conservatively
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes
policies:
# Remove max 2 pods per minute
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min # Most conservative
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka:9092
topic: inference-requests
lagThreshold: "5"What's happening here:
-
Scale-up (fast): When queue builds, double your replicas every 15 seconds or add 10 pods per minute - pick the larger number. This handles traffic spikes quickly.
-
Scale-down (slow): Remove at most 2 pods per minute, but wait 5 minutes of calm before even trying. This prevents the "flapping" where you scale up, immediately scale down, repeat.
Real-world result: Requests queue builds from 0→100 in seconds, you launch 10 new pods in 60 seconds, queue drains, you wait 5 minutes of silence, then taper down 2 pods every minute.
Combining Metrics: The Multi-Signal Approach
Here's a production config that combines queue depth, GPU utilization, and throughput:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ml-inference-production
namespace: ml-inference
spec:
scaleTargetRef:
name: inference-deployment
minReplicaCount: 1
maxReplicaCount: 100
cooldownPeriod: 300
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 20
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 3
periodSeconds: 60
selectPolicy: Min
triggers:
# Primary: Queue depth (inference requests waiting)
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka:9092
consumerGroup: inference-prod
topic: model-inference-requests
lagThreshold: "5" # 1 pod per 5 queued requests
# Secondary: GPU utilization (efficiency signal)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: gpu_utilization
query: |
avg(container_accelerator_memory_used_bytes{pod=~"inference-.*"} /
container_accelerator_memory_total_bytes{pod=~"inference-.*"} * 100)
threshold: "85" # Scale if GPU memory > 85%
# Tertiary: Inference throughput (business metric)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: inference_throughput
query: |
sum(rate(inference_requests_total{job="inference"}[1m]))
threshold: "1000" # Scale if throughput needs > 1000 req/s
# Quaternary: Latency SLO breach (performance signal)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: inference_latency_slo_breach
query: |
(histogram_quantile(0.99,
rate(inference_latency_ms_bucket[5m])) > 2000)
* on() group_left() vector(1)
threshold: "0.5" # Scale if >50% of windows breach SLOEach trigger is independent. If any signal says "scale up," we scale. This gives you:
- Responsive: Queue builds? Immediate scale-up.
- Efficient: GPU maxed? Scale up before you're slow.
- Business-aligned: Throughput goals? Built in.
- Safe: Latency breaching SLO? Auto-remedy.
Measuring Success: Cold Start Analysis
After 2 weeks of production, analyze your autoscaling behavior:
# Pull cold start events from metrics
kubectl exec -it prometheus-0 -- promtool query instant \
'count(increase(inference_cold_starts_total[2w]))'
# Check scaling frequency
kubectl exec -it prometheus-0 -- promtool query range \
'changes(keda_scaledobject_replicas[2w])' \
--start=$(date -u -d '2 weeks ago' +%s)s \
--end=$(date -u +%s)s \
--step=1h
# Calculate savings
# Cost per GPU per hour: ~$1.50 (adjust for your region/instance)
# Hours at zero replicas × cost per GPU × avg GPUs per podA well-tuned system typically sees:
- 50-70% time at 0 replicas (pure savings)
- <5 cold starts per day (assuming typical batch workloads)
- 10-15% scale-up efficiency (reaching stable state in <2 minutes)
- $15,000-50,000 monthly savings (depending on baseline usage)
Deep Dive: Metrics Collection and Prometheus Integration
Before KEDA can scale intelligently, you need to export the right metrics from your inference pods. This is where most teams struggle - they export everything, or they export nothing.
Here's what you actually need:
from prometheus_client import Counter, Histogram, Gauge
import time
# Counters (monotonically increasing)
inference_requests = Counter(
'inference_requests_total',
'Total inference requests',
['model_name', 'status'] # Track success vs error
)
inference_tokens = Counter(
'inference_tokens_generated_total',
'Total tokens generated',
['model_name']
)
# Histograms (distributions with percentiles)
inference_latency = Histogram(
'inference_latency_ms',
'End-to-end inference latency',
buckets=[10, 50, 100, 250, 500, 1000, 2500, 5000, 10000],
['model_name']
)
queue_processing_time = Histogram(
'queue_processing_time_ms',
'Time from dequeue to response sent',
buckets=[100, 500, 1000, 5000, 10000],
['model_name']
)
# Gauges (point-in-time values)
gpu_memory_used = Gauge(
'gpu_memory_used_bytes',
'GPU memory currently used',
['gpu_id']
)
active_inference_jobs = Gauge(
'active_inference_jobs',
'Number of inflight inference requests'
)
model_load_time = Gauge(
'model_load_time_seconds',
'How long the model took to load on startup'
)Now, the production pattern is to update these within your inference loop:
class InferenceService:
def process_request(self, request):
start = time.time()
active_inference_jobs.inc()
try:
# Your inference logic
result = self.model.predict(request.data)
inference_latency.labels(model_name="llm-7b").observe(
(time.time() - start) * 1000 # Convert to ms
)
inference_requests.labels(
model_name="llm-7b",
status="success"
).inc()
return result
except Exception as e:
inference_requests.labels(
model_name="llm-7b",
status="error"
).inc()
raise
finally:
active_inference_jobs.dec()The critical insight: KEDA queries these metrics to make scaling decisions. If you're not exporting the right metrics, KEDA is flying blind. Don't export metrics just for observability - export them for decisioning.
Common Pitfalls and How to Avoid Them
Pitfall 1: Triggering Too Aggressively
Problem: Setting lagThreshold too low (e.g., "1") causes constant scaling. Each message triggers a new pod.
Solution: Match threshold to pod throughput. If each GPU pod processes 20 messages/second, set lagThreshold to "100" (5 second queue = 1 pod).
Pitfall 2: Not Accounting for Startup Time
Problem: Scaling signals arrive, but pods take 90s to boot. Queue backs up while you wait.
Solution: Pre-calculate startup latency. If startup = 90s, add 1.5× buffer to your scaling threshold.
Pitfall 3: Leaving minReplicaCount > 0
Problem: Paying for idle pods. minReplicaCount: 1 defeats the cost savings.
Solution: Use scale-to-zero (minReplicaCount: 0) with warm-pool triggers for business hours.
Pitfall 4: Ignoring GPU Memory Constraints
Problem: Scaling up, but GPUs OOM because pods pack too tightly.
Solution: Set GPU memory requests (nvidia.com/gpu: 1) and monitor nvidia-smi memory. KEDA won't violate resource requests.
Production Debugging: When Autoscaling Doesn't Work
You deployed the KEDA config, and... nothing happens. Queue builds up, but no new pods. Or pods scale endlessly. Here's how to debug.
First, check KEDA logs:
kubectl logs -n keda -l app.kubernetes.io/name=keda-operator -fLook for errors parsing your ScaledObject or metric failures.
Second, verify the ScaledObject is active:
kubectl describe scaledobject ml-inference-scaler -n ml-inferenceCheck the status section. It should show each trigger as "Active: true". If any trigger shows "Active: false", that trigger isn't being evaluated.
Third, test your metrics manually:
# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus 9090:9090
# Query your metric directly
curl 'http://localhost:9090/api/v1/query?query=kafka_consumergroup_lag{topic="inference-requests"}'If the metric returns empty or stale data, your exporters aren't running or are misconfigured. KEDA can't scale on metrics that don't exist.
Fourth, check HPA behavior:
kubectl get hpa -n ml-inference
kubectl describe hpa inference-deployment-inference-scaler -n ml-inferenceKEDA creates an HPA under the hood. If the HPA shows "unknown" for metric values, metrics aren't reaching the API.
Fifth, audit scaling decisions:
# Watch KEDA scale events
kubectl get events -n ml-inference --sort-by='.lastTimestamp' | grep keda
# Check deployment scale history
kubectl rollout history deployment/inference-deployment -n ml-inferenceIf you're seeing scale events but no pod launches, the issue is likely node capacity or pod scheduling constraints.
Optimization: Multi-Region and Geo-Failover
If you have inference clusters across regions, you need KEDA + service mesh coordination. Here's the pattern:
---
# Region A: Primary inference cluster
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-us-east
namespace: ml-inference
spec:
scaleTargetRef:
name: inference-deployment
minReplicaCount: 2 # Always 2 for failover
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-us-east.region-a:9092
topic: inference-requests
lagThreshold: "10"
---
# Region B: Secondary (scales aggressively if primary fails)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-us-west
namespace: ml-inference
spec:
scaleTargetRef:
name: inference-deployment
minReplicaCount: 0 # Scale to zero normally
maxReplicaCount: 100 # But scale *up* if region A lags
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-us-west.region-b:9092
topic: inference-requests
# Smaller lag threshold: more aggressive scaling
lagThreshold: "5"The logic: Region A carries the load with aggressive scale-down (lower cost). If Region A's queue builds up significantly, that's a signal Region A is struggling. Region B sees its own queue and scales aggressively. Traffic steering (via Istio/Linkerd) sends new requests to Region B. Region A recovers, queue drains, Region B scales down.
This requires traffic steering intelligence, not just KEDA, but KEDA provides the autoscaling foundation.
Cost Optimization: The Economics of Autoscaling
Here's the financial calculation most teams skip but should do religiously.
Baseline costs (8 GPUs, always on):
- 8 × NVIDIA A100 40GB = 8 × $2.48/hour (AWS pricing) = $19.84/hour
- Monthly (24/7): $19.84 × 730 hours = $14,493/month
With naive autoscaling (minReplicaCount: 1, always at least 1 pod):
- 1 GPU always on: $2.48/hour = $1,810/month
- 4-7 additional GPUs (average): $12.32/hour = $8,994/month
- Total: ~$10,804/month (26% savings)
With KEDA scale-to-zero (minReplicaCount: 0, warm pool during business hours):
- Business hours (8am-8pm, M-F = 60 hours/week × 4.33 weeks = 260 hours): 2 GPUs
- Off-hours: 0 GPUs
- Cost: (2 × $2.48 × 260) + (0 × (730 - 260)) = $1,289/month
- Peak load handling (additional 6 GPUs for 40 hours/week average): $14.88 × 40 × 4.33 = $2,577/month
- Total: ~$3,866/month (73% savings)
The difference between approach 1 and approach 3? $6,938/month. That's $83,256 per year saved by proper autoscaling tuning.
But here's the gotcha: cold start costs money too. Each 90-second cold start means:
- 90 seconds = 0.025 hours
- Opportunity cost: requests you can't serve while spinning up
- Each cold start pod consumes ~$0.062 in just booting (0.025 × $2.48)
- 10 cold starts/day = $18.65/month
With scale-to-zero, you need <5 cold starts/day to break even vs keeping 1 warm pod. Most batch workloads easily hit that threshold.
The real lever: Understanding your traffic pattern. If you have:
- Steady state traffic (consistent requests): Keep 1-2 warm pods. Don't go to zero.
- Bursty traffic (sudden spikes, then quiet): Use KEDA with aggressive scale-up, but scale-to-zero during quiet periods.
- Bimodal traffic (busy 9-5, dead nights): Use cron-based warm pool during business hours, zero at night.
Measure your actual pattern with:
# In your inference server
import json
from datetime import datetime, time
from prometheus_client import Gauge
requests_per_hour = Gauge(
'inference_requests_per_hour',
'Rolling count of requests this hour'
)
class InferenceServer:
def __init__(self):
self.hourly_count = 0
self.last_hour = datetime.now().hour
def on_request(self):
now = datetime.now()
if now.hour != self.last_hour:
requests_per_hour.set(self.hourly_count)
self.hourly_count = 0
self.last_hour = now.hour
self.hourly_count += 1After 2 weeks, you'll have enough data to build a cost-optimized KEDA config.
Putting It All Together: Complete Production Example
Here's a battle-tested configuration for production inference:
---
# Namespace
apiVersion: v1
kind: Namespace
metadata:
name: ml-inference
---
# ScaledObject (KEDA)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: inference-ml-scaler
namespace: ml-inference
spec:
scaleTargetRef:
name: inference-api
minReplicaCount: 0
maxReplicaCount: 50
cooldownPeriod: 300
idleReplicaCount: 0
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 15
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60
selectPolicy: Min
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker.kafka:9092
consumerGroup: inference-prod
topic: inference-requests
lagThreshold: "10"
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference-api
namespace: ml-inference
spec:
selector:
matchLabels:
app: inference-api
template:
metadata:
labels:
app: inference-api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
accelerator: nvidia-gpu
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
terminationGracePeriodSeconds: 120
containers:
- name: api
image: inference-server:v2
ports:
- name: http
containerPort: 8000
- name: metrics
containerPort: 8080
resources:
requests:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 1
limits:
cpu: 4
memory: 16Gi
nvidia.com/gpu: 1
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
env:
- name: MODEL_PATH
value: /models/inference-model.onnx
- name: KAFKA_BROKERS
value: kafka-broker.kafka:9092
- name: GPU_MEMORY_FRACTION
value: "0.95"This config:
- Scales to zero (minReplicaCount: 0)
- Uses Kafka queue depth as the primary signal
- Scales up aggressively (double every 15s)
- Scales down conservatively (2 pods/min after 5 min calm)
- Proper health checks for graceful drains
- GPU-aware scheduling and resource limits
The Feedback Loop Problem: Scaling Pathologies
Here's something that tends to bite teams in production: autoscaling can create feedback loops that make your system less stable, not more. Imagine this scenario. Your queue depth metric spikes because your inference service got slightly slower (maybe due to a code deployment or a temporary GPU contention). KEDA sees the queue building and scales up. The new pods take 90 seconds to boot and load the model. During that 90 seconds, the queue keeps building because you're still running at reduced capacity. By the time the new pods are ready, you might have overprovisioned significantly. Then everything drains. You go from fully loaded to mostly idle in minutes. KEDA sees low queue depth and starts scaling down aggressively. You drop below your required capacity. Queue builds again. Welcome to scaling thrashing - your system oscillates between over and under-provisioned.
The solution is conservative scaling policies. Scale up quickly (you can always scale down later). Scale down slowly (waiting is cheap compared to thrashing). Use stabilization windows that prevent rapid reversal. Add hysteresis to your metrics so a small blip doesn't trigger a scale decision. These seem like minor tuning details but they're critical for stability.
Another pathology: cold start avalanches. You scale to zero during off-hours to save money. Traffic arrives. You need to launch fifty new pods simultaneously because all your capacity is zero. Kubernetes starts launching pods, pulling container images, allocating GPUs. Your container registry gets hammered. Your cluster's network gets saturated. Some pods timeout during startup. You don't actually launch fifty pods - you launch thirty because twenty hit errors. Now you're under-provisioned from the start and never catch up. Queue backs up. Users wait minutes for responses.
The solution here is warm pools during predictable busy times. If you know you're going to get traffic 9am-5pm, keep 2-5 pods warm during those hours. Don't scale to truly zero during business hours. The cost of keeping a couple warm pods running is much less than the cost of cold start failures and poor user experience.
A third pathology: metric stale-ness. Your scaling decision is based on metrics collected 60 seconds ago. In that 60 seconds, traffic quadrupled. By the time your scaling decision is made and executed, you're responding to yesterday's demand. Real-world solution: use shorter metric evaluation windows (15-30 seconds instead of 60) and accept more reactive behavior. Yes, you might scale one or two pods too many. But you're not constantly chasing your tail trying to catch up to real-time demand.
Summary
Autoscaling ML inference is hard. But with KEDA, Prometheus, and thoughtful scaling policies, you can build systems that:
- Save 50-70% of idle compute costs with scale-to-zero
- Maintain sub-second latency during traffic spikes
- Tolerate cold starts gracefully with warm-pool triggers
- Scale intelligently on queue depth, GPU utilization, and business metrics
Start simple: queue-based scaling with a warm pool. Measure cold starts. Add GPU metrics. Fine-tune stabilization windows. Iterate.
The difference between paying $50k/month and $10k/month for the same throughput? It's here.
Building Autoscaling Culture
The hardest part of implementing KEDA isn't the technology - it's building an organizational practice around it. Many teams deploy KEDA, set some reasonable defaults, and then never tune it. They don't measure whether they're actually saving money. They don't track cold starts. They don't correlate scaling decisions with user experience metrics. KEDA runs invisibly, looking like it's working, but it's probably sub-optimal.
The teams that get real value treat autoscaling as a continuous optimization problem. They measure baseline costs. They deploy KEDA with conservative settings. They monitor the metrics. They find inefficiencies. They tune the triggers. They measure again. They iterate. Over three months of this, they find 30-50% additional savings beyond what naive KEDA configuration gives them.
This requires tooling and discipline. You need dashboards that show you:
- How many replicas you're running over time
- Cost per period (with forecasting)
- Cold start frequency and latency impact
- Scale-up/scale-down frequency (detecting thrashing)
- Average queue depth
- User experience metrics (latency, errors, success rate)
When all these metrics are visible and tracked historically, teams naturally start asking "why is this metric bad?" and fixing it. The visibility drives the behavior change. Without visibility, autoscaling runs invisibly and you never know whether you're getting value.
Also, involve the team that deploys and monitors the service in the tuning process. They have domain knowledge about typical traffic patterns, peak times, and user expectations. They know which metrics matter and which are noise. They're the ones who'll catch when something goes wrong. Making them partners in tuning KEDA, not just users of it, builds ownership and leads to better decisions.
When Autoscaling Isn't the Right Answer
Here is a truth that infrastructure vendors rarely advertise: not every workload benefits from autoscaling. Understanding when autoscaling makes sense and when it does not is critical for making good infrastructure decisions.
Autoscaling shines when you have variable traffic. If your inference service gets light traffic most of the time with occasional spikes, autoscaling can save you seventy to eighty percent of your costs. You are not paying for capacity you are not using. But if your traffic is perfectly steady - maybe you serve internal batch inference jobs that run at the same rate every day - autoscaling brings overhead without benefit. You might as well provision a fixed number of pods and save the operational complexity of maintaining KEDA.
Autoscaling also requires that your workloads tolerate latency variation. When you scale up, there is overhead. New pods spin up. Models load. There is a startup penalty, measured in seconds. If your users are okay with occasional requests taking longer, this is fine. But if you have strict latency SLAs, scaling to zero during off-hours might violate those SLAs. You might end up maintaining a warm pool at all hours to keep latency predictable, which defeats the cost savings of scale-to-zero.
The cold start problem is also real. Scaling from zero is cheap in theory but expensive in practice if your cold start penalty is high. If your model takes ninety seconds to load, and a traffic spike causes you to launch fifty pods simultaneously, you have ninety seconds where all fifty pods are loading their models and providing no service. The queue backs up. You actually launched too many pods because they were not ready when the traffic arrived. You pay for more concurrency than you needed, making the cost calculation worse.
Another case where autoscaling fails is when your workloads are tightly coupled. If you are running a distributed training) job that requires exactly eight GPUs talking to each other over high-bandwidth interconnect, autoscaling does not help. The job either runs with all eight GPUs or it does not run at all. You cannot scale gradually.
And here is something that catches many teams: autoscaling adds latency to the request path. Your scaling metrics are checked every fifteen to thirty seconds. If traffic spikes, you might not detect it for thirty seconds. By the time you are spinning up new pods, you already have a queue. Your users experience long latencies during the ramp-up. A system with pre-provisioned capacity would have served those requests instantly. For latency-sensitive applications, the cost of autoscaling overhead might outweigh the cost savings of not running idle capacity.
The Organizational Angle: Autoscaling as a Tool for Different Stakeholder Groups
Autoscaling creates alignment and misalignment between different groups in an organization. For cost-conscious organizations, autoscaling is obviously good. It reduces cloud spending. But for reliability-focused groups, autoscaling can look like unnecessary risk. Every scaling decision is an opportunity for something to go wrong. Pods might fail to start. Metrics might be stale. You might under-provision and create a terrible user experience.
This tension is real and it is not irrational. The solution is transparency and measurement. If you can show that your autoscaling system reliably handles ninety-eight percent of traffic spikes with zero SLA violations, reliability teams will trust it. But this requires months of history and careful monitoring. Early autoscaling deployments should be conservative. Run with a larger-than-necessary baseline. Autoscale only for traffic that significantly exceeds your baseline. As you gain confidence, you can increase the aggression.
This also suggests that autoscaling is not a binary choice but a spectrum. At one extreme is manual scaling - you watch dashboards and manually request more capacity when needed. At the other extreme is aggressive autoscaling that can scale from zero pods to one hundred pods in minutes. Most production systems benefit from something in between. Maintain a small baseline that handles typical load, then autoscale for traffic that exceeds the baseline. This hybrid approach gives you reliability when you need it and cost savings when you can afford the scale-up latency.
The Feedback Loop Problem Revisited
Earlier I mentioned scaling pathologies. Let me expand on that because it is one of the hardest problems in autoscaling and most teams encounter it eventually. Imagine this scenario. Your inference service is running smoothly at ten pods. Traffic increases. KEDA detects the queue building and scales to fifteen pods. The new pods start up and load their models. During model loading, they are not yet processing requests, so the queue keeps building. By the time the new pods are ready, maybe you needed twenty pods all along. You scale to twenty. Traffic spikes again. You scale to thirty. Then everything drains. Traffic was a temporary spike. Now you have thirty pods running with minimal load. KEDA scales down, but conservatively, removing maybe two pods per minute. It takes fifteen minutes to get back to ten pods. You are paying for idle capacity the whole time.
During this entire cycle, your system was constantly chasing the load instead of smoothly handling it. This is called thrashing, and it is the enemy of cost-efficient autoscaling.
The fix is to be smarter about scaling decisions. Use exponential backoff on scale-down. If you just scaled up, do not scale down for several minutes. Detect traffic spikes and pre-scale before the queue gets large. Look at trends in your metrics, not just instantaneous values. If your queue is growing but not yet large, you might want to start scaling before it explodes. Some advanced autoscaling systems use machine learning to predict traffic and scale preemptively. But most teams do not need that. Conservative scaling policies - scale up fast, scale down slow, wait between decisions - solve most thrashing problems.