The Problem with Standard Kubernetes Autoscaling

Kubernetes' default Horizontal Pod Autoscaler (HPA) works great for web services. It watches CPU and memory, scales up when requests pile up, and scales down when things calm down. Simple. Effective.

But ML inference is different.

Here's why HPA alone isn't enough:

Your GPU utilization might be 5%, but you're sitting on a waiting queue of 10,000 inference requests. HPA doesn't know that. It sees low CPU, decides everything's fine, and keeps your pods starved while customers wait.

Or worse: you scale based on CPU threshold, but inference workloads have weird CPU/GPU relationships. Your GPU is maxed, but CPU sits at 40%. Your model performs inference almost entirely on the GPU - CPU is just glue.

Classic HPA is metric-blind. It optimizes for the wrong signals.

The fundamental issue here is that HPA was designed with stateless web services in mind - the kind where every request is independent and similar in cost. ML inference breaks these assumptions. A single inference request might take 100ms with light GPU load or 2 seconds with heavy compute. HPA's linear scaling assumption fails. You end up with scenarios where you're scaling up but still have terrible latency because your metric (CPU%) doesn't reflect what actually matters for your model's performance.

Introducing KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) flips the script. Instead of watching generic system metrics, KEDA scales on events and custom metrics that actually drive your workload.

What KEDA adds to the table:

Queue depth awareness: Scale based on message queue length (Kafka, RabbitMQ, AWS SQS)
Custom Prometheus metrics: GPU utilization, inference latency, throughput
Scale-to-zero: Actually shut down pods when idle, not just minimum replicas
Business metrics: Requests in queue, pending jobs, inference confidence scores

Think of KEDA as HPA's smarter cousin. It's HPA-compatible (they can work together), but KEDA understands asynchronous workloads and external event sources.

The beauty of KEDA is that it solves the fundamental problem: your scaling signal becomes domain-aware. Instead of "the CPU is at 75%," you're saying "I have 100 inference requests in my queue and each pod processes 10/second, so I need 10 pods." That direct mapping between business reality and infrastructure is what makes KEDA so powerful.

HPA vs KEDA: Side-by-Side Comparison

Feature	HPA	KEDA
Default Metrics	CPU, Memory	Queue depth, custom metrics, events
Custom Metrics	Prometheus (manual setup)	Native integrations (40+)
Scale to Zero	No (minReplicas ≥ 1)	Yes
Event Sources	None	Kafka, SQS, Postgres, HTTP webhooks
Complex Logic	Limited	Full ScaledObject configs
GPU Awareness	No	Yes (via custom metrics)

Here's the key insight: HPA and KEDA aren't either/or. They're complementary. You can use KEDA's ScaledObject for queue-based scaling and HPA for secondary scaling on CPU spikes. Best of both worlds.

In practice, most production ML systems use KEDA as the primary scaler because it's optimized for asynchronous workloads (which is how most inference systems operate). HPA acts as a safety valve for unexpected compute patterns that don't correlate with queue depth.

Why This Matters in Production

Let's ground this in real economics. Many teams think autoscaling is optional - something you add later. This is wrong. Autoscaling is how you ship cost-efficient ML systems.

Consider a typical inference service that gets bursty traffic:

Off-peak: 10 inference requests/hour
Peak times: 1,000 requests/minute
Model inference cost: $2.50/GPU/hour

Without autoscaling, you either pay for peak capacity all day ($60/day) or users experience 30+ second latencies during spikes. With proper autoscaling using KEDA, you pay for what you use, cutting costs by 70-90%.

But here's the complexity: getting autoscaling right in ML is hard. You need to understand:

How long your model takes to load (cold start time)
How many requests per pod you can sustain
What latency targets you need to hit
How sensitive your users are to scaling-related delays

KEDA lets you control these variables. HPA doesn't.

Architecture: GPU-Aware Autoscaling

Let's design a system that scales an ML inference service on multiple signals: queue depth, GPU utilization, and request throughput.

graph TB
    A["🔄 Request Queue<br/>Kafka Topic"]
    B["📊 Prometheus<br/>Metrics"]
    C["🔌 KEDA<br/>ScaledObject"]
    D["⚙️ Custom Metrics<br/>Adapter"]
    E["🚀 Inference Pods<br/>GPU Cluster"]
    F["📉 HPA<br/>Fallback"]
 
    A -->|Queue Depth| C
    B -->|GPU Util %<br/>Throughput| D
    D -->|Scaled Values| C
    C -->|Scale Signal| E
    F -->|CPU Spike| E
 
    E -->|Processed| A
    E -->|Metrics Export| B
 
    style A fill:#ff6b6b
    style B fill:#4ecdc4
    style C fill:#45b7d1
    style E fill:#ffd93d

This is a composite system:

Request Queue (Kafka) holds inference jobs
Prometheus scrapes GPU metrics from inference pods
Custom Metrics Adapter transforms Prometheus data into scaling signals
KEDA ScaledObject watches both queue depth and custom metrics
HPA acts as a safety net for unexpected CPU surges
Inference Pods process jobs and export telemetry

Now let's make it real.

Setting Up KEDA with Kafka and GPU Metrics

First, install KEDA on your cluster:

bash

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Now, let's configure a ScaledObject that scales on queue depth and GPU utilization:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-inference-scaler
  namespace: ml-inference
spec:
  scaleTargetRef:
    name: inference-deployment
    kind: Deployment
  minReplicaCount: 0  # Scale to zero!
  maxReplicaCount: 50
 
  # Cooldown: be conservative on scale-down
  cooldownPeriod: 300
  idleReplicaCount: 0
  fallback:
    failureThreshold: 3
    replicas: 2
 
  # Define scaling triggers
  triggers:
    # Trigger 1: Kafka queue depth
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker.kafka:9092
        consumerGroup: ml-inference-group
        topic: inference-requests
        # Scale up 1 pod per 10 messages in queue
        lagThreshold: "10"
        offsetResetPolicy: "latest"
      authenticationRef:
        name: kafka-auth
 
    # Trigger 2: GPU utilization from Prometheus
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: gpu_utilization_percent
        query: |
          avg(gpu_utilization_percent{job="inference-pods"})
        # Scale up when GPU util > 75%
        threshold: "75"
      authenticationRef:
        name: prometheus-auth
 
    # Trigger 3: Custom metric - inference latency
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: inference_latency_p99_ms
        query: |
          histogram_quantile(0.99,
            rate(inference_latency_ms_bucket[30s]))
        # If p99 latency exceeds 2 seconds, scale up
        threshold: "2000"

This config does several things:

minReplicaCount: 0 - Actually scales to zero. No pods running = no costs.
Multiple triggers - Combines queue depth, GPU utilization, and latency. Any trigger can initiate scaling.
Fallback replicas - If all metrics fail, maintain 2 pods (safety net).
Conservative cooldown - Waits 5 minutes before scaling down (avoids thrashing).

Now let's define the Deployment that KEDA manages:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-deployment
  namespace: ml-inference
spec:
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      # Schedule on GPU nodes
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
 
      containers:
        - name: inference-server
          image: my-inference-server:v1
 
          # Resource requests for scheduling
          resources:
            requests:
              cpu: 2
              memory: 8Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 4
              memory: 16Gi
              nvidia.com/gpu: 1
 
          # Expose Prometheus metrics
          ports:
            - name: metrics
              containerPort: 8080
 
          # Liveness/readiness for graceful scale-down
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
 
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
 
          env:
            - name: KAFKA_BROKERS
              value: kafka-broker.kafka:9092
            - name: KAFKA_TOPIC
              value: inference-requests
            - name: GPU_MEMORY_FRACTION
              value: "0.95"

Key details:

GPU scheduling: Node selector ensures pods run on GPU hardware
Metrics port: Prometheus scrapes from :8080/metrics
Readiness probes: Lets KEDA gracefully drain connections before scale-down

The Scale-to-Zero Challenge: Cold Starts

Here's where it gets interesting. Scaling to zero saves money - you pay nothing when idle. But ML models take time to load.

A typical NVIDIA GPU inference server-inference-server-multi-model-serving) startup sequence:

Container initialization: ~5 seconds
Model load (weights, quantization-pipeline-pipelines-training-orchestration)-fundamentals)-automated-model-compression)-production-inference-deployment)-llms)): ~30-90 seconds (depends on model size)
Warmup inference: ~5 seconds
Ready to serve: Total ~40-120 seconds

That's your cold start latency. If requests arrive every few hours, you're paying this tax each time.

Let's measure it:

python

# Add to your inference server startup
import time
import os
from prometheus_client import Histogram, Counter
 
startup_time = Histogram(
    'inference_startup_seconds',
    'Time to load model and become ready',
    buckets=(10, 30, 60, 90, 120)
)
 
startup_requests = Counter(
    'inference_cold_starts_total',
    'Total cold starts from zero replicas'
)
 
class InferenceServer:
    def __init__(self):
        start = time.time()
        self.load_model()
        self.warmup()
        elapsed = time.time() - start
        startup_time.observe(elapsed)
        print(f"Ready in {elapsed:.1f}s")
 
    def load_model(self):
        # Load your ONNX/TorchScript model
        import onnx
        self.model = onnx.load(os.getenv('MODEL_PATH'))
        print("Model loaded")
 
    def warmup(self):
        # Dummy inference to warm up GPU
        import numpy as np
        dummy_input = np.random.randn(1, 3, 224, 224).astype(np.float32)
        self.model(dummy_input)
        print("Warmup complete")

Now, decide: Is cold start acceptable?

If yes: Use scale-to-zero. Save 100% of idle costs. Accept 40-120s latency on first request.

If no: Keep 1-2 "warm pool" replicas running. Trade idle costs (~$200-500/month per GPU) for sub-second latency on every request.

Many teams use a hybrid: scale-to-zero during off-hours (nights, weekends), maintain warm pool during business hours.

Here's a KEDA config that implements this:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-inference-with-warmpool
spec:
  scaleTargetRef:
    name: inference-deployment
 
  # During business hours (8am-8pm US/Eastern): keep 2 warm
  # During off-hours: scale to zero
  minReplicaCount: 0
  maxReplicaCount: 50
 
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker.kafka:9092
        topic: inference-requests
        lagThreshold: "10"
 
    # Add a time-based "always on" trigger for business hours
    - type: cron
      metadata:
        timezone: America/New_York
        start: 8 * * * MON-FRI
        end: 20 * * * MON-FRI
        desiredReplicas: "2"

This keeps 2 replicas warm during business hours (9-5 ET, M-F), then scales to zero nights and weekends.

Tuning Scaling Policies for Smooth Performance

Aggressive scaling creates chaos. Pod thrashing, unnecessary churn, cold starts, wasted resources. We need smart policies.

KEDA supports scaling policies that control how fast you scale up vs down:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-tuned
spec:
  scaleTargetRef:
    name: inference-deployment
 
  minReplicaCount: 1
  maxReplicaCount: 50
 
  # Stabilization windows prevent thrashing
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        # Scale UP aggressively when queue builds
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
            # Add 4 pods per minute until max
            - type: Percent
              value: 100  # Double current replicas
              periodSeconds: 15
            # OR add 10 pods per minute (whichever is larger)
            - type: Pods
              value: 10
              periodSeconds: 60
          selectPolicy: Max  # Pick the most aggressive
 
        # Scale DOWN conservatively
        scaleDown:
          stabilizationWindowSeconds: 300  # Wait 5 minutes
          policies:
            # Remove max 2 pods per minute
            - type: Pods
              value: 2
              periodSeconds: 60
          selectPolicy: Min  # Most conservative
 
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker.kafka:9092
        topic: inference-requests
        lagThreshold: "5"

What's happening here:

Scale-up (fast): When queue builds, double your replicas every 15 seconds or add 10 pods per minute - pick the larger number. This handles traffic spikes quickly.
Scale-down (slow): Remove at most 2 pods per minute, but wait 5 minutes of calm before even trying. This prevents the "flapping" where you scale up, immediately scale down, repeat.

Real-world result: Requests queue builds from 0→100 in seconds, you launch 10 new pods in 60 seconds, queue drains, you wait 5 minutes of silence, then taper down 2 pods every minute.

Combining Metrics: The Multi-Signal Approach

Here's a production config that combines queue depth, GPU utilization, and throughput:

yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ml-inference-production
  namespace: ml-inference
spec:
  scaleTargetRef:
    name: inference-deployment
 
  minReplicaCount: 1
  maxReplicaCount: 100
 
  cooldownPeriod: 300
 
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30
          policies:
            - type: Percent
              value: 100
              periodSeconds: 15
            - type: Pods
              value: 20
              periodSeconds: 60
          selectPolicy: Max
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 3
              periodSeconds: 60
          selectPolicy: Min
 
  triggers:
    # Primary: Queue depth (inference requests waiting)
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker.kafka:9092
        consumerGroup: inference-prod
        topic: model-inference-requests
        lagThreshold: "5"  # 1 pod per 5 queued requests
 
    # Secondary: GPU utilization (efficiency signal)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: gpu_utilization
        query: |
          avg(container_accelerator_memory_used_bytes{pod=~"inference-.*"} /
              container_accelerator_memory_total_bytes{pod=~"inference-.*"} * 100)
        threshold: "85"  # Scale if GPU memory > 85%
 
    # Tertiary: Inference throughput (business metric)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: inference_throughput
        query: |
          sum(rate(inference_requests_total{job="inference"}[1m]))
        threshold: "1000"  # Scale if throughput needs > 1000 req/s
 
    # Quaternary: Latency SLO breach (performance signal)
    - type: prometheus
      metadata:
        serverAddress: http://prometheus.monitoring:9090
        metricName: inference_latency_slo_breach
        query: |
          (histogram_quantile(0.99,
            rate(inference_latency_ms_bucket[5m])) > 2000)
          * on() group_left() vector(1)
        threshold: "0.5"  # Scale if >50% of windows breach SLO

Each trigger is independent. If any signal says "scale up," we scale. This gives you:

Responsive: Queue builds? Immediate scale-up.
Efficient: GPU maxed? Scale up before you're slow.
Business-aligned: Throughput goals? Built in.
Safe: Latency breaching SLO? Auto-remedy.

Measuring Success: Cold Start Analysis

After 2 weeks of production, analyze your autoscaling behavior:

bash

# Pull cold start events from metrics
kubectl exec -it prometheus-0 -- promtool query instant \
  'count(increase(inference_cold_starts_total[2w]))'
 
# Check scaling frequency
kubectl exec -it prometheus-0 -- promtool query range \
  'changes(keda_scaledobject_replicas[2w])' \
  --start=$(date -u -d '2 weeks ago' +%s)s \
  --end=$(date -u +%s)s \
  --step=1h
 
# Calculate savings
# Cost per GPU per hour: ~$1.50 (adjust for your region/instance)
# Hours at zero replicas × cost per GPU × avg GPUs per pod

A well-tuned system typically sees:

50-70% time at 0 replicas (pure savings)
<5 cold starts per day (assuming typical batch workloads)
10-15% scale-up efficiency (reaching stable state in <2 minutes)
$15,000-50,000 monthly savings (depending on baseline usage)

Deep Dive: Metrics Collection and Prometheus Integration

Before KEDA can scale intelligently, you need to export the right metrics from your inference pods. This is where most teams struggle - they export everything, or they export nothing.

Here's what you actually need:

python

from prometheus_client import Counter, Histogram, Gauge
import time
 
# Counters (monotonically increasing)
inference_requests = Counter(
    'inference_requests_total',
    'Total inference requests',
    ['model_name', 'status']  # Track success vs error
)
 
inference_tokens = Counter(
    'inference_tokens_generated_total',
    'Total tokens generated',
    ['model_name']
)
 
# Histograms (distributions with percentiles)
inference_latency = Histogram(
    'inference_latency_ms',
    'End-to-end inference latency',
    buckets=[10, 50, 100, 250, 500, 1000, 2500, 5000, 10000],
    ['model_name']
)
 
queue_processing_time = Histogram(
    'queue_processing_time_ms',
    'Time from dequeue to response sent',
    buckets=[100, 500, 1000, 5000, 10000],
    ['model_name']
)
 
# Gauges (point-in-time values)
gpu_memory_used = Gauge(
    'gpu_memory_used_bytes',
    'GPU memory currently used',
    ['gpu_id']
)
 
active_inference_jobs = Gauge(
    'active_inference_jobs',
    'Number of inflight inference requests'
)
 
model_load_time = Gauge(
    'model_load_time_seconds',
    'How long the model took to load on startup'
)

Now, the production pattern is to update these within your inference loop:

python

class InferenceService:
    def process_request(self, request):
        start = time.time()
        active_inference_jobs.inc()
 
        try:
            # Your inference logic
            result = self.model.predict(request.data)
 
            inference_latency.labels(model_name="llm-7b").observe(
                (time.time() - start) * 1000  # Convert to ms
            )
 
            inference_requests.labels(
                model_name="llm-7b",
                status="success"
            ).inc()
 
            return result
 
        except Exception as e:
            inference_requests.labels(
                model_name="llm-7b",
                status="error"
            ).inc()
            raise
 
        finally:
            active_inference_jobs.dec()

The critical insight: KEDA queries these metrics to make scaling decisions. If you're not exporting the right metrics, KEDA is flying blind. Don't export metrics just for observability - export them for decisioning.

Common Pitfalls and How to Avoid Them

Pitfall 1: Triggering Too Aggressively

Problem: Setting lagThreshold too low (e.g., "1") causes constant scaling. Each message triggers a new pod.

Solution: Match threshold to pod throughput. If each GPU pod processes 20 messages/second, set lagThreshold to "100" (5 second queue = 1 pod).

Pitfall 2: Not Accounting for Startup Time

Problem: Scaling signals arrive, but pods take 90s to boot. Queue backs up while you wait.

Solution: Pre-calculate startup latency. If startup = 90s, add 1.5× buffer to your scaling threshold.

Pitfall 3: Leaving minReplicaCount > 0

Problem: Paying for idle pods. minReplicaCount: 1 defeats the cost savings.

Solution: Use scale-to-zero (minReplicaCount: 0) with warm-pool triggers for business hours.

Pitfall 4: Ignoring GPU Memory Constraints

Problem: Scaling up, but GPUs OOM because pods pack too tightly.

Solution: Set GPU memory requests (nvidia.com/gpu: 1) and monitor nvidia-smi memory. KEDA won't violate resource requests.

Production Debugging: When Autoscaling Doesn't Work

You deployed the KEDA config, and... nothing happens. Queue builds up, but no new pods. Or pods scale endlessly. Here's how to debug.

First, check KEDA logs:

bash

kubectl logs -n keda -l app.kubernetes.io/name=keda-operator -f

Look for errors parsing your ScaledObject or metric failures.

Second, verify the ScaledObject is active:

bash

kubectl describe scaledobject ml-inference-scaler -n ml-inference

Check the status section. It should show each trigger as "Active: true". If any trigger shows "Active: false", that trigger isn't being evaluated.

Third, test your metrics manually:

bash

# Port-forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus 9090:9090
 
# Query your metric directly
curl 'http://localhost:9090/api/v1/query?query=kafka_consumergroup_lag{topic="inference-requests"}'

If the metric returns empty or stale data, your exporters aren't running or are misconfigured. KEDA can't scale on metrics that don't exist.

Fourth, check HPA behavior:

bash

kubectl get hpa -n ml-inference
kubectl describe hpa inference-deployment-inference-scaler -n ml-inference

KEDA creates an HPA under the hood. If the HPA shows "unknown" for metric values, metrics aren't reaching the API.

Fifth, audit scaling decisions:

bash

# Watch KEDA scale events
kubectl get events -n ml-inference --sort-by='.lastTimestamp' | grep keda
 
# Check deployment scale history
kubectl rollout history deployment/inference-deployment -n ml-inference

If you're seeing scale events but no pod launches, the issue is likely node capacity or pod scheduling constraints.

Optimization: Multi-Region and Geo-Failover

If you have inference clusters across regions, you need KEDA + service mesh coordination. Here's the pattern:

yaml

---
# Region A: Primary inference cluster
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-us-east
  namespace: ml-inference
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 2  # Always 2 for failover
  maxReplicaCount: 50
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-us-east.region-a:9092
        topic: inference-requests
        lagThreshold: "10"
---
# Region B: Secondary (scales aggressively if primary fails)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-us-west
  namespace: ml-inference
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 0  # Scale to zero normally
  maxReplicaCount: 100  # But scale *up* if region A lags
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-us-west.region-b:9092
        topic: inference-requests
        # Smaller lag threshold: more aggressive scaling
        lagThreshold: "5"

The logic: Region A carries the load with aggressive scale-down (lower cost). If Region A's queue builds up significantly, that's a signal Region A is struggling. Region B sees its own queue and scales aggressively. Traffic steering (via Istio/Linkerd) sends new requests to Region B. Region A recovers, queue drains, Region B scales down.

This requires traffic steering intelligence, not just KEDA, but KEDA provides the autoscaling foundation.

Cost Optimization: The Economics of Autoscaling

Here's the financial calculation most teams skip but should do religiously.

Baseline costs (8 GPUs, always on):

8 × NVIDIA A100 40GB = 8 × $2.48/hour (AWS pricing) = $19.84/hour
Monthly (24/7): $19.84 × 730 hours = $14,493/month

With naive autoscaling (minReplicaCount: 1, always at least 1 pod):

1 GPU always on: $2.48/hour = $1,810/month
4-7 additional GPUs (average): $12.32/hour = $8,994/month
Total: ~$10,804/month (26% savings)

With KEDA scale-to-zero (minReplicaCount: 0, warm pool during business hours):

Business hours (8am-8pm, M-F = 60 hours/week × 4.33 weeks = 260 hours): 2 GPUs
Off-hours: 0 GPUs
Cost: (2 × $2.48 × 260) + (0 × (730 - 260)) = $1,289/month
Peak load handling (additional 6 GPUs for 40 hours/week average): $14.88 × 40 × 4.33 = $2,577/month
Total: ~$3,866/month (73% savings)

The difference between approach 1 and approach 3? $6,938/month. That's $83,256 per year saved by proper autoscaling tuning.

But here's the gotcha: cold start costs money too. Each 90-second cold start means:

90 seconds = 0.025 hours
Opportunity cost: requests you can't serve while spinning up
Each cold start pod consumes ~$0.062 in just booting (0.025 × $2.48)
10 cold starts/day = $18.65/month

With scale-to-zero, you need <5 cold starts/day to break even vs keeping 1 warm pod. Most batch workloads easily hit that threshold.

The real lever: Understanding your traffic pattern. If you have:

Steady state traffic (consistent requests): Keep 1-2 warm pods. Don't go to zero.
Bursty traffic (sudden spikes, then quiet): Use KEDA with aggressive scale-up, but scale-to-zero during quiet periods.
Bimodal traffic (busy 9-5, dead nights): Use cron-based warm pool during business hours, zero at night.

Measure your actual pattern with:

python

# In your inference server
import json
from datetime import datetime, time
from prometheus_client import Gauge
 
requests_per_hour = Gauge(
    'inference_requests_per_hour',
    'Rolling count of requests this hour'
)
 
class InferenceServer:
    def __init__(self):
        self.hourly_count = 0
        self.last_hour = datetime.now().hour
 
    def on_request(self):
        now = datetime.now()
        if now.hour != self.last_hour:
            requests_per_hour.set(self.hourly_count)
            self.hourly_count = 0
            self.last_hour = now.hour
 
        self.hourly_count += 1

After 2 weeks, you'll have enough data to build a cost-optimized KEDA config.

Putting It All Together: Complete Production Example

Here's a battle-tested configuration for production inference:

yaml

---
# Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: ml-inference
---
# ScaledObject (KEDA)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-ml-scaler
  namespace: ml-inference
spec:
  scaleTargetRef:
    name: inference-api
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 300
  idleReplicaCount: 0
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 30
          policies:
            - type: Percent
              value: 100
              periodSeconds: 15
            - type: Pods
              value: 15
              periodSeconds: 60
          selectPolicy: Max
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
            - type: Pods
              value: 2
              periodSeconds: 60
          selectPolicy: Min
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker.kafka:9092
        consumerGroup: inference-prod
        topic: inference-requests
        lagThreshold: "10"
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-api
  namespace: ml-inference
spec:
  selector:
    matchLabels:
      app: inference-api
  template:
    metadata:
      labels:
        app: inference-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
        - key: nvidia.com/gpu
          operator: Equal
          value: "true"
          effect: NoSchedule
      terminationGracePeriodSeconds: 120
      containers:
        - name: api
          image: inference-server:v2
          ports:
            - name: http
              containerPort: 8000
            - name: metrics
              containerPort: 8080
          resources:
            requests:
              cpu: 2
              memory: 8Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 4
              memory: 16Gi
              nvidia.com/gpu: 1
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 60
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 2
          env:
            - name: MODEL_PATH
              value: /models/inference-model.onnx
            - name: KAFKA_BROKERS
              value: kafka-broker.kafka:9092
            - name: GPU_MEMORY_FRACTION
              value: "0.95"

This config:

Scales to zero (minReplicaCount: 0)
Uses Kafka queue depth as the primary signal
Scales up aggressively (double every 15s)
Scales down conservatively (2 pods/min after 5 min calm)
Proper health checks for graceful drains
GPU-aware scheduling and resource limits

The Feedback Loop Problem: Scaling Pathologies

Here's something that tends to bite teams in production: autoscaling can create feedback loops that make your system less stable, not more. Imagine this scenario. Your queue depth metric spikes because your inference service got slightly slower (maybe due to a code deployment or a temporary GPU contention). KEDA sees the queue building and scales up. The new pods take 90 seconds to boot and load the model. During that 90 seconds, the queue keeps building because you're still running at reduced capacity. By the time the new pods are ready, you might have overprovisioned significantly. Then everything drains. You go from fully loaded to mostly idle in minutes. KEDA sees low queue depth and starts scaling down aggressively. You drop below your required capacity. Queue builds again. Welcome to scaling thrashing - your system oscillates between over and under-provisioned.

The solution is conservative scaling policies. Scale up quickly (you can always scale down later). Scale down slowly (waiting is cheap compared to thrashing). Use stabilization windows that prevent rapid reversal. Add hysteresis to your metrics so a small blip doesn't trigger a scale decision. These seem like minor tuning details but they're critical for stability.

Another pathology: cold start avalanches. You scale to zero during off-hours to save money. Traffic arrives. You need to launch fifty new pods simultaneously because all your capacity is zero. Kubernetes starts launching pods, pulling container images, allocating GPUs. Your container registry gets hammered. Your cluster's network gets saturated. Some pods timeout during startup. You don't actually launch fifty pods - you launch thirty because twenty hit errors. Now you're under-provisioned from the start and never catch up. Queue backs up. Users wait minutes for responses.

The solution here is warm pools during predictable busy times. If you know you're going to get traffic 9am-5pm, keep 2-5 pods warm during those hours. Don't scale to truly zero during business hours. The cost of keeping a couple warm pods running is much less than the cost of cold start failures and poor user experience.

A third pathology: metric stale-ness. Your scaling decision is based on metrics collected 60 seconds ago. In that 60 seconds, traffic quadrupled. By the time your scaling decision is made and executed, you're responding to yesterday's demand. Real-world solution: use shorter metric evaluation windows (15-30 seconds instead of 60) and accept more reactive behavior. Yes, you might scale one or two pods too many. But you're not constantly chasing your tail trying to catch up to real-time demand.

Summary

Autoscaling ML inference is hard. But with KEDA, Prometheus, and thoughtful scaling policies, you can build systems that:

Save 50-70% of idle compute costs with scale-to-zero
Maintain sub-second latency during traffic spikes
Tolerate cold starts gracefully with warm-pool triggers
Scale intelligently on queue depth, GPU utilization, and business metrics

Start simple: queue-based scaling with a warm pool. Measure cold starts. Add GPU metrics. Fine-tune stabilization windows. Iterate.

The difference between paying $50k/month and $10k/month for the same throughput? It's here.

Building Autoscaling Culture

The hardest part of implementing KEDA isn't the technology - it's building an organizational practice around it. Many teams deploy KEDA, set some reasonable defaults, and then never tune it. They don't measure whether they're actually saving money. They don't track cold starts. They don't correlate scaling decisions with user experience metrics. KEDA runs invisibly, looking like it's working, but it's probably sub-optimal.

The teams that get real value treat autoscaling as a continuous optimization problem. They measure baseline costs. They deploy KEDA with conservative settings. They monitor the metrics. They find inefficiencies. They tune the triggers. They measure again. They iterate. Over three months of this, they find 30-50% additional savings beyond what naive KEDA configuration gives them.

This requires tooling and discipline. You need dashboards that show you:

How many replicas you're running over time
Cost per period (with forecasting)
Cold start frequency and latency impact
Scale-up/scale-down frequency (detecting thrashing)
Average queue depth
User experience metrics (latency, errors, success rate)

When all these metrics are visible and tracked historically, teams naturally start asking "why is this metric bad?" and fixing it. The visibility drives the behavior change. Without visibility, autoscaling runs invisibly and you never know whether you're getting value.

Also, involve the team that deploys and monitors the service in the tuning process. They have domain knowledge about typical traffic patterns, peak times, and user expectations. They know which metrics matter and which are noise. They're the ones who'll catch when something goes wrong. Making them partners in tuning KEDA, not just users of it, builds ownership and leads to better decisions.

When Autoscaling Isn't the Right Answer

Here is a truth that infrastructure vendors rarely advertise: not every workload benefits from autoscaling. Understanding when autoscaling makes sense and when it does not is critical for making good infrastructure decisions.

Autoscaling shines when you have variable traffic. If your inference service gets light traffic most of the time with occasional spikes, autoscaling can save you seventy to eighty percent of your costs. You are not paying for capacity you are not using. But if your traffic is perfectly steady - maybe you serve internal batch inference jobs that run at the same rate every day - autoscaling brings overhead without benefit. You might as well provision a fixed number of pods and save the operational complexity of maintaining KEDA.

Autoscaling also requires that your workloads tolerate latency variation. When you scale up, there is overhead. New pods spin up. Models load. There is a startup penalty, measured in seconds. If your users are okay with occasional requests taking longer, this is fine. But if you have strict latency SLAs, scaling to zero during off-hours might violate those SLAs. You might end up maintaining a warm pool at all hours to keep latency predictable, which defeats the cost savings of scale-to-zero.

The cold start problem is also real. Scaling from zero is cheap in theory but expensive in practice if your cold start penalty is high. If your model takes ninety seconds to load, and a traffic spike causes you to launch fifty pods simultaneously, you have ninety seconds where all fifty pods are loading their models and providing no service. The queue backs up. You actually launched too many pods because they were not ready when the traffic arrived. You pay for more concurrency than you needed, making the cost calculation worse.

Another case where autoscaling fails is when your workloads are tightly coupled. If you are running a distributed training) job that requires exactly eight GPUs talking to each other over high-bandwidth interconnect, autoscaling does not help. The job either runs with all eight GPUs or it does not run at all. You cannot scale gradually.

And here is something that catches many teams: autoscaling adds latency to the request path. Your scaling metrics are checked every fifteen to thirty seconds. If traffic spikes, you might not detect it for thirty seconds. By the time you are spinning up new pods, you already have a queue. Your users experience long latencies during the ramp-up. A system with pre-provisioned capacity would have served those requests instantly. For latency-sensitive applications, the cost of autoscaling overhead might outweigh the cost savings of not running idle capacity.

The Organizational Angle: Autoscaling as a Tool for Different Stakeholder Groups

Autoscaling creates alignment and misalignment between different groups in an organization. For cost-conscious organizations, autoscaling is obviously good. It reduces cloud spending. But for reliability-focused groups, autoscaling can look like unnecessary risk. Every scaling decision is an opportunity for something to go wrong. Pods might fail to start. Metrics might be stale. You might under-provision and create a terrible user experience.

This tension is real and it is not irrational. The solution is transparency and measurement. If you can show that your autoscaling system reliably handles ninety-eight percent of traffic spikes with zero SLA violations, reliability teams will trust it. But this requires months of history and careful monitoring. Early autoscaling deployments should be conservative. Run with a larger-than-necessary baseline. Autoscale only for traffic that significantly exceeds your baseline. As you gain confidence, you can increase the aggression.

This also suggests that autoscaling is not a binary choice but a spectrum. At one extreme is manual scaling - you watch dashboards and manually request more capacity when needed. At the other extreme is aggressive autoscaling that can scale from zero pods to one hundred pods in minutes. Most production systems benefit from something in between. Maintain a small baseline that handles typical load, then autoscale for traffic that exceeds the baseline. This hybrid approach gives you reliability when you need it and cost savings when you can afford the scale-up latency.

The Feedback Loop Problem Revisited

Earlier I mentioned scaling pathologies. Let me expand on that because it is one of the hardest problems in autoscaling and most teams encounter it eventually. Imagine this scenario. Your inference service is running smoothly at ten pods. Traffic increases. KEDA detects the queue building and scales to fifteen pods. The new pods start up and load their models. During model loading, they are not yet processing requests, so the queue keeps building. By the time the new pods are ready, maybe you needed twenty pods all along. You scale to twenty. Traffic spikes again. You scale to thirty. Then everything drains. Traffic was a temporary spike. Now you have thirty pods running with minimal load. KEDA scales down, but conservatively, removing maybe two pods per minute. It takes fifteen minutes to get back to ten pods. You are paying for idle capacity the whole time.

During this entire cycle, your system was constantly chasing the load instead of smoothly handling it. This is called thrashing, and it is the enemy of cost-efficient autoscaling.

The fix is to be smarter about scaling decisions. Use exponential backoff on scale-down. If you just scaled up, do not scale down for several minutes. Detect traffic spikes and pre-scale before the queue gets large. Look at trends in your metrics, not just instantaneous values. If your queue is growing but not yet large, you might want to start scaling before it explodes. Some advanced autoscaling systems use machine learning to predict traffic and scale preemptively. But most teams do not need that. Conservative scaling policies - scale up fast, scale down slow, wait between decisions - solve most thrashing problems.

Autoscaling ML Inference: KEDA, HPA, and Custom Metrics

The Problem with Standard Kubernetes Autoscaling

Introducing KEDA: Event-Driven Autoscaling

HPA vs KEDA: Side-by-Side Comparison

Why This Matters in Production

Architecture: GPU-Aware Autoscaling

Setting Up KEDA with Kafka and GPU Metrics

The Scale-to-Zero Challenge: Cold Starts

Tuning Scaling Policies for Smooth Performance

Combining Metrics: The Multi-Signal Approach

Measuring Success: Cold Start Analysis

Deep Dive: Metrics Collection and Prometheus Integration

Common Pitfalls and How to Avoid Them

Production Debugging: When Autoscaling Doesn't Work

Optimization: Multi-Region and Geo-Failover

Cost Optimization: The Economics of Autoscaling

Putting It All Together: Complete Production Example

The Feedback Loop Problem: Scaling Pathologies

Summary

Building Autoscaling Culture

When Autoscaling Isn't the Right Answer

The Organizational Angle: Autoscaling as a Tool for Different Stakeholder Groups

The Feedback Loop Problem Revisited

Need help implementing this?