Canary Deployments for ML Models
You're about to push a new ML model to production. It has better accuracy on your test set, but what if it fails on real data? What if it crashes under load? What if it has silent quality degradation that tanks your business metrics? Canary deployments are your safety net - they let you roll out models progressively, catching problems before they affect all your users.
This isn't just deploying code. ML models are different. You need to watch not just error rates and latency, but also embedding-pipelines-training-orchestration)-fundamentals))-engineering-chunking-embedding-retrieval) drift, prediction accuracy, and business outcomes. Kubernetes Argo Rollouts gives you the machinery to do this automatically, with Prometheus queries deciding whether to push forward or roll back in seconds.
Let's build a canary deployment-production-inference-deployment) system that treats your ML models with the rigor they deserve.
Table of Contents
- Why Canary Deployments Matter for ML Models
- Why Confidence Matters When Deploying Models
- The Hidden Complexity of ML Deployments
- Why Business Context Changes Everything
- The Canary Progression: From 1% to 100%
- The Hidden Layer: Why These Gates Matter
- Argo Rollouts: The Kubernetes Primitive for Progressive Delivery
- Key Components
- Understanding the Rollout Lifecycle
- Why Kubernetes Primitives Matter for ML
- Building a Complete Canary Rollout Manifest
- Analysis Templates: Where the ML Logic Happens
- Expected Output: Watching the Rollout Happen
- Automated Rollback: Getting Back to Safety in Under 2 Minutes
- What Happens Next: The Runbook
- Detecting Quality Degradation: The ML-Specific Metrics
- The Metrics Hierarchy
- Why Embeddings Matter as a Leading Indicator
- 1. Embedding Drift via Cosine Similarity
- 2. Shadow Logging and Ground Truth Validation
- 3. Business Metric Monitoring
- Putting It All Together: The Production Workflow
- Advanced Patterns: When Canary Deployments Get Real
- The Real World: Where Deployments Get Messy
- Handling Model-Specific Latency
- Multi-Model Canaries
- Rollout with Feature Flags
- Monitoring the Rollout: Dashboards and Alerts
- The Psychology of Dashboards
- Alert Aggregation and Signal-to-Noise Ratio
- Key Takeaways
- References
Why Canary Deployments Matter for ML Models
Traditional deployments ask a simple question: "Does the code run?" ML deployments ask much harder ones: "Does the model still work on real data?" That's the fundamental difference.
When you deploy a new model, you're shipping a black box trained on historical data. Concept drift, data distribution shifts, and subtle performance degradation happen in production, not in your test environment. A canary strategy catches these problems before they spiral.
Here's what you get:
- Progressive traffic shifting: You don't bet the company on a single deployment. You start with 1% of traffic, then 5%, then 20%, watching for problems at each gate.
- Automated quality gates: Prometheus queries monitor accuracy, latency, embedding drift, and business metrics. If any metric breaches a threshold, the rollout pauses or rolls back automatically.
- Fast rollback SLA: If something goes wrong, you're back to the stable model in under 2 minutes - not hours of debugging in production.
- Human approval loops: For major traffic increases (like going from 20% to 50%), you can require explicit approval, giving your ML team a final sanity check.
The payoff is simple: you deploy more frequently, with higher confidence, and fail faster when you do.
Why Confidence Matters When Deploying Models
Consider the emotional and business reality of model deployments. Your data science team spent weeks optimizing a new model. They validated it on holdout test sets. The accuracy numbers look great. They're confident. But they're also cautious, because they've been burned before. They've seen models with great test-set performance that flopped in production. They know that test sets don't reflect the full distribution of real-world data. They know that edge cases exist that weren't represented in training.
Canary deployments let you transform this nervousness into confidence. Instead of demanding that the team stake their reputation on a single binary decision - deploy or don't deploy - canary gives them a structured path to confidence. Start with 1% of traffic. Watch for 5 minutes. If nothing breaks, move to 5%. If embedding distributions look normal, proceed to 20%. Each stage is a question: "Is this stage okay?" If the answer is yes, move forward. If it's no, you've caught a problem early.
This progressive validation transforms deployments from nerve-wracking binary decisions into calm, data-driven progressions. The team's confidence grows with each gate that passes. And if a gate fails, you have evidence. You know exactly which metric broke and at what traffic level. This beats the alternative - deploying fully, getting burned by real users, then frantically rolling back while customers complain.
The Hidden Complexity of ML Deployments
Most infrastructure teams come from traditional software deployment backgrounds. They understand rolling deployments, blue-green swaps, and gradual traffic shifts. But ML adds layers of complexity that standard deployment patterns don't address. When you deploy a new version of your API, you're mostly shipping a different algorithm for the same logical problem. When you deploy a new ML model, you're fundamentally changing how the system understands the world.
The trained weights encode learned patterns from historical data. If that data distribution was different from what your model now sees in production, or if the relationships between features have shifted, your new model might confidently make completely wrong predictions. This is why we can't just watch error rates and latency. A model can have perfect latency and zero errors from an infrastructure perspective while silently degrading in quality. It returns predictions fast, it doesn't crash, everything looks great on the ops dashboard. But those predictions are wrong.
Your users don't notice immediately, but over time, they stop trusting your system. Your recommendation engine recommends irrelevant products. Your fraud detector misses obvious fraud. Your pricing model makes terrible decisions. The failure mode is insidious because it's invisible to traditional monitoring. Canary deployments solve this by making you measure what actually matters: not whether your infrastructure is healthy, but whether your predictions are still good. This shift in perspective is fundamental. You stop asking "is the service up?" and start asking "is the service still right?" These are orthogonal concerns, and you need both.
Why Business Context Changes Everything
Many teams build canary deployments focused purely on technical metrics - accuracy, latency, error rates. This is necessary but not sufficient. The truth is that technical correctness doesn't guarantee business success. A model might be technically accurate but economically harmful.
Imagine a recommendation engine. Version 1 has an NDCG score of 0.75. Version 2 has an NDCG score of 0.78 - clearly better from a ranking perspective. But version 2 focuses recommendations on higher-margin products that generate less engagement. Users skip them. Conversion rate drops by 3%. Revenue per user drops by 8%. The model is technically more accurate, but the business gets worse.
Without business metric gates, you'd deploy this model thinking you've made progress. With them, you catch the problem in the canary phase. That's the discipline canary deployments bring: you measure what actually matters to your business, embed those measurements into your gates, and refuse to proceed if those metrics degrade. This transforms deployment from a technical exercise into a business decision made with real data.
The Canary Progression: From 1% to 100%
A typical ML canary rollout follows this progression:
| Stage | Traffic % | Duration | Key Metrics | Gate |
|---|---|---|---|---|
| Canary 1 | 1% | 5 min | Error rate <0.1%, P99 latency <500ms | Automated |
| Canary 2 | 5% | 10 min | Accuracy >95% baseline, no embedding drift | Automated |
| Canary 3 | 20% | 15 min | Business metrics stable, no anomalies | Automated |
| Canary 4 | 50% | 20 min | All metrics nominal, manual approval | Human |
| Canary 5 | 100% | 5 min | Final promotion to stable | Automated |
Each stage has explicit health checks. If the canary model's accuracy drops below 95% of the baseline, the rollout doesn't proceed to the next stage - it either pauses for investigation or rolls back automatically.
The Hidden Layer: Why These Gates Matter
You might wonder why we care about embedding drift or why business metrics matter. Here's the logic:
- Error rate and latency catch infrastructure problems (crashed pods, OOM, network timeouts).
- Accuracy catches model quality issues (the model learned something bad, or the production data differs from training).
- Embedding drift (measured via cosine similarity of embeddings) catches silent failures - the model might still classify correctly on obvious cases, but fail on edge cases that are creeping into your data.
- Business metrics (conversions, revenue, user engagement) catch problems your test metrics miss. A model might be technically accurate but hurt your business if it recommends the wrong products.
Without these gates, you're flying blind. This is why so many ML deployments go wrong - teams fixate on accuracy numbers and ignore the broader context. A model that increases accuracy by 2% but decreases user engagement by 5% is a failure, no matter what the metrics say. Canary deployments force you to measure what actually matters.
Argo Rollouts: The Kubernetes Primitive for Progressive Delivery
Argo Rollouts is a Kubernetes controller that extends the native Deployment resource with advanced rollout strategies. Instead of swapping all pods at once (the old way), Argo gives you fine-grained control over traffic shifting and automated analysis.
Key Components
Rollout CRD: Replaces Deployment. Defines the desired state, canary steps, and analysis templates.
Analysis Templates: Prometheus queries that run at each stage to determine if the canary passes.
Rollout Controller: Watches the cluster, manages pod lifecycle, and makes promotion decisions based on analysis results.
ServiceMeshes (optional): Istio or SMI integrations let you shift traffic at the network layer instead of the pod layer, giving you fractional percentages (e.g., 1.5% of traffic).
For ML models, the most powerful feature is the ability to query arbitrary metrics from Prometheus and make decisions based on results. You can check if your model's inference accuracy is still above 95%, or if token generation latency spiked.
Understanding the Rollout Lifecycle
The Argo Rollout goes through predictable states: Progressing, Paused (if a manual approval is required), and either Succeeded or Aborted. Understanding these states is crucial for troubleshooting. When a rollout gets stuck in Paused, check the AnalysisRun status - usually a gate failed and needs investigation. When a rollout moves to Aborted, the gates caught a real problem, and your automatic safety net worked as intended. This is success, even if it feels like failure in the moment.
The genius of Argo is that it makes the invisible visible. Traditional deployments hide most of the state behind kubectl commands. Canary deployments with Argo force transparency: every gate pass or failure is logged, every metric is recorded, every decision is auditable. Six months later when someone asks "why did we roll back that model version?", you have the answer in your analysis runs.
Why Kubernetes Primitives Matter for ML
You might wonder why we're layering an Argo Rollout abstraction on top of Kubernetes Deployments. Isn't that over-engineering? The answer comes down to what Kubernetes natively gives you versus what ML systems actually need. Kubernetes Deployments handle pod lifecycle: scheduling, resource management, restart policies. They're excellent at ensuring your infrastructure is healthy. But they have no concept of model quality.
A native Deployment can perform a rolling update: delete an old pod, start a new one, repeat. All pods move to the new version eventually. If something goes wrong, you manually kubectl rollout undo. This works for infrastructure code but fails for models. You can't manually decide whether to roll back when you're on-call at 3 AM and don't have the context to understand whether the accuracy drop is expected or catastrophic.
Argo changes this by introducing the concept of gates and automated decision-making. Gates are Prometheus queries that run automatically. If a query returns false, Argo knows "this stage is not safe, pause or roll back." Your on-call engineer doesn't need to make a judgment call. The system made it based on data. This removes the cognitive load from humans and replaces it with objective, auditable decisions.
This is especially valuable in scale-out scenarios. If you're managing 100 ML models across your organization, you can't have humans manually reviewing each deployment. You need automation. You need gates. You need Argo.
Building a Complete Canary Rollout Manifest
Let's build a real example: deploying a new LLM inference service alongside an old one, with canary stages that check generation quality and latency.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: llm-inference-service
namespace: ml-models
spec:
replicas: 10
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
version: v2
spec:
containers:
- name: model-server
image: us.gcr.io/myproject/llm-inference:v2.1.0
ports:
- containerPort: 8000
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "6Gi"
cpu: "4"
env:
- name: MODEL_PATH
value: "/models/llama-7b-v2"
- name: NUM_GPUS
value: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- llm-inference
topologyKey: kubernetes.io/hostname
strategy:
canary:
canaryService: llm-inference-canary
stableService: llm-inference-stable
trafficWeight:
canary: 0
stable: 100
steps:
# Stage 1: 1% traffic, 5 minutes
- setWeight: 1
- pause:
duration: 5m
- analysis:
templates:
- name: error-rate-check
- name: latency-p99-check
# Stage 2: 5% traffic, 10 minutes
- setWeight: 5
- pause:
duration: 10m
- analysis:
templates:
- name: embedding-drift-check
- name: accuracy-check
# Stage 3: 20% traffic, 15 minutes
- setWeight: 20
- pause:
duration: 15m
- analysis:
templates:
- name: business-metrics-check
# Stage 4: 50% traffic with manual approval
- setWeight: 50
- pause:
duration: 0
# Stage 5: 100% traffic
- setWeight: 100This manifest defines the rollout strategy, but the real intelligence lives in the analysis templates.
Analysis Templates: Where the ML Logic Happens
Analysis templates are where you plug in Prometheus queries that validate your model's health. Here's a complete example:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: llm-inference-canary-analysis
namespace: ml-models
spec:
metrics:
# Gate 1: Error rate must stay below 0.1%
- name: error-rate-check
interval: 1m
count: 5
threshold: 1
failureLimit: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
100 * (
rate(model_inference_errors_total{model="llm-inference"}[5m]) /
rate(model_inference_requests_total{model="llm-inference"}[5m])
) < 0.1
# Gate 2: P99 latency must stay under 500ms
- name: latency-p99-check
interval: 1m
count: 5
threshold: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
histogram_quantile(0.99,
rate(model_inference_duration_seconds_bucket{model="llm-inference"}[5m])
) < 0.5
# Gate 3: Embedding drift (cosine similarity to baseline)
- name: embedding-drift-check
interval: 2m
count: 5
threshold: 1
failureLimit: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
model_embedding_drift_cosine_similarity{model="llm-inference",baseline="v1"} > 0.92
# Gate 4: Accuracy > 95% of baseline
- name: accuracy-check
interval: 5m
count: 3
threshold: 1
failureLimit: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
(model_inference_accuracy{model="llm-inference",version="v2"} /
model_inference_accuracy{model="llm-inference",version="v1"}) > 0.95
# Gate 5: Business metrics (e.g., user engagement, revenue)
- name: business-metrics-check
interval: 5m
count: 2
threshold: 1
provider:
prometheus:
address: http://prometheus-operated.monitoring.svc.cluster.local:9090
query: |
(business_metric_conversion_rate{model="llm-inference",version="v2"} /
business_metric_conversion_rate{model="llm-inference",version="v1"}) > 0.98Each metric runs at a defined interval (e.g., every 1 minute) and counts successes. Once the count threshold is met (e.g., 5 successful checks in a row), the gate passes and the rollout proceeds to the next stage.
If a query ever returns false (or errors), it's a gate failure. The rollout pauses or rolls back based on your configuration.
Expected Output: Watching the Rollout Happen
When you apply this Rollout, here's what you see:
$ kubectl get rollouts -n ml-models -w
NAME DESIRED CURRENT UPDATED READY AVAILABLE
llm-inference-service 10 10 3 3 3
$ kubectl describe rollout llm-inference-service -n ml-models
Name: llm-inference-service
Namespace: ml-models
Status: Progressing
Current Step: 2/10
Desired Replicas: 10
Current Replicas: 10
Canary:
Weight: 1
Desired: 1
Current: 1
Updated: 1
Ready: 1
Available: 1
Analysis Runs:
error-rate-check: Successful
latency-p99-check: Successful
Message: Waiting 4m30s until next stepArgo is running your Prometheus queries every minute. If error rates stay below 0.1%, the check passes. Once you've had 5 successful checks, Argo moves to the next step.
If a check fails, you'll see:
Status: Paused
Current Step: 2/10
Paused Reason: Paused before the next step
Analysis Runs:
accuracy-check: Failed (accuracy dropped to 0.91x baseline)
Message: AnalysisRun failedAt this point, you investigate the model, fix the issue, and either retry the rollout or roll back.
Automated Rollback: Getting Back to Safety in Under 2 Minutes
Here's the scary moment: an analysis gate fails. Your new model's accuracy is lower than expected. What happens?
With proper Argo configuration, rollback is automatic:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: llm-inference-service
spec:
strategy:
canary:
analysis:
templates:
- name: accuracy-check
failed:
updateStepIndex: 0 # Rollback to step 0 (stable)
stopProgressing: trueWhen an analysis run fails, Argo:
- Halts the rollout immediately (stops shifting traffic).
- Initiates rollback by shifting traffic back to the stable version.
- Terminates canary pods (the new version).
- Completes in <2 minutes (SLA).
Here's what this looks like in practice:
$ kubectl get rollouts -n ml-models -w
NAME STATUS CANARY_WEIGHT
llm-inference-service Progressing 20
# [accuracy-check fails]
$ kubectl get rollouts -n ml-models -w
NAME STATUS CANARY_WEIGHT
llm-inference-service Progressing 15
llm-inference-service Progressing 10
llm-inference-service Progressing 5
llm-inference-service Progressing 1
llm-inference-service Progressing 0
llm-inference-service Rollback 0Within 2 minutes, all traffic has shifted back to the stable model. The bad canary pods are gone. Your users never saw the degraded model.
The beauty of this is that it's all automated and auditable. You don't need an on-call engineer frantically rolling back. The system makes the decision based on objective metrics. Later, when you review what happened, you have a clear audit trail: at T+22min, the accuracy-check failed with a ratio of 0.91x baseline. The rollout halted. The rollback completed at T+24min. This transparency is invaluable for post-mortems and learning.
What Happens Next: The Runbook
You don't just move on. Automate the incident response:
- Immediate Alert: PagerDuty triggers with rollout failure details (which gate failed, metric values, step number).
- Slack Notification: Your ML team gets pinged with a link to the failed analysis run.
- Automatic Investigation: Log aggregation pulls error traces and model inference logs from the 20-minute canary window.
- Post-Mortem Template: An automated task appears in your incident management system, pre-filled with:
- Rollout manifest (what we tried to deploy)
- Failed metrics (accuracy: 0.91x baseline vs 0.95x required)
- Canary duration (20 minutes)
- Data distribution samples (what did the canary see that training didn't?)
- Suggested next steps (retrain-scale)-real-time-ml-features)-apache-spark))-training-smaller-models))? data investigation? model adjustment?)
This keeps your team focused on root cause, not firefighting.
Detecting Quality Degradation: The ML-Specific Metrics
Standard deployment metrics (error rate, latency) aren't enough for ML models. You need model-specific signals. Understanding what can go wrong helps you build better gates. Models don't fail catastrophically - they degrade subtly. Your metrics should catch that subtlety.
The Metrics Hierarchy
Think of metrics as a pyramid. At the base are infrastructure metrics: is the service up, can it respond, does it crash? These are table stakes. Without them, nothing else matters. But they're not sufficient. The next layer is service metrics: latency and throughput. Is the system responsive? Can it handle load? These tell you if your infrastructure is configured correctly.
The top of the pyramid is model metrics: accuracy, precision, recall, and business outcomes. These tell you if your model still makes good predictions. Many teams focus only on the bottom two layers and wonder why their models degrade in production. They're measuring the wrong things. Infrastructure can be perfectly healthy while model quality crumbles. You need all three layers working together.
The canary deployment gates you'll build will span all three layers. Early stages focus on infrastructure (error rate, latency). Later stages focus on model metrics (accuracy, embedding drift). Final stages focus on business outcomes (conversion, revenue). This pyramid of gates catches problems at every level.
Why Embeddings Matter as a Leading Indicator
Here's a subtle but powerful insight: embedding similarity is a leading indicator of model degradation. It happens before accuracy measurably declines. Your embeddings are the model's internal representation of the input. If those representations start to drift - systematically differ from the baseline - the model's learned patterns are changing.
This could be because the training distribution was different from the production distribution. It could be because the new model learned slightly different patterns due to randomness in training or subtle hyperparameter differences. It could be because the production data has genuinely shifted. Whatever the cause, embedding drift is a red flag. It's worth pausing the deployment to investigate.
The beautiful part is that embedding similarity requires no ground truth labels. You only need to run both the baseline and canary models on the same inputs, compute their embeddings, and measure cosine similarity. This can be done in production in real-time. It's fast, cheap, and surprisingly predictive of downstream quality issues.
1. Embedding Drift via Cosine Similarity
When your model's embeddings start looking different from the baseline, something's wrong. This could indicate:
- The model learned differently during training.
- The production data shifted (different language, different user behavior).
- Numerical instabilities in the new model version.
Measure this by computing cosine similarity between a sample of embeddings from the canary and baseline models on the same inputs:
# In your model inference monitoring sidecar
def compute_embedding_drift(baseline_embeddings, canary_embeddings):
"""
Cosine similarity between baseline and canary embeddings.
Values close to 1.0 = similar. Values < 0.9 = drift detected.
"""
similarities = [
cosine_similarity(b, c)
for b, c in zip(baseline_embeddings, canary_embeddings)
]
mean_similarity = np.mean(similarities)
# Export to Prometheus
embedding_drift_gauge.set(mean_similarity)
return mean_similarityYour Prometheus query then checks if this stays above 0.92:
model_embedding_drift_cosine_similarity{model="llm-inference"} > 0.92
If the new model's embeddings diverge, the gate fails and the rollout pauses. This catches cases where the model's internal representations changed significantly, which often precedes quality degradation.
2. Shadow Logging and Ground Truth Validation
Deploy the canary model in shadow mode first: run it on all production requests but don't use its output. Instead, log predictions from both the stable and canary models. Later, when ground truth labels arrive (user feedback, manual review), compare accuracy offline.
# In your inference service deployment
containers:
- name: model-server
env:
- name: ENABLE_SHADOW_LOGGING
value: "true"
volumeMounts:
- name: shadow-logs
mountPath: /logs/shadowWhen ground truth arrives (a user clicks on a recommendation, or a QA team labels data), compute accuracy:
def compute_accuracy_from_shadow_logs(shadow_logs_path, ground_truth_labels):
"""
Load shadow logs (canary predictions).
Match with ground truth labels.
Compute accuracy.
"""
canary_predictions = load_shadow_logs(shadow_logs_path)
correct = 0
for pred_id, pred_value in canary_predictions:
if ground_truth_labels[pred_id] == pred_value:
correct += 1
accuracy = correct / len(canary_predictions)
# Export to Prometheus
model_accuracy_gauge.labels(version="v2").set(accuracy)
return accuracyThis gives you ground truth accuracy, not just test set accuracy. It's more accurate but slower (labels arrive with a delay).
3. Business Metric Monitoring
Finally, the most important metric: is the model actually helping your business? Monitor:
- Conversion rate: % of users who complete an action (purchase, signup, etc.).
- Revenue per user: Average money per user session.
- Engagement: Time spent, clicks per session, return rate.
- Support tickets: Did this model cause more complaints?
These are less sensitive to short-term noise, so you check them at the 15-minute mark (not 1 minute). If the canary model's conversion rate is 2% lower than the baseline, that's a dealbreaker, even if the accuracy looks fine.
Putting It All Together: The Production Workflow
Here's what a real deployment looks like, from commit to full production:
- You commit a new model to
models/llm-inference:v2.1.0on Git. - CI/CD builds the image and runs unit tests. It's good to go.
- You apply the Rollout manifest with
kubectl apply -f rollout.yaml. - Argo detects the change and starts the canary progression.
- T=0m: 1% traffic, running error-rate and latency checks
- T=5m: Checks pass, shift to 5% traffic
- T=15m: Accuracy and embedding drift checks pass, shift to 20% traffic
- T=30m: Business metrics are stable, shift to 50% traffic
- T=50m: Manual approval required (your team reviews the numbers)
- T=55m: Approved, shift to 100% traffic
- T=60m: Promotion complete, old pods terminated
- You're done. The new model is live. Total time: ~1 hour, with multiple gates catching problems.
If any gate fails:
- Argo rolls back automatically in <2 minutes.
- Your team is alerted with full context (which gate, what metric, what value).
- You investigate (Is the model bad? Is the data different? Is there a bug in the inference code?).
- You fix and retry, or revert to the previous version.
This workflow replaces the terror of deployments with structured risk management. Instead of hoping nothing breaks, you actively probe for breakage and stop it if you find it.
Advanced Patterns: When Canary Deployments Get Real
The Real World: Where Deployments Get Messy
The patterns we've shown so far assume a relatively clean scenario: one model, straightforward metrics, static thresholds. Production is messier. You might have multiple models in the same system, each with different performance characteristics. You might have traffic patterns that vary by time of day or user segment. You might have cost trade-offs that make sense for some users but not others. Real canary deployments handle this complexity.
The trick is not to make your canary deployment system so complex that nobody understands it. Instead, you build layers. Start with the simple case: one model, basic gates. Once that's stable, add complexity incrementally. Add model-specific gates. Add business metric gates. Add feature flag integration. Each addition should be motivated by a real problem you've encountered.
This approach keeps your system maintainable. A canary deployment that nobody understands is worse than no canary deployment at all. Someone will skip the gates. Someone will force-deploy without approval. Someone will lose confidence in the system because it seems arbitrary. The best canary deployments are the ones that earn trust through consistent, understandable behavior.
Handling Model-Specific Latency
LLM token generation latency varies wildly. A model might generate 100 tokens (2 seconds) or 1000 tokens (20 seconds), depending on the prompt. Your P99 latency gate needs to account for this:
- name: latency-p99-check
provider:
prometheus:
query: |
histogram_quantile(0.99,
rate(model_inference_duration_seconds_bucket{
model="llm-inference",
tokens_generated_bucket="200-500"
}[5m])
) < 2.0Bucket by token count, not just raw latency. A 20-second response for 1000 tokens is fast; 20 seconds for 100 tokens is a disaster.
Multi-Model Canaries
What if you're deploying multiple models in the same rollout? A retrieval model and a ranking model, for example?
canaryService: recommendation-canary
stableService: recommendation-stable
trafficWeight:
canary: 0
stable: 100
analysis:
templates:
- name: retrieval-model-check
- name: ranking-model-check
- name: end-to-end-latency-checkEach model has its own gates, but they all must pass for the rollout to proceed.
Rollout with Feature Flags
Sometimes you want to deploy the model code but gate the model swap with a feature flag:
# Step 1: Deploy canary code (but flag directs 0% traffic to it)
- setWeight: 0
pause:
duration: 5m
# Step 2: Flip feature flag to 1%
- setWeight: 1
pause:
duration: 5m
# Step 3: Analyze, then gradually increase flag valueThis decouples code deployment from model activation, giving you extra control.
Monitoring the Rollout: Dashboards and Alerts
You need visibility into three things:
- Rollout progress: Is it moving through stages? Is it stuck? Grafana dashboard showing weight over time.
- Analysis gate status: Which checks passed? Which failed? What were the metric values?
- Alert fatigue: If you're checking every 1 minute, you'll get 100 alerts in a bad rollout. Use alert aggregation (PagerDuty) to group them.
A good dashboard shows:
- Canary weight over time (should follow the step curve)
- Stable vs. canary error rates (two lines on the same graph)
- Stable vs. canary accuracy (should stay close)
- Gate status (green, yellow, red for each check)
- Rollback button (big red button that immediately reverts)
The dashboard isn't just for monitoring - it's for documentation. When your team looks at a rollout weeks later, the dashboard tells the story of what happened and why.
The Psychology of Dashboards
Here's something that's rarely discussed but incredibly important: dashboards shape how people behave. A dashboard that shows one number - "is the rollout progressing?" - is not as good as a dashboard that shows the story. Why is the rollout paused? What metrics are being checked? How long until the next stage?
A good rollout dashboard answers these questions at a glance. Someone on your team should be able to look at it for 10 seconds and understand the current state, the recent history, and what's expected to happen next. This transparency builds trust. People stop asking "why is it still in canary?" and start understanding "it's still in canary because we're waiting for accuracy checks to pass, and they've only been stable for 3 minutes of a 5-minute requirement."
Transparency also helps with post-mortems. When a rollout fails - and eventually one will - you want to understand not just that it failed, but why. A good dashboard with historical data lets you see exactly what happened. You can see when the metric started degrading, at what traffic level, and what happened next. This turns a mysterious failure into a teachable moment.
Alert Aggregation and Signal-to-Noise Ratio
One of the biggest mistakes teams make with canary deployments is creating too many alerts. If your canary gates run every minute, and you have 10 gates, that's 10 alerts per minute in the worst case. Over a 1-hour deployment, that's 600 alerts. Nobody reads 600 alerts. They turn off notifications.
The solution is alert aggregation. Instead of alerting on every gate failure, aggregate alerts by severity and type. Group all "accuracy drift" alerts. Group all "embedding drift" alerts. Alert your on-call engineer once with a summary: "Accuracy gate has failed 3 times in the last 5 minutes. Rollout is paused." This gives them the signal without the noise.
Better yet, automate the response. If a gate fails and the rollout is paused, don't alert a human immediately. Wait 2 minutes. If it recovers, no alert. If it's still paused after 2 minutes, then alert. This catches real problems while filtering out transient blips.
Key Takeaways
-
Canary deployments are risk management, not just deployment strategy: They turn binary choices into progressive, measured rollouts with automated gates.
-
Quality gates must be model-specific: Error rate and latency aren't enough. Measure accuracy, embedding drift, and business metrics.
-
Automation is essential: Without it, gates become bottlenecks. Your gates must run automatically at each stage.
-
Rollback must be fast: Sub-2-minute rollbacks let you recover from bad deployments before they affect too many users.
-
Manual approvals at scale gates: Use them when jumping from 20% to 50% traffic. For smaller increments, let automation decide.
-
Shadow logging enables ground truth validation: Don't just check test metrics. Validate against real user feedback.
-
Argo Rollouts integrates with Prometheus: Prometheus queries are your gates. Build queries that check what matters to your business.
Canary deployments transformed how we think about ML model releases. What was once terrifying (deploying new models to production) becomes routine. What was once opaque (did the deployment succeed?) becomes transparent. Your metrics tell the story, and Argo makes the right decision automatically.
The next time you're about to push a new model to production, remember: you don't have to bet the company. You can start with 1%, watch the metrics, and let the system decide when it's safe to go further.