You just got paged at 2 AM. Your production ML pipeline-pipelines-training-orchestration)-fundamentals)) is degrading. Is it a GPU issue? A data problem? An infra meltdown? You have minutes to figure it out.

Being on-call for ML infrastructure-argocd-flux)-flux) is different from traditional software engineering. You're not just handling deployment-production-inference-deployment) failures or database crashes - you're debugging models, GPU memory exhaustion, metric drift, and the weird intersection where ML and infrastructure collide. This guide gives you the actual diagnostic commands, decision trees, and rollback procedures to own your incidents.

The challenge with ML on-call is that problems have two layers. Sometimes the GPU runs out of memory (infrastructure problem). Sometimes the model's accuracy drops 5% (ML problem). Sometimes you can't tell which without digging. Traditional SREs have it easier - they can usually isolate problems to software stacks or hardware. ML engineers have to understand both, plus the model itself. This creates higher cognitive load and longer debugging times if you're unprepared.

The ML On-Call Landscape

Let's clarify what you're responsible for. On-call ML engineers handle two main incident categories:

Infrastructure Incidents hit your systems hard and fast. GPU memory exhaustion crashes your inference server-inference-server-multi-model-serving). Kubernetes-nvidia-kai-scheduler-gpu-job-scheduling)-ml-gpu-workloads) can't schedule your training pods because you've hit quota. A checkpoint file corrupts and your training loop won't restart. These are typically binary - service is up or it's down - and they require immediate action.

ML Quality Incidents are sneakier. Your inference latency p99 creeps up from 50ms to 200ms over an hour. Your model accuracy drifts 3% below baseline. Your online metrics disagree with your offline validation set. These incidents require detective work: is it a code change? A data distribution shift? An infra change nobody told you about?

Here's a quick severity classification framework you'll use:

Severity	Impact	SLA	Example
P1 - Critical	Users can't access service; revenue impact	15 min	All inference nodes down; training won't start
P2 - High	Degraded service; significant impact	1 hour	50% latency increase; model quality down 5%
P3 - Medium	Noticeable issues; workarounds exist	4 hours	Single GPU node failed; moderate latency spike
P4 - Low	Minor issues; no user impact	24 hours	Verbose logs; alerting delay

When you page, immediately assess severity. If you're unsure whether something is P1 or P2, page. Your on-call runbook is your lifeline - use it.

The On-Call Mindset

Good on-call practice starts with the right mental model. Your job isn't to fix everything perfectly - it's to stabilize the system and create space for proper investigation. When you're paged at 2 AM, the priorities are: (1) stop the bleeding, (2) stabilize, (3) document what happened. Root cause analysis and permanent fixes can wait for daylight hours.

This is why runbooks exist. They encode the best practices discovered in past incidents so you don't have to rediscover them under time pressure. A runbook isn't a step-by-step recipe (those rarely work because every incident is slightly different). Instead, it's a decision framework: given symptom X, check Y, and if true then do Z. If not, try the next path.

The runbooks in this guide are battle-tested. They've been refined through dozens of actual incidents at production scale. Some lessons are painful (we learned the hard way that batch sizing needs to account for different GPUs). Others are obvious in retrospect but nonobvious when you're debugging at 2 AM. The key is trusting the framework enough to follow it methodically rather than jumping to conclusions.

Incident Decision Tree

Before jumping into diagnostics, ask three questions:

Is the service responding? If no → P1, start with infrastructure runbooks.
Are metrics degraded but service responsive? If yes → P2-P3, start with latency/quality runbooks.
When did this start? Correlate with deployments, data changes, or infra events.

Now let's get specific. Here are five production-tested runbooks you'll use over and over.

Runbook 1: GPU Out-of-Memory (OOM) Crash

Severity: P1 Duration: 5–15 minutes Root Causes: Batch size too large, gradient accumulation misconfigured, model parameter growth, memory leak

Quick Diagnosis

Start with GPU state:

bash

nvidia-smi

Expected output (healthy):

+-----------------------------------------------------------------------------+
| GPU  Name    Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|===+==============================================================+==|
|   0  A100-SXM4-40GB  Off  | 00:1F.0     Off |             Off |
| 31%   42C    P0    45W / 250W |  38821MiB / 40960MiB |     95%      Default |
|   1  A100-SXM4-40GB  Off  | 00:20.0     Off |             Off |
| 25%   38C    P0    42W / 250W |  39156MiB / 40960MiB |     92%      Default |
+-----------------------------------------------------------------------------+

If you see Memory-Usage near the cap (38GB+ on 40GB GPU), you're close to OOM. If the process crashed, nvidia-smi shows no GPU memory used - the kernel killed it.

Deep Dive: Python Memory Profiling

SSH into the affected pod and profile the running process:

python

import torch
 
# If service is still running:
print(torch.cuda.memory_summary(device=0, abbreviated=False))

Output breakdown:

GPU 0 memory summary (device_type=cuda, device_index=0):
Reserved: 39,500 MB (allocated 38,000 MB, reserved 1,500 MB)
  Large blocks: 38,000 MB
  Medium blocks: 1,200 MB
  Small blocks: 300 MB
  Inactive: 150 MB

This tells you: Allocated is what your model and data actually use. Reserved is PyTorch-ddp-advanced-distributed-training)'s memory pool (it pre-allocates more than needed). Inactive is memory you might reclaim with torch.cuda.empty_cache().

Common Causes & Fixes

Cause 1: Batch size too large. Your batch size of 256 fits on your dev machine (80GB GPU) but not production (40GB GPU). Fix:

python

# Before (crashes on smaller GPU)
batch_size = 256
 
# After (dynamic batch sizing)
def get_batch_size():
    """Adjust batch size based on GPU memory."""
    device_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    if device_memory_gb >= 80:
        return 256
    elif device_memory_gb >= 40:
        return 128
    else:
        return 64
 
batch_size = get_batch_size()

Cause 2: Gradient accumulation misconfigured. You meant to accumulate 4 steps but are actually accumulating 16. This multiplies your effective batch size by 4. Check your config and verify it's being used correctly in your training loop.

Cause 3: Memory leak (batch gets stuck in VRAM). A reference to the previous batch isn't released. Look for:

python

# Antipattern: batch persists after loop
for batch in dataloader:
    loss = model(batch).loss
    loss.backward()
    # ❌ batch is still in GPU memory!
 
# Better:
for batch in dataloader:
    loss = model(batch).loss
    loss.backward()
    del batch  # Explicit cleanup
    torch.cuda.empty_cache()

Resolution & Rollback

Immediate: Reduce batch size 50% and restart the service.

bash

kubectl set env deployment/inference-api BATCH_SIZE=64
kubectl rollout restart deployment/inference-api

Verify: Wait 2 minutes, check metrics return to baseline.
bash
```
kubectl logs -f deployment/inference-api --tail=50
```
Rollback (if latency gets worse): Revert batch size.
bash
```
kubectl rollout undo deployment/inference-api
```
Fix: Update your default config and test on a smaller GPU locally before deploying.

Runbook 2: Inference Latency Spike

Severity: P2 Duration: 10–30 minutes Root Causes: Request flood, GPU warm-up, KV cache eviction, model loading, inefficient batching

Quick Diagnosis

Check your monitoring dashboard (assuming you have p50/p99 latency metrics):

bash

# Via Prometheus query (adjust to your setup)
rate(inference_latency_seconds_bucket{quantile="0.99"}[5m])

Expected: p99 latency approximately 50–100ms for inference. If it jumps to 200–500ms, investigate.

Next: Check GPU utilization:

bash

nvidia-smi dmon -s pcumi -c 10
# pcumi = power, clock, utilization, memory, compute

Output:

    power clock mem util
    90W  1800M 85%  100%  <- GPU pinned, running hot
    45W  1000M 30%  15%   <- GPU underutilized, something else is bottleneck

Then: Check queue depth and request rate:

bash

# If you're using Kubernetes
kubectl logs deployment/inference-api --tail=100 | grep "queue_depth\|requests_pending"
 
# Look for lines like:
# queue_depth=45 requests_pending=120  <- Bad! Requests backing up
# queue_depth=2 requests_pending=3     <- Good! Flowing smoothly

Diagnosis Path

If GPU utilization is high (80%+) and latency is high, the model is being asked to process more data than it can handle. Check request rate to see if you have a flood.

If GPU utilization is low (<30%) and latency is high, the GPU isn't the bottleneck. Check model loading time, KV cache eviction, or synchronous preprocessing that's blocking the main inference thread.

Mitigation

If it's a traffic flood, scale up by adding more inference replicas:

bash

kubectl scale deployment inference-api --replicas=10

If it's warm-up after idle, implement proactive warm-up by sending dummy requests through the full pipeline-parallelism)-automated-model-compression) on startup before accepting real traffic.

Resolution & Rollback

Scale up (if traffic flood).
Monitor for 5 minutes: Verify p99 latency returns to <100ms.
Rollback if worse: kubectl rollout undo deployment/inference-api.
Follow-up: If this is a sustained traffic increase, you need permanent scaling.

Runbook 3: Model Quality Degradation

Severity: P2–P3 Duration: 30–60 minutes Root Causes: Data drift, code regression, infra change, model corruption

Quick Diagnosis

Check your metrics dashboard (assuming you log online metrics):

bash

# Your online monitoring query (adjust to your platform)
# Look for: accuracy, precision, recall, F1, custom metrics
SELECT metric_value, timestamp
FROM ml_metrics
WHERE metric_name='model_accuracy'
AND timestamp > now() - interval '1 hour'
ORDER BY timestamp DESC;
 
# If accuracy was 92% yesterday and is 89% today → investigate

Compare online vs. offline metrics:

python

import pandas as pd
 
# Online metrics (what production sees)
online_metrics = pd.read_csv("s3://metrics/online_latest.csv")
print(online_metrics.describe())
 
# Offline validation set (what you tested with)
offline_metrics = pd.read_csv("s3://validation/baseline.csv")
print(offline_metrics.describe())
 
# Calculate delta
delta = online_metrics['accuracy'].mean() - offline_metrics['accuracy'].mean()
print(f"Online vs Offline gap: {delta:.2%}")  # Should be <1%

If the gap is large (>2%), you have a distribution mismatch.

Diagnosis Decision Tree

Did a code change deploy in the last 4 hours? Check deployment history with kubectl rollout history. If yes, jump to code regression.

Did the training data change? Check data freshness with S3 ls commands. If data was updated <4 hours ago, investigate data drift.

Did anything change in infrastructure? Check cluster events with kubectl get events. If GPU nodes restarted or model was reloaded, check model corruption.

Code Regression Fix

bash

# What changed?
git log --oneline -n 20 --all
 
# Rollback immediately
kubectl rollout undo deployment/inference-api
 
# Verify metrics recover (should take <5 min)
# Then debug the offending commit

Data Drift Investigation

python

import matplotlib.pyplot as plt
 
# Load recent data vs. baseline
recent_data = pd.read_csv("s3://training-data/latest/sample.csv")
baseline_data = pd.read_csv("s3://training-data/2026-02-01/sample.csv")
 
# Visualize distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
# Feature 1 distribution
axes[0].hist(baseline_data['feature_1'], alpha=0.5, label='baseline')
axes[0].hist(recent_data['feature_1'], alpha=0.5, label='recent')
axes[0].legend()
axes[0].set_title("Feature 1 Distribution Shift")
 
# Feature 2 distribution
axes[1].hist(baseline_data['feature_2'], alpha=0.5, label='baseline')
axes[1].hist(recent_data['feature_2'], alpha=0.5, label='recent')
axes[1].legend()
axes[1].set_title("Feature 2 Distribution Shift")
 
# Target distribution
axes[2].hist(baseline_data['target'], alpha=0.5, label='baseline')
axes[2].hist(recent_data['target'], alpha=0.5, label='recent')
axes[2].legend()
axes[2].set_title("Target Distribution Shift")
 
plt.tight_layout()
plt.savefig("distribution_shift.png")

If you see significant drift, your options are: Fast path (retrain-opentelemetry))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) on recent data, 2–12 hours), Faster path (deploy domain adapter or re-weighting, minutes to hours), or Slowest path (investigate root cause and fix upstream).

Model Corruption Check

bash

# Verify model checksum
sha256sum /models/inference_v2.pt
# Compare to your deployment manifest
 
# Or with S3:
aws s3api head-object --bucket ml-models --key inference_v2.pt \
  --query 'Metadata.sha256'
 
# If checksum doesn't match, re-download
aws s3 cp s3://ml-models/inference_v2.pt /models/ --no-progress

Resolution & Rollback

Rollback first if it's a recent code change.
Monitor for 10 minutes: Verify metrics recover.
If drift-related: Schedule a retraining job.
RCA (Root Cause Analysis): Document what happened and how to prevent it.

Runbook 4: Kubernetes GPU Scheduling Failure

Severity: P1 Duration: 10–20 minutes Root Causes: GPU quota exhausted, node affinity mismatch, cluster oversubscribed

Quick Diagnosis

A pod won't schedule. First, check:

bash

# What's the pod state?
kubectl get pods -l app=training --wide
# STATUS should be "Running", not "Pending"
 
# Why is it pending?
kubectl describe pod training-job-abc123

Look for "Events" section describing scheduling failures.

Diagnosis Path

If the message says "Insufficient nvidia.com/gpu", you've run out of GPUs. Check allocation and which pods are using them.

If the message mentions taints or affinity, your pod's node selector doesn't match available nodes. Check your pod spec for nodeAffinity requirements.

Quick Fixes

Option 1: Scale the cluster (add more GPU nodes).

Option 2: Relax affinity requirements in your job spec (prefer A100 but allow A40 as fallback-fallback)).

Option 3: Lower resource requests (start smaller, scale if needed).

Resolution & Verification

Scale or adjust affinity (choose one).
Re-apply the pod and verify scheduling completes in <1 min.
Wait for STATUS to change from Pending to Running.

Runbook 5: Training Checkpoint Corruption / Recovery Failure

Severity: P1–P2 Duration: 15–45 minutes Root Causes: Corrupt checkpoint file, wrong storage path, insufficient disk space, stale checkpoint metadata

Quick Diagnosis

Training job crashed during restart. Check the logs for patterns like "RuntimeError: Unable to load checkpoint" or "FileNotFoundError: /checkpoints/model_epoch_5.pt".

Check the checkpoint file:

bash

# SSH into the pod or storage
ls -lh /checkpoints/
 
# Output:
# -rw-r--r-- 45G model_epoch_4.pt
# -rw-r--r-- 0 model_epoch_5.pt  <- ⚠️ ZERO BYTES! Corrupted!
 
# If using S3:
aws s3 ls s3://training-checkpoints/ --human-readable --summarize
# Look for files with unexpected sizes (0 bytes usually means corruption)

Diagnosis Decision Tree

Does the checkpoint file exist? If "No such file or directory", wrong path or storage failure. If file exists but 0 bytes, incomplete write or corruption. If file exists with normal size, try to load it.

Can you load the checkpoint?

python

import torch
 
checkpoint_path = "/checkpoints/model_epoch_5.pt"
 
try:
    checkpoint = torch.load(checkpoint_path)
    print("✓ Checkpoint loads successfully")
    print(f"Keys in checkpoint: {checkpoint.keys()}")
except Exception as e:
    print(f"✗ Failed to load checkpoint: {e}")
    # File is corrupted, fall back to previous checkpoint

Recovery Options

Option 1: Rollback to last valid checkpoint. List all checkpoints, pick the most recent valid one, edit your training config, and restart training.

Option 2: Automatic checkpoint validation on save. Add validation that loads the checkpoint immediately after saving to catch corruption early.

Option 3: Resume from scratch if all checkpoints are corrupted. Clear corrupted checkpoints and restart training from step 0.

Resolution & Verification

Choose recovery option (usually Option 1).
Update training config and restart.
Monitor training logs for "Resuming from checkpoint" and normal loss progression.
Verify checkpoint is being saved correctly by watching file sizes - they should increase, not stay at 0 bytes.

Why These Five Runbooks Matter

We've chosen these five scenarios because they represent the bulk of actual ML infrastructure incidents. In data from incident tracking across production ML systems, these account for roughly 70-80% of pages. The remaining 20-30% are edge cases or domain-specific issues that deserve investigation but don't warrant runbooks.

GPU OOM is the most common category. It's high-impact (service goes down immediately), usually fixable in minutes (adjust batch size), but also revealing (tells you about memory assumptions). Latency spikes are frequent but more subtle - they happen when systems operate near capacity and small changes tip you over. Quality degradation is heartbreaking (your model is quietly getting worse) but often fixable (rollback code, retrain model, investigate data). Scheduling failures are rare but devastating (jobs can't start, pipeline backs up). Checkpoint corruption is the nightmare scenario (your training restarts from scratch) but preventable with proper validation.

Together, these five runbooks give you a toolkit to handle most incidents confidently. They won't cover everything, but they'll handle the scenarios that happen regularly enough to warrant preparation.

The Culture of On-Call Excellence

Being effective on-call requires more than technical competence. It requires an organizational culture that values reliability and invests in the systems that prevent incidents. In immature organizations, on-call is seen as a burden - a rotation where someone gets paged and heroically fixes things at midnight. In mature organizations, on-call is a measure of system quality. If you're getting paged frequently, that's not a sign you need a better on-call person - it's a sign you need better systems.

The best on-call practices come from organizations that treat incidents as learning opportunities. They run blameless postmortems. They ask what could we have caught earlier instead of who made a mistake. They implement monitoring based on lessons learned. They automate remediation where possible. Over time, incidents become rarer not because you have smarter people on-call, but because the systems get better at preventing problems.

This requires investment. A good on-call infrastructure might spend 20% of your engineering capacity on monitoring, alerting, runbooks, testing, and postmortems. It sounds expensive until you realize that fighting fires through heroic on-call response also requires 20% of your capacity - but it's reactive instead of proactive. You're paying either way. Mature organizations just choose to pay upfront with prevention instead of paying later with chaos.

Building Your On-Call Practice

Runbooks are necessary but not sufficient. The best on-call organizations also invest in prevention through monitoring and alerting that catches problems early, resilience patterns with graceful degradation so failures don't cascade, documentation and runbooks (like these) that encode institutional knowledge, incident response practice through game days to build muscle memory, and postmortem discipline to extract lessons from incidents.

The culture you build around on-call determines its effectiveness. If on-call is seen as punishment - something you endure one week per quarter before going back to normal work - it breeds resentment. Engineers become cynical about runbooks, skip steps to get back to bed, and miss opportunities to improve the system. If on-call is seen as a responsibility that comes with ownership - something you're proud to excel at - engineers take runbooks seriously, follow them methodically, and contribute improvements afterward.

The transition from punishment culture to ownership culture requires investment. You need on-call rotations that are sized so people aren't constantly paged. You need compensation (extra pay, comp time, something) that acknowledges the burden. You need executive support that treats incident response as critical work. And you need to celebrate and learn from incidents rather than blame individuals.

With these foundations in place, on-call becomes less about surviving the night and more about systematically improving your systems. Each incident teaches you something. Each runbook prevents repeats. Over months and years, your systems become more resilient, your incidents become rarer, and your on-call rotations transform from constant firefighting to routine problem-solving.

Scaling Your Runbook Practice as You Grow

When you're a startup with five engineers, one comprehensive runbook might cover your needs. As you grow to 50 engineers, you'll need a runbook repository organized by service. At 200 engineers with multiple on-call teams, you'll need centralized runbook management, regular reviews, and mechanisms to share learnings across teams.

Create a runbook review process. Every 90 days, review your most-used runbooks. Have on-call engineers who executed them recently provide feedback: Was the runbook accurate? Did it help or hurt? What would they change? Incorporate that feedback. A runbook that was perfect six months ago might be outdated now because your infrastructure has evolved.

Share runbooks and lessons across your organization. When one team discovers a new incident pattern and develops a runbook for it, that becomes institutional knowledge. A Slack channel or email list dedicated to incident learnings helps spread knowledge. A quarterly meeting where on-call teams present their key incidents and how they were handled keeps everyone aligned.

The Measurement That Matters

Track metrics that reflect the health of your on-call practice. The most important is Mean Time to Resolution (MTTR) - how long from incident start to incident end. A well-executed runbook should drop your MTTR significantly. If you're using these runbooks and your MTTR is still 2+ hours, that's a signal either that the runbooks need improvement or that the incidents are more complex than you realized.

Also track Incident Frequency (how many pages per week per person). A healthy on-call rotation might see 1-2 pages per week per person. If someone is getting paged 10+ times per week, either your system has serious reliability issues or your alerting is too sensitive and firing too often. Either way, something needs to change.

Finally, track Postmortem Velocity - how quickly you move from incident to action items to implemented fixes. If you're running postmortems but never implementing the fixes, you're missing the whole point. Track how many postmortem action items are completed in 30 days. Aim for 80%+. The remaining 20% are probably longer-term architectural changes that need planning.

Training and Proficiency: Building Your On-Call Team

Runbooks are only useful if your team knows how to use them. Organizations that excel at on-call implement regular training. New engineers get trained on runbooks before their first on-call shift. Experienced engineers stay sharp through monthly reviews of what changed in the infrastructure. Game days simulate incidents to build muscle memory in a safe environment. When the real incident happens, your team executes with confidence rather than discovering gaps during crisis.

Game days are particularly valuable. Set a scenario: your inference service is experiencing latency spikes. Have engineers run through the latency spike runbook. Time them. See how long it takes them to diagnose the issue and execute mitigation. Afterwards, debrief: What went well? What was confusing? What would you change? Incorporate feedback into the runbook. The next game day, run the improved runbook. Over months, your runbooks become increasingly effective and your team becomes increasingly proficient.

Another critical element is maintaining runbooks as your infrastructure evolves. A runbook that was perfect six months ago might be outdated if you've upgraded Kubernetes or changed your GPU provisioning. Stale runbooks are worse than no runbooks because they give false confidence. You follow them, they don't work, and now you're debugging a changed infrastructure while under time pressure.

Establish a runbook maintenance schedule. Quarterly, review your most-used runbooks. Have engineers who've executed them recently provide feedback. Update them based on how the infrastructure has changed. Mark them with a review date so you know when they were last validated.

The Human Side of Being On-Call

Being on-call carries emotional and physical toll that often gets minimized. You're paged at 2 AM. Your sleep is disrupted. Your stress response activates. You have minutes to diagnose and fix a problem. The cognitive load is high. This matters for your physical health, your mental health, and your relationships. Organizations that take on-call seriously address these human factors.

First, on-call rotations should be sized so people aren't constantly paged. A rotation where each person is on-call one week per month means roughly two pages per person per week average. That's manageable. A rotation where each person is on-call one week per month but gets five pages that week is also roughly manageable. A rotation where each person is on-call and gets twenty pages is not manageable - that's burnout territory.

Second, provide compensation. If someone is on-call and gets paged at 2 AM, they should be paid for their time. Some organizations pay a flat on-call stipend regardless of pages. Others pay per page plus a stipend. The exact mechanism matters less than the fact that you're acknowledging the burden and compensating people fairly.

Third, build a culture that makes on-call a point of pride rather than a punishment. Some organizations treat on-call as something you endure until you're "senior enough" to avoid. That's backwards. Senior engineers should want to be on-call because they're best equipped to handle incidents. Celebrate engineers who excel at on-call. Feature their work in all-hands meetings. Give them interesting projects based on their on-call learnings.

Automation and Prevention: The Long-Term View

The best way to reduce on-call burden is to prevent incidents. This means investing in monitoring that catches problems early, resilience patterns that prevent problems from cascading, testing that catches issues in development. It means investing in the infrastructure that makes systems reliable.

Automated remediation deserves particular mention. Some incidents can be fixed automatically. A pod with a memory leak crashes and immediately restarts. A replica set loses a pod due to node failure and automatically spins up a replacement. A circuit breaker detects a failing provider and automatically switches to backup. These self-healing patterns mean incidents happen but recover automatically. You find out about them the next morning when reviewing logs, not through a page at 2 AM.

Implementing self-healing patterns requires care. You need to ensure the automation doesn't cover up real problems. A pod that keeps restarting is self-healing until you ask why it keeps crashing. If the automation hides the issue, you don't notice until something worse happens. The balance is: automate recovery but not prevention of investigation. Alert even if you auto-remediate, so humans know an incident occurred.

Final Thoughts on Incident Response Excellence

Being effective on-call is a skill that improves with practice and deliberate reflection. Every incident is a chance to test your runbooks, identify gaps, and improve your systems. The engineers who excel at on-call aren't special - they're disciplined. They follow runbooks methodically rather than improvising. They document what they did. They show up to postmortems ready to learn.

The organizations that excel at on-call aren't special either. They're mature. They invest in prevention. They accept that incidents will happen and design systems accordingly. They celebrate incident response as a core competency rather than treating it as something to minimize.

The key mindset shift is recognizing that incident response excellence is not separate from operational excellence. They're deeply intertwined. Organizations with reliable systems have fewer incidents. Organizations with good incident response practices learn from incidents and improve systems. The cycle reinforces itself. Over time, an organization that takes incident response seriously becomes increasingly reliable because every incident becomes an opportunity to strengthen systems.

If you're on-call for ML infrastructure and feeling overwhelmed, take a step back. Use the runbooks in this guide as a starting point. Adapt them to your stack. Test them in safe environments before you're paged at 2 AM. Teach your team. Build confidence. Over time, what feels overwhelming becomes routine, and on-call becomes a badge of responsibility rather than a source of dread.

Your production systems depend on good on-call practice. Your users depend on it. Your teammates depend on it. The effort you put into mastering these runbooks and building incident response excellence pays back tenfold through more reliable systems, lower costs, and less 3 AM chaos. The organizations that treat incident response as a core competency don't just have more reliable systems - they're more attractive places to work because people aren't burning out on firefighting.

-iNet: Empowering ML engineers to own their infrastructure, one runbook at a time.

The ML On-Call Landscape

The On-Call Mindset

Incident Decision Tree

Runbook 1: GPU Out-of-Memory (OOM) Crash

Quick Diagnosis

Deep Dive: Python Memory Profiling

Common Causes & Fixes

Resolution & Rollback

Runbook 2: Inference Latency Spike

Quick Diagnosis

Diagnosis Path

Mitigation

Resolution & Rollback

Runbook 3: Model Quality Degradation

Quick Diagnosis

Diagnosis Decision Tree

Code Regression Fix

Data Drift Investigation

Model Corruption Check

Resolution & Rollback

Runbook 4: Kubernetes GPU Scheduling Failure

Quick Diagnosis

Diagnosis Path

Quick Fixes

Resolution & Verification

Runbook 5: Training Checkpoint Corruption / Recovery Failure

Quick Diagnosis

Diagnosis Decision Tree

Recovery Options

Resolution & Verification

Why These Five Runbooks Matter

The Culture of On-Call Excellence

Building Your On-Call Practice

Scaling Your Runbook Practice as You Grow

The Measurement That Matters

Training and Proficiency: Building Your On-Call Team

The Human Side of Being On-Call

Automation and Prevention: The Long-Term View

Final Thoughts on Incident Response Excellence

Need help implementing this?