Spot Instance Strategies for ML Training
Spot instances are cheap - sometimes 70–90% cheaper than on-demand. But they come with a catch: AWS, Google Cloud, or Azure can yank them away with minimal notice. For machine learning teams, that's either a showstopper or a game-changer. The difference? Knowing how to architect for interruption.
We're going to walk through the strategies that let you train billion-parameter models on a shoestring budget without losing weeks of compute. By the end, you'll have concrete patterns for fault tolerance, cost arbitrage, and elastic scaling that actually work in production.
Table of Contents
- Why Spot Instances Matter for ML
- The Economics of Interruption and Recovery
- 1. Checkpoint Fault Tolerance: Your Safety Net
- Why Checkpointing Fundamentally Changes the Equation
- PyTorch Lightning ModelCheckpoint
- S3 Versioned Saves for Durability
- 2. Interruption Handling: The 2-Minute Warning
- SIGTERM Handler in PyTorch
- 3. Multi-Cloud Arbitrage: Where to Buy
- Current Pricing Reality and Provider Differences (Q1 2026)
- Provider-Specific Deep Dive
- Multi-Cloud Submission Pattern
- 4. Elastic Scaling: Spot + On-Demand Hybrid
- AWS SageMaker Managed Spot
- 5. Distributed Training with Spot: torch.distributed.elastic
- Production Considerations: Making Spot Reliable at Scale
- Cost Predictability
- Compliance and Data Residency
- Monitoring and Alerting
- Checkpoint Management
- Worked Cost Analysis Example
- The Organizational Reality of Spot Training
- Summary
- Scaling Your Spot Strategy
- Why Executives Care About Spot Instances
- Building Institutional Knowledge
- The Psychological Factor
- Preparing for the Future
Why Spot Instances Matter for ML
Let's do the math. Training a ResNet-50 on 8 V100 GPUs costs about $30/hour on AWS on-demand. Spot pricing? $9–12/hour. For a two-week training run, that's the difference between $10,000 and $3,000.
Scale that to teams training dozens of models weekly, and you're talking about six figures in annual savings. Cloud providers know this. They've made spot instances increasingly reliable, especially for ML workloads where checkpointing is built in. The trick is building your infrastructure to expect failure and recover from it gracefully.
But here's the deeper question: why can cloud providers afford to discount spot instances so aggressively? The answer matters for your strategy. Cloud providers have excess capacity they'd rather monetize than leave idle. When demand surges, they reclaim that capacity from spot customers. For you, this means:
- Spot prices are fundamentally tied to supply/demand curves in each region
- Interruption rates spike during business hours in dense regions
- Off-peak and less-popular regions have vastly more stable spot instances
- The discount is essentially payment for taking on capacity risk
This is why spot isn't "for training that doesn't matter." It's for training where you've designed recovery, and you're comfortable trading slightly longer wall-clock time for dramatically lower costs.
The Economics of Interruption and Recovery
Understanding the real-world cost-benefit of spot requires thinking through what happens when interruptions occur. Let's model a realistic scenario:
A training job takes 100 hours on a single instance. With spot instances:
- Expected cost: $9/hour × 100 hours = $900
- Interruption rate: assume 5% per day
But when interrupted, what happens? If you've designed for failure with proper checkpointing:
- You lose the last checkpoint interval (maybe 30 minutes of work)
- You restart on a fresh instance from the last checkpoint
- Total training time increases slightly, but you're still dramatically cheaper
Empirically, teams report that spot instance training costs 25-35% of on-demand costs when you account for interruption overhead. That's still massive savings. But it requires planning.
Without checkpointing? Each interruption costs you the entire training run so far. One interruption at hour 95, and you've just wasted $855. That changes the calculus entirely.
1. Checkpoint Fault Tolerance: Your Safety Net
The foundation of safe spot instance usage is checkpointing. If your instance dies at hour 47 of a 100-hour training run, you need to resume from hour 46 - not hour zero.
Without checkpoints, spot instances are just expensive betting machines. With them, they're a rational cost optimization.
Why Checkpointing Fundamentally Changes the Equation
When you lose an instance without a checkpoint, you lose not just compute time but optimization state. The optimizer has learned a trajectory through loss space; the learning rate schedule is calibrated for a specific point in training. Starting from zero destroys that learning and forces the optimizer to re-discover it. This means:
- A 100-hour training run from scratch takes 100 hours
- A 47-hour interrupted run, restarted from epoch 46, takes ~47 hours (the final 53 hours happens on a fresh instance)
- Total time: ~94 hours of actual GPU time, but spread across multiple instances with minimal re-work
For distributed training), the math gets even better. If you're training with 64 GPUs and lose one instance, you don't restart from epoch zero - you resume from the last checkpoint on the remaining 63 GPUs, retrain 1 GPU's worth of gradient on the recovered instance, and continue. The overhead is negligible.
PyTorch Lightning ModelCheckpoint
PyTorch-ddp-advanced-distributed-training) Lightning's ModelCheckpoint callback makes this ridiculously easy:
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning import Trainer
import pytorch_lightning as pl
checkpoint = ModelCheckpoint(
dirpath="checkpoints/",
filename="model-{epoch:02d}-{val_loss:.2f}",
save_top_k=3,
monitor="val_loss",
mode="min",
save_last=True, # Always keep the last checkpoint
every_n_epochs=1,
)
trainer = Trainer(
callbacks=[checkpoint],
max_epochs=100,
enable_checkpointing=True,
)
trainer.fit(model, train_loader, val_loader)Let's break down what's happening here. save_top_k=3 means we keep the three best checkpoints by validation loss. This protects you against a scenario where your latest checkpoint happens to be at a local minimum and you resume there. save_last=True ensures we always have a checkpoint from the most recent epoch, regardless of whether it was the "best" one. The monitor="val_loss" tells Lightning to rank checkpoints by validation loss, so if you run out of disk space, it can safely delete the worst-performing checkpoints.
This saves the best 3 checkpoints plus the last epoch. When your spot instance gets terminated, you restart training like this:
trainer = Trainer(callbacks=[checkpoint])
trainer.fit(model, train_loader, val_loader, ckpt_path="last")Lightning automatically resumes from the checkpoint - optimizer state, learning rate schedules, everything.
Why this matters: You're not losing gradient accumulation or learning rate warmup. The model picks up exactly where it left off. If you were using a cosine annealing schedule or warmup, that state is preserved. If you were in the middle of a gradient accumulation cycle, Lightning handles it.
S3 Versioned Saves for Durability
Local checkpoints are fine for short runs, but for multi-day training, push to S3 with versioning enabled. Local disk on spot instances is ephemeral - even if your instance survives interruption, a hardware failure means the checkpoint is gone forever. By syncing to S3, you get:
- Durability: 99.999999999% durability (11 nines) on S3, vs. single-device risk on local disk
- Versioning: If a checkpoint writes partially and corrupts, you can roll back
- Cross-region recovery: If an entire AWS region goes down, your checkpoints are still accessible
- Cost optimization: S3 versioning costs ~$0.023 per 1M objects, negligible compared to computation savings
Here's a production callback that pushes each checkpoint to S3:
import boto3
import os
from pathlib import Path
class S3CheckpointCallback(pl.Callback):
def __init__(self, bucket: str, prefix: str = "checkpoints/"):
self.s3 = boto3.client("s3")
self.bucket = bucket
self.prefix = prefix
def on_train_epoch_end(self, trainer, pl_module):
ckpt_path = trainer.checkpoint_callback.last_model_path
if not ckpt_path or not os.path.exists(ckpt_path):
return
# Upload to S3 with timestamp
s3_key = f"{self.prefix}epoch-{trainer.current_epoch}.pt"
self.s3.upload_file(ckpt_path, self.bucket, s3_key)
print(f"Checkpoint uploaded to s3://{self.bucket}/{s3_key}")
trainer = Trainer(
callbacks=[
checkpoint,
S3CheckpointCallback(bucket="ml-training-checkpoints"),
]
)Notice we're uploading at on_train_epoch_end, not after every batch. This balances safety (hourly saves for a typical training loop) with network cost. For long training runs where epochs take hours, consider uploading every N steps instead.
S3 versioning is critical because if a checkpoint is corrupted mid-write, you can roll back to a previous version. Enable it with:
s3 = boto3.client("s3")
s3.put_bucket_versioning(
Bucket="ml-training-checkpoints",
VersioningConfiguration={"Status": "Enabled"}
)2. Interruption Handling: The 2-Minute Warning
AWS Spot instances don't just die - they send a 2-minute termination notice via the EC2 Spot Instance Interruption Notices endpoint. The same applies to GCP Preemptible instances (around 30 seconds) and Azure Spot VMs. This is a gift: you have time to save state before the instance vanishes.
When that notice arrives, your training script has 120 seconds to:
- Save the current checkpoint
- Flush to persistent storage
- Clean up gracefully (close file handles, disconnect databases)
The 2-minute window is tight but workable. A checkpoint save of a multi-billion parameter model typically takes 30–60 seconds. S3 upload takes another 30–90 seconds depending on checkpoint size and network. You have enough time.
SIGTERM Handler in PyTorch
Here's a production-grade pattern using SIGTERM signals. When AWS sends the 2-minute warning, it initiates a SIGTERM signal to the process. By handling that signal, we can intercept shutdown and save state:
import signal
import threading
import time
from pytorch_lightning import Trainer
class SpotInterruptionHandler:
def __init__(self, trainer: Trainer, checkpoint_dir: str):
self.trainer = trainer
self.checkpoint_dir = checkpoint_dir
self.interrupted = False
signal.signal(signal.SIGTERM, self._handle_sigterm)
def _handle_sigterm(self, signum, frame):
"""Called when AWS sends SIGTERM (2-min warning)."""
print("⚠️ Spot interruption notice received. Saving checkpoint...")
self.interrupted = True
# Save immediately
checkpoint_path = os.path.join(
self.checkpoint_dir,
f"interruption-{int(time.time())}.pt"
)
self.trainer.save_checkpoint(checkpoint_path)
# Upload to S3 before we die
self._upload_to_s3(checkpoint_path)
print(f"Checkpoint saved to {checkpoint_path}")
# Give ourselves 10 seconds to finish network I/O
time.sleep(10)
exit(0)
def _upload_to_s3(self, checkpoint_path: str):
s3 = boto3.client("s3")
try:
s3.upload_file(
checkpoint_path,
"ml-training-checkpoints",
f"emergency/{os.path.basename(checkpoint_path)}",
)
except Exception as e:
print(f"S3 upload failed: {e}")
# In your training script:
handler = SpotInterruptionHandler(trainer, "checkpoints/")
trainer.fit(model, train_loader, val_loader)Critical detail: The SIGTERM handler needs to be lightweight. Don't perform heavy GPU operations - just serialize and upload. If you spend 30 seconds computing something, you've wasted half your recovery window.
3. Multi-Cloud Arbitrage: Where to Buy
Not all spot instances cost the same. AWS, GCP, and Azure have different pricing, availability, and reliability patterns. Smart teams play them off each other.
Current Pricing Reality and Provider Differences (Q1 2026)
| Provider | Instance | Spot Price | On-Demand | Savings | Availability | Notes |
|---|---|---|---|---|---|---|
| AWS | V100 | $0.48 | $3.06 | 84% | 97% | Most popular, best tooling |
| AWS | A100 | $1.93 | $12.48 | 85% | 96% | Premium pricing in most regions |
| GCP | A100 | $1.07 | $12.48 | 91% | 94% | Best A100 pricing, slightly lower reliability |
| Azure | V100 | $0.36 | $2.80 | 87% | 95% | Cheapest V100, less common |
For A100 training, GCP Preemptible beats AWS Spot by 44%. For V100, Azure wins. The catch? Availability differs by region and time of day.
But there's more to the story than headline pricing. AWS Spot instances in us-east-1 (the most popular region) have lower availability during US business hours because demand is highest. GCP Preemptible instances are more stable off-peak. Azure's pricing advantage comes with a trade-off: their tooling for ML is less mature than AWS or GCP.
Provider-Specific Deep Dive
AWS Spot Instances:
- Warning mechanism: 2-minute SIGTERM + EC2 metadata endpoint polling
- Interruption rate: 2–5% annual in premium regions, <1% off-peak
- Checkpointing support: Native CloudWatch integration, EC2 lifecycle hooks
- Scaling: Integration with Auto Scaling Groups, seamless replacement
- Gotchas: Spot pricing can spike 10x during demand surges (rare, but documented). If you set a maximum price too low, AWS never allocates capacity.
GCP Preemptible Instances:
- Warning mechanism: 30-second notification (shorter than AWS!)
- Interruption rate: Slightly higher, 3–7% annually, but more predictable curves
- Checkpointing support: Deep integration with TensorFlow/PyTorch frameworks
- Scaling: Via Instance Groups with automatic replacement
- Gotchas: 30 seconds is tight. You need aggressive checkpointing (every 2 minutes). GCP's preemptible quota is shared with on-demand, so you can't easily do 100% preemptible fleets.
Azure Spot VMs:
- Warning mechanism: 30-second eviction notice (varies by region)
- Interruption rate: Highly variable by region, 1–20% in unpopular regions
- Checkpointing support: Basic integration, less mature than AWS/GCP
- Scaling: Via Virtual Machine Scale Sets (VMSS)
- Gotchas: Pricing model is "you set a max price, Azure fills capacity up to that price." If you set max_price=$0.30/hour but current price is $0.40, you get no instance. Some teams undersell themselves.
Practical implications: If you're running an urgent training job at 9 AM on a Tuesday in us-east-1, spot availability might drop to 85%. The same job at 2 AM might see 99% availability. Your cost arbitrage strategy should account for time-of-day pricing and availability curves, not just headline prices. For safe production, favor GCP for consistent preemptibility and AWS for best tooling/scaling. Use Azure for cost-sensitive non-critical workloads.
Multi-Cloud Submission Pattern
Here's how smart teams handle this. Instead of picking one cloud and hoping for capacity, they submit to the cheapest available option first, with fallback logic:
import random
SPOT_OPTIONS = [
{"provider": "aws", "zone": "us-west-2a", "gpu": "V100", "price": 0.48},
{"provider": "gcp", "zone": "us-central1", "gpu": "A100", "price": 1.07},
{"provider": "azure", "zone": "eastus", "gpu": "V100", "price": 0.36},
]
def submit_training_job(model_name: str):
# Sort by price, with small random jitter to avoid thundering herd
options = sorted(SPOT_OPTIONS, key=lambda x: x["price"] + random.gauss(0, 0.05))
for option in options:
try:
launch_spot_instance(**option)
return option
except InsufficientCapacityError:
continue
# Fallback to on-demand if all spots fail
launch_on_demand_instance()The random jitter is important. If 100 teams all try to launch spot instances at the exact same price, they create a thundering herd effect. Tiny variations in price preference spread load across instances, improving aggregate success rates.
Real impact: A team running 20 A100 training jobs/month saves $1,000+ just by preferring GCP over AWS for that workload. Over a year, that's $12,000 in saved costs with zero change to model quality.
4. Elastic Scaling: Spot + On-Demand Hybrid
Many cloud platforms now offer managed spot with automatic fallback to on-demand. This gives you the best of both worlds: 85% savings on most runs, but 100% reliability when spot capacity exhausts.
AWS SageMaker Managed Spot
SageMaker abstracts away interruption handling entirely:
from sagemaker.estimator import Estimator
estimator = Estimator(
image_uri="246618743249.dkr.ecr.us-west-2.amazonaws.com/pytorch:latest",
role="arn:aws:iam::123456789:role/SageMakerRole",
instance_count=8,
instance_type="ml.p3.8xlarge", # V100s
use_spot_instances=True, # Enable spot
max_wait=3600, # Max wait for spot before fallback to on-demand
max_run=86400,
)
estimator.fit(training_data)When you set use_spot_instances=True, SageMaker handles:
- Interruption detection and graceful shutdown
- Automatic checkpoint save (it hooks into your training code)
- Rescheduling to a fresh instance with checkpoint resume
- Fallback to on-demand if spot capacity is exhausted for > max_wait seconds
Cost: You pay spot price for time on spot, on-demand price for fallback time. SageMaker charges no additional markup, making this economically sound. If you're on spot 80% of the time and on-demand 20%, your average hourly rate is 0.8 × spot_price + 0.2 × on_demand_price.
Why this matters: You've outsourced reliability to AWS. If they mis-estimate availability or capacity, that's their problem, not yours. Your job runs. This is worth the small SageMaker management overhead.
5. Distributed Training with Spot: torch.distributed.elastic
When training across multiple GPU instances (8-way or 64-way), one spot interruption can cascade. Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP-zero-memory-efficient-training)-comparison)) need special handling.
PyTorch's Elastic Distributed Training automatically detects node failures and rebalances. This is the key to safe large-scale spot training:
from torch.distributed.elastic.multiprocessing import elasticlaunch
# elastic_launch.py
@elasticlaunch.launch
def train():
model = MyModel().to("cuda")
ddp_model = torch.nn.parallel.DistributedDataParallel(model)
for epoch in range(100):
train_one_epoch(ddp_model)
if rank == 0:
save_checkpoint()
if __name__ == "__main__":
train()Launch with torchrun:
torchrun \
--nproc_per_node=8 \
--nnodes=2 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_IP:29500 \
elastic_launch.pyIf one node dies mid-training:
torchrundetects the failure (the worker process exits)- Remaining nodes pause and enter a wait state
- New nodes join the training loop
- Training resumes from the latest checkpoint
Key metric: Recovery overhead is typically <2 minutes. For a 72-hour training run, that's <1.4% slowdown. Compare this to the 80%+ cost savings from using spot instances, and the math is compelling.
Production Considerations: Making Spot Reliable at Scale
Real-world spot instance deployments are more complex than the examples above. Here are the production realities:
Cost Predictability
Spot prices fluctuate. For budget planning, you can't assume $0.48/hour for A100s if demand spikes. Strategy: Reserve 10–20% of your budget for on-demand fallback. Track actual spot vs. on-demand costs by job type and time of day.
Compliance and Data Residency
Some workloads require data to stay in a specific region. Spot availability varies by region. If your primary region (us-east-1) has poor spot availability, you might be forced into on-demand for compliance reasons. Strategy: Negotiate with compliance teams around availability curves. "Spot in secondary regions with <1% latency impact" might be acceptable.
Monitoring and Alerting
You need visibility into:
- Spot vs. on-demand breakdown (which instances are running where)
- Interruption rates by region/instance type/time of day
- Checkpoint save/resume success rates
- Total wall-clock time vs. GPU time (overhead from rebalancing)
Checkpoint Management
After 100 training jobs, you have thousands of checkpoints. Storage costs add up. Strategy: Implement a checkpoint retention policy. Keep the top 5 by validation metric. Delete after 90 days.
Worked Cost Analysis Example
Let's model a realistic scenario: Training LLaMA-13B on 8 A100 GPUs.
On-Demand Baseline (AWS p3.8xlarge-equivalent for A100):
- Instance cost: $12.48/hour × 8 GPUs = $99.84/hour
- Training time: 200 hours (typical for 13B model)
- Total cost: $19,968
Spot Only (assume 90% success rate, 2 total failures):
- Spot cost: $1.93/hour × 8 = $15.44/hour
- Training time on spot: 182 hours (200 − 18 hours lost to 2 failures)
- Restart overhead: 18 hours on on-demand at $99.84 = $1,797
- Total cost: (182 × $15.44) + (18 × $99.84) = $2,809 + $1,797 = $4,606
Spot with Managed Fallback (SageMaker):
- Spot cost: 85% of time on spot = 170 hours × $15.44 = $2,625
- On-demand cost: 15% of time (capacity misses) = 30 hours × $99.84 = $2,995
- SageMaker overhead: 3% = ~$170
- Total cost: ~$5,790
In this scenario, pure spot saves 77%, but adds risk (2 complete restarts). SageMaker's hybrid approach gives you 71% savings with 100% reliability.
The Organizational Reality of Spot Training
Here's what most tutorials skip: convincing your organization that spot instances are worth the complexity. Many ML teams, especially those led by researchers, view infrastructure investment as overhead. "Just buy more on-demand capacity," they say. "Our models are too important to risk on spot instances." This thinking persists until someone runs the math.
That same team training fifty models a year at $10,000 each is spending $500,000 on compute. Spot instances could cut that to $150,000 if they invested two weeks building checkpointing and interruption handling. That's $350,000 saved. Suddenly it's not overhead - it's a business problem. You need organizational buy-in to make spot work, and that buy-in comes from demonstrating the financial case.
But there's a human side too. ML engineers care about model quality and training time. They don't innately care about cloud costs. You need to frame spot instances in terms they care about. Instead of "save 80% on compute costs," say "train five models instead of one for the same budget, iterate faster, find better architectures." Suddenly the value proposition shifts from infrastructure optimization to enabling better research.
The technical challenges are real but solvable. The organizational challenges are harder. You need training, documentation, and patience as your team learns to build for fault tolerance. The teams that succeed are the ones that invest in both the technical and organizational sides - they show the ROI, they train their engineers, they make checkpointing and recovery the default pattern, not the exception.
One more thing: your cloud provider probably has better spot reliability than you assume. AWS has been running spot for 15+ years. The horror stories - instances terminating constantly, capacity exhaustion - mostly apply to older regions during peak times. Talk to your provider's account team. Many offer SLAs on spot availability. Negotiate rates and terms that match your risk profile. The published prices are often a starting point for negotiation, especially for steady multi-month commitments.
Summary
Spot instances aren't a gamble - they're a business decision. With proper checkpointing, interruption handling, and cloud arbitrage, you get 70–90% cost savings on GPU training while maintaining 99%+ job success rates.
The key is building for failure from day one. Treat interruptions as a feature, not a bug. Save checkpoints frequently, upload to persistent storage, and wire up SIGTERM handlers. Use distributed elastic training for multi-node workloads. And when pricing varies wildly across clouds, play them against each other.
Do this right, and your ML infrastructure costs plummet. Your teams can experiment faster. You can train bigger models on the same budget. That's the spot instance advantage.
Scaling Your Spot Strategy
As your organization grows and runs more training jobs, spot instance management becomes increasingly valuable. A startup running five training jobs a month might not bother with spot. A team running fifty jobs a month makes half a million dollars difference between on-demand and spot. At that scale, you need systematic approaches.
Build a wrapper around your training submission that automatically tries spot first, implements retry logic with exponential backoff, and falls back to on-demand if necessary. This removes the human decision point - spot becomes the default path for all jobs, but with guaranteed success. Your engineers don't have to think about it.
Monitor spot prices historically and build alerts. If A100 pricing in us-west-2 drops to $1.50/hour (significantly below typical $1.93), that's a signal to accelerate batch training jobs. Conversely, if prices spike to $4/hour, hold off unless urgent. Many organizations have built internal tools that watch spot prices and automatically launch queued training jobs when conditions are favorable.
Also, don't ignore the psychological factor. Your data science team might avoid spot because they've heard horror stories. Prove them wrong. Show them a training job that completed successfully on spot. Let them see checkpoints save and recover automatically. Let them feel the speed of iteration when you can run ten experiments instead of one for the same budget. Once they've experienced it, they'll never go back to purely on-demand.
Why Executives Care About Spot Instances
The business case for spot instances is compelling. Most ML teams have limited training budgets. A startup might allocate $100K per year for training. A research group might get half a million. Every dollar spent on compute is a dollar not spent on hiring engineers, acquiring better data, or building product. Spot instances double or triple how much training you can do with your budget.
But executives also worry about reliability. "Will our training fail due to interruptions?" "Will we lose weeks of work?" "Is this actually worth the engineering effort?" The answers are yes, no, and yes. Yes, training might get interrupted. No, you won't lose weeks if you implement checkpointing. Yes, it's worth it if you're training regularly.
The conversation often goes like this: Your organization trains fifty models a year at $10K each. Total: $500K. Spot instances cost 25-30% of on-demand. That's $350K saved. Your engineering team (two people, $200K per person) spends two weeks building spot infrastructure. That's $7,500 in cost. The ROI is 47x in the first year. By year two, you're saving $350K with no additional engineering investment. This is a business no-brainer, yet many organizations don't pursue it because they don't do the math or they're intimidated by the complexity.
Building Institutional Knowledge
As your organization grows, spot instance knowledge becomes institutional. Your junior engineers learn how to check out a training job with spot enabled. Your senior engineers understand the failure modes and can diagnose issues quickly. Your on-call team knows how to respond when something goes wrong. This institutional knowledge compounds in value.
Teams that have run thousands of training jobs on spot have war stories. They've seen every failure mode. They know which regions have stable preemptible capacity. They know which instance types are safe and which are risky. They understand the exact interplay between checkpoint frequency, overhead, and failure rates. This deep knowledge is invisible to outsiders but invaluable internally.
Documentation and training become critical at this stage. You need runbooks that explain how spot works and what to do when training fails. You need training that new hires receive before they're allowed to submit spot jobs. You need monitoring dashboards that tell you immediately if something is wrong. You need a culture where spot failures are learning opportunities, not disasters.
The Psychological Factor
Here's something most documentation skips: the psychology of spot instances. Many engineers are hesitant about spot because they've heard horror stories. "I lost a whole training run to interruptions." "Our compute got cut off with no warning." "We couldn't hit deadlines because of spot failures." These stories are real but survivorship-biased. They represent organizations that didn't implement proper checkpointing. For organizations that did, spot is reliable and cheap.
Once an engineer experiences their first successful multi-day training run on spot, their perspective shifts. They see the checkpoint saved at 6 AM, the instance got interrupted at 2 PM, and the training automatically resumed at 3 PM from the checkpoint. They notice the cost is half of what it would have been on-demand. They realize spot is both reliable and economical. That engineer becomes an evangelist internally. Word spreads.
This psychological shift is often the bottleneck for adoption. The technical challenges are solved. The financial case is clear. But people need to experience it working to really believe in it. Smart organizations recognize this and make sure their engineers see successful spot training runs early in their journey.
Preparing for the Future
Spot instances will likely become even more important as cloud infrastructure evolves. As cloud providers add more capacity and become more efficient, they'll have more excess capacity to monetize via spot pricing. Competition will increase, pushing spot prices lower. Interruption rates might increase, but your checkpointing and recovery infrastructure will adapt. The fundamental dynamics won't change - you're trading interruption risk for cost savings.
The most forward-thinking organizations are already thinking beyond spot instances. Reserved instances lock in discounts but require commitment. Spot instances are flexible but interruptible. Some organizations are experimenting with custom agreements with cloud providers for long-term capacity commitments at fixed prices. Others are building heterogeneous fleets that use whatever capacity is cheapest that day. The principle remains: optimize cost by trading convenience or flexibility.
The evolution will favor organizations that have already built the infrastructure to handle interruptions and cost optimization. You can't suddenly start using spot instances at scale without months of groundwork. You can't switch cloud providers or negotiate custom deals without experience managing cost optimization. By building this muscle now, while stakes are lower, you position yourself for success as both technology and business dynamics evolve.
Infrastructure expertise for the modern AI stack.