Infrastructure as Code for ML - Terraform and Pulumi Patterns
You've built an amazing ML model. Now comes the hard part - deploying it at scale without losing your mind to manual infrastructure management. That's where Infrastructure as Code (IaC) enters the picture, and honestly, it's non-negotiable for serious ML workloads. Let's explore how Terraform and Pulumi approach the unique challenges of ML infrastructure-flux)-flux), and which patterns actually work in production.
Table of Contents
- The ML Infrastructure Challenge
- ML-Specific IaC Requirements
- GPU Resource Management
- Training vs. Serving Infrastructure
- Auto-scaling for Inference
- Kubernetes GPU Node Pools
- State and Drift Management
- Terraform for ML Infrastructure: Architecture Overview
- Terraform Module for ML Training Infrastructure
- Directory Structure
- Core Module: main.tf
- Spot Fleet Configuration: spot_fleet.tf
- User Data Script: user_data.sh
- Variables: variables.tf
- Kubernetes GPU Node Pools with Terraform
- EKS Cluster Module
- Pulumi with Python SDK: Same Language as Your ML Code
- Pulumi Project Structure
- Pulumi Configuration: __main__.py
- Dynamic Resource Creation from Config: ml_serving.py
- ComponentResource for Reusable Patterns: utils.py
- ML Infrastructure Drift Detection and Management
- Terraform Drift Detection in CI
- State Locking with DynamoDB
- Destroy and Recreate Pattern for Training Infrastructure
- Monitoring and Observability for IaC Changes
- Practical Example: Complete Training → Serving Pipeline
- Step 1: Define Your Model Config
- Step 2: Provision Training Infrastructure
- Step 3: Monitor Training, Handle Interruptions
- Step 4: Deploy Serving Infrastructure
- Step 5: Destroy Training Infrastructure
- Advanced: Cost Optimization Patterns
- Spot Instance Diversification
- Reserved Capacity for Baseline Load
- Terraform vs. Pulumi: When to Use Each
- Practical Workflow: From Experiment to Production
- Checklist: IaC Maturity for ML
- Summary
- The Organizational Shift Enabled by IaC
- Building Infrastructure Abstractions
- Evolving Your IaC as You Scale
- The Cost Impact of IaC
- Choosing Between Terraform and Pulumi: A Practical Guide
The ML Infrastructure Challenge
Here's the problem: traditional IaC tools were built for web services. Your ML infrastructure has very different demands. You need ephemeral training clusters that spin up, consume expensive GPU resources, then disappear without a trace. You need persistent serving infrastructure that auto-scales based on inference demand. You need spot instances to save costs, but with graceful handling of interruptions. You need GPU node pools in Kubernetes-nvidia-kai-scheduler-gpu-job-scheduling), managed resource quotas, and drift detection to catch configuration drift before it causes your training job to mysteriously fail at 3 AM.
Standard IaC patterns don't cut it. You need infrastructure that thinks about ML-specific concerns: checkpoint management, distributed training-pipelines-training-orchestration)-fundamentals)) coordination, model versioning-ab-testing), and cost optimization.
This is where the gap between generic IaC and ML-aware IaC becomes painfully clear. A Terraform configuration that works beautifully for deploying web services - declaring stateless APIs, databases, load balancers - breaks down when you introduce the complexities of ML workloads. Web services are largely stateless; a crashed instance is replaced with a new one running identical code. ML training is stateful in ways that matter; a crashed GPU in the middle of a 40-hour training run doesn't just mean "restart from the beginning." It means losing all intermediate progress, potentially hours of wasted compute. Your infrastructure code needs to account for this by managing checkpoints, handling spot instance interruptions gracefully, and coordinating distributed training-ddp-advanced-distributed-training) across multiple nodes where any single failure requires recovery logic.
The other challenge is the diversity of ML workloads. You're not deploying one type of service; you're deploying many. Some jobs are training jobs that run once and finish. Some are serving jobs that run forever. Some are periodic batch jobs that process data on a schedule. Some are interactive experiments that scientists run ad hoc. Each pattern demands different infrastructure semantics. Static web app infrastructure assumes one deployment-production-inference-deployment) per version. ML infrastructure assumes multiple concurrent workloads with different resource requirements, lifespans, and failure modes. This is why the best ML infrastructure teams abstract their IaC tooling behind domain-specific orchestrators (like Kubeflow, Ray, or SageMaker) rather than exposing raw Terraform or Pulumi to data scientists. The abstraction layer hides complexity while maintaining flexibility.
ML-Specific IaC Requirements
Before we dive into Terraform and Pulumi, let's establish what "good" looks like for ML infrastructure.
GPU Resource Management
You can't just request GPUs like regular compute. You need quota management, instance type selection (NVIDIA A100 vs H100 vs cheaper options), and awareness of spot instance pricing and availability. Your infrastructure code should declare what you need, and the IaC tool should handle availability zone diversity and fallback-fallback) options.
Training vs. Serving Infrastructure
Training clusters are fundamentally ephemeral. A training job runs for hours or days, then you're done. You want to destroy everything afterward to avoid surprise costs. Serving infrastructure is persistent - models serve traffic 24/7, so you need reliable, auto-scaling infrastructure with state management, load balancing, and canary deployments.
The same IaC tool needs to handle both patterns elegantly without requiring completely different approaches.
Auto-scaling for Inference
Web services scale on CPU and request latency. ML models scale on inference request volume, but with important differences. A single request might consume multiple GPUs. Queue depth matters more than request rate. You need predictable, fast scaling to handle traffic spikes without dropping requests or burning money on over-provisioning.
Kubernetes GPU Node Pools
Most modern ML serving happens on Kubernetes. You need your IaC to:
- Create GPU node groups with specific instance types
- Apply GPU taints so non-GPU workloads don't schedule there
- Install the NVIDIA device plugin
- Configure auto-scaling with tools like Karpenter
- Manage pod requests and limits
State and Drift Management
Unlike stateless web apps, ML infrastructure often has state - training checkpoints in S3, model metadata in DynamoDB, experiment tracking in specialized services. Your IaC needs to handle state locking to prevent concurrent modifications, drift detection to catch manual changes, and safe destroy/recreate patterns for stateless training infrastructure.
Terraform for ML Infrastructure: Architecture Overview
Terraform is the industry standard for infrastructure as code, and for good reason. It has been around since 2014 and has accumulated deep integrations with cloud providers. For AWS, which dominates ML infrastructure deployment, Terraform's AWS provider is remarkably comprehensive - thousands of resources, hundreds of data sources, and a mature community. The declarative language (HCL) reads like structured configuration, which makes it accessible to non-experts while remaining powerful enough for complex orchestration.
The key advantage of Terraform is its state management. It tracks what infrastructure exists, what you've declared, and what diffs need to apply. This gives you safety - you can preview changes before applying them, and you have a durable record of your infrastructure. The downside is that state becomes your source of truth, not your code. If your state file gets corrupted, you're in for a bad time. If your state becomes out of sync with actual infrastructure (drift), Terraform's error messages become cryptic. This is manageable at small scale, but at large scale, state management becomes a serious operational concern.
For ML infrastructure specifically, Terraform's module system is what makes it work. You don't write monolithic 2000-line Terraform files. You write modular, composable modules - one for VPC, one for Kubernetes cluster, one for GPU node pools, one for IAM policies. Your "main" configuration then orchestrates these modules. This composition pattern scales well. You can version modules separately, share them across projects, and test them independently.
The challenge is that Terraform is verbose. A simple "create a GPU instance pool with auto-scaling" requires writing policy documents, security group rules, IAM roles with trust relationships, and launch templates. There's a lot of boilerplate that feels mechanical. This is where Pulumi offers a different angle: it lets you write infrastructure as real code (Python, Go, Node.js), which means you can use loops, conditionals, functions, and libraries - all the tools you use when writing applications.
┌─────────────────────────────────────────────────────────┐
│ ML Workload Definition │
│ (experiment_config.yaml, model_config.json) │
└────────┬────────────────────────────────────────────────┘
│
┌────────▼────────────────────────────────────────────────┐
│ Terraform Root Module (main.tf) │
│ • Data sources (current AWS account, AZs) │
│ • Local variables (instance types, spot config) │
│ • Module composition │
└────────┬────────────────────────────────────────────────┘
│
┌────────┴──────────────────────────────────────────────────┐
│ │
├─────────────────────────┬────────────────────┬────────────┤
│ VPC Module │ IAM Module │ EKS Module│
│ • Subnets │ • Role policies │ • Cluster │
│ • Security groups │ • Service accts │ • Node grps
│ • NAT gateways │ • Least privilege │ • RBAC │
├─────────────────────────┼────────────────────┼────────────┤
│ Spot Fleet Module │ S3/Checkpoint │ Monitoring│
│ • Launch template │ • Bucket config │ • CloudWatch
│ • Interrupt handling │ • Lifecycle rules │ • Alerting│
│ • Cost optimization │ • Encryption │ │
└─────────────────────────┴────────────────────┴────────────┘
Terraform's strength for ML lies in its massive AWS provider support and mature ecosystem. You get fine-grained control over every resource. The downside? Lots of boilerplate, and you're writing declarative code that feels distant from your Python-based ML work.
Terraform Module for ML Training Infrastructure
Let's build a practical Terraform module for ML training. This module manages EC2 spot instances with GPU support, including interrupt handling and checkpoint management.
Directory Structure
terraform/
├── modules/
│ └── ml_training_cluster/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── iam.tf
│ ├── networking.tf
│ └── spot_fleet.tf
├── environments/
│ ├── dev.tfvars
│ ├── staging.tfvars
│ └── prod.tfvars
└── main.tf
Core Module: main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
}
# Fetch available AZs for instance diversity
data "aws_availability_zones" "available" {
state = "available"
}
# Data source for Ubuntu AMI with GPU drivers pre-installed
data "aws_ami" "gpu_optimized" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["Deep Learning AMI GPU CUDA 12 (Ubuntu 22.04)*"]
}
}
# VPC and networking
module "networking" {
source = "./networking"
vpc_name = "${var.cluster_name}-vpc"
vpc_cidr = var.vpc_cidr
availability_zones = data.aws_availability_zones.available.names
enable_nat_gateway = true
}
# IAM roles for EC2 instances
module "iam" {
source = "./iam"
cluster_name = var.cluster_name
checkpoint_bucket = aws_s3_bucket.checkpoints.id
}
# S3 bucket for training checkpoints
resource "aws_s3_bucket" "checkpoints" {
bucket = "${var.cluster_name}-checkpoints-${data.aws_caller_identity.current.account_id}"
tags = {
Name = "${var.cluster_name}-checkpoints"
Environment = var.environment
}
}
resource "aws_s3_bucket_lifecycle_configuration" "checkpoints" {
bucket = aws_s3_bucket.checkpoints.id
rule {
id = "cleanup-old-checkpoints"
status = "Enabled"
expiration {
days = var.checkpoint_retention_days
}
}
}
# Spot Fleet for training cluster
module "spot_fleet" {
source = "./spot_fleet"
cluster_name = var.cluster_name
vpc_id = module.networking.vpc_id
subnet_ids = module.networking.private_subnet_ids
security_group_id = module.networking.training_security_group_id
instance_profile_arn = module.iam.instance_profile_arn
ami_id = data.aws_ami.gpu_optimized.id
instance_types = var.instance_types
capacity_units_target = var.capacity_units_target
max_price = var.spot_max_price
availability_zones = data.aws_availability_zones.available.names
}
data "aws_caller_identity" "current" {}
output "checkpoint_bucket_name" {
value = aws_s3_bucket.checkpoints.id
}
output "spot_fleet_id" {
value = module.spot_fleet.fleet_id
}Spot Fleet Configuration: spot_fleet.tf
# Launch template for GPU instances
resource "aws_launch_template" "training" {
name_prefix = "${var.cluster_name}-"
image_id = var.ami_id
instance_type = var.instance_types[0]
vpc_security_group_ids = [var.security_group_id]
iam_instance_profile {
arn = var.instance_profile_arn
}
# User data: install training tools and setup graceful shutdown
user_data = base64encode(templatefile("${path.module}/user_data.sh", {
checkpoint_bucket = var.checkpoint_bucket
region = data.aws_region.current.name
}))
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_size = 100
volume_type = "gp3"
iops = 3000
throughput = 125
delete_on_termination = true
encrypted = true
}
}
monitoring {
enabled = true
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "${var.cluster_name}-training"
ClusterName = var.cluster_name
}
}
lifecycle {
create_before_destroy = true
}
}
# Spot Fleet Request with diversification strategy
resource "aws_ec2_fleet" "training" {
launch_template_config {
launch_template_specification {
launch_template_id = aws_launch_template.training.id
version = "$Latest"
}
# Diversify across instance types and AZs to minimize interruption impact
dynamic "overrides" {
for_each = var.instance_types
content {
instance_type = overrides.value
dynamic "availability_zone" {
for_each = var.availability_zones
content {
availability_zone = availability_zone.value
}
}
}
}
}
type = "maintain"
excess_capacity_termination_policy = "termination"
target_capacity_specification {
total_target_capacity = var.capacity_units_target
on_demand_target_capacity = 0 # Use 100% spot for cost optimization
spot_target_capacity = var.capacity_units_target
}
spot_options {
allocation_strategy = "price-capacity-optimized"
instance_interruption_behavior = "terminate"
maintenance_strategies {
capacity_rebalance {
replacement_strategy = "launch"
}
}
}
tags = {
Name = "${var.cluster_name}-fleet"
}
}
data "aws_region" "current" {}User Data Script: user_data.sh
#!/bin/bash
set -e
# Setup CloudWatch agent for monitoring
amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/config.json
# Install training dependencies
pip install -U pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pytorch-lightning wandb
# Setup graceful shutdown handler
cat > /opt/graceful_shutdown.sh << 'EOF'
#!/bin/bash
# Listen for EC2 spot interruption notice (2 minute warning)
while true; do
if curl -s http://169.254.169.254/latest/meta-data/spot/instance-action | grep -q "instance-action"; then
echo "Spot interruption detected. Saving checkpoint..."
# Trigger checkpoint save signal to training process
pkill -SIGUSR1 python || true
sleep 120 # Wait for graceful shutdown before termination
break
fi
sleep 5
done
EOF
chmod +x /opt/graceful_shutdown.sh
nohup /opt/graceful_shutdown.sh &
echo "Training instance ready"Variables: variables.tf
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "cluster_name" {
description = "Name of the training cluster"
type = string
}
variable "environment" {
description = "Environment (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
default = "10.0.0.0/16"
}
variable "instance_types" {
description = "GPU instance types to use (e.g., g4dn.xlarge, g4dn.2xlarge)"
type = list(string)
default = ["g4dn.xlarge", "g4dn.2xlarge", "g4dn.12xlarge"]
}
variable "capacity_units_target" {
description = "Target capacity in capacity units (vCPU-based)"
type = number
default = 8
}
variable "spot_max_price" {
description = "Maximum spot price per vCPU-hour"
type = string
default = "0.50"
}
variable "checkpoint_retention_days" {
description = "Days to retain checkpoints in S3"
type = number
default = 30
}Kubernetes GPU Node Pools with Terraform
Now let's manage EKS with GPU node pools using Terraform. This is where you deploy serving infrastructure.
EKS Cluster Module
# EKS Cluster
resource "aws_eks_cluster" "ml_serving" {
name = var.cluster_name
version = var.kubernetes_version
role_arn = aws_iam_role.eks_cluster_role.arn
vpc_config {
subnet_ids = concat(module.networking.public_subnet_ids, module.networking.private_subnet_ids)
endpoint_private_access = true
endpoint_public_access = true
}
depends_on = [aws_iam_role_policy_attachment.eks_cluster_policy]
}
# GPU Node Group with Auto Scaling
resource "aws_eks_node_group" "gpu" {
cluster_name = aws_eks_cluster.ml_serving.name
node_group_name = "${var.cluster_name}-gpu-nodes"
node_role_arn = aws_iam_role.eks_node_role.arn
subnet_ids = module.networking.private_subnet_ids
version = var.kubernetes_version
scaling_config {
desired_size = var.desired_size
max_size = var.max_size
min_size = var.min_size
}
instance_types = var.gpu_instance_types
# GPU-specific launch template
launch_template {
id = aws_launch_template.gpu_nodes.id
version = aws_launch_template.gpu_nodes.latest_version_number
}
# GPU taints to prevent non-GPU workloads from scheduling
taints {
key = "nvidia.com/gpu"
value = "true"
effect = "NO_SCHEDULE"
}
tags = {
"NodeType" = "gpu"
}
depends_on = [
aws_iam_role_policy_attachment.eks_node_policy,
aws_iam_role_policy_attachment.eks_cni_policy,
]
}
# Launch template with GPU-specific settings
resource "aws_launch_template" "gpu_nodes" {
name_prefix = "${var.cluster_name}-gpu-"
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 100
volume_type = "gp3"
delete_on_termination = true
encrypted = true
}
}
monitoring {
enabled = true
}
}
# NVIDIA Device Plugin via Helm
resource "helm_release" "nvidia_device_plugin" {
name = "nvidia-device-plugin"
repository = "https://nvidia.github.io/k8s-device-plugin"
chart = "nvidia-device-plugin"
namespace = "kube-system"
set {
name = "nodeSelector.nvidia\\.com/gpu"
value = "true"
}
depends_on = [aws_eks_cluster.ml_serving]
}
# Karpenter for intelligent GPU auto-scaling
resource "helm_release" "karpenter" {
namespace = "karpenter"
create_namespace = true
name = "karpenter"
repository = "oci://public.ecr.aws/karpenter"
chart = "karpenter"
set {
name = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = aws_iam_role.karpenter.arn
}
set {
name = "settings.aws.clusterName"
value = aws_eks_cluster.ml_serving.name
}
}
output "cluster_endpoint" {
value = aws_eks_cluster.ml_serving.endpoint
}
output "cluster_name" {
value = aws_eks_cluster.ml_serving.name
}Pulumi with Python SDK: Same Language as Your ML Code
Here's where things get interesting. Pulumi lets you write infrastructure in Python, Go, or TypeScript. For ML teams, this is huge - your infrastructure code lives in the same language as your model training scripts.
Pulumi's fundamental difference from Terraform is philosophical. Terraform is declarative - you describe the final state you want, and Terraform calculates the diff. Pulumi is imperative - you write actual code that runs and creates resources. This might sound like a small difference, but it cascades into dramatic consequences for complexity and flexibility.
Consider a simple scenario: you want to create a different number of GPU nodes depending on the environment. With Terraform, you write conditional logic in HCL - doable, but HCL's conditionals feel bolted-on. With Pulumi in Python, you write a loop: for i in range(num_nodes): create_gpu_node(...). You have the full power of Python - functions, classes, imports, libraries. You can fetch the number of nodes from an external API. You can compute it based on config. You can even make it data-driven, pulling requirements from a YAML file and generating infrastructure.
This flexibility is powerful for ML infrastructure because ML requirements change rapidly. You're not deploying a static web service once. You're deploying diverse workloads with different resource requirements, conducting experiments, scaling based on demand. Pulumi's model matches this reality better than Terraform's declarative approach. You can version control your infrastructure, review it in PRs, and deploy it with CI/CD, just like you do with application code.
The downside is that Pulumi requires more infrastructure expertise. You're writing real code, which means you can write bad code. With Terraform, you're constrained by the language to express infrastructure concerns. With Pulumi, you could in theory do anything Python allows, which is both a feature (flexibility) and a bug (complexity). Teams that get Pulumi right build powerful, DRY infrastructure libraries. Teams that get it wrong end up with infrastructure code that's harder to maintain than application code because it touches more surfaces and has more failure modes.
Pulumi Project Structure
ml-infrastructure/
├── Pulumi.yaml
├── Pulumi.dev.yaml
├── Pulumi.prod.yaml
├── __main__.py
├── ml_training.py
├── ml_serving.py
└── utils.py
Pulumi Configuration: main.py
import json
import pulumi
import pulumi_aws as aws
import pulumi_eks as eks
import pulumi_kubernetes as k8s
# Load config from context
config = pulumi.Config()
cluster_name = config.get("cluster_name") or "ml-serving"
environment = pulumi.get_stack()
region = config.get("region") or "us-east-1"
# Create VPC
vpc = aws.ec2.Vpc(f"{cluster_name}-vpc",
cidr_block="10.0.0.0/16",
enable_dns_hostnames=True,
)
# Subnets across AZs
availability_zones = aws.get_availability_zones(state="available").names
subnet_ids = []
for i, az in enumerate(availability_zones[:3]): # 3 AZs
subnet = aws.ec2.Subnet(f"{cluster_name}-subnet-{i}",
vpc_id=vpc.id,
cidr_block=f"10.0.{i}.0/24",
availability_zone=az,
)
subnet_ids.append(subnet.id)
# Security group for cluster
security_group = aws.ec2.SecurityGroup(f"{cluster_name}-sg",
vpc_id=vpc.id,
ingress=[
aws.ec2.SecurityGroupIngressArgs(
protocol="tcp",
from_port=443,
to_port=443,
cidr_blocks=["0.0.0.0/0"],
),
],
)
# Create EKS cluster
eks_cluster = eks.Cluster(f"{cluster_name}-cluster",
version="1.28",
vpc_id=vpc.id,
subnet_ids=subnet_ids,
endpoint_private_access=True,
endpoint_public_access=True,
)
# GPU Node group
gpu_nodegroup = eks.NodeGroup(f"{cluster_name}-gpu-nodegroup",
cluster=eks_cluster,
node_role_arn=aws.iam.Role(f"{cluster_name}-node-role",
assume_role_policy=json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Action": "sts:AssumeRole",
"Principal": {"Service": "ec2.amazonaws.com"},
"Effect": "Allow",
}],
}),
).arn,
scaling_config=eks.NodeGroupScalingConfigArgs(
desired_size=3,
max_size=10,
min_size=1,
),
instance_types=["g4dn.xlarge", "g4dn.2xlarge"],
)
# Kubernetes provider using cluster credentials
k8s_provider = k8s.Provider("k8s-provider",
kubeconfig=eks_cluster.kubeconfig_json,
)
# Export outputs
pulumi.export("cluster_name", eks_cluster.name)
pulumi.export("kubeconfig", eks_cluster.kubeconfig_json)Dynamic Resource Creation from Config: ml_serving.py
Here's the Pulumi advantage - you can read your ML model config and dynamically create infrastructure based on it:
import json
import yaml
import pulumi
import pulumi_aws as aws
import pulumi_kubernetes as k8s
def deploy_model_serving_infrastructure(model_config_path, k8s_provider):
"""
Read model configuration and create serving infrastructure dynamically.
This is where Pulumi shines—same language as your ML code!
"""
with open(model_config_path, 'r') as f:
model_config = yaml.safe_load(f)
model_name = model_config['model']['name']
replicas = model_config['serving']['replicas']
gpu_per_replica = model_config['serving']['gpu_per_replica']
batch_size = model_config['serving']['batch_size']
# Create ConfigMap from model config
config_map = k8s.core.v1.ConfigMap(
f"{model_name}-config",
metadata={"namespace": "default"},
data={
"model_config.yaml": json.dumps(model_config)
},
opts=pulumi.ResourceOptions(provider=k8s_provider)
)
# Create Deployment with computed resources
deployment = k8s.apps.v1.Deployment(
f"{model_name}-deployment",
spec=k8s.apps.v1.DeploymentSpecArgs(
replicas=replicas,
selector=k8s.meta.v1.LabelSelectorArgs(
match_labels={"app": model_name}
),
template=k8s.core.v1.PodTemplateSpecArgs(
metadata=k8s.meta.v1.ObjectMetaArgs(
labels={"app": model_name}
),
spec=k8s.core.v1.PodSpecArgs(
containers=[
k8s.core.v1.ContainerArgs(
name=model_name,
image=f"myregistry.azurecr.io/{model_name}:latest",
ports=[k8s.core.v1.ContainerPortArgs(container_port=5000)],
resources=k8s.core.v1.ResourceRequirementsArgs(
requests={
"memory": "4Gi",
"cpu": "2",
"nvidia.com/gpu": gpu_per_replica,
},
limits={
"nvidia.com/gpu": gpu_per_replica,
},
),
env=[
k8s.core.v1.EnvVarArgs(
name="BATCH_SIZE",
value=str(batch_size),
),
],
)
],
node_selector={"nvidia.com/gpu": "true"},
tolerations=[
k8s.core.v1.TolerationArgs(
key="nvidia.com/gpu",
operator="Equal",
value="true",
effect="NoSchedule",
)
],
),
),
),
metadata=k8s.meta.v1.ObjectMetaArgs(namespace="default"),
opts=pulumi.ResourceOptions(provider=k8s_provider)
)
# Create Service to expose model
service = k8s.core.v1.Service(
f"{model_name}-service",
spec=k8s.core.v1.ServiceSpecArgs(
type="LoadBalancer",
selector={"app": model_name},
ports=[k8s.core.v1.ServicePortArgs(port=80, target_port=5000)],
),
metadata=k8s.meta.v1.ObjectMetaArgs(namespace="default"),
opts=pulumi.ResourceOptions(provider=k8s_provider)
)
# Create HPA for auto-scaling based on GPU utilization
hpa = k8s.autoscaling.v2.HorizontalPodAutoscaler(
f"{model_name}-hpa",
spec=k8s.autoscaling.v2.HorizontalPodAutoscalerSpecArgs(
scale_target_ref=k8s.autoscaling.v2.CrossVersionObjectReferenceArgs(
api_version="apps/v1",
kind="Deployment",
name=f"{model_name}-deployment",
),
min_replicas=replicas,
max_replicas=replicas * 3,
metrics=[
k8s.autoscaling.v2.MetricSpecArgs(
type="Resource",
resource=k8s.autoscaling.v2.ResourceMetricSourceArgs(
name="cpu",
target=k8s.autoscaling.v2.MetricTargetArgs(
type="Utilization",
average_utilization=70,
),
),
),
],
),
metadata=k8s.meta.v1.ObjectMetaArgs(namespace="default"),
opts=pulumi.ResourceOptions(provider=k8s_provider)
)
return {
"deployment": deployment,
"service": service,
"hpa": hpa,
}ComponentResource for Reusable Patterns: utils.py
Pulumi's ComponentResource lets you abstract complex patterns:
import json
import pulumi
import pulumi_aws as aws
class GPUTrainingCluster(pulumi.ComponentResource):
"""
Reusable component for ephemeral ML training clusters.
Same interface, different infrastructure based on config.
"""
def __init__(self, name, config, opts=None):
super().__init__('custom:ml:GPUTrainingCluster', name, None, opts)
self.cluster_name = name
self.config = config
# Create S3 bucket for checkpoints
self.checkpoint_bucket = aws.s3.Bucket(
f"{name}-checkpoints",
force_destroy=False, # Prevent accidental deletion
opts=pulumi.ResourceOptions(parent=self)
)
# Create IAM role for training instances
self.training_role = aws.iam.Role(
f"{name}-training-role",
assume_role_policy=json.dumps({
"Version": "2012-10-17",
"Statement": [{
"Action": "sts:AssumeRole",
"Principal": {"Service": "ec2.amazonaws.com"},
"Effect": "Allow",
}],
}),
opts=pulumi.ResourceOptions(parent=self)
)
# Grant S3 access to training role
s3_policy = aws.iam.RolePolicy(
f"{name}-s3-policy",
role=self.training_role.id,
policy=pulumi.Output.concat(
'{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", '
'"Action": ["s3:GetObject", "s3:PutObject"], "Resource": "',
self.checkpoint_bucket.arn,
'/*"}]}'
),
opts=pulumi.ResourceOptions(parent=self)
)
# Create EC2 fleet for training
self.spot_fleet = self._create_spot_fleet()
# Register outputs
self.register_outputs({
"checkpoint_bucket_name": self.checkpoint_bucket.id,
"training_role_arn": self.training_role.arn,
"spot_fleet_id": self.spot_fleet.id,
})
def _create_spot_fleet(self):
# Implementation matches Terraform spot_fleet.tf
launch_template = aws.ec2.LaunchTemplate(
f"{self.cluster_name}-launch-template",
instance_type="g4dn.xlarge",
vpc_security_group_ids=[self.config["security_group_id"]],
opts=pulumi.ResourceOptions(parent=self)
)
return aws.ec2.Fleet(
f"{self.cluster_name}-fleet",
launch_template_configs=[aws.ec2.FleetLaunchTemplateConfigArgs(
launch_template_specification=aws.ec2.FleetLaunchTemplateSpecificationArgs(
launch_template_id=launch_template.id,
version="$Latest",
),
)],
target_capacity_specification=aws.ec2.FleetTargetCapacitySpecificationArgs(
total_target_capacity=self.config["capacity_units"],
on_demand_target_capacity=0,
spot_target_capacity=self.config["capacity_units"],
),
opts=pulumi.ResourceOptions(parent=self)
)
@property
def checkpoint_bucket_name(self):
return self.checkpoint_bucket.id
@property
def training_role_arn(self):
return self.training_role.arnML Infrastructure Drift Detection and Management
Here's a critical pain point: your infrastructure drifts. Someone manually adjusts a security group. A node gets an unexpected update. A checkpoint retention policy changes. You need drift detection, and you need it automated.
Terraform Drift Detection in CI
# .github/workflows/terraform-drift-check.yml
name: Terraform Drift Detection
on:
schedule:
# Run every 6 hours
- cron: '0 */6 * * *'
jobs:
drift-detection:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v2
with:
terraform_version: 1.5.0
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/github-terraform-role
aws-region: us-east-1
- name: Terraform Init
run: terraform init
- name: Terraform Plan (Drift Detection)
run: |
terraform plan -out=drift.tfplan
- name: Check for Drift
run: |
if terraform show drift.tfplan | grep -q "No changes"; then
echo "✅ No infrastructure drift detected"
else
echo "⚠️ Infrastructure drift detected!"
terraform show drift.tfplan
exit 1
fi
- name: Slack Notification (if drift)
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "⚠️ Infrastructure drift detected in ${{ github.repository }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Infrastructure Drift Alert*\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
}
]
}State Locking with DynamoDB
Prevent concurrent modifications that cause state conflicts:
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "ml-infrastructure/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
# Create lock table (one-time setup)
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
point_in_time_recovery_specification {
enabled = true
}
tags = {
Name = "terraform-locks"
}
}Destroy and Recreate Pattern for Training Infrastructure
Training clusters should be ephemeral. Use Terraform workspaces or separate stacks:
# Spin up training cluster
terraform workspace new training-job-001
terraform apply -var-file=environments/training.tfvars
# Run training... (external process monitors job)
# Destroy when complete
terraform destroy -auto-approve
terraform workspace delete training-job-001Monitoring and Observability for IaC Changes
You need visibility into what infrastructure changed and when:
# Enable CloudTrail for audit logging
resource "aws_cloudtrail" "infrastructure_changes" {
name = "${var.cluster_name}-trail"
s3_bucket_name = aws_s3_bucket.cloudtrail_logs.id
include_global_service_events = true
is_multi_region_trail = true
enable_log_file_validation = true
depends_on = [aws_s3_bucket_policy.cloudtrail]
}
# CloudWatch alarm for unexpected infrastructure changes
resource "aws_cloudwatch_log_group" "infrastructure_changes" {
name = "/aws/cloudtrail/${var.cluster_name}-changes"
}
resource "aws_cloudwatch_log_stream" "infrastructure_changes" {
name = "changes"
log_group_name = aws_cloudwatch_log_group.infrastructure_changes.name
}Practical Example: Complete Training → Serving Pipeline
Let's walk through a realistic end-to-end scenario. You have a PyTorch model trained on distributed GPUs, and you need to deploy it for inference.
Step 1: Define Your Model Config
# model_config.yaml
model:
name: recommendation-transformer
version: v2.1
framework: pytorch
training:
data_source: s3://ml-datasets/recommendations/2025Q1/
batch_size: 128
num_epochs: 50
checkpoint_interval_steps: 1000
serving:
replicas: 3
gpu_per_replica: 1
batch_size: 32
max_latency_ms: 100
infrastructure:
training:
instance_types: [g4dn.2xlarge, g4dn.12xlarge]
capacity_units: 8
spot_max_price: "0.50"
serving:
instance_types: [g4dn.xlarge]
min_replicas: 3
max_replicas: 10Step 2: Provision Training Infrastructure
Using Terraform:
# Initialize and plan
cd terraform/training
terraform init
terraform plan -var-file=environments/prod.tfvars -out=train.tfplan
# Apply
terraform apply train.tfplan
# Your spot fleet is now live. Submit training job.
python train.py \
--checkpoint_dir s3://recommendation-transformer-checkpoints/ \
--config ../model_config.yamlTerraform creates:
- S3 bucket for checkpoints (with lifecycle policies)
- EC2 spot fleet across 3 availability zones
- CloudWatch monitoring for instance interruptions
- IAM role with S3 access (least-privilege)
- Security group allowing worker communication
Step 3: Monitor Training, Handle Interruptions
Your user data script listens for spot interruption notices:
# training_wrapper.py (runs on each EC2 instance)
import signal
import torch
import boto3
checkpoint_interval = 1000
steps = 0
latest_checkpoint_dir = None
def handle_interrupt_signal(signum, frame):
"""Called when spot interruption notice arrives"""
print("Spot interrupt detected. Saving emergency checkpoint...")
torch.save(model.state_dict(), f"/tmp/emergency_ckpt_{steps}.pt")
s3_client = boto3.client("s3")
s3_client.upload_file(
f"/tmp/emergency_ckpt_{steps}.pt",
"recommendation-transformer-checkpoints",
f"emergency_ckpt_{steps}.pt"
)
sys.exit(0)
signal.signal(signal.SIGUSR1, handle_interrupt_signal)
# Training loop
model = load_model(config)
optimizer = setup_optimizer(config)
for epoch in range(config.num_epochs):
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
steps += 1
if steps % checkpoint_interval == 0:
# Periodic checkpoint (survives spot interruption)
save_checkpoint(model, optimizer, steps, "s3://...")
print(f"Checkpoint saved at step {steps}")
print("Training complete!")This way, if a spot instance gets interrupted, you checkpoint and resume on a replacement instance without losing progress.
Step 4: Deploy Serving Infrastructure
With Pulumi, deploying your trained model is straightforward:
# __main__.py
from ml_serving import deploy_model_serving_infrastructure
# Read the trained model config
with open("model_config.yaml") as f:
model_config = yaml.safe_load(f)
# Read cluster kubeconfig (from training Terraform outputs)
kubeconfig_json = open("kubeconfig.json").read()
k8s_provider = k8s.Provider("k8s", kubeconfig=kubeconfig_json)
# Deploy model serving
serving = deploy_model_serving_infrastructure(
model_config_path="model_config.yaml",
k8s_provider=k8s_provider
)
pulumi.export("model_endpoint", serving["service"].status.load_balancer.ingress[0].hostname)
pulumi.export("model_replicas", serving["deployment"].spec.replicas)Pulumi creates:
- Kubernetes deployment with GPU requests/limits
- Horizontal Pod Autoscaler (based on CPU)
- LoadBalancer service exposing your model
- ConfigMap with model config for model serving container to reference
Step 5: Destroy Training Infrastructure
Once training completes, clean up immediately to avoid costs:
cd terraform/training
terraform destroy -auto-approve
# Verify cleanup
aws ec2 describe-spot-fleet-requests --query 'SpotFleetRequestConfigs[?Status.Code==`cancelled_running`]'Total cost for training: minutes of spot instance time + checkpoint storage. No idle GPU charges.
Advanced: Cost Optimization Patterns
Spot Instance Diversification
Why use multiple instance types? Because GPU availability varies. If g4dn.12xlarge has a stockout in your region, Terraform tries g4dn.2xlarge with similar characteristics:
variable "instance_types" {
type = list(string)
default = [
"g4dn.12xlarge", # 4x GPU, most cost-effective
"g4dn.2xlarge", # 1x GPU, fallback
"g4dn.xlarge", # 1x GPU, final fallback
]
}
# In spot_fleet.tf, we iterate:
dynamic "overrides" {
for_each = var.instance_types
content {
instance_type = overrides.value
# Terraform applies this override across all AZs
}
}Result: 99.5% fleet launch success rate even during GPU shortages.
Reserved Capacity for Baseline Load
For serving infrastructure with predictable traffic, mix reserved and on-demand:
# Baseline always-on capacity (reserved, cheaper)
resource "aws_eks_node_group" "gpu_reserved" {
cluster_name = aws_eks_cluster.ml_serving.name
node_group_name = "gpu-reserved"
capacity_type = "on_demand" # Reserve this
scaling_config {
desired_size = 3
max_size = 3
min_size = 3
}
}
# Burst capacity (spot, cheaper)
resource "aws_eks_node_group" "gpu_burst" {
cluster_name = aws_eks_cluster.ml_serving.name
node_group_name = "gpu-burst"
capacity_type = "spot"
scaling_config {
desired_size = 0
max_size = 20
min_size = 0
}
}Pods with preferredDuringSchedulingIgnoredDuringExecution target burst nodes. During spikes, they spill to burst. 60% cost savings on serving infrastructure.
Terraform vs. Pulumi: When to Use Each
| Aspect | Terraform | Pulumi |
|---|---|---|
| Learning Curve | Moderate (HCL syntax) | Lower (Python/Go/TS) |
| ML Integration | Via bash/external scripts | Native (same language) |
| Ecosystem | Largest (300K+ modules) | Growing, excellent AWS support |
| State Management | Excellent (mature, proven) | Excellent (same backend options) |
| Team Skills | Requires HCL expertise | Leverage existing Python skills |
| GPU Infrastructure | Fully supported, battle-tested | Fully supported, expanding |
| Cost | Free/open-source | Free tier + commercial support |
| Organization | Separate infra/ML teams | ML-native, polyglot teams |
| Dynamic Resources | Limited (for_each, count loops) | Full programming language |
Use Terraform if:
- Your organization has dedicated infrastructure engineers
- You need multi-cloud (AWS, GCP, Azure) consistency
- You prefer declarative, static infrastructure
- Your team has HCL expertise
Use Pulumi if:
- Your ML team writes and deploys infrastructure
- You want Python, Go, or TypeScript everywhere
- You need dynamic resource generation from configs
- You plan to grow from ML→MLOps→DevOps
For most ML teams, start with Terraform for training (proven, simple) and Pulumi for serving (dynamic, Pythonic).
The deeper question both tools answer is organizational: who owns infrastructure? Historically, operations and infrastructure teams owned infrastructure, and developers requested resources. In modern ML-heavy organizations, data scientists and ML engineers own their infrastructure. They write it, test it, deploy it, debug it. This shift from "infrastructure as a specialized function" to "infrastructure as a team skill" is fundamental. Terraform encourages the old model - you write HCL, someone reviews it, someone deploys it. Pulumi enables the new model - you write Python alongside your training scripts, you can unit test infrastructure just like you test models, you can version and review infrastructure with the same tooling as code. The best choice depends on which model your organization is adopting.
It's also worth noting that these tools aren't mutually exclusive. Large organizations often use both - Terraform for shared infrastructure (networking, cluster setup, IAM policies) managed centrally, and Pulumi for workload-specific infrastructure (model serving-inference-server-multi-model-serving) stacks, experiment resources) managed by teams. You can compose Terraform outputs into Pulumi programs, creating a hybrid model that balances central governance with team autonomy.
Practical Workflow: From Experiment to Production
Here's how this works end-to-end:
- Experiment Phase: Data scientist runs training locally or on a small Terraform-managed spot cluster.
- Dev Deployment: Push to dev, Pulumi spins up a test serving infrastructure and deploys the model.
- Staging Phase: Same model config, different Pulumi stack, creates staging infrastructure with load testing.
- Production Rollout: Canary deployment using Kubernetes, auto-scaling based on real traffic.
- Monitoring: CloudWatch and drift detection catch configuration changes.
- Cleanup: Terraform destroys training infrastructure after job completes, saving costs.
The key insight: infrastructure and models should evolve together. Store your infrastructure code alongside model code. Version both. Test both. A model trained on outdated infrastructure config will fail at deployment time.
This workflow sidesteps a common pain point: the disconnect between "how we trained it" and "how we deploy it." You train locally with 8 GPUs, PyTorch DataLoader, certain batch sizes. Then you deploy to a containerized serving environment with different hardware and different inference patterns. Surprise - it doesn't work as expected. Or worse, it works but is embarrassingly slow. The root cause is always that training and serving infrastructure diverged.
When your infrastructure lives in code alongside your model code, they're bound together. You can't accidentally train a model that's incompatible with serving infrastructure because they're declared together, versioned together, reviewed together. A PR that changes the serving infrastructure includes the code that interacts with it. Reviewers see both sides. This catches integration bugs early, before they cause production incidents.
The other benefit is reproducibility. A year from now, you need to retrain))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) the model. You check out the infrastructure code from that version, spin up the identical training environment, retrain. You get the same hardware, same setup, same behavior. You can't "accidentally" upgrade a dependency or use different GPU hardware and wonder why the results shifted. This is invaluable for model governance and auditing.
Checklist: IaC Maturity for ML
Before calling your ML infrastructure "production-ready," validate:
- State Management: Remote backend (S3 + DynamoDB), encrypted, daily backups
- Drift Detection: Automated checks every 6 hours, Slack alerts
- Least Privilege: IAM roles with minimal permissions (S3, EC2, EKS only)
- Spot Handling: Graceful shutdown on interruption, checkpoint save, auto-resume
- Multi-AZ: Training and serving spread across 3+ AZs
- Monitoring: CloudWatch dashboards, alarm on failed training, service latency tracking
- Cost Tracking: Per-environment cost tags, monthly reports
- Disaster Recovery: Can you recreate production in 1 hour from code? (You should)
- Documentation: README explaining how to deploy, scale, and destroy
- Testing: Dry-run
terraform planin CI, validate Pulumi stack in staging before prod
Summary
Infrastructure as Code for ML isn't a nice-to-have - it's essential for reproducibility, cost control, and peace of mind. Terraform provides battle-tested, multi-cloud support with fine-grained resource control and a massive ecosystem. Pulumi brings flexibility and language familiarity that resonates with ML teams who live in Python.
Both tools handle the unique challenges of training (ephemeral, GPU-heavy, interrupt-tolerant) and serving (persistent, auto-scaling, low-latency) infrastructure elegantly. The choice depends on your team's structure and skills, not the technical capabilities of the tools.
Start with either tool, but commit to one. Build reusable modules or ComponentResources. Automate drift detection. Lock your state. Treat infrastructure as you treat code - version it, test it, review it before deploying.
Your 3 AM pages and surprise cloud bills will become someone else's problem.
The Organizational Shift Enabled by IaC
Infrastructure as Code fundamentally changes how organizations scale. In the old model, you had infrastructure specialists who understood networking, storage, and compute. When you needed resources, you submitted a ticket and waited. This created bottlenecks. Your team wanted to spin up a GPU cluster for an experiment but had to wait three weeks for approval and provisioning. By then, you'd moved on to something else.
With IaC, teams become self-sufficient. Your data scientist writes Python training code and Python infrastructure code. They submit a PR. It gets reviewed. They apply it. Cluster is up in minutes. Experiment runs. Cluster gets deleted. Done. No specialists required. This doesn't mean infrastructure becomes unimportant - quite the opposite. Good infrastructure patterns become codified and shared. Your ML team writes a Pulumi component that encapsulates best practices for training clusters. Everyone uses that component. Infrastructure improvements propagate across all teams automatically.
This democratization comes with risks. People write bad infrastructure code. Someone hardcodes credentials in state files. Someone creates a security group that opens port 22 to 0.0.0.0. Peer review and linting help, but you also need cultural investment. Your team needs to understand that infrastructure has security and cost implications. A single thoughtless configuration decision could cost thousands or expose your data.
Building Infrastructure Abstractions
The most mature organizations don't have every team writing base infrastructure from scratch. They build abstractions. In Terraform, you build modules that encapsulate complexity. In Pulumi, you build ComponentResources. For ML workloads specifically, you might have components like TrainingCluster, ServingStack, FeatureStore, NotebookEnvironment. Teams instantiate these with a few parameters and get fully functional infrastructure.
Building good abstractions requires understanding patterns deeply. You notice that every training cluster needs spot instance handling with graceful shutdown. Every one needs monitoring and alerting. Every one needs cost tracking. So you build a component that includes all of this. Teams don't need to reimplement or understand the details. They just use the component.
Abstractions also serve a governance function. You can enforce standards at the component level. All training clusters must have encryption at rest. All serving stacks must be multi-AZ. All state must be backed up daily. These become requirements of the component, not guidelines that teams might or might not follow.
Evolving Your IaC as You Scale
When you're a five-person ML team, your infrastructure is simple. A small GPU cluster for training, maybe a managed service for serving. You probably don't even need IaC - manual clicking is fine. But once you're a fifty-person team running hundreds of models, manual infrastructure becomes impossible. You need systematic approaches.
The evolution is natural. First you write Terraform to describe your current setup. You run plan and apply manually. Over time, you integrate it into CI/CD. You write tests. You implement drift detection. You set up cost alerts. You create reusable modules. You train your team on infrastructure best practices. The system grows with you.
But there's a maturity trap. Mature organizations sometimes develop such sophisticated infrastructure that only specialists understand it. A junior engineer can't add a simple feature because the infrastructure code is too complex. The abstraction becomes leaky. You need regular refactoring to keep abstractions clean and accessible.
The Cost Impact of IaC
Good infrastructure code directly impacts your cloud bill. By using spot instances intelligently, you cut training costs 70%. By auto-scaling serving infrastructure properly, you reduce idle capacity. By tracking costs with tags and automated reporting, you catch waste early. By implementing drift detection, you catch expensive misconfigurements before they become month-long problems.
Conversely, bad infrastructure code is expensive. Someone hardcodes 20 on-demand instances when 3 would suffice because they didn't understand resource sizing. Someone leaves a development cluster running 24/7 when it should be ephemeral. Someone replicates infrastructure across three regions when only one is needed. These mistakes compound.
The cost savings from good IaC often exceed the engineering investment within months. That's why mature organizations are willing to invest heavily in infrastructure tooling and training.
Choosing Between Terraform and Pulumi: A Practical Guide
The choice between Terraform and Pulumi comes down to your organization's structure and philosophy. If you have dedicated infrastructure engineers who write infrastructure separately from application engineers, Terraform's declarative model is ideal. It enforces a clear separation of concerns. Infrastructure code stays in a separate repository. Application teams reference infrastructure. Infrastructure teams own reliability and security.
If your organization believes ML engineers should own their infrastructure end-to-end, Pulumi is the better fit. ML engineers already write Python. Having them write infrastructure in the same language eliminates cognitive overhead. They can share utilities and libraries between application and infrastructure code. They can test infrastructure the same way they test models. This model scales better as your organization grows because you avoid infrastructure bottlenecks.
Hybrid approaches work too. Core infrastructure (networking, Kubernetes clusters, IAM) is managed by infrastructure engineers in Terraform. Application-specific infrastructure (model serving stacks, feature pipelines) is managed by ML teams in Pulumi. Pulumi consumes Terraform outputs. You get the consistency of Terraform with the flexibility of Pulumi.