The ML Infrastructure Challenge

Here's the problem: traditional IaC tools were built for web services. Your ML infrastructure has very different demands. You need ephemeral training clusters that spin up, consume expensive GPU resources, then disappear without a trace. You need persistent serving infrastructure that auto-scales based on inference demand. You need spot instances to save costs, but with graceful handling of interruptions. You need GPU node pools in Kubernetes-nvidia-kai-scheduler-gpu-job-scheduling), managed resource quotas, and drift detection to catch configuration drift before it causes your training job to mysteriously fail at 3 AM.

Standard IaC patterns don't cut it. You need infrastructure that thinks about ML-specific concerns: checkpoint management, distributed training-pipelines-training-orchestration)-fundamentals)) coordination, model versioning-ab-testing), and cost optimization.

This is where the gap between generic IaC and ML-aware IaC becomes painfully clear. A Terraform configuration that works beautifully for deploying web services - declaring stateless APIs, databases, load balancers - breaks down when you introduce the complexities of ML workloads. Web services are largely stateless; a crashed instance is replaced with a new one running identical code. ML training is stateful in ways that matter; a crashed GPU in the middle of a 40-hour training run doesn't just mean "restart from the beginning." It means losing all intermediate progress, potentially hours of wasted compute. Your infrastructure code needs to account for this by managing checkpoints, handling spot instance interruptions gracefully, and coordinating distributed training-ddp-advanced-distributed-training) across multiple nodes where any single failure requires recovery logic.

The other challenge is the diversity of ML workloads. You're not deploying one type of service; you're deploying many. Some jobs are training jobs that run once and finish. Some are serving jobs that run forever. Some are periodic batch jobs that process data on a schedule. Some are interactive experiments that scientists run ad hoc. Each pattern demands different infrastructure semantics. Static web app infrastructure assumes one deployment-production-inference-deployment) per version. ML infrastructure assumes multiple concurrent workloads with different resource requirements, lifespans, and failure modes. This is why the best ML infrastructure teams abstract their IaC tooling behind domain-specific orchestrators (like Kubeflow, Ray, or SageMaker) rather than exposing raw Terraform or Pulumi to data scientists. The abstraction layer hides complexity while maintaining flexibility.

ML-Specific IaC Requirements

Before we dive into Terraform and Pulumi, let's establish what "good" looks like for ML infrastructure.

GPU Resource Management

You can't just request GPUs like regular compute. You need quota management, instance type selection (NVIDIA A100 vs H100 vs cheaper options), and awareness of spot instance pricing and availability. Your infrastructure code should declare what you need, and the IaC tool should handle availability zone diversity and fallback-fallback) options.

Training vs. Serving Infrastructure

Training clusters are fundamentally ephemeral. A training job runs for hours or days, then you're done. You want to destroy everything afterward to avoid surprise costs. Serving infrastructure is persistent - models serve traffic 24/7, so you need reliable, auto-scaling infrastructure with state management, load balancing, and canary deployments.

The same IaC tool needs to handle both patterns elegantly without requiring completely different approaches.

Auto-scaling for Inference

Web services scale on CPU and request latency. ML models scale on inference request volume, but with important differences. A single request might consume multiple GPUs. Queue depth matters more than request rate. You need predictable, fast scaling to handle traffic spikes without dropping requests or burning money on over-provisioning.

Kubernetes GPU Node Pools

Most modern ML serving happens on Kubernetes. You need your IaC to:

Create GPU node groups with specific instance types
Apply GPU taints so non-GPU workloads don't schedule there
Install the NVIDIA device plugin
Configure auto-scaling with tools like Karpenter
Manage pod requests and limits

State and Drift Management

Unlike stateless web apps, ML infrastructure often has state - training checkpoints in S3, model metadata in DynamoDB, experiment tracking in specialized services. Your IaC needs to handle state locking to prevent concurrent modifications, drift detection to catch manual changes, and safe destroy/recreate patterns for stateless training infrastructure.

Terraform for ML Infrastructure: Architecture Overview

Terraform is the industry standard for infrastructure as code, and for good reason. It has been around since 2014 and has accumulated deep integrations with cloud providers. For AWS, which dominates ML infrastructure deployment, Terraform's AWS provider is remarkably comprehensive - thousands of resources, hundreds of data sources, and a mature community. The declarative language (HCL) reads like structured configuration, which makes it accessible to non-experts while remaining powerful enough for complex orchestration.

The key advantage of Terraform is its state management. It tracks what infrastructure exists, what you've declared, and what diffs need to apply. This gives you safety - you can preview changes before applying them, and you have a durable record of your infrastructure. The downside is that state becomes your source of truth, not your code. If your state file gets corrupted, you're in for a bad time. If your state becomes out of sync with actual infrastructure (drift), Terraform's error messages become cryptic. This is manageable at small scale, but at large scale, state management becomes a serious operational concern.

For ML infrastructure specifically, Terraform's module system is what makes it work. You don't write monolithic 2000-line Terraform files. You write modular, composable modules - one for VPC, one for Kubernetes cluster, one for GPU node pools, one for IAM policies. Your "main" configuration then orchestrates these modules. This composition pattern scales well. You can version modules separately, share them across projects, and test them independently.

The challenge is that Terraform is verbose. A simple "create a GPU instance pool with auto-scaling" requires writing policy documents, security group rules, IAM roles with trust relationships, and launch templates. There's a lot of boilerplate that feels mechanical. This is where Pulumi offers a different angle: it lets you write infrastructure as real code (Python, Go, Node.js), which means you can use loops, conditionals, functions, and libraries - all the tools you use when writing applications.

┌─────────────────────────────────────────────────────────┐
│                  ML Workload Definition                 │
│  (experiment_config.yaml, model_config.json)            │
└────────┬────────────────────────────────────────────────┘
         │
┌────────▼────────────────────────────────────────────────┐
│         Terraform Root Module (main.tf)                 │
│  • Data sources (current AWS account, AZs)              │
│  • Local variables (instance types, spot config)        │
│  • Module composition                                   │
└────────┬────────────────────────────────────────────────┘
         │
┌────────┴──────────────────────────────────────────────────┐
│                                                            │
├─────────────────────────┬────────────────────┬────────────┤
│   VPC Module            │  IAM Module        │  EKS Module│
│  • Subnets              │  • Role policies   │  • Cluster │
│  • Security groups      │  • Service accts   │  • Node grps
│  • NAT gateways         │  • Least privilege │  • RBAC    │
├─────────────────────────┼────────────────────┼────────────┤
│   Spot Fleet Module     │  S3/Checkpoint     │  Monitoring│
│  • Launch template      │  • Bucket config   │  • CloudWatch
│  • Interrupt handling   │  • Lifecycle rules │  • Alerting│
│  • Cost optimization    │  • Encryption      │            │
└─────────────────────────┴────────────────────┴────────────┘

Terraform's strength for ML lies in its massive AWS provider support and mature ecosystem. You get fine-grained control over every resource. The downside? Lots of boilerplate, and you're writing declarative code that feels distant from your Python-based ML work.

Terraform Module for ML Training Infrastructure

Let's build a practical Terraform module for ML training. This module manages EC2 spot instances with GPU support, including interrupt handling and checkpoint management.

Directory Structure

terraform/
├── modules/
│   └── ml_training_cluster/
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── iam.tf
│       ├── networking.tf
│       └── spot_fleet.tf
├── environments/
│   ├── dev.tfvars
│   ├── staging.tfvars
│   └── prod.tfvars
└── main.tf

Core Module: main.tf

hcl

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}
 
provider "aws" {
  region = var.aws_region
}
 
# Fetch available AZs for instance diversity
data "aws_availability_zones" "available" {
  state = "available"
}
 
# Data source for Ubuntu AMI with GPU drivers pre-installed
data "aws_ami" "gpu_optimized" {
  most_recent = true
  owners      = ["amazon"]
 
  filter {
    name   = "name"
    values = ["Deep Learning AMI GPU CUDA 12 (Ubuntu 22.04)*"]
  }
}
 
# VPC and networking
module "networking" {
  source = "./networking"
 
  vpc_name            = "${var.cluster_name}-vpc"
  vpc_cidr            = var.vpc_cidr
  availability_zones  = data.aws_availability_zones.available.names
  enable_nat_gateway  = true
}
 
# IAM roles for EC2 instances
module "iam" {
  source = "./iam"
 
  cluster_name       = var.cluster_name
  checkpoint_bucket  = aws_s3_bucket.checkpoints.id
}
 
# S3 bucket for training checkpoints
resource "aws_s3_bucket" "checkpoints" {
  bucket = "${var.cluster_name}-checkpoints-${data.aws_caller_identity.current.account_id}"
 
  tags = {
    Name        = "${var.cluster_name}-checkpoints"
    Environment = var.environment
  }
}
 
resource "aws_s3_bucket_lifecycle_configuration" "checkpoints" {
  bucket = aws_s3_bucket.checkpoints.id
 
  rule {
    id     = "cleanup-old-checkpoints"
    status = "Enabled"
 
    expiration {
      days = var.checkpoint_retention_days
    }
  }
}
 
# Spot Fleet for training cluster
module "spot_fleet" {
  source = "./spot_fleet"
 
  cluster_name           = var.cluster_name
  vpc_id                 = module.networking.vpc_id
  subnet_ids             = module.networking.private_subnet_ids
  security_group_id      = module.networking.training_security_group_id
  instance_profile_arn   = module.iam.instance_profile_arn
  ami_id                 = data.aws_ami.gpu_optimized.id
  instance_types         = var.instance_types
  capacity_units_target  = var.capacity_units_target
  max_price              = var.spot_max_price
  availability_zones     = data.aws_availability_zones.available.names
}
 
data "aws_caller_identity" "current" {}
 
output "checkpoint_bucket_name" {
  value = aws_s3_bucket.checkpoints.id
}
 
output "spot_fleet_id" {
  value = module.spot_fleet.fleet_id
}

Spot Fleet Configuration: spot_fleet.tf

hcl

# Launch template for GPU instances
resource "aws_launch_template" "training" {
  name_prefix            = "${var.cluster_name}-"
  image_id               = var.ami_id
  instance_type          = var.instance_types[0]
  vpc_security_group_ids = [var.security_group_id]
  iam_instance_profile {
    arn = var.instance_profile_arn
  }
 
  # User data: install training tools and setup graceful shutdown
  user_data = base64encode(templatefile("${path.module}/user_data.sh", {
    checkpoint_bucket = var.checkpoint_bucket
    region            = data.aws_region.current.name
  }))
 
  block_device_mappings {
    device_name = "/dev/sda1"
    ebs {
      volume_size           = 100
      volume_type           = "gp3"
      iops                  = 3000
      throughput            = 125
      delete_on_termination = true
      encrypted             = true
    }
  }
 
  monitoring {
    enabled = true
  }
 
  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "${var.cluster_name}-training"
      ClusterName = var.cluster_name
    }
  }
 
  lifecycle {
    create_before_destroy = true
  }
}
 
# Spot Fleet Request with diversification strategy
resource "aws_ec2_fleet" "training" {
  launch_template_config {
    launch_template_specification {
      launch_template_id = aws_launch_template.training.id
      version            = "$Latest"
    }
 
    # Diversify across instance types and AZs to minimize interruption impact
    dynamic "overrides" {
      for_each = var.instance_types
      content {
        instance_type = overrides.value
 
        dynamic "availability_zone" {
          for_each = var.availability_zones
          content {
            availability_zone = availability_zone.value
          }
        }
      }
    }
  }
 
  type                          = "maintain"
  excess_capacity_termination_policy = "termination"
  target_capacity_specification {
    total_target_capacity = var.capacity_units_target
    on_demand_target_capacity = 0  # Use 100% spot for cost optimization
    spot_target_capacity = var.capacity_units_target
  }
 
  spot_options {
    allocation_strategy            = "price-capacity-optimized"
    instance_interruption_behavior = "terminate"
    maintenance_strategies {
      capacity_rebalance {
        replacement_strategy = "launch"
      }
    }
  }
 
  tags = {
    Name = "${var.cluster_name}-fleet"
  }
}
 
data "aws_region" "current" {}

User Data Script: user_data.sh

bash

#!/bin/bash
set -e
 
# Setup CloudWatch agent for monitoring
amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/config.json
 
# Install training dependencies
pip install -U pip setuptools wheel
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install pytorch-lightning wandb
 
# Setup graceful shutdown handler
cat > /opt/graceful_shutdown.sh << 'EOF'
#!/bin/bash
# Listen for EC2 spot interruption notice (2 minute warning)
while true; do
  if curl -s http://169.254.169.254/latest/meta-data/spot/instance-action | grep -q "instance-action"; then
    echo "Spot interruption detected. Saving checkpoint..."
    # Trigger checkpoint save signal to training process
    pkill -SIGUSR1 python || true
    sleep 120  # Wait for graceful shutdown before termination
    break
  fi
  sleep 5
done
EOF
 
chmod +x /opt/graceful_shutdown.sh
nohup /opt/graceful_shutdown.sh &
 
echo "Training instance ready"

Variables: variables.tf

hcl

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}
 
variable "cluster_name" {
  description = "Name of the training cluster"
  type        = string
}
 
variable "environment" {
  description = "Environment (dev, staging, prod)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}
 
variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  default     = "10.0.0.0/16"
}
 
variable "instance_types" {
  description = "GPU instance types to use (e.g., g4dn.xlarge, g4dn.2xlarge)"
  type        = list(string)
  default     = ["g4dn.xlarge", "g4dn.2xlarge", "g4dn.12xlarge"]
}
 
variable "capacity_units_target" {
  description = "Target capacity in capacity units (vCPU-based)"
  type        = number
  default     = 8
}
 
variable "spot_max_price" {
  description = "Maximum spot price per vCPU-hour"
  type        = string
  default     = "0.50"
}
 
variable "checkpoint_retention_days" {
  description = "Days to retain checkpoints in S3"
  type        = number
  default     = 30
}

Kubernetes GPU Node Pools with Terraform

Now let's manage EKS with GPU node pools using Terraform. This is where you deploy serving infrastructure.

EKS Cluster Module

hcl

# EKS Cluster
resource "aws_eks_cluster" "ml_serving" {
  name            = var.cluster_name
  version         = var.kubernetes_version
  role_arn        = aws_iam_role.eks_cluster_role.arn
  vpc_config {
    subnet_ids              = concat(module.networking.public_subnet_ids, module.networking.private_subnet_ids)
    endpoint_private_access = true
    endpoint_public_access  = true
  }
 
  depends_on = [aws_iam_role_policy_attachment.eks_cluster_policy]
}
 
# GPU Node Group with Auto Scaling
resource "aws_eks_node_group" "gpu" {
  cluster_name    = aws_eks_cluster.ml_serving.name
  node_group_name = "${var.cluster_name}-gpu-nodes"
  node_role_arn   = aws_iam_role.eks_node_role.arn
  subnet_ids      = module.networking.private_subnet_ids
  version         = var.kubernetes_version
 
  scaling_config {
    desired_size = var.desired_size
    max_size     = var.max_size
    min_size     = var.min_size
  }
 
  instance_types = var.gpu_instance_types
 
  # GPU-specific launch template
  launch_template {
    id      = aws_launch_template.gpu_nodes.id
    version = aws_launch_template.gpu_nodes.latest_version_number
  }
 
  # GPU taints to prevent non-GPU workloads from scheduling
  taints {
    key    = "nvidia.com/gpu"
    value  = "true"
    effect = "NO_SCHEDULE"
  }
 
  tags = {
    "NodeType" = "gpu"
  }
 
  depends_on = [
    aws_iam_role_policy_attachment.eks_node_policy,
    aws_iam_role_policy_attachment.eks_cni_policy,
  ]
}
 
# Launch template with GPU-specific settings
resource "aws_launch_template" "gpu_nodes" {
  name_prefix = "${var.cluster_name}-gpu-"
 
  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = 100
      volume_type           = "gp3"
      delete_on_termination = true
      encrypted             = true
    }
  }
 
  monitoring {
    enabled = true
  }
}
 
# NVIDIA Device Plugin via Helm
resource "helm_release" "nvidia_device_plugin" {
  name       = "nvidia-device-plugin"
  repository = "https://nvidia.github.io/k8s-device-plugin"
  chart      = "nvidia-device-plugin"
  namespace  = "kube-system"
 
  set {
    name  = "nodeSelector.nvidia\\.com/gpu"
    value = "true"
  }
 
  depends_on = [aws_eks_cluster.ml_serving]
}
 
# Karpenter for intelligent GPU auto-scaling
resource "helm_release" "karpenter" {
  namespace        = "karpenter"
  create_namespace = true
  name             = "karpenter"
  repository       = "oci://public.ecr.aws/karpenter"
  chart            = "karpenter"
 
  set {
    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
    value = aws_iam_role.karpenter.arn
  }
 
  set {
    name  = "settings.aws.clusterName"
    value = aws_eks_cluster.ml_serving.name
  }
}
 
output "cluster_endpoint" {
  value = aws_eks_cluster.ml_serving.endpoint
}
 
output "cluster_name" {
  value = aws_eks_cluster.ml_serving.name
}

Pulumi with Python SDK: Same Language as Your ML Code

Here's where things get interesting. Pulumi lets you write infrastructure in Python, Go, or TypeScript. For ML teams, this is huge - your infrastructure code lives in the same language as your model training scripts.

Pulumi's fundamental difference from Terraform is philosophical. Terraform is declarative - you describe the final state you want, and Terraform calculates the diff. Pulumi is imperative - you write actual code that runs and creates resources. This might sound like a small difference, but it cascades into dramatic consequences for complexity and flexibility.

Consider a simple scenario: you want to create a different number of GPU nodes depending on the environment. With Terraform, you write conditional logic in HCL - doable, but HCL's conditionals feel bolted-on. With Pulumi in Python, you write a loop: for i in range(num_nodes): create_gpu_node(...). You have the full power of Python - functions, classes, imports, libraries. You can fetch the number of nodes from an external API. You can compute it based on config. You can even make it data-driven, pulling requirements from a YAML file and generating infrastructure.

This flexibility is powerful for ML infrastructure because ML requirements change rapidly. You're not deploying a static web service once. You're deploying diverse workloads with different resource requirements, conducting experiments, scaling based on demand. Pulumi's model matches this reality better than Terraform's declarative approach. You can version control your infrastructure, review it in PRs, and deploy it with CI/CD, just like you do with application code.

The downside is that Pulumi requires more infrastructure expertise. You're writing real code, which means you can write bad code. With Terraform, you're constrained by the language to express infrastructure concerns. With Pulumi, you could in theory do anything Python allows, which is both a feature (flexibility) and a bug (complexity). Teams that get Pulumi right build powerful, DRY infrastructure libraries. Teams that get it wrong end up with infrastructure code that's harder to maintain than application code because it touches more surfaces and has more failure modes.

Pulumi Project Structure

ml-infrastructure/
├── Pulumi.yaml
├── Pulumi.dev.yaml
├── Pulumi.prod.yaml
├── __main__.py
├── ml_training.py
├── ml_serving.py
└── utils.py

Pulumi Configuration: main.py

python

import json
import pulumi
import pulumi_aws as aws
import pulumi_eks as eks
import pulumi_kubernetes as k8s
 
# Load config from context
config = pulumi.Config()
cluster_name = config.get("cluster_name") or "ml-serving"
environment = pulumi.get_stack()
region = config.get("region") or "us-east-1"
 
# Create VPC
vpc = aws.ec2.Vpc(f"{cluster_name}-vpc",
    cidr_block="10.0.0.0/16",
    enable_dns_hostnames=True,
)
 
# Subnets across AZs
availability_zones = aws.get_availability_zones(state="available").names
subnet_ids = []
 
for i, az in enumerate(availability_zones[:3]):  # 3 AZs
    subnet = aws.ec2.Subnet(f"{cluster_name}-subnet-{i}",
        vpc_id=vpc.id,
        cidr_block=f"10.0.{i}.0/24",
        availability_zone=az,
    )
    subnet_ids.append(subnet.id)
 
# Security group for cluster
security_group = aws.ec2.SecurityGroup(f"{cluster_name}-sg",
    vpc_id=vpc.id,
    ingress=[
        aws.ec2.SecurityGroupIngressArgs(
            protocol="tcp",
            from_port=443,
            to_port=443,
            cidr_blocks=["0.0.0.0/0"],
        ),
    ],
)
 
# Create EKS cluster
eks_cluster = eks.Cluster(f"{cluster_name}-cluster",
    version="1.28",
    vpc_id=vpc.id,
    subnet_ids=subnet_ids,
    endpoint_private_access=True,
    endpoint_public_access=True,
)
 
# GPU Node group
gpu_nodegroup = eks.NodeGroup(f"{cluster_name}-gpu-nodegroup",
    cluster=eks_cluster,
    node_role_arn=aws.iam.Role(f"{cluster_name}-node-role",
        assume_role_policy=json.dumps({
            "Version": "2012-10-17",
            "Statement": [{
                "Action": "sts:AssumeRole",
                "Principal": {"Service": "ec2.amazonaws.com"},
                "Effect": "Allow",
            }],
        }),
    ).arn,
    scaling_config=eks.NodeGroupScalingConfigArgs(
        desired_size=3,
        max_size=10,
        min_size=1,
    ),
    instance_types=["g4dn.xlarge", "g4dn.2xlarge"],
)
 
# Kubernetes provider using cluster credentials
k8s_provider = k8s.Provider("k8s-provider",
    kubeconfig=eks_cluster.kubeconfig_json,
)
 
# Export outputs
pulumi.export("cluster_name", eks_cluster.name)
pulumi.export("kubeconfig", eks_cluster.kubeconfig_json)

Dynamic Resource Creation from Config: ml_serving.py

Here's the Pulumi advantage - you can read your ML model config and dynamically create infrastructure based on it:

python

import json
import yaml
import pulumi
import pulumi_aws as aws
import pulumi_kubernetes as k8s
 
def deploy_model_serving_infrastructure(model_config_path, k8s_provider):
    """
    Read model configuration and create serving infrastructure dynamically.
    This is where Pulumi shines—same language as your ML code!
    """
    with open(model_config_path, 'r') as f:
        model_config = yaml.safe_load(f)
 
    model_name = model_config['model']['name']
    replicas = model_config['serving']['replicas']
    gpu_per_replica = model_config['serving']['gpu_per_replica']
    batch_size = model_config['serving']['batch_size']
 
    # Create ConfigMap from model config
    config_map = k8s.core.v1.ConfigMap(
        f"{model_name}-config",
        metadata={"namespace": "default"},
        data={
            "model_config.yaml": json.dumps(model_config)
        },
        opts=pulumi.ResourceOptions(provider=k8s_provider)
    )
 
    # Create Deployment with computed resources
    deployment = k8s.apps.v1.Deployment(
        f"{model_name}-deployment",
        spec=k8s.apps.v1.DeploymentSpecArgs(
            replicas=replicas,
            selector=k8s.meta.v1.LabelSelectorArgs(
                match_labels={"app": model_name}
            ),
            template=k8s.core.v1.PodTemplateSpecArgs(
                metadata=k8s.meta.v1.ObjectMetaArgs(
                    labels={"app": model_name}
                ),
                spec=k8s.core.v1.PodSpecArgs(
                    containers=[
                        k8s.core.v1.ContainerArgs(
                            name=model_name,
                            image=f"myregistry.azurecr.io/{model_name}:latest",
                            ports=[k8s.core.v1.ContainerPortArgs(container_port=5000)],
                            resources=k8s.core.v1.ResourceRequirementsArgs(
                                requests={
                                    "memory": "4Gi",
                                    "cpu": "2",
                                    "nvidia.com/gpu": gpu_per_replica,
                                },
                                limits={
                                    "nvidia.com/gpu": gpu_per_replica,
                                },
                            ),
                            env=[
                                k8s.core.v1.EnvVarArgs(
                                    name="BATCH_SIZE",
                                    value=str(batch_size),
                                ),
                            ],
                        )
                    ],
                    node_selector={"nvidia.com/gpu": "true"},
                    tolerations=[
                        k8s.core.v1.TolerationArgs(
                            key="nvidia.com/gpu",
                            operator="Equal",
                            value="true",
                            effect="NoSchedule",
                        )
                    ],
                ),
            ),
        ),
        metadata=k8s.meta.v1.ObjectMetaArgs(namespace="default"),
        opts=pulumi.ResourceOptions(provider=k8s_provider)
    )
 
    # Create Service to expose model
    service = k8s.core.v1.Service(
        f"{model_name}-service",
        spec=k8s.core.v1.ServiceSpecArgs(
            type="LoadBalancer",
            selector={"app": model_name},
            ports=[k8s.core.v1.ServicePortArgs(port=80, target_port=5000)],
        ),
        metadata=k8s.meta.v1.ObjectMetaArgs(namespace="default"),
        opts=pulumi.ResourceOptions(provider=k8s_provider)
    )
 
    # Create HPA for auto-scaling based on GPU utilization
    hpa = k8s.autoscaling.v2.HorizontalPodAutoscaler(
        f"{model_name}-hpa",
        spec=k8s.autoscaling.v2.HorizontalPodAutoscalerSpecArgs(
            scale_target_ref=k8s.autoscaling.v2.CrossVersionObjectReferenceArgs(
                api_version="apps/v1",
                kind="Deployment",
                name=f"{model_name}-deployment",
            ),
            min_replicas=replicas,
            max_replicas=replicas * 3,
            metrics=[
                k8s.autoscaling.v2.MetricSpecArgs(
                    type="Resource",
                    resource=k8s.autoscaling.v2.ResourceMetricSourceArgs(
                        name="cpu",
                        target=k8s.autoscaling.v2.MetricTargetArgs(
                            type="Utilization",
                            average_utilization=70,
                        ),
                    ),
                ),
            ],
        ),
        metadata=k8s.meta.v1.ObjectMetaArgs(namespace="default"),
        opts=pulumi.ResourceOptions(provider=k8s_provider)
    )
 
    return {
        "deployment": deployment,
        "service": service,
        "hpa": hpa,
    }

ComponentResource for Reusable Patterns: utils.py

Pulumi's ComponentResource lets you abstract complex patterns:

python

import json
import pulumi
import pulumi_aws as aws
 
class GPUTrainingCluster(pulumi.ComponentResource):
    """
    Reusable component for ephemeral ML training clusters.
    Same interface, different infrastructure based on config.
    """
 
    def __init__(self, name, config, opts=None):
        super().__init__('custom:ml:GPUTrainingCluster', name, None, opts)
 
        self.cluster_name = name
        self.config = config
 
        # Create S3 bucket for checkpoints
        self.checkpoint_bucket = aws.s3.Bucket(
            f"{name}-checkpoints",
            force_destroy=False,  # Prevent accidental deletion
            opts=pulumi.ResourceOptions(parent=self)
        )
 
        # Create IAM role for training instances
        self.training_role = aws.iam.Role(
            f"{name}-training-role",
            assume_role_policy=json.dumps({
                "Version": "2012-10-17",
                "Statement": [{
                    "Action": "sts:AssumeRole",
                    "Principal": {"Service": "ec2.amazonaws.com"},
                    "Effect": "Allow",
                }],
            }),
            opts=pulumi.ResourceOptions(parent=self)
        )
 
        # Grant S3 access to training role
        s3_policy = aws.iam.RolePolicy(
            f"{name}-s3-policy",
            role=self.training_role.id,
            policy=pulumi.Output.concat(
                '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", '
                '"Action": ["s3:GetObject", "s3:PutObject"], "Resource": "',
                self.checkpoint_bucket.arn,
                '/*"}]}'
            ),
            opts=pulumi.ResourceOptions(parent=self)
        )
 
        # Create EC2 fleet for training
        self.spot_fleet = self._create_spot_fleet()
 
        # Register outputs
        self.register_outputs({
            "checkpoint_bucket_name": self.checkpoint_bucket.id,
            "training_role_arn": self.training_role.arn,
            "spot_fleet_id": self.spot_fleet.id,
        })
 
    def _create_spot_fleet(self):
        # Implementation matches Terraform spot_fleet.tf
        launch_template = aws.ec2.LaunchTemplate(
            f"{self.cluster_name}-launch-template",
            instance_type="g4dn.xlarge",
            vpc_security_group_ids=[self.config["security_group_id"]],
            opts=pulumi.ResourceOptions(parent=self)
        )
 
        return aws.ec2.Fleet(
            f"{self.cluster_name}-fleet",
            launch_template_configs=[aws.ec2.FleetLaunchTemplateConfigArgs(
                launch_template_specification=aws.ec2.FleetLaunchTemplateSpecificationArgs(
                    launch_template_id=launch_template.id,
                    version="$Latest",
                ),
            )],
            target_capacity_specification=aws.ec2.FleetTargetCapacitySpecificationArgs(
                total_target_capacity=self.config["capacity_units"],
                on_demand_target_capacity=0,
                spot_target_capacity=self.config["capacity_units"],
            ),
            opts=pulumi.ResourceOptions(parent=self)
        )
 
    @property
    def checkpoint_bucket_name(self):
        return self.checkpoint_bucket.id
 
    @property
    def training_role_arn(self):
        return self.training_role.arn

ML Infrastructure Drift Detection and Management

Here's a critical pain point: your infrastructure drifts. Someone manually adjusts a security group. A node gets an unexpected update. A checkpoint retention policy changes. You need drift detection, and you need it automated.

Terraform Drift Detection in CI

yaml

# .github/workflows/terraform-drift-check.yml
name: Terraform Drift Detection
 
on:
  schedule:
    # Run every 6 hours
    - cron: '0 */6 * * *'
 
jobs:
  drift-detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
 
      - uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: 1.5.0
 
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: arn:aws:iam::${{ secrets.AWS_ACCOUNT_ID }}:role/github-terraform-role
          aws-region: us-east-1
 
      - name: Terraform Init
        run: terraform init
 
      - name: Terraform Plan (Drift Detection)
        run: |
          terraform plan -out=drift.tfplan
 
      - name: Check for Drift
        run: |
          if terraform show drift.tfplan | grep -q "No changes"; then
            echo "✅ No infrastructure drift detected"
          else
            echo "⚠️ Infrastructure drift detected!"
            terraform show drift.tfplan
            exit 1
          fi
 
      - name: Slack Notification (if drift)
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "⚠️ Infrastructure drift detected in ${{ github.repository }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Infrastructure Drift Alert*\n${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                  }
                }
              ]
            }

State Locking with DynamoDB

Prevent concurrent modifications that cause state conflicts:

hcl

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "ml-infrastructure/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}
 
# Create lock table (one-time setup)
resource "aws_dynamodb_table" "terraform_locks" {
  name           = "terraform-locks"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "LockID"
 
  attribute {
    name = "LockID"
    type = "S"
  }
 
  point_in_time_recovery_specification {
    enabled = true
  }
 
  tags = {
    Name = "terraform-locks"
  }
}

Destroy and Recreate Pattern for Training Infrastructure

Training clusters should be ephemeral. Use Terraform workspaces or separate stacks:

bash

# Spin up training cluster
terraform workspace new training-job-001
terraform apply -var-file=environments/training.tfvars
 
# Run training... (external process monitors job)
 
# Destroy when complete
terraform destroy -auto-approve
terraform workspace delete training-job-001

Monitoring and Observability for IaC Changes

You need visibility into what infrastructure changed and when:

hcl

# Enable CloudTrail for audit logging
resource "aws_cloudtrail" "infrastructure_changes" {
  name                          = "${var.cluster_name}-trail"
  s3_bucket_name                = aws_s3_bucket.cloudtrail_logs.id
  include_global_service_events = true
  is_multi_region_trail         = true
  enable_log_file_validation    = true
  depends_on                    = [aws_s3_bucket_policy.cloudtrail]
}
 
# CloudWatch alarm for unexpected infrastructure changes
resource "aws_cloudwatch_log_group" "infrastructure_changes" {
  name = "/aws/cloudtrail/${var.cluster_name}-changes"
}
 
resource "aws_cloudwatch_log_stream" "infrastructure_changes" {
  name           = "changes"
  log_group_name = aws_cloudwatch_log_group.infrastructure_changes.name
}

Practical Example: Complete Training → Serving Pipeline

Let's walk through a realistic end-to-end scenario. You have a PyTorch model trained on distributed GPUs, and you need to deploy it for inference.

Step 1: Define Your Model Config

yaml

# model_config.yaml
model:
  name: recommendation-transformer
  version: v2.1
  framework: pytorch
 
training:
  data_source: s3://ml-datasets/recommendations/2025Q1/
  batch_size: 128
  num_epochs: 50
  checkpoint_interval_steps: 1000
 
serving:
  replicas: 3
  gpu_per_replica: 1
  batch_size: 32
  max_latency_ms: 100
 
infrastructure:
  training:
    instance_types: [g4dn.2xlarge, g4dn.12xlarge]
    capacity_units: 8
    spot_max_price: "0.50"
  serving:
    instance_types: [g4dn.xlarge]
    min_replicas: 3
    max_replicas: 10

Step 2: Provision Training Infrastructure

Using Terraform:

bash

# Initialize and plan
cd terraform/training
terraform init
terraform plan -var-file=environments/prod.tfvars -out=train.tfplan
 
# Apply
terraform apply train.tfplan
 
# Your spot fleet is now live. Submit training job.
python train.py \
  --checkpoint_dir s3://recommendation-transformer-checkpoints/ \
  --config ../model_config.yaml

Terraform creates:

S3 bucket for checkpoints (with lifecycle policies)
EC2 spot fleet across 3 availability zones
CloudWatch monitoring for instance interruptions
IAM role with S3 access (least-privilege)
Security group allowing worker communication

Step 3: Monitor Training, Handle Interruptions

Your user data script listens for spot interruption notices:

python

# training_wrapper.py (runs on each EC2 instance)
import signal
import torch
import boto3
 
checkpoint_interval = 1000
steps = 0
latest_checkpoint_dir = None
 
def handle_interrupt_signal(signum, frame):
    """Called when spot interruption notice arrives"""
    print("Spot interrupt detected. Saving emergency checkpoint...")
    torch.save(model.state_dict(), f"/tmp/emergency_ckpt_{steps}.pt")
    s3_client = boto3.client("s3")
    s3_client.upload_file(
        f"/tmp/emergency_ckpt_{steps}.pt",
        "recommendation-transformer-checkpoints",
        f"emergency_ckpt_{steps}.pt"
    )
    sys.exit(0)
 
signal.signal(signal.SIGUSR1, handle_interrupt_signal)
 
# Training loop
model = load_model(config)
optimizer = setup_optimizer(config)
 
for epoch in range(config.num_epochs):
    for batch in dataloader:
        loss = model(batch)
        loss.backward()
        optimizer.step()
        steps += 1
 
        if steps % checkpoint_interval == 0:
            # Periodic checkpoint (survives spot interruption)
            save_checkpoint(model, optimizer, steps, "s3://...")
            print(f"Checkpoint saved at step {steps}")
 
print("Training complete!")

This way, if a spot instance gets interrupted, you checkpoint and resume on a replacement instance without losing progress.

Step 4: Deploy Serving Infrastructure

With Pulumi, deploying your trained model is straightforward:

python

# __main__.py
from ml_serving import deploy_model_serving_infrastructure
 
# Read the trained model config
with open("model_config.yaml") as f:
    model_config = yaml.safe_load(f)
 
# Read cluster kubeconfig (from training Terraform outputs)
kubeconfig_json = open("kubeconfig.json").read()
k8s_provider = k8s.Provider("k8s", kubeconfig=kubeconfig_json)
 
# Deploy model serving
serving = deploy_model_serving_infrastructure(
    model_config_path="model_config.yaml",
    k8s_provider=k8s_provider
)
 
pulumi.export("model_endpoint", serving["service"].status.load_balancer.ingress[0].hostname)
pulumi.export("model_replicas", serving["deployment"].spec.replicas)

Pulumi creates:

Kubernetes deployment with GPU requests/limits
Horizontal Pod Autoscaler (based on CPU)
LoadBalancer service exposing your model
ConfigMap with model config for model serving container to reference

Step 5: Destroy Training Infrastructure

Once training completes, clean up immediately to avoid costs:

bash

cd terraform/training
terraform destroy -auto-approve
 
# Verify cleanup
aws ec2 describe-spot-fleet-requests --query 'SpotFleetRequestConfigs[?Status.Code==`cancelled_running`]'

Total cost for training: minutes of spot instance time + checkpoint storage. No idle GPU charges.

Advanced: Cost Optimization Patterns

Spot Instance Diversification

Why use multiple instance types? Because GPU availability varies. If g4dn.12xlarge has a stockout in your region, Terraform tries g4dn.2xlarge with similar characteristics:

hcl

variable "instance_types" {
  type = list(string)
  default = [
    "g4dn.12xlarge",  # 4x GPU, most cost-effective
    "g4dn.2xlarge",   # 1x GPU, fallback
    "g4dn.xlarge",    # 1x GPU, final fallback
  ]
}
 
# In spot_fleet.tf, we iterate:
dynamic "overrides" {
  for_each = var.instance_types
  content {
    instance_type = overrides.value
    # Terraform applies this override across all AZs
  }
}

Result: 99.5% fleet launch success rate even during GPU shortages.

Reserved Capacity for Baseline Load

For serving infrastructure with predictable traffic, mix reserved and on-demand:

hcl

# Baseline always-on capacity (reserved, cheaper)
resource "aws_eks_node_group" "gpu_reserved" {
  cluster_name    = aws_eks_cluster.ml_serving.name
  node_group_name = "gpu-reserved"
  capacity_type   = "on_demand"  # Reserve this
 
  scaling_config {
    desired_size = 3
    max_size     = 3
    min_size     = 3
  }
}
 
# Burst capacity (spot, cheaper)
resource "aws_eks_node_group" "gpu_burst" {
  cluster_name    = aws_eks_cluster.ml_serving.name
  node_group_name = "gpu-burst"
  capacity_type   = "spot"
 
  scaling_config {
    desired_size = 0
    max_size     = 20
    min_size     = 0
  }
}

Pods with preferredDuringSchedulingIgnoredDuringExecution target burst nodes. During spikes, they spill to burst. 60% cost savings on serving infrastructure.

Terraform vs. Pulumi: When to Use Each

Aspect	Terraform	Pulumi
Learning Curve	Moderate (HCL syntax)	Lower (Python/Go/TS)
ML Integration	Via bash/external scripts	Native (same language)
Ecosystem	Largest (300K+ modules)	Growing, excellent AWS support
State Management	Excellent (mature, proven)	Excellent (same backend options)
Team Skills	Requires HCL expertise	Leverage existing Python skills
GPU Infrastructure	Fully supported, battle-tested	Fully supported, expanding
Cost	Free/open-source	Free tier + commercial support
Organization	Separate infra/ML teams	ML-native, polyglot teams
Dynamic Resources	Limited (for_each, count loops)	Full programming language

Use Terraform if:

Your organization has dedicated infrastructure engineers
You need multi-cloud (AWS, GCP, Azure) consistency
You prefer declarative, static infrastructure
Your team has HCL expertise

Use Pulumi if:

Your ML team writes and deploys infrastructure
You want Python, Go, or TypeScript everywhere
You need dynamic resource generation from configs
You plan to grow from ML→MLOps→DevOps

For most ML teams, start with Terraform for training (proven, simple) and Pulumi for serving (dynamic, Pythonic).

The deeper question both tools answer is organizational: who owns infrastructure? Historically, operations and infrastructure teams owned infrastructure, and developers requested resources. In modern ML-heavy organizations, data scientists and ML engineers own their infrastructure. They write it, test it, deploy it, debug it. This shift from "infrastructure as a specialized function" to "infrastructure as a team skill" is fundamental. Terraform encourages the old model - you write HCL, someone reviews it, someone deploys it. Pulumi enables the new model - you write Python alongside your training scripts, you can unit test infrastructure just like you test models, you can version and review infrastructure with the same tooling as code. The best choice depends on which model your organization is adopting.

It's also worth noting that these tools aren't mutually exclusive. Large organizations often use both - Terraform for shared infrastructure (networking, cluster setup, IAM policies) managed centrally, and Pulumi for workload-specific infrastructure (model serving-inference-server-multi-model-serving) stacks, experiment resources) managed by teams. You can compose Terraform outputs into Pulumi programs, creating a hybrid model that balances central governance with team autonomy.

Practical Workflow: From Experiment to Production

Here's how this works end-to-end:

Experiment Phase: Data scientist runs training locally or on a small Terraform-managed spot cluster.
Dev Deployment: Push to dev, Pulumi spins up a test serving infrastructure and deploys the model.
Staging Phase: Same model config, different Pulumi stack, creates staging infrastructure with load testing.
Production Rollout: Canary deployment using Kubernetes, auto-scaling based on real traffic.
Monitoring: CloudWatch and drift detection catch configuration changes.
Cleanup: Terraform destroys training infrastructure after job completes, saving costs.

The key insight: infrastructure and models should evolve together. Store your infrastructure code alongside model code. Version both. Test both. A model trained on outdated infrastructure config will fail at deployment time.

This workflow sidesteps a common pain point: the disconnect between "how we trained it" and "how we deploy it." You train locally with 8 GPUs, PyTorch DataLoader, certain batch sizes. Then you deploy to a containerized serving environment with different hardware and different inference patterns. Surprise - it doesn't work as expected. Or worse, it works but is embarrassingly slow. The root cause is always that training and serving infrastructure diverged.

When your infrastructure lives in code alongside your model code, they're bound together. You can't accidentally train a model that's incompatible with serving infrastructure because they're declared together, versioned together, reviewed together. A PR that changes the serving infrastructure includes the code that interacts with it. Reviewers see both sides. This catches integration bugs early, before they cause production incidents.

The other benefit is reproducibility. A year from now, you need to retrain))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) the model. You check out the infrastructure code from that version, spin up the identical training environment, retrain. You get the same hardware, same setup, same behavior. You can't "accidentally" upgrade a dependency or use different GPU hardware and wonder why the results shifted. This is invaluable for model governance and auditing.

Checklist: IaC Maturity for ML

Before calling your ML infrastructure "production-ready," validate:

Summary

Infrastructure as Code for ML isn't a nice-to-have - it's essential for reproducibility, cost control, and peace of mind. Terraform provides battle-tested, multi-cloud support with fine-grained resource control and a massive ecosystem. Pulumi brings flexibility and language familiarity that resonates with ML teams who live in Python.

Both tools handle the unique challenges of training (ephemeral, GPU-heavy, interrupt-tolerant) and serving (persistent, auto-scaling, low-latency) infrastructure elegantly. The choice depends on your team's structure and skills, not the technical capabilities of the tools.

Start with either tool, but commit to one. Build reusable modules or ComponentResources. Automate drift detection. Lock your state. Treat infrastructure as you treat code - version it, test it, review it before deploying.

Your 3 AM pages and surprise cloud bills will become someone else's problem.

The Organizational Shift Enabled by IaC

Infrastructure as Code fundamentally changes how organizations scale. In the old model, you had infrastructure specialists who understood networking, storage, and compute. When you needed resources, you submitted a ticket and waited. This created bottlenecks. Your team wanted to spin up a GPU cluster for an experiment but had to wait three weeks for approval and provisioning. By then, you'd moved on to something else.

With IaC, teams become self-sufficient. Your data scientist writes Python training code and Python infrastructure code. They submit a PR. It gets reviewed. They apply it. Cluster is up in minutes. Experiment runs. Cluster gets deleted. Done. No specialists required. This doesn't mean infrastructure becomes unimportant - quite the opposite. Good infrastructure patterns become codified and shared. Your ML team writes a Pulumi component that encapsulates best practices for training clusters. Everyone uses that component. Infrastructure improvements propagate across all teams automatically.

This democratization comes with risks. People write bad infrastructure code. Someone hardcodes credentials in state files. Someone creates a security group that opens port 22 to 0.0.0.0. Peer review and linting help, but you also need cultural investment. Your team needs to understand that infrastructure has security and cost implications. A single thoughtless configuration decision could cost thousands or expose your data.

Building Infrastructure Abstractions

The most mature organizations don't have every team writing base infrastructure from scratch. They build abstractions. In Terraform, you build modules that encapsulate complexity. In Pulumi, you build ComponentResources. For ML workloads specifically, you might have components like TrainingCluster, ServingStack, FeatureStore, NotebookEnvironment. Teams instantiate these with a few parameters and get fully functional infrastructure.

Building good abstractions requires understanding patterns deeply. You notice that every training cluster needs spot instance handling with graceful shutdown. Every one needs monitoring and alerting. Every one needs cost tracking. So you build a component that includes all of this. Teams don't need to reimplement or understand the details. They just use the component.

Abstractions also serve a governance function. You can enforce standards at the component level. All training clusters must have encryption at rest. All serving stacks must be multi-AZ. All state must be backed up daily. These become requirements of the component, not guidelines that teams might or might not follow.

Evolving Your IaC as You Scale

When you're a five-person ML team, your infrastructure is simple. A small GPU cluster for training, maybe a managed service for serving. You probably don't even need IaC - manual clicking is fine. But once you're a fifty-person team running hundreds of models, manual infrastructure becomes impossible. You need systematic approaches.

The evolution is natural. First you write Terraform to describe your current setup. You run plan and apply manually. Over time, you integrate it into CI/CD. You write tests. You implement drift detection. You set up cost alerts. You create reusable modules. You train your team on infrastructure best practices. The system grows with you.

But there's a maturity trap. Mature organizations sometimes develop such sophisticated infrastructure that only specialists understand it. A junior engineer can't add a simple feature because the infrastructure code is too complex. The abstraction becomes leaky. You need regular refactoring to keep abstractions clean and accessible.

The Cost Impact of IaC

Good infrastructure code directly impacts your cloud bill. By using spot instances intelligently, you cut training costs 70%. By auto-scaling serving infrastructure properly, you reduce idle capacity. By tracking costs with tags and automated reporting, you catch waste early. By implementing drift detection, you catch expensive misconfigurements before they become month-long problems.

Conversely, bad infrastructure code is expensive. Someone hardcodes 20 on-demand instances when 3 would suffice because they didn't understand resource sizing. Someone leaves a development cluster running 24/7 when it should be ephemeral. Someone replicates infrastructure across three regions when only one is needed. These mistakes compound.

The cost savings from good IaC often exceed the engineering investment within months. That's why mature organizations are willing to invest heavily in infrastructure tooling and training.

Choosing Between Terraform and Pulumi: A Practical Guide

The choice between Terraform and Pulumi comes down to your organization's structure and philosophy. If you have dedicated infrastructure engineers who write infrastructure separately from application engineers, Terraform's declarative model is ideal. It enforces a clear separation of concerns. Infrastructure code stays in a separate repository. Application teams reference infrastructure. Infrastructure teams own reliability and security.

If your organization believes ML engineers should own their infrastructure end-to-end, Pulumi is the better fit. ML engineers already write Python. Having them write infrastructure in the same language eliminates cognitive overhead. They can share utilities and libraries between application and infrastructure code. They can test infrastructure the same way they test models. This model scales better as your organization grows because you avoid infrastructure bottlenecks.

Hybrid approaches work too. Core infrastructure (networking, Kubernetes clusters, IAM) is managed by infrastructure engineers in Terraform. Application-specific infrastructure (model serving stacks, feature pipelines) is managed by ML teams in Pulumi. Pulumi consumes Terraform outputs. You get the consistency of Terraform with the flexibility of Pulumi.

The ML Infrastructure Challenge

ML-Specific IaC Requirements

GPU Resource Management

Training vs. Serving Infrastructure

Auto-scaling for Inference

Kubernetes GPU Node Pools

State and Drift Management

Terraform for ML Infrastructure: Architecture Overview

Terraform Module for ML Training Infrastructure

Directory Structure

Core Module: main.tf

Spot Fleet Configuration: spot_fleet.tf

User Data Script: user_data.sh

Variables: variables.tf

Kubernetes GPU Node Pools with Terraform

EKS Cluster Module

Pulumi with Python SDK: Same Language as Your ML Code

Pulumi Project Structure

Pulumi Configuration: main.py

Dynamic Resource Creation from Config: ml_serving.py

ComponentResource for Reusable Patterns: utils.py

ML Infrastructure Drift Detection and Management

Terraform Drift Detection in CI

State Locking with DynamoDB

Destroy and Recreate Pattern for Training Infrastructure

Monitoring and Observability for IaC Changes

Practical Example: Complete Training → Serving Pipeline

Step 1: Define Your Model Config

Step 2: Provision Training Infrastructure

Step 3: Monitor Training, Handle Interruptions

Step 4: Deploy Serving Infrastructure

Step 5: Destroy Training Infrastructure

Advanced: Cost Optimization Patterns

Spot Instance Diversification

Reserved Capacity for Baseline Load

Terraform vs. Pulumi: When to Use Each

Practical Workflow: From Experiment to Production

Checklist: IaC Maturity for ML

Summary

The Organizational Shift Enabled by IaC

Building Infrastructure Abstractions

Evolving Your IaC as You Scale

The Cost Impact of IaC

Choosing Between Terraform and Pulumi: A Practical Guide

Need help implementing this?