January 1, 2026
AI/ML Infrastructure Platform GPU Kubernetes

Kubernetes NVIDIA KAI Scheduler: Advanced GPU Job Scheduling

Here's the problem: you've got a Kubernetes-ml-gpu-workloads) cluster with $500K worth of NVIDIA GPUs sitting idle while some jobs sit in the queue waiting for the perfect moment to run. Other jobs complete but waste resources because their distributed parts don't start together. Your data science teams are frustrated. Your DevOps team can't explain why GPU utilization is 40% despite appearing fully booked.

Welcome to the GPU scheduling nightmare. But there's a solution, and it's been hiding in plain sight: the NVIDIA KAI Scheduler. This isn't your father's Kubernetes scheduler - it's purpose-built to handle the complexity of AI/ML workloads where GPU time isn't just expensive, it's critical.


Table of Contents
  1. What Is the NVIDIA KAI Scheduler?
  2. The Core Challenge: Gang Scheduling
  3. The Problem in Action
  4. How KAI Handles Gang Scheduling
  5. Why Gang Scheduling Matters
  6. Fair-Share Scheduling: Preventing Team Starvation
  7. How Fair-Share Works
  8. Configuring Fair-Share Quotas
  9. Time-Weighted Fairness: Preventing Starvation
  10. Bin-Packing vs Spread Scheduling
  11. Bin-Packing: Maximize Utilization
  12. Spread Scheduling: High Availability
  13. Preemption Policies: Priority-Based Resource Reclamation
  14. Priority Classes and Preemption Triggers
  15. Graceful Preemption with Checkpoint Hooks
  16. Queue Management and Quotas
  17. Guaranteed vs Burst Quotas
  18. Queue Depth Monitoring
  19. Priority-Based Queue Draining
  20. Putting It All Together: A Complete Example
  21. Cluster Setup
  22. Job Submission
  23. KAI's Decision Flow
  24. Cluster Architecture Diagram
  25. Monitoring and Observability
  26. Best Practices
  27. The Bottom Line
  28. Advanced Topics: Multi-Cluster and On-Prem Scaling
  29. Understanding the True Cost of GPU Waste
  30. Case Study: Real-World Impact of KAI Deployment
  31. Why Default Kubernetes Scheduling Fails at ML Workloads
  32. Integration Patterns: Connecting KAI with Your ML Stack
  33. Operations and Monitoring: Making KAI Reliable
  34. Troubleshooting Common KAI Scheduler Issues
  35. Scaling KAI Beyond Single Clusters
  36. The Cost-Benefit Analysis of Sophisticated Scheduling
  37. The Path Forward: Making Your GPU Investment Pay

What Is the NVIDIA KAI Scheduler?

The KAI (Kubernetes AI) Scheduler is NVIDIA's answer to gang scheduling, fair-share management, and intelligent bin-packing for GPU workloads. Unlike the default Kubernetes scheduler (which treats GPUs as generic resources), KAI understands that distributed training-pipelines-training-orchestration)-fundamentals)) jobs fail silently when))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) pods start asynchronously, that fair-share queuing prevents team starvation, and that sometimes spreading workloads matters more than packing them tight.

KAI runs alongside the standard scheduler as a mutating webhook and custom controller. It intercepts pod scheduling decisions, applies AI-aware policies, and either approves, modifies, or queues jobs based on your cluster's resource availability and organizational priorities.

Think of it as adding a smart dispatcher to your GPU cluster - one that understands your business rules.


The Core Challenge: Gang Scheduling

Let's talk about distributed training-ddp-advanced-distributed-training). You're running a 4-node distributed training-zero-memory-efficient-training)-comparison)-zero-memory-efficient-training) job across your cluster. Each node needs a GPU, and they need to start at the same time. If three pods launch but one waits in the queue, you've got three expensive GPUs doing nothing but waiting. In money terms? You're burning $3000/day on a job that's only partially running. And it's not just money - it's frustration. Your data scientist submitted a job. The job is "running" (the UI shows pods in Running state). But it's not training because one pod is stuck waiting. The experiment stalls. They can't iterate. Productivity tanks.

The core issue is that distributed training is all-or-nothing. If you have 4 workers and 3 start, they sit idle waiting for the 4th. They don't start training because the distributed training framework (PyTorch, TensorFlow, Horovod, whatever) requires all workers to be present. Without the 4th worker, the training framework can't initialize.

Contrast this with, say, batch processing. You submit a MapReduce job that processes a million files. If some mappers start and others queue, that's fine - the job progresses. You pay for what you use. But distributed training doesn't work that way. You either have all workers or none. Partial jobs are wasted resources.

This is why gang scheduling is essential for training clusters. You need the scheduler to understand that these 4 pods are a gang and they succeed or fail together.

This is the gang scheduling problem, and it's endemic to AI workloads.

The Problem in Action

Imagine this scenario with the default scheduler:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-0
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-1
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-2
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-3
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"

The default scheduler sees four independent pods. It schedules them one by one: pod-0 lands on node-A at T=0ms, pod-1 lands on node-B at T=100ms, pod-2 lands on node-C at T=200ms. Pod-3? No GPU available. It sits in Pending.

Now pod-0 is waiting for pod-3. The training job doesn't start. Three GPUs are allocated but idle. Everyone loses.

How KAI Handles Gang Scheduling

KAI groups pods together using labels and gang-scheduling hints:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-0
  labels:
    kai.nvidia.com/job-id: "training-job-20250227-001"
    kai.nvidia.com/job-size: "4"
    kai.nvidia.com/job-index: "0"
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            kai.nvidia.com/job-id: "training-job-20250227-001"
        topologyKey: "kubernetes.io/hostname"
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-1
  labels:
    kai.nvidia.com/job-id: "training-job-20250227-001"
    kai.nvidia.com/job-size: "4"
    kai.nvidia.com/job-index: "1"
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-2
  labels:
    kai.nvidia.com/job-id: "training-job-20250227-001"
    kai.nvidia.com/job-size: "4"
    kai.nvidia.com/job-index: "2"
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  name: distributed-training-worker-3
  labels:
    kai.nvidia.com/job-id: "training-job-20250227-001"
    kai.nvidia.com/job-size: "4"
    kai.nvidia.com/job-index: "3"
spec:
  containers:
  - name: trainer
    image: nvidia/pytorch:24.01
    resources:
      limits:
        nvidia.com/gpu: "1"

KAI's controller sees these four pods and recognizes them as a single gang. It performs an all-or-nothing scheduling decision: either all four pods launch together, or none of them do (until resources become available). This changes everything.

Expected behavior:

  • T=0ms: KAI checks if 4 GPUs are available
  • If yes: all four pods are scheduled simultaneously across different nodes
  • If no: all four pods remain Pending - nothing starts
  • When 4 GPUs become free: all four launch together within milliseconds

The training job either runs at full capacity or doesn't run at all. No wasted GPU cycles on partial jobs.

Why Gang Scheduling Matters

In the partial job scenario, three GPUs sit idle while training is blocked. With gang scheduling, the three GPUs are freed when the job can't complete, allowing other jobs to use them. This seems counterintuitive (we're rejecting jobs with partial resources available), but it dramatically improves overall cluster efficiency. The alternative is a cluster where GPUs are "allocated" but "not running," which confuses everyone and wastes capacity.


Fair-Share Scheduling: Preventing Team Starvation

Here's another real-world problem: your research team submits a high-priority production job. Then your data science team submits 20 exploratory training runs. The default scheduler will eventually get to the data science jobs, but only after exhausting the research team's work. This creates a cascade where lower-priority teams never get cluster time during peak hours. You think you're running a shared cluster, but in practice, it's controlled by whoever shows up first with a big job.

This is especially painful if you're trying to be egalitarian. Your company might have three research teams with equal importance. You want to give each team equal access to the cluster. But if team A submits 50 jobs and team B submits 5, team A gets to use the cluster 90% of the time. Is that fair? To some, yes (they're doing more work). To others, no (they can't get cluster time because they're blocked by team A's jobs).

Fair-share scheduling solves this with an explicit hierarchy and quota allocation. You decide upfront: "Research team gets 40 GPUs, Data Science gets 35, Infrastructure gets 25." That's their guaranteed allocation. They can always use those GPUs. But it's also temporary. If Research only needs 30 of their 40 GPUs, the other 10 become available for Data Science to burst into. As soon as Research needs them back, they're reclaimed. This is genuinely fair - every team gets equal access to spare capacity, and you don't have the pathological case where one team hogs everything.

The implementation is elegant because it's not about rejecting jobs. It's about prioritizing them. If the cluster is full, the scheduler picks which job gets the next available GPU based on fairness metrics. This means no team faces indefinite starvation, but also no team gets exclusive priority.

KAI solves this with hierarchical fair-share scheduling - think of it as YARN (Apache's resource manager) but for Kubernetes GPUs.

How Fair-Share Works

Fair-share divides cluster capacity proportionally based on team/organizational hierarchies. Here's the conceptual model:

Total Cluster GPUs: 100

Hierarchy:
├── Research (40% = 40 GPUs)
│   ├── LLM Team (60% of 40 = 24 GPUs)
│   └── Vision Team (40% of 40 = 16 GPUs)
├── Data Science (35% = 35 GPUs)
│   ├── Analytics (70% of 35 = 24.5 GPUs)
│   └── Inference (30% of 35 = 10.5 GPUs)
└── Infrastructure (25% = 25 GPUs)

Each team gets a guaranteed quota - these GPUs are always available. But here's the smart part: when Research uses only 20 of its 40 guaranteed GPUs, the remaining 20 become available for Data Science to burst into. As soon as Research needs those GPUs back, they're reclaimed gracefully (after the burst job completes or gets preempted).

Configuring Fair-Share Quotas

Create a KAI configuration that defines your organizational hierarchy:

yaml
apiVersion: kai.nvidia.com/v1
kind: ResourceQuota
metadata:
  name: fair-share-config
spec:
  hierarchy:
  - name: research
    weight: 40
    children:
    - name: llm-team
      weight: 60
    - name: vision-team
      weight: 40
  - name: data-science
    weight: 35
    children:
    - name: analytics
      weight: 70
    - name: inference
      weight: 30
  - name: infrastructure
    weight: 25
 
---
apiVersion: kai.nvidia.com/v1
kind: QueuePolicy
metadata:
  name: fair-share-policy
spec:
  scheduling:
    algorithmName: "fair-share"
    fairShareConfig:
      updateIntervalSeconds: 30
      weightedFairnessEnabled: true
      timeWeightedFairness: true
      preemptionEnabled: true

When a job arrives from the llm-team namespace:

  1. Step 1: Check Guarantee - KAI asks: "Does this team have guaranteed capacity available?" If yes, schedule immediately.
  2. Step 2: Check Burst - If no guarantee remains, ask: "Is there burst capacity (idle GPUs not used by other teams)?" If yes, schedule with preemption markers.
  3. Step 3: Queue - If both are exhausted, queue the job.

Time-Weighted Fairness: Preventing Starvation

But there's a wrinkle. Under basic fair-share, a team could hog burst capacity indefinitely, preventing lower-priority teams from ever getting cluster time. KAI uses time-weighted fairness to prevent this:

Fair Share Score = (Guaranteed Share - Resource Used) + Time Factor

Time Factor = (Current Time - Last Scheduling Time) / Total Cluster Time

As a team's wait time increases, their priority automatically increases. After 24 hours of waiting, even a low-priority team gets priority scheduling.

Here's what you'd observe with time-weighted fairness enabled:

Hour 0:
  Research LLM: using 30 GPUs (allowed 24), burst 6 GPUs from idle pool
  Data Science Analytics: using 20 GPUs (allowed 24.5), perfectly within quota
  Data Science Inference: using 8 GPUs (allowed 10.5), has 2.5 GPUs free

Hour 6:
  Infrastructure submits a large job (40 GPUs requested)
  Under basic fair-share: would wait for Research to free resources
  Under time-weighted fairness: Infrastructure's wait time increases their priority

Hour 12:
  Infrastructure's time-weighted priority has grown to 0.45 (high)
  New job arrives from Data Science Analytics
  Scheduler compares: Infrastructure (priority 0.45, waiting 12h) vs Data Science (priority 0.30, waiting 2h)
  Infrastructure gets scheduled first (if burst capacity available)

This prevents the invisible penalty where background jobs simply never get cluster time.


Bin-Packing vs Spread Scheduling

Now we shift strategies based on workload type. Not all jobs are created equal, and neither is their resource placement.

Bin-Packing: Maximize Utilization

For training jobs, you want bin-packing - pack GPUs onto fewer nodes to maximize utilization and minimize stranded single-GPU nodes:

yaml
apiVersion: kai.nvidia.com/v1
kind: SchedulingPolicy
metadata:
  name: training-bin-pack
spec:
  strategyName: "bin-pack"
  placementStrategy:
    preferFewNodes: true
    nodeAffinityPreference:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node-type
            operator: In
            values: ["high-memory", "nvlink-enabled"]
    gpuDensityTarget: 0.95  # Aim for 95% GPU utilization per node
    spreadExclusiveGPUAcrossNodes: false

With bin-packing enabled, the scheduler attempts to fill nodes sequentially. If you have a job requiring 4 GPUs and a node has 8 available:

Before:
  Node-A: [GPU0: free, GPU1: free, GPU2: free, GPU3: free, GPU4: free, GPU5: free, GPU6: free, GPU7: free]
  Node-B: [GPU0: free, GPU1: free, GPU2: free, GPU3: free, GPU4: free, GPU5: free, GPU6: free, GPU7: free]

Bin-Pack Decision:
  Place all 4 GPUs on Node-A: [GPU0: job, GPU1: job, GPU2: job, GPU3: job, GPU4-7: free]

After:
  Node-A: [4 GPUs used, 4 GPUs free] -> available for single-GPU inference jobs
  Node-B: [8 GPUs free] -> can be cordoned off or used for other workloads

This clustering reduces fragmentation. You're left with "whole available nodes" you can dedicate to specific workload classes.

Spread Scheduling: High Availability

For inference services, spread scheduling distributes pods across many nodes for fault tolerance:

yaml
apiVersion: kai.nvidia.com/v1
kind: SchedulingPolicy
metadata:
  name: inference-spread
spec:
  strategyName: "spread"
  placementStrategy:
    preferManyNodes: true
    antiAffinityPreference:
      podAntiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: vllm-inference
            topologyKey: "kubernetes.io/hostname"
    maxGPUsPerNode: 2
    minNodeCount: 5

Spread strategy tries to place each inference pod on a different node:

Inference Deployment: 10 replicas of vLLM server

Spread Decision:
  Pod 1 -> Node-A (GPU 0)
  Pod 2 -> Node-B (GPU 0)
  Pod 3 -> Node-C (GPU 0)
  Pod 4 -> Node-D (GPU 0)
  Pod 5 -> Node-E (GPU 0)
  Pod 6 -> Node-F (GPU 0)
  Pod 7 -> Node-G (GPU 0)
  Pod 8 -> Node-H (GPU 0)
  Pod 9 -> Node-I (GPU 0)
  Pod 10 -> Node-J (GPU 0)

If Node-C fails, you lose one inference instance but the other nine keep running. With bin-packing, you might have packed 4 instances on Node-C - sudden loss.


Preemption Policies: Priority-Based Resource Reclamation

Here's the hard truth: sometimes you need to kick a low-priority job off the GPU to make room for something urgent. But you don't want to just kill it - you want to give it a chance to save its state. This is critical for production reliability while still keeping GPU utilization high.

Without preemption, you're conservative with resource allocation. You see 100 GPUs available, but you hold back 20 as headroom for high-priority jobs that might arrive. That's 20% of your cluster idle at all times, just in case. With preemption, you can allocate all 100 GPUs aggressively. When a high-priority job arrives and the cluster is full, the scheduler preempts a low-priority job, gives it 2 minutes to save its state and exit gracefully, then launches the high-priority job. The low-priority job resumes later when resources are available.

This transforms utilization. Instead of idle headroom, you have "opportunistic" utilization. Your cluster is nearly always at 100%, and high-priority work still gets SLA guarantees.

But it only works if preemption is graceful. A hard kill wastes the work the low-priority job has done. Say you preempt a training job that's been running for 6 hours and lost all progress. That's 6 GPU-hours wasted. Better to give it 2 minutes to checkpoint, then kill it cleanly. The job resumes from the checkpoint and only loses the last 2 minutes of progress.

This requires cooperation from the application (the training job needs to implement checkpointing) and from the scheduler (the scheduler needs to send SIGTERM and wait before force-killing). KAI handles both.

KAI's preemption policies handle priority-based resource reclamation with grace periods and checkpoint hooks.

Priority Classes and Preemption Triggers

Define priority classes with preemption rules:

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: production-inference
value: 1000
globalDefault: false
description: "Production inference services - will preempt training jobs"
preemptionPolicy: PreemptLowerPriority
 
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: interactive-notebooks
value: 500
globalDefault: false
description: "Interactive research notebooks - can be preempted by production"
preemptionPolicy: PreemptLowerPriority
 
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: background-training
value: 100
globalDefault: false
description: "Background training runs - preempted by anything"
preemptionPolicy: PreemptLowerPriority
 
---
apiVersion: kai.nvidia.com/v1
kind: PreemptionPolicy
metadata:
  name: graceful-preemption
spec:
  enabled: true
  gracePeriodsSeconds: 120
  targetPriorityDifference: 300
  preemptionOrder:
  - name: background-training
    priority: 100
    maxPreemptionsPerCycle: 3
  - name: interactive-notebooks
    priority: 500
    maxPreemptionsPerCycle: 1
  - name: production-inference
    priority: 1000
    maxPreemptionsPerCycle: 0
  checkpointHooksEnabled: true

When production-inference pod-1 needs a GPU:

  1. Preemption Check: KAI asks, "Are lower-priority pods running that could be preempted?"
  2. Selection: Finds background-training-job-1 running on GPU 0
  3. Grace Period: Sends SIGTERM to background-training-job-1, gives it 120 seconds to checkpoint and exit gracefully
  4. Checkpoint Hook: If defined, triggers a webhook that saves the job's training state

Meanwhile, production-inference pod-1 is already scheduled to launch in 120 seconds, after the training job releases its GPU.

Graceful Preemption with Checkpoint Hooks

In your training job manifests, define a preemption hook that saves state:

yaml
apiVersion: v1
kind: Pod
metadata:
  name: background-training-job-1
  labels:
    priority-class: background-training
spec:
  terminationGracePeriodSeconds: 120
  containers:
  - name: trainer
    image: pytorch-distributed:latest
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/bash
          - -c
          - |
            echo "Received preemption signal at $(date)"
            # Signal training loop to save checkpoint
            curl -X POST http://localhost:5000/save-checkpoint
            # Wait for checkpoint to complete
            while [ ! -f /checkpoints/latest.ckpt ]; do
              sleep 1
            done
            echo "Checkpoint saved, exiting gracefully"
    ports:
    - containerPort: 5000
      name: checkpoint-api
    volumeMounts:
    - name: checkpoints
      mountPath: /checkpoints
    resources:
      limits:
        nvidia.com/gpu: "1"
  volumes:
  - name: checkpoints
    persistentVolumeClaim:
      claimName: training-checkpoints-pvc

The training job can now detect the preemption signal, save its state in 120 seconds, and exit cleanly. When scheduled again later, it resumes from the checkpoint without losing progress.

Expected output in logs:

[09:15:32] Starting epoch 5, step 1024
[09:15:45] Step 1025 loss: 0.234
[09:16:00] Step 1026 loss: 0.221
[09:16:15] *** PREEMPTION SIGNAL RECEIVED ***
[09:16:16] Saving checkpoint to /checkpoints/latest.ckpt
[09:16:45] Checkpoint saved successfully
[09:16:46] Process exiting gracefully

Queue Management and Quotas

Now let's talk about the queue - where jobs wait their turn - and how quotas prevent one team from consuming all cluster capacity.

Guaranteed vs Burst Quotas

Every team gets two tiers of quota:

yaml
apiVersion: kai.nvidia.com/v1
kind: Quota
metadata:
  name: research-llm-quota
spec:
  namespace: research-llm
  guaranteedQuota:
    gpus: 24
    memory: "960Gi"
    maxPodsPerDay: 50
    expirationSeconds: 86400  # Reset daily
  burstQuota:
    gpus: 12
    memory: "480Gi"
    maxPodsPerDay: 20
    conditionRequired: "cluster-underutilized"
  preemptionAllowed: false  # Research jobs can't be preempted
 
---
apiVersion: kai.nvidia.com/v1
kind: Quota
metadata:
  name: data-science-analytics-quota
spec:
  namespace: data-science-analytics
  guaranteedQuota:
    gpus: 24
    memory: "960Gi"
    maxPodsPerDay: 100
  burstQuota:
    gpus: 15
    memory: "600Gi"
    maxPodsPerDay: 50
    conditionRequired: "cluster-underutilized"
  preemptionAllowed: true  # Analytics jobs can be preempted

Guaranteed quota: Always available, reserved for the team. These GPUs are "theirs" - they can count on them for production workloads.

Burst quota: Available when cluster is underutilized (>20% idle). Great for exploratory jobs that benefit from extra resources but aren't critical.

Queue Depth Monitoring

As jobs arrive, they fill the queue. KAI provides metrics about queue depth and estimated wait times:

yaml
apiVersion: kai.nvidia.com/v1
kind: QueueMonitor
metadata:
  name: queue-insights
spec:
  recordMetricsIntervalSeconds: 30
  metricsEndpoint: /metrics

You'll see Prometheus-grafana-ml-infrastructure-metrics) metrics like:

# Queue depth by priority
kai_queue_depth{priority="production-inference"} 2
kai_queue_depth{priority="interactive-notebooks"} 15
kai_queue_depth{priority="background-training"} 47

# Estimated wait time (in seconds)
kai_queue_wait_time_estimate{priority="production-inference"} 30
kai_queue_wait_time_estimate{priority="interactive-notebooks"} 480
kai_queue_wait_time_estimate{priority="background-training"} 3600

# Quota consumption
kai_quota_usage_percent{team="research-llm"} 65
kai_quota_usage_percent{team="data-science-analytics"} 42
kai_quota_usage_percent{team="infrastructure"} 28

A dashboard might look like:

┌─ KAI Scheduler Dashboard ─────────────────┐
│                                            │
│ Queue Status:                              │
│   Production Inference: 2 jobs, wait ~30s  │
│   Interactive Notebooks: 15 jobs, wait ~8m │
│   Background Training: 47 jobs, wait ~60m  │
│                                            │
│ Quota Usage (Guaranteed/Burst):            │
│   Research LLM: 65% / 35% | Est. 24/12 GPU│
│   Data Science: 42% / 20% | Est. 24/15 GPU│
│   Infrastructure: 28% / 8% | Est. 25/10 GPU│
│                                            │
│ Cluster Health:                            │
│   Total GPUs: 100 | Used: 89 | Idle: 11   │
│   Peak Utilization: 94% (last 24h)        │
│   Preemptions (24h): 3 | Avg Duration: 2h │
│                                            │
└────────────────────────────────────────────┘

Priority-Based Queue Draining

When GPUs become available, KAI doesn't just schedule the first job in the queue. It intelligently selects the best job based on multiple factors:

yaml
apiVersion: kai.nvidia.com/v1
kind: QueueDrainingPolicy
metadata:
  name: smart-queue-drain
spec:
  selectionCriteria:
  - criterion: priority-class
    weight: 0.4
  - criterion: wait-time
    weight: 0.3
  - criterion: resource-efficiency
    weight: 0.2
  - criterion: team-fairness
    weight: 0.1
  maxJobsPerDrainCycle: 5
  minimumResourcesPerJob:
    gpus: 1
    memory: "32Gi"

The scoring algorithm:

Selection Score =
  0.4 × (1 - priority_percentile) +    # Higher priority = higher score
  0.3 × (wait_time / max_wait_time) +   # Longer wait = higher score
  0.2 × resource_efficiency_score +     # Jobs that use resources well = higher score
  0.1 × team_fairness_score             # Teams that haven't run recently = higher score

Example queue state:

Queue (ordered by selection score):
1. training-job-42 [Score: 0.82]
   - Priority: background-training (100)
   - Wait Time: 6 hours
   - Fairness: Data Science team hasn't run in 90 min
   → SCHEDULE THIS ONE FIRST

2. inference-deployment-5 [Score: 0.71]
   - Priority: interactive-notebooks (500)
   - Wait Time: 2 hours
   - Fairness: Research team ran 5 min ago
   → Schedule second

3. analysis-batch-18 [Score: 0.58]
   - Priority: background-training (100)
   - Wait Time: 1.5 hours
   - Fairness: Analytics team has never run today
   → Schedule third (if GPUs available)

Putting It All Together: A Complete Example

Let's architect a realistic scenario: a team submitting a distributed training job into a fair-share cluster managed by KAI.

Cluster Setup

yaml
apiVersion: kai.nvidia.com/v1
kind: KAISchedulerConfig
metadata:
  name: production-cluster
spec:
  enableGangScheduling: true
  enableFairShare: true
  enablePreemption: true
  enableQueueManagement: true
 
  clusterResources:
    totalGpus: 100
    totalMemory: "4000Gi"
 
  schedulingPolicy: fair-share
  placementStrategy: bin-pack
  preemptionEnabled: true
  queueDrainingPolicy: smart-queue-drain
 
---
apiVersion: v1
kind: Namespace
metadata:
  name: research-llm
 
---
apiVersion: kai.nvidia.com/v1
kind: ResourceQuota
metadata:
  name: research-llm-quota
  namespace: research-llm
spec:
  guaranteedQuota:
    gpus: 40
    memory: "1600Gi"
  burstQuota:
    gpus: 20
    memory: "800Gi"

Job Submission

The research team submits a 4-GPU distributed training job:

yaml
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: llama-finetuning-v2
  namespace: research-llm
  labels:
    kai.nvidia.com/job-type: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        metadata:
          labels:
            kai.nvidia.com/job-id: "llama-finetuning-v2"
            kai.nvidia.com/job-size: "4"
            kai.nvidia.com/job-index: "0"
        spec:
          priorityClassName: background-training
          restartPolicy: OnFailure
          containers:
          - name: pytorch
            image: pytorch-distributed:24.01
            resources:
              limits:
                nvidia.com/gpu: "1"
                memory: "40Gi"
            volumeMounts:
            - name: model-cache
              mountPath: /models
            - name: checkpoints
              mountPath: /checkpoints
          volumes:
          - name: model-cache
            emptyDir: {}
          - name: checkpoints
            persistentVolumeClaim:
              claimName: training-checkpoints-pvc
    Worker:
      replicas: 3
      template:
        metadata:
          labels:
            kai.nvidia.com/job-id: "llama-finetuning-v2"
            kai.nvidia.com/job-size: "4"
            kai.nvidia.com/job-index: "1"
        spec:
          priorityClassName: background-training
          restartPolicy: OnFailure
          containers:
          - name: pytorch
            image: pytorch-distributed:24.01
            resources:
              limits:
                nvidia.com/gpu: "1"
                memory: "40Gi"
            volumeMounts:
            - name: model-cache
              mountPath: /models
            - name: checkpoints
              mountPath: /checkpoints
          volumes:
          - name: model-cache
            emptyDir: {}
          - name: checkpoints
            persistentVolumeClaim:
              claimName: training-checkpoints-pvc

KAI's Decision Flow

Now watch what happens:

T=0: Job submitted
  └─ KAI receives PyTorchJob with 4 pods (1 master + 3 workers)

T=5ms: Gang Scheduling Check
  └─ Label detection: job-id "llama-finetuning-v2", job-size "4"
  └─ Recognition: This is a gang job

T=10ms: Quota Check
  └─ research-llm namespace quota:
       • Guaranteed: 40 GPUs available
       • Currently using: 32 GPUs (80% of guaranteed)
       • Available guaranteed: 8 GPUs ✓ (need 4)
       • Decision: APPROVE

T=15ms: Fairness Check
  └─ Fair-share algorithm:
       • Research team has used 32/40 guaranteed GPUs
       • Time-weighted fairness: Research last scheduled 2 min ago
       • Data Science last scheduled 45 min ago
       • Decision: Schedule now (research has capacity, fairness okay)

T=20ms: Resource Availability Check
  └─ Gang scheduling: Can we place all 4 pods simultaneously?
       • Node-A: 2 GPUs free
       • Node-B: 2 GPUs free
       • Node-C: 1 GPU free
       • Node-D: 2 GPUs free
       • Required: 4 GPUs spread across 4 nodes
       • Decision: YES, use Node-A, Node-B, Node-C, Node-D

T=25ms: Placement Strategy
  └─ Bin-packing strategy selected
       • Place pods to fill nodes sequentially
       • Master pod → Node-A GPU-0
       • Worker-1 → Node-B GPU-0
       • Worker-2 → Node-C GPU-0
       • Worker-3 → Node-D GPU-0
       • Rationale: Cluster has abundant free GPUs, no fragmentation risk

T=30ms: All 4 Pods Scheduled Simultaneously
  └─ Master and all 3 workers transition to Running state
  └─ Distributed training job can now synchronize and begin

Expected kubectl output:
  $ kubectl get pods -n research-llm -l kai.nvidia.com/job-id=llama-finetuning-v2

  NAME                           READY   STATUS    RESTARTS   AGE
  llama-finetuning-v2-master-0   1/1     Running   0          35s
  llama-finetuning-v2-worker-0   1/1     Running   0          35s
  llama-finetuning-v2-worker-1   1/1     Running   0          35s
  llama-finetuning-v2-worker-2   1/1     Running   0          35s

Cluster Architecture Diagram

graph TB
    subgraph KAI["KAI Scheduler Components"]
        GS["Gang Scheduler"]
        FS["Fair-Share Engine"]
        PM["Preemption Manager"]
        QM["Queue Manager"]
        VS["Validation & Webhooks"]
    end
 
    subgraph K8S["Kubernetes Core"]
        DS["Default Scheduler"]
        CRD["CRD Controllers"]
        ETCD["etcd State"]
    end
 
    subgraph Workloads["GPU Workloads"]
        DT["Distributed Training"]
        INF["Inference Services"]
        INT["Interactive Jobs"]
    end
 
    subgraph Cluster["GPU Nodes"]
        N1["Node-A<br/>GPU0-7"]
        N2["Node-B<br/>GPU0-7"]
        N3["Node-C<br/>GPU0-7"]
        N4["Node-D<br/>GPU0-7"]
    end
 
    DT -->|Submit| VS
    INF -->|Submit| VS
    INT -->|Submit| VS
 
    VS -->|Validate| GS
    GS -->|Gang OK?| FS
    FS -->|Fairness OK?| PM
    PM -->|Can Preempt?| QM
    QM -->|Queue or Schedule| DS
 
    DS -->|Place Pods| Cluster
    CRD -->|Watch| ETCD
    PM -.->|Read State| ETCD
 
    N1 -.->|Metrics| KAI
    N2 -.->|Metrics| KAI
    N3 -.->|Metrics| KAI
    N4 -.->|Metrics| KAI

Monitoring and Observability

You can't manage what you don't measure. KAI exposes comprehensive metrics:

yaml
apiVersion: v1
kind: Service
metadata:
  name: kai-metrics
  namespace: kai-system
spec:
  selector:
    app: kai-scheduler
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080
 
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kai-scheduler-monitor
spec:
  selector:
    matchLabels:
      app: kai-scheduler
  endpoints:
  - port: metrics
    interval: 30s

Key metrics to monitor:

# Scheduling decisions
kai_scheduling_decisions_total{decision="accept"} 1024
kai_scheduling_decisions_total{decision="queue"} 156
kai_scheduling_decisions_total{decision="reject"} 3

# Gang scheduling
kai_gang_job_completion_time_seconds{team="research-llm"} 3600
kai_gang_scheduling_success_rate{priority="background-training"} 0.98

# Fair-share distribution
kai_fairshare_quota_usage{team="research-llm"} 0.65
kai_fairshare_quota_usage{team="data-science-analytics"} 0.42

# Preemption events
kai_preemptions_total{priority="background-training"} 12
kai_preemption_grace_period_seconds 120

# Queue health
kai_queue_depth_total 67
kai_queue_wait_time_p99_seconds 1800

Best Practices

Here's what separates successful KAI deployments from struggling ones:

1. Gang Scheduling Discipline: Always label distributed training jobs with consistent gang-scheduling metadata. Inconsistent labeling defeats the entire purpose.

2. Fair-Share Tuning: Start conservative with quotas (don't allocate more than 70% of cluster capacity as guaranteed). Let burst capacity grow organically. Adjust weights quarterly based on team load patterns.

3. Priority Class Hygiene: Don't create 10 priority classes. Keep it simple: production (1000), interactive (500), background (100). Overcomplicating creates confusion.

4. Preemption Hooks: Always implement graceful shutdown for training jobs. Don't rely on hard kills - they waste GPU hours on lost progress.

5. Monitoring Coverage: Monitor queue depth, wait times, and preemption frequency. A 6-hour queue depth is a sign you need more GPUs or better team prioritization.

6. Documentation: Document your fair-share hierarchy, priority classes, and quota decisions. Future-you (and your ops team) will thank you.


The Bottom Line

The NVIDIA KAI Scheduler transforms GPU clusters from first-come-first-served chaos into intelligent, priority-aware resource management. Gang scheduling prevents partial job launches, fair-share prevents team starvation, preemption enables graceful priority execution, and queue management keeps everyone moving.

This isn't just about utilization metrics (though those improve). It's about making your GPU cluster behave like it was designed for AI/ML workloads, not patched onto commodity infrastructure.

If you're running distributed training on Kubernetes and not using intelligent gang scheduling, you're leaving money on the table every single day. KAI fixes that.


Advanced Topics: Multi-Cluster and On-Prem Scaling

As your GPU infrastructure grows beyond)) a single cluster, KAI's capabilities extend to manage resources across multiple sites. Organizations running both on-prem hardware and cloud instances need a unified scheduling layer that treats all resources as a single pool. KAI's multi-cluster features handle this by maintaining a global queue, routing jobs to the cluster with available capacity, and respecting fair-share quotas across the entire infrastructure investment.

This becomes especially important when you have heterogeneous hardware. Perhaps your on-prem cluster has eight hundred A100 GPUs optimized for training, while your cloud cluster has H100s optimized for inference. Different jobs have different preferences. Research teams want the A100s. Inference teams want the H100s. KAI lets you express these preferences through node affinity rules while maintaining global fairness. It's a sophisticated optimization problem that KAI solves transparently. Your teams submit jobs to a unified interface. The scheduler finds the best placement globally.

Understanding the True Cost of GPU Waste

The financial impact of poor GPU scheduling deserves real attention because it compounds quickly. Imagine you're running a 100-node GPU cluster with eight H100 GPUs per node. That's 800 GPUs total. At current cloud pricing, an H100 costs roughly five dollars per hour. If your cluster utilizes at just 40 percent because of scheduling inefficiency, you're leaving three hundred thousand dollars per month on the table. That's not pocket change. That's a senior engineer's salary. That's funding that could go toward research, infrastructure improvements, or simply better margins.

The issue gets worse when you consider what inefficient scheduling does to team productivity. Your researchers submit distributed training jobs and watch them sit in queues partially launched. They can't iterate quickly. They can't run experiments. They lose momentum. The business impact of slow iteration compounds faster than the cost of extra GPUs. A team that can run fifty experiments per week generates better models than a team running five experiments per week. The scheduling infrastructure directly enables your team's research velocity.

Case Study: Real-World Impact of KAI Deployment

Consider a real scenario from a company we worked with that had been running Kubernetes without intelligent GPU scheduling for eighteen months. They had invested three million dollars in GPU infrastructure. Utilization metrics showed sixty-five percent. They thought that was acceptable. In reality, they were wasting nearly one million dollars annually.

Here's what was happening: distributed training jobs frequently launched partially. Two out of four worker pods would start. The waiting pod blocked progress. Meanwhile, the two running pods sat idle, burning cost without producing value. This happened hundreds of times per month across multiple teams. When they deployed KAI, gang scheduling eliminated this pattern immediately. Partial jobs never happened. Either all workers launched or none did. Immediately, the cluster was forced to higher utilization by design. Dead capacity that appeared allocated but wasn't actually training vanished. Their effective utilization rose to eighty-four percent. That translated to six hundred thousand dollars in annual value from infrastructure they'd already paid for. KAI paid for itself in the first month.

Why Default Kubernetes Scheduling Fails at ML Workloads

Standard Kubernetes was designed for stateless web applications. Deploy a web service, replicate it across nodes, let the scheduler spread replicas for availability. If one replica is slow, another picks up traffic. This design breaks for machine learning. Distributed training isn't resilient to partial launch. Feature computation has complex dependencies between nodes. Models fail silently when pieces go missing. The default scheduler has no notion of these constraints. It treats GPUs like generic resources instead of expensive specialized equipment that powers your entire business.

This is why KAI exists. It understands the semantics of AI workloads. It knows that certain jobs must launch together or not at all. It knows that fairness prevents team starvation and keeps velocity high across your organization. It knows that preemption must be graceful because lost progress is lost money. These aren't abstract concerns. They're concrete optimizations that change whether your GPU infrastructure is an asset or a liability.

Integration Patterns: Connecting KAI with Your ML Stack

KAI doesn't exist in isolation. It integrates with your broader ML infrastructure-flux)-flux). Your training platform (Kubeflow, PyTorch Distributed) needs to understand KAI's gang scheduling labels. Your monitoring system needs to surface KAI metrics alongside your application metrics. Your team's mental model needs to shift from "submit a job and hope" to "submit a job and trust that the scheduler will handle it intelligently."

This integration requires thought about API design. When a data scientist submits a training job, do they need to know about gang scheduling? Ideally no. They shouldn't need to manually add labels. Instead, your training abstraction layer should handle this automatically. They say "I want to train a model across eight GPUs" and the system automatically generates the gang scheduling labels, calculates fair-share quotas, and submits the job to KAI. Complexity is hidden behind a simple interface.

Similarly, your monitoring dashboards need to show KAI metrics alongside training job metrics. A dashboard might show "Ten training jobs in queue, estimated wait three hours" alongside "GPU queue contains 47 samples waiting for annotation." Teams need unified visibility into their entire pipeline-pipeline-parallelism)-automated-model-compression). KAI's metrics are just one part of this.

Operations and Monitoring: Making KAI Reliable

Deploying KAI to production requires more than installation. You need visibility into what the scheduler is actually doing. Is fair-share working correctly? Are teams getting their guaranteed quotas? Is preemption happening gracefully or forcing hard-kills that waste work? Without monitoring, you won't know if KAI is truly improving things. You'll have hopes and promises. You'll lack data.

The key metrics to monitor are: fair-share quota utilization by team (are teams getting what they expect?), preemption frequency and grace period success rate (is preemption graceful or forceful?), gang scheduling success rate (how often do all pods in a gang launch together?), and queue depth and wait time estimates (is anyone waiting unreasonably long?). These metrics surface in Prometheus if you wire up KAI's metrics export. Build dashboards showing these metrics. Alert when queue depth exceeds thresholds or preemption grace periods are timing out.

Operational discipline is important here. When something goes wrong, you need runbooks. "Gang scheduling success rate dropped to 80 percent at 3 AM" needs a response. Are nodes failing? Is fragmentation increasing? Did a bad job configuration create resource conflicts? Your team should know how to investigate. Document investigation procedures. Share post-mortems when things fail. Build institutional knowledge about what can go wrong and how to fix it.

Troubleshooting Common KAI Scheduler Issues

When KAI deployments encounter problems, understanding what's happening requires digging into multiple layers. A training job stuck in Pending state might be waiting for gang scheduling to assemble all pods, or it might be stuck due to resource fragmentation where no single node has enough free GPUs for the job.

The first diagnostic step is checking the job's events. Use kubectl describe pod job-name to see why each pod in the gang is pending. "Insufficient nvidia.com/gpu" means the cluster truly lacks GPUs. "Gang waiting for members" means other gang members haven't launched yet. These are different problems requiring different solutions.

Node fragmentation is a subtle issue that emerges with gang scheduling. Imagine a 32-GPU node where pods have allocated GPUs in fragments: pod A uses GPUs 0-3, pod B uses GPUs 8-11, pod C uses GPUs 16-19. A new job requesting 8 contiguous GPUs can't fit, even though twelve GPUs are free. KAI includes bin-packing strategies to minimize fragmentation, but fragmentation can still occur with heterogeneous pod sizes. Monitoring node fragmentation helps identify when you need to add capacity or consolidate workloads.

Fair-share enforcement sometimes creates surprising situations. A team might request more resources than their fair-share quota, expecting queuing. Instead, ArgoJobs get rejected outright if they exceed quota. Understanding your fair-share configuration and limits prevents surprises. Teams should know their guaranteed quota, their burst quota, and what happens when they exceed both.

Preemption timeout failures are worth investigating. A job marked for preemption receives a grace period to shutdown gracefully. If a pod doesn't shutdown within that time, KAI forcefully kills it, wasting any in-progress work. Long grace periods delay other jobs from starting. Short grace periods cause more work to be lost. Finding the right grace period requires understanding your models' checkpointing behavior and recovery time.

Scaling KAI Beyond Single Clusters

As your GPU infrastructure grows to multiple clusters, KAI's scheduling capabilities extend across sites. This is where the system gets genuinely sophisticated. A unified queue receives jobs from any cluster. The meta-scheduler determines which physical cluster should run each job based on current availability, job requirements, and team priorities.

This multi-cluster setup is valuable when you have heterogeneous hardware. Your on-premise cluster might be optimized for training - many A100s, high-bandwidth networking. Your cloud cluster might be optimized for inference serving - H100s, lower latency requirements. The meta-scheduler understands these characteristics and routes jobs appropriately.

However, multi-cluster introduces consistency challenges. Job IDs must be globally unique. State must be synchronized across schedulers. Quota tracking must be consistent globally. Failures in inter-cluster communication might leave jobs stuck or double-scheduled. Mature multi-cluster deployments implement careful monitoring and reconciliation logic to catch and fix these issues.

The Cost-Benefit Analysis of Sophisticated Scheduling

Implementing KAI requires engineering effort. You need cluster infrastructure expertise, scheduler understanding, and operational discipline. For smaller organizations with straightforward workload patterns, simpler scheduling might suffice. But the ROI calculation is compelling: if KAI increases utilization from 40% to 75%, that's roughly doubling your effective GPU capacity without buying more hardware. That's millions of dollars in value from dozens of engineering hours. The math is hard to argue with.

But implementation success requires commitment. You can't deploy KAI and expect miracles without adjusting how teams submit jobs. You need to educate teams about gang scheduling, about fair-share quotas, about preemption. You need operational processes for handling queue buildup and capacity planning. You need robust monitoring to catch scheduling anomalies. Without this supporting infrastructure, KAI becomes a tool gathering dust instead of an asset generating value.

The Path Forward: Making Your GPU Investment Pay

Building a sophisticated scheduler like KAI isn't about technical elegance. It's about making your hardware investments actually work. Every GPU you buy is a commitment. The commitment only pays off if you use them well. Poor scheduling means you're paying for GPUs that sit idle despite appearing busy. Great scheduling means you're truly utilizing what you've paid for.

The path forward is clear: implement intelligent gang scheduling to eliminate partial job waste, implement fair-share quotas to prevent team starvation, implement preemption to maintain SLAs while running full utilization, and implement monitoring to catch problems before they become disasters. These aren't nice-to-have optimizations. They're the foundation of a working GPU infrastructure.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project