Here's the fundamental problem: modern GPUs like the A100 and H100 are monsters - too much compute power for most individual workloads. A single inference job might use 5% of available GPU memory and less than 10% of compute. Meanwhile, another team's notebook is blocked waiting for GPU availability.

Traditional GPU scheduling is binary: either you own the whole GPU, or you don't get one. This leads to massive waste. Running three small inference models on separate GPUs costs 3x the silicon, 3x the power, 3x the cooling - but you're getting maybe 35% utilization across all of them.

The economics are brutal when you think about it. An A100 80GB GPU costs roughly $12,000 to purchase and another $3,000-5,000 annually to operate (power, cooling, networking, depreciation). In the cloud, that translates to $2-3 per GPU-hour on-demand. If your GPU is idle 70% of the time, you're hemorrhaging money. Even large enterprises with hundreds of GPUs find themselves with expensive paperweights. For a 100-GPU cluster running at 30% utilization, that's $1.8 million per year in pure waste.

NVIDIA's answer comes in three flavors, each with different guarantees and trade-offs:

MIG: Hardware-level partitioning. True isolation at the silicon level.
MPS: Software-level sharing. Single GPU context serving multiple processes.
Time-slicing: Pure scheduling fairness. Everyone gets turns at the full GPU.

Pick the wrong one, and you'll hit performance cliffs or reliability issues. Pick the right one, and you multiply your effective GPU capacity without buying more hardware. This isn't just about being thrifty with resources. It's about making your engineering team happy. If your data scientists are waiting for GPU availability while there are GPUs in the rack at 30% utilization, you've got a problem bigger than money. You've got a cultural problem where the infrastructure is fighting the team instead of enabling them.

The decision isn't abstract either. Get this right in a 100-GPU cluster, and you've effectively purchased 300 GPUs' worth of throughput without spending a dime on new silicon. Get it wrong, and your data science team is waiting for GPU availability while your infrastructure team explains why sharing doesn't work. Understanding the trade-offs between these three approaches will determine whether you're running an enabling infrastructure or a constraining one.

MIG: Hardware Isolation at the Instance Level

What MIG Actually Does

Multi-Instance GPU carves an A100 or H100 into truly independent compute units at the hardware level. It's not virtualization - it's physical partitioning. Each MIG instance gets dedicated memory, dedicated compute cores, and dedicated interconnects. If your job crashes on MIG instance 1, MIG instance 2 keeps running. Period. This is the critical difference from the other approaches: MIG is failure-isolated at the silicon level.

Think of it like dividing a physical GPU into separate mini-GPUs. That's closer to the truth than it sounds. NVIDIA actually carved up the silicon during manufacturing. Each instance has its own L2 cache partitions, its own memory controllers, its own execution units. When you enable MIG mode, you're not doing software magic - you're unlocking hardware that was designed from the ground up to be partitionable.

Why MIG Matters

The reason MIG is valuable isn't just technical - it's organizational. In a multi-tenant environment, you might have multiple customer workloads running on the same hardware. With MIG, you have a guarantee: if one customer's code crashes the GPU, other customers aren't affected. This is literally your SLA. You promised 99.9% availability, and one customer's bug shouldn't violate that promise to another customer.

Compare this to time-slicing, where all jobs take turns on the same GPU. If one job crashes the GPU during its turn, all waiting jobs are delayed. Or MPS, where a crash in one client process resets the entire shared context. With MIG, crashes are contained. This isolation is especially valuable in SaaS scenarios where you're serving multiple customers and one customer's problems can't cascade to others.

NVIDIA's A100 and H100 support up to 7 MIG instances simultaneously. The sizes vary:

1g.5gb   - 1 GPU, 5GB memory (lightweight inference)
2g.10gb  - 2 GPUs, 10GB memory (small training)
3g.20gb  - 3 GPUs, 20GB memory (medium workloads)
4g.20gb  - 4 GPUs, 20GB memory (balanced workloads)
7g.40gb  - 7 GPUs, 40GB memory (large multi-tenant setup)

(There's also 7g.80gb for the full GPU in single-instance mode, which defeats MIG's purpose.)

The constraint: all instances on a GPU must be the same size. You can't mix a 3g.20gb and a 2g.10gb instance on the same physical GPU. This is a hardware limitation that sometimes forces you to leave capacity on the table.

Understanding this constraint is critical for planning. If you partition a physical A100 into four 3g.20gb instances (12GB each), you're using 48GB of the 80GB available. That's only 60% utilization. But if you tried to add a 1g.5gb instance, you'd violate the "all same size" rule and have to reconfigure the entire GPU. This forces a binary decision: optimize for a single workload size, or accept capacity waste.

Hardware Guarantees and Isolation

Here's where MIG shines: the isolation is enforced at the hardware level. Each MIG instance has its own:

Execution units (SMs - Streaming Multiprocessors)
L2 cache partition
Memory controllers
Interconnect bandwidth

If one MIG instance goes rogue and starts hammering memory, it can't steal bandwidth from another instance. If a kernel on instance 1 deadlocks, instance 2 keeps running unaffected. This is true fault isolation.

The implications are profound. In multi-tenant environments - think a SaaS company running customer models on shared GPUs - this guarantee is everything. One customer's buggy training code can't crash another customer's inference. One team's experimental research can't poison another team's production service. This isolation is legally and operationally valuable in ways that are hard to quantify until you've been burned by a neighbor process.

Compare that to MPS (which we'll cover next), where a single GPU context serves all processes. One misbehaving process can bring down the entire context.

Kubernetes MIG Device Plugin Setup

Getting MIG working in Kubernetes means:

Enable MIG on the GPU (driver configuration)
Partition it into your desired instance sizes
Tell Kubernetes about the partitions (NVIDIA device plugin)
Schedule workloads to MIG instances (resource requests)

This is where your infrastructure decisions get locked in. Once you've configured a GPU for MIG with four 3g.20gb instances, you're committed. Changing that partition later requires downtime.

Here's what that looks like in practice:

bash

# SSH into the node with GPUs
# Check current MIG mode
nvidia-smi -L
 
# Enable MIG mode (GPU 0)
sudo nvidia-smi -i 0 -mig 1
 
# Configure MIG instances
# Create 4 instances of 3g.20gb on GPU 0
sudo nvidia-smi -i 0 -cip 3g.20gb,3g.20gb,3g.20gb,3g.20gb
 
# Verify
nvidia-smi
 
# Output should show:
# GPU 0
#   MIG 0: 3g.20gb
#   MIG 1: 3g.20gb
#   MIG 2: 3g.20gb
#   MIG 3: 3g.20gb

Now deploy the NVIDIA device plugin to make Kubernetes aware of these instances:

yaml

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
 
---
apiVersion: helm.sh/v1
kind: HelmChart
metadata:
  name: nvidia-device-plugin
  namespace: gpu-operator
spec:
  repo: https://nvidia.github.io/k8s-device-plugin
  chart: nvidia-device-plugin
  targetNamespace: gpu-operator
  valuesContent: |-
    driver:
      enabled: false
    devicePlugin:
      config:
        shares: 1
        failOnInitError: true
        migStrategy: "mixed"
        migMonitoringEnabled: true
        deviceListStrategy: "envvar"
        deviceIDStrategy: "uuid"

After the device plugin starts, check what Kubernetes sees:

bash

kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia
 
# Output:
# "nvidia.com/gpu": "4"  (4 MIG instances on this node)

MIG Scheduling in Kubernetes

Now request MIG instances in your pod specs:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: inference-pod
  namespace: default
spec:
  containers:
    - name: inference-server
      image: myregistry/inference:latest
      resources:
        requests:
          nvidia.com/gpu: "1" # Request 1 MIG instance
        limits:
          nvidia.com/gpu: "1"
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: CUDA_VISIBLE_DEVICES
          value: "0" # The runtime sets this automatically

The kubelet's device plugin will bind this pod to one of the four 3g.20gb instances we created. Each instance looks like a fully independent GPU from the container's perspective. The pod can't see the other instances, can't starve them, can't crash them.

MIG's Sweet Spot

MIG is your answer when:

You run inference workloads that don't need the full GPU
You need hard isolation guarantees (multi-tenant SaaS, compliance-heavy)
You want to maximize utilization across many small workloads
Your workloads can fit in the smaller instance sizes (the constraint is always memory, never compute)

MIG's weakness: inflexibility. You're locked into fixed partitions. If you partition for small inference (1g.5gb) but then need to run a larger model, you're stuck with smaller instances than you actually need. This isn't just an inconvenience - it affects your total cluster cost and capacity planning for months ahead.

How MPS Works

CUDA Multi-Process Service takes a different approach: instead of hardware partitioning, MPS creates a single GPU context and lets multiple user processes share it. Think of it like a context manager that schedules competing kernel launches onto the same GPU.

MPS architecture looks like this:

┌─────────────────────────────────────────┐
│ NVIDIA CUDA MPS Service (Daemon)        │
│ ┌──────────────────────────────────────┐│
│ │ GPU Context (A100 or H100)            ││
│ │ - Shared Memory Space                 ││
│ │ - Single Kernel Queue                 ││
│ └──────────────────────────────────────┘│
└─────────────────────────────────────────┘
    ↑         ↑         ↑         ↑
  Client    Client    Client    Client
  Process 1 Process 2 Process 3 Process 4

Each client process connects to the MPS daemon, submits kernels to the shared context, and the daemon schedules them. From the application's perspective, it looks like it has exclusive GPU access.

The magic: MPS enforces memory isolation between clients through memory protection tables. Client 1's memory is fenced off from Client 2's memory even though they're in the same GPU context. If Client 1 tries to write to Client 2's heap, the hardware blocks it.

But here's the catch: fault isolation is not guaranteed. If Client 1 submits a kernel with a bug that causes a GPU reset, Client 2's kernels in-flight will be dropped. The GPU context recovers, but work is lost. This is the critical difference from MIG, and why MPS is for trusted workloads only.

Setting Up MPS in Kubernetes

First, start the MPS daemon on each node with GPUs:

bash

# On the GPU node (outside Kubernetes)
# Or as a daemonset if you want it container-based
 
# Start the daemon
nvidia-cuda-mps-control -d
 
# Verify it's running
ps aux | grep mps
 
# Set memory limits per client (optional but recommended)
nvidia-smi -i 0 -mps 1  # Enable MPS mode

Then deploy a daemonset to ensure MPS is always running:

yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-mps-daemon
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-mps
  template:
    metadata:
      labels:
        app: nvidia-mps
    spec:
      hostPID: true
      hostNetwork: true
      nodeSelector:
        nvidia.com/gpu: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: mps-daemon
          image: nvidia/cuda:12.0-base-ubuntu22.04
          command:
            - sh
            - -c
            - |
              nvidia-cuda-mps-control -d
              echo "start_server -uid 0" | nvidia-cuda-mps-control
              sleep infinity
          securityContext:
            privileged: true
            runAsUser: 0
          volumeMounts:
            - mountPath: /usr/local/nvidia
              name: nvidia-install-dir
      volumes:
        - name: nvidia-install-dir
          hostPath:
            path: /usr/local/nvidia

Now your pods requesting GPUs will connect to the MPS daemon:

yaml

apiVersion: v1
kind: Pod
metadata:
  name: mps-client-1
spec:
  containers:
    - name: training-worker
      image: myregistry/pytorch:latest
      resources:
        requests:
          nvidia.com/gpu: "0.25" # 25% of GPU capacity
        limits:
          nvidia.com/gpu: "0.25"
      env:
        - name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
          value: "25" # Soft cap on resources

MPS Limitations and Risks

MPS is clever, but it has real limitations:

No guaranteed fault isolation: A kernel crash resets the entire GPU context. All clients lose in-flight work. In a production inference-production-inference-deployment) scenario where you have four models running on one GPU, a bug in one model crashes all four simultaneously. That's catastrophic for availability.
Memory isolation is application-dependent: MPS enforces memory bounds, but only if the CUDA application respects them. A buggy app writing out of bounds can still corrupt other clients' memory. MPS's memory protection is a speedbump, not a wall.
Performance variability: Multiple processes sharing the same GPU context compete for kernel queue slots. Latency becomes unpredictable. Your inference endpoint might respond in 10ms one moment and 50ms the next, depending on what else is running.
Debugging is harder: When things go wrong, you're debugging multiple processes' interactions in a shared context instead of isolated hardware. That 11ms latency spike - was it your model, someone else's, or contention between both?
Compute isolation is limited: Unlike MIG, compute resources aren't partitioned. One process running a heavy kernel can starve others. It's not a hard resource limit; it's a scheduling fairness mechanism that breaks under adversarial conditions.

MPS works best for:

Development environments where isolation isn't critical
Inference workloads with similar latency requirements (batch inferenceprocessing-millions-records), LLM serving)
Trusted workloads (internal teams, not multi-tenant SaaS)
Cost-sensitive scenarios where you accept the isolation trade-offs

Time-Slicing: Fair Scheduling Without Isolation

The Time-Slicing Strategy

Time-slicing is the simplest approach: configure the NVIDIA device plugin to let multiple pods request the same physical GPU. The Kubernetes scheduler oversubscribes the resource, and the plugin enforces fairness through time-slicing.

Here's the idea:

Pod A gets 100ms of GPU time
Pod B gets 100ms of GPU time
Pod C gets 100ms of GPU time
Repeat until one pod finishes

From each pod's perspective, it has exclusive GPU access for brief windows. In reality, they're rotating through the GPU.

Time-slicing is entirely transparent to your applications. Your models have no idea they're sharing. The GPU driver handles context switching automatically. This transparency is powerful for adoption - you don't need to modify code or handle special cases.

NVIDIA Device Plugin Time-Slicing Configuration

Enable time-slicing in the device plugin config:

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-configs
  namespace: gpu-operator
data:
  any: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 4
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin
          image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
          args:
            - "-config=/etc/nvidia-container-runtime/device-plugin-configs.yaml"
          volumeMounts:
            - name: device-metrics
              mountPath: /run/prometheus
            - name: device-plugin-config
              mountPath: /etc/nvidia-container-runtime
      volumes:
        - name: device-plugin-config
          configMap:
            name: nvidia-device-plugin-configs
        - name: device-metrics
          emptyDir: {}

The replicas: 4 setting means Kubernetes will report each GPU as having 4 allocatable units. If you schedule 4 pods requesting nvidia.com/gpu: "1" each, all 4 will land on the same physical GPU, and time-slicing handles the rotation.

Oversubscription Ratios and Fairness

The oversubscription ratio (how many pods you let share one GPU) depends on your workload:

Inference workloads: 4-8x oversubscription is safe. Inference is latency-tolerant, and multiple models benefit from batching.
Development/notebook workloads: 2-4x is safer. These are interactive, and users get annoyed by slowdowns.
Training workloads: 1x (don't time-slice). Training is compute-hungry and time-sensitive.

Here's a more tuned config:

yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-configs
  namespace: gpu-operator
data:
  any: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 8
        failOnInitError: false
        cudaProc:
          start: 0
          end: 10
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"

The cudaProc.start and cudaProc.end settings control which CUDA processes participate in time-slicing. This gives you fine-grained fairness guarantees across namespaces.

The Trade-Offs: Context Switching Overhead

Time-slicing isn't free. Each time you switch contexts from one pod to another, you pay overhead:

GPU memory flush: GPU caches clear.
Context state swap: The new pod's context loads into GPU memory.
Synchronization: Kernels must complete before switching.

For a 4x oversubscription ratio, expect:

5-15% latency increase per pod compared to exclusive GPU access
Throughput still improves overall because 4 pods are making progress instead of 1
Variability increases: One pod's performance depends on what else is running

These aren't theoretical numbers - they're from real deployments. A single inference call that takes 10ms exclusively might take 11.2ms under 4x time-slicing. That matters if your SLA is 10ms p99. But if your workload is batch inference, where you're processing 10,000 images and latency variance doesn't matter, time-slicing is almost free.

Time-slicing shines when:

You have many small, batch workloads (inference)
Fairness across teams is critical (you want everyone's notebooks to work)
You accept latency trade-offs for better overall throughput
You don't need hard isolation guarantees

Workload-to-Strategy Matching

Here's the decision tree:

Does your workload need hard isolation?
├─ YES (multi-tenant, compliance, safety-critical)
│  └─ Use MIG if instances fit your memory needs
│     └─ If not, use separate GPUs
│
├─ NO (internal teams, trusted workloads)
│  ├─ Is the workload compute-intensive (training)?
│  │  ├─ YES → Use dedicated GPUs (no sharing)
│  │  └─ NO → Is latency critical (real-time serving)?
│  │     ├─ YES → MPS with caution, or MIG
│  │     └─ NO → Time-slicing (batch inference, notebooks)
│  │
│  └─ Is memory the constraint?
│     ├─ YES → MIG (if it fits) or time-slicing
│     └─ NO → MPS or time-slicing

Real-World Examples

Scenario 1: SaaS LLM Inference Provider

You run a managed inference service where customers deploy their own models. You absolutely need isolation - one customer's model crashing shouldn't affect another's.

→ Use MIG exclusively. Partition each A100 into 3g.20gb or 2g.10gb instances depending on model sizes. Schedule each customer's inference pod to a dedicated MIG instance.

yaml

apiVersion: v1
kind: Pod
metadata:
  name: customer-acme-inference
spec:
  containers:
    - name: inference
      image: customer-acme-registry/llm:latest
      resources:
        requests:
          nvidia.com/gpu: "1" # One 3g.20gb MIG instance

Scenario 2: Internal Development Cluster

Your data science team needs GPU access for notebooks, small experiments, and model development. You have 4 A100s and 20 people wanting GPU access simultaneously.

→ Use time-slicing at 4x oversubscription. Everyone gets turns at the GPU. Latency isn't critical; the important thing is that everyone's notebooks stay responsive.

yaml

apiVersion: v1
kind: Pod
metadata:
  name: notebook-alice
spec:
  containers:
    - name: jupyter
      image: jupyter/pytorch-notebook:latest
      resources:
        requests:
          nvidia.com/gpu: "0.25" # Share 1 GPU with 3 others

Scenario 3: Mixed Training and Inference Workload

You have both model training (needs exclusive GPU) and batch inference (can share). You have 8 GPUs.

→ Partition the cluster: Reserve 4 GPUs for training (no sharing), partition the other 4 with MIG into 3g.20gb instances for inference.

yaml

---
# Training pod - reserves entire GPU
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  containers:
    - name: trainer
      image: myregistry/training:latest
      resources:
        requests:
          nvidia.com/gpu: "1" # Full GPU
---
# Inference pod - requests MIG instance
apiVersion: v1
kind: Pod
metadata:
  name: inference-job
spec:
  containers:
    - name: inference
      image: myregistry/inference:latest
      resources:
        requests:
          nvidia.com/gpu: "1" # One MIG 3g.20gb instance

When to Choose Each Approach: A Decision Matrix

At this point you know the mechanics of all three approaches. The question is: which one for your cluster? The answer depends on your workloads, your tolerance for risk, and your utilization goals.

MIG: Best for SaaS and Multi-Tenant Inference

Use MIG when you need strong isolation guarantees and your workloads are roughly homogeneous in size. MIG is expensive (in terms of capacity waste due to the same-size constraint), but it's the only approach with hardware-level fault isolation.

When to use MIG:

Multi-tenant SaaS where customer isolation is critical
Production inference serving multiple models
Scenarios where one workload crashing is unacceptable
You value reliability over maximum utilization

When NOT to use MIG:

Your workloads are wildly different sizes (8GB jobs and 24GB jobs)
You're running research code and crashes are expected
Utilization is already good (MIG wastes capacity)

MPS: Best for Trusted Workloads in Controlled Environments

Use MPS when you trust all workloads on the GPU, you want better utilization than MIG, and you can accept occasional context resets.

When to use MPS:

Internal research clusters where team members know each other's code quality
Batch inference from known, tested models
Model serving with models you trained yourself (not customer code)
You want finer control over resource sharing than time-slicing

When NOT to use MPS:

Multi-tenant SaaS (untrusted code)
Scenarios where GPU crashes are unacceptable
You're running experimental code from multiple teams

Time-Slicing: Best for Development and Maximizing Utilization

Use time-slicing when you want simplicity, fair sharing among many workloads, and maximum utilization. It's the easiest to implement and understand.

When to use time-slicing:

Development clusters with notebooks and experiments
You have many jobs (20+) waiting for GPU
Academic settings with student code
You want simple scheduling fairness
Cost is more important than latency

When NOT to use time-slicing:

You have strict latency SLOs (context switching adds delay)
You have long-running jobs (they'll be preempted frequently)
You need deterministic performance (time-slicing is inherently variable)

Scheduling and Bin-Packing Strategies

Gang Scheduling for MIG Workloads

If your job needs multiple MIG instances (e.g., a distributed training-pipelines-training-orchestration)-fundamentals)) job using 4 MIG instances), you need gang scheduling. Without it, the scheduler might allocate 2 instances to your job, then get stuck because other pods claimed the remaining 2.

Use Karpenter or KAI Scheduler (NVIDIA's own) to coordinate multi-instance allocations:

yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training-job
spec:
  parallelism: 4
  completions: 4
  template:
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: job-id
                    operator: In
                    values:
                      - distributed-training-job
              topologyKey: kubernetes.io/hostname
      containers:
        - name: trainer
          image: myregistry/distributed-training:latest
          resources:
            requests:
              nvidia.com/gpu: "1" # 4 pods × 1 MIG instance = 4 instances needed

The podAffinity rule ensures all 4 pods land on the same node. Then the device plugin allocates them to MIG instances.

When multiple teams share a cluster, enforce quotas at the namespace level:

yaml

---
apiVersion: v1
kind: Namespace
metadata:
  name: team-data-science
 
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: team-data-science
spec:
  hard:
    requests.nvidia.com/gpu: "8" # This team gets max 8 GPU slots
    limits.nvidia.com/gpu: "8"
 
---
apiVersion: v1
kind: PodDisruptionBudget
metadata:
  name: protect-gpu-workloads
  namespace: team-data-science
spec:
  minAvailable: 1
  selector:
    matchLabels:
      workload-type: gpu-intensive

Now if team-data-science tries to schedule a pod requesting 10 GPU slots, the quota controller rejects it. Fair sharing enforced.

Chargeback: Billing Based on Actual GPU Time Used

If you're doing chargeback (billing internal teams for GPU usage), you need to track actual GPU time consumed, not just requested.

MIG and MPS make this cleaner than time-slicing:

MIG: Pod running on MIG instance = GPU time consumed. Track by instance.
MPS: Monitor nvidia-smi output; Current Users tells you how many processes are using each GPU.
Time-slicing: Harder. You need metrics on how much time each pod actually held the GPU.

Use NVIDIA DCGM (Data Center GPU Manager) to export metrics:

yaml

apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    app: dcgm-exporter
  ports:
    - port: 9400
      protocol: TCP
 
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.0
          securityContext:
            privileged: true
          env:
            - name: DCGM_EXPORTER_INTERVAL
              value: "30000"
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
          volumeMounts:
            - mountPath: /run/prometheus
              name: pod-resources
      volumes:
        - name: pod-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources

This exports Prometheus-grafana-ml-infrastructure-metrics) metrics like:

dcgm_gpu_utilization{gpu="0",uuid="GPU-xxx"} 85
dcgm_sm_clock{gpu="0",uuid="GPU-xxx"} 1410
dcgm_memory_clock{gpu="0",uuid="GPU-xxx"} 7001

Scrape these into Prometheus, join with pod metadata, and you can bill teams based on actual utilization.

Why This Matters in Production

In a real production environment, GPU sharing becomes an operational necessity around 50-100 GPUs. At that scale, you have:

Multiple teams competing for resources
Different workload characteristics (some need real-time response, others batch)
Cost pressures to maximize utilization
Availability requirements that demand fault tolerance

Getting this wrong doesn't just waste money - it causes friction that kills adoption. If your data scientists can never get GPU access because training jobs hog the entire machine, they switch to smaller models that don't need GPUs, defeating the purpose of buying them. If your inference system crashes when one customer's job breaks, you lose customer trust.

The right strategy, implemented thoughtfully, compounds benefits: better team satisfaction, lower cost per compute unit, higher cluster utilization, and ultimately faster research and product iteration.

Comparative Architecture Diagram

Here's how the three strategies differ at a high level:

MIG ISOLATION
┌──────────┐
│ Physical │
│ GPU      │
├──────────┤
│ MIG 0 ├─→ Pod A (isolated compute + memory)
│ MIG 1 ├─→ Pod B (isolated compute + memory)
│ MIG 2 ├─→ Pod C (isolated compute + memory)
└──────────┘
Guarantee: Hardware-level isolation. Pod A crash ≠ Pod B crash.

MPS SHARING
┌──────────────────────────┐
│ GPU Context (shared)     │
├──────────────────────────┤
│ Kernel Queue             │
├──┬──┬──┬──────────────────┤
│ K│ K│ K│ from Pod A, B, C │
├──┴──┴──┴──────────────────┤
│ Memory: Pod A | Pod B | C │ (fenced)
└──────────────────────────┘
Guarantee: Memory isolation. No compute or fault isolation.

TIME-SLICING
┌──────────────────────────┐
│ GPU Context              │
├──────┬──────┬──────┬─────┤
│Time 0│Time 1│Time 2│Time 3
├──────┼──────┼──────┼─────┤
│Pod A │Pod B │Pod C │Pod A
└──────┴──────┴──────┴─────┘
Guarantee: Fair scheduling only. Coarse context switching overhead.

Putting It All Together: Complete Example Setup

Here's a production-ready cluster setup using all three strategies:

yaml

---
# 1. Install NVIDIA GPU Operator (prerequisite)
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-operator
 
---
# 2. Configure device plugin for MIG + time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-configs
  namespace: gpu-operator
data:
  any: |
    version: v1
    sharing:
      timeSlicing:
        replicas: 4
    migStrategy: "mixed"
 
---
# 3. Deploy device plugin daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-device-plugin
  template:
    metadata:
      labels:
        app: nvidia-device-plugin
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: nvidia-device-plugin
          image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
          args:
            - "-config=/etc/nvidia-container-runtime/configs.yaml"
          securityContext:
            privileged: true
          volumeMounts:
            - name: device-plugin-config
              mountPath: /etc/nvidia-container-runtime
      volumes:
        - name: device-plugin-config
          configMap:
            name: nvidia-device-plugin-configs
 
---
# 4. MIG-only namespace (multi-tenant inference)
apiVersion: v1
kind: Namespace
metadata:
  name: saas-inference
  labels:
    gpu-strategy: "mig-only"
 
---
# 5. SaaS inference pod requesting MIG instance
apiVersion: v1
kind: Pod
metadata:
  name: customer-acme-llm
  namespace: saas-inference
spec:
  containers:
    - name: llm-inference
      image: myregistry/llm-server:latest
      resources:
        requests:
          nvidia.com/gpu: "1" # One MIG instance
        limits:
          nvidia.com/gpu: "1"
      ports:
        - containerPort: 8000
 
---
# 6. Time-slicing namespace (development)
apiVersion: v1
kind: Namespace
metadata:
  name: development
  labels:
    gpu-strategy: "time-slicing"
 
---
# 7. Development notebook pod (shared GPU)
apiVersion: v1
kind: Pod
metadata:
  name: notebook-alice
  namespace: development
spec:
  containers:
    - name: jupyter
      image: jupyter/pytorch-notebook:latest
      resources:
        requests:
          nvidia.com/gpu: "0.25" # 1/4 of GPU
        limits:
          nvidia.com/gpu: "0.25"
      ports:
        - containerPort: 8888
      env:
        - name: JUPYTER_ENABLE_LAB
          value: "yes"
 
---
# 8. Resource quota for development namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: dev-gpu-quota
  namespace: development
spec:
  hard:
    requests.nvidia.com/gpu: "2" # Max 2 full GPUs worth
    limits.nvidia.com/gpu: "2"
 
---
# 9. DCGM metrics exporter for chargeback
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
 
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu: "true"
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.0
          securityContext:
            privileged: true
          env:
            - name: DCGM_EXPORTER_KUBERNETES
              value: "true"
          volumeMounts:
            - mountPath: /run/prometheus
              name: pod-resources
      volumes:
        - name: pod-resources
          hostPath:
            path: /var/lib/kubelet/pod-resources

Deploy this, verify each layer:

bash

# Verify device plugin is running
kubectl get daemonset -n gpu-operator nvidia-device-plugin
 
# Check available GPU resources
kubectl describe nodes | grep nvidia.com/gpu
 
# Expected output (for 4x time-slicing):
# nvidia.com/gpu: 16  (4 physical GPUs × 4 shares each)
 
# Submit MIG workload
kubectl apply -f saas-inference.yaml
 
# Submit time-sliced workload
kubectl apply -f development-notebook.yaml
 
# Verify pod scheduling
kubectl get pods -n saas-inference
kubectl get pods -n development
 
# Check GPU utilization
nvidia-smi dmon  # Real-time monitoring

Monitoring and Troubleshooting

When something goes wrong, start here:

Pod stuck in Pending?

bash

kubectl describe pod <pod-name> -n <namespace>
# Look for "Insufficient nvidia.com/gpu" in events
# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin

GPU performance degraded?

bash

# Check active processes on GPU
nvidia-smi
 
# If using MPS, check client count
nvidia-cuda-mps-control -server -get active_thread_percentage
 
# If using time-slicing, check scheduling pressure
kubectl top pods -n <namespace> --containers

MIG instance not appearing in Kubernetes?

bash

# Verify MIG is enabled at the node level
nvidia-smi -L  # Should show MIG instances
 
# Restart device plugin to pick up changes
kubectl rollout restart daemonset nvidia-device-plugin -n gpu-operator
 
# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin | grep -i mig

Lessons from Large-Scale GPU Deployments

Real deployments at scale teach lessons that aren't obvious from smaller experiments. A team running ten GPUs might not see the problems that emerge at one hundred GPUs. Problems scale differently than you'd expect. If you have one hundred percent GPU utilization at ten GPUs, that's probably fine - you're just using all your resources efficiently. If you have one hundred percent utilization at one hundred GPUs, you have a problem. You have no slack for spikes in demand, no room for node failures, no headroom for oncall paging. Best practices suggest aiming for seventy to eighty percent utilization as a sustainable operating point. Below that and you're wasting money. Above that and you're constantly fighting capacity constraints.

The coordination problem grows at scale. At ten GPUs, one person can manage things. At one hundred GPUs, you need dedicated infrastructure. Who manages the device plugin? Who handles GPU node failures? Who monitors utilization and decides when to add capacity? Who debugs when scheduling goes wrong? These questions become organizational, not just technical. Mature large-scale GPU operations have clear ownership and runbooks for every scenario.

Another lesson: heterogeneous GPU types complicate strategy decisions. Ideally, your entire cluster is the same GPU (all A100s, all H100s). In practice, you end up with mixed generations. You have some older V100s, some newer A100s, and some latest-generation H100s. MIG support varies across generations. MPS is more consistent but latency characteristics differ. Time-slicing works everywhere but performs differently. Accommodating this heterogeneity requires sophisticated bin-packing logic in your scheduler. You need to understand which workloads can run on which GPUs, which are sensitive to GPU generation, and make those constraints clear to your orchestration system.

Key Takeaways

You now have three tools in your arsenal:

MIG for hard isolation: Use it when workload independence is non-negotiable. Trade flexibility for guarantees.
MPS for shared contexts: Use it for inference workloads that trust each other. Moderate risk, better utilization than MIG alone.
Time-slicing for fairness: Use it when you want simple scheduling with minimal isolation concerns. Highest utilization, highest contention.

The real win comes from matching the right strategy to the right workload. A SaaS platform runs MIG. A development cluster runs time-slicing. A batch inference pipeline-parallelism)-automated-model-compression) might use all three on different GPUs.

Start by auditing your current GPU utilization. If it's below 50%, you're leaving money on the table. Pick one strategy, pilot it with a subset of workloads, measure the results, then scale.

GPU sharing isn't magic - it's thoughtful scheduling. Get it right, and you'll double or triple your effective GPU capacity without buying more hardware.

Implementation Challenges That Teams Face

The theory of GPU sharing sounds clean until you hit production reality. The first challenge is handling heterogeneous workloads on the same physical GPU. You might partition a GPU for inference, thinking all workloads will have similar characteristics. Then someone submits a training job that requires significantly more memory. Your carefully tuned MIG partitions become misaligned with actual workload demands. Real teams maintain multiple partition configurations and switch between them based on observed workload patterns. Some use predictive scaling based on historical demand, automatically reconfiguring GPUs before the demand arrives.

The second challenge is debugging performance issues in shared environments. When a pod is slow, is it slow because the model is inherently slow, or because it's competing for GPU time? Time-slicing makes this difficult to determine without detailed instrumentation. Teams often maintain shadow pods running the same workload in isolation to establish baseline performance. If the isolated version is faster, you have contention. But determining what caused the contention requires detailed GPU tracing, which is time-consuming. Investing in automated performance regression detection helps catch degradation before users complain.

The third challenge is fair scheduling across many workloads. Kubernetes doesn't inherently understand GPU fairness. It knows about compute and memory but treats GPU time as opaque. If you have strict fairness requirements, you might need additional scheduling layers or custom policies. Some teams use karpenter or KAI scheduler to provide more nuanced control. Others implement custom webhook logic that intercepts scheduling decisions and applies fairness rules.

The fourth challenge is managing stateful GPU operations. GPU memory is limited, and with sharing it's even more constrained. Applications that allocate large GPU buffers for the session lifetime become problematic. If pod A allocates thirty percent of GPU memory for its session, then pod B tries to start, pod B gets less memory than expected. Some applications handle this gracefully; others crash. The gateway pattern helps here - applications should request GPU memory dynamically and handle resize situations. This requires application changes, which adds engineering burden.

Cost Attribution and Billing in Shared Environments

Billing for shared GPUs is conceptually simple but operationally complex. If you're doing cost allocation, you need to track actual GPU time consumed per workload, not just requested. A pod might request one MIG instance for one hour, but if it only uses it for five minutes, should you bill for the full hour? Fair billing might charge by actual consumption, but that requires accurate measurement. With time-slicing, measurement is particularly difficult because the same physical GPU might run multiple workloads and you need to attribute time fairly.

Some teams implement billing based on requested resources as a simplification. It's easier to implement but feels unfair to efficient workloads - a model that trains efficiently might consume less compute in the same time window as an inefficient model, but they're charged equally. Other teams implement billing based on actual utilization measured through DCGM metrics, which is fairer but requires sophisticated accounting infrastructure.

A third approach is fixed pricing per pod per month, removing the complexity of usage-based billing entirely. This works well for development clusters where fairness is more important than precise cost allocation. It breaks down for production workloads where cost optimization is critical.

Observability and Monitoring at Scale

Proper observability is the difference between GPU sharing working smoothly and teams blaming the scheduling system for their own problems. You need visibility into resource contention, pod interference, and individual workload performance. NVIDIA's DCGM provides low-level metrics but lacks context about which pod is causing which behavior.

Sophisticated observability requires joining DCGM metrics with pod metadata and application logs. A team deployed GPU sharing without investing in observability. When pods started running slower, they assumed the scheduling strategy was wrong. Weeks of troubleshooting later, they discovered the real issue was a memory leak in one pod fragment, consuming more memory over time and degrading performance for other pods. If they'd had proper memory tracking and anomaly detection, they would have caught this in minutes.

Latency tracing becomes critical with sharing. When a request takes longer than expected, you need to understand whether the time is spent computing or waiting for GPU allocation. Distributed tracing)) systems integrated with GPU metrics provide this visibility. Some teams build custom monitoring that correlates pod scheduling timestamps with latency metrics to identify scheduling-induced delays.

Most teams don't pick the optimal strategy immediately and stick with it forever. Instead, they evolve based on operational experience. A common pattern is starting with time-slicing for simplicity, discovering that performance variability becomes problematic, then moving to MIG for better isolation. Another pattern is starting with simple resource requests and evolving to more sophisticated bin-packing as utilization demands increase.

The evolution requires planning. If you start with pure time-slicing and later want to migrate to MIG, you need a transition period where both coexist. Some clusters have MIG nodes and time-slicing nodes, with workloads explicitly assigned to the appropriate tier. This hybrid approach adds complexity but provides flexibility during transitions.

Understanding your evolution path in advance prevents rework. Teams that anticipate growth design their clusters with MIG-capable GPUs from the start, even if they're only using time-slicing initially. When demands change, they reconfigure rather than upgrade hardware.

Scaling Beyond Single Clusters

GPU sharing patterns change when you operate at multi-cluster scale. Different clusters might have different sharing strategies. Maybe your research cluster uses time-slicing for fairness, while your production cluster uses MIG for isolation. Your scheduling system needs to understand these different capabilities and route workloads to appropriate clusters.

Federation of multiple clusters adds complexity. You need admission control that understands overall capacity across clusters. You need to handle cases where one cluster is at capacity and workloads need to queue or spill over to other clusters. You need to maintain fairness across cluster boundaries. This typically requires a global scheduler or cluster federation framework like karmada that understands these constraints.

Some teams maintain per-team GPU quotas across clusters, ensuring fair resource distribution even as workloads move between clusters. This requires a global accounting system that knows how much GPU time each team has consumed across all clusters.

GPU sharing technology is rapidly evolving. Newer NVIDIA GPUs may have better built-in support for fine-grained sharing that moves beyond)) MIG, MPS, and time-slicing. Architectural support for mixing workload sizes on the same GPU might eliminate the constraint that all MIG instances on a GPU must be the same size. Improved scheduler support in Kubernetes for GPU-specific scheduling logic might reduce the need for custom controllers.

The broader trend is toward more sophisticated resource scheduling for heterogeneous workloads. GPUs are getting more powerful and more specialized. Sharing becomes increasingly necessary but also increasingly complex. Teams investing in solid foundations now will be better positioned to adopt new technologies as they emerge.

Practical infrastructure for AI systems that scale.

Multi-Tenant GPU Sharing: MIG, MPS, and Time-Slicing

MIG: Hardware Isolation at the Instance Level

What MIG Actually Does

Why MIG Matters

Hardware Guarantees and Isolation

Kubernetes MIG Device Plugin Setup

MIG Scheduling in Kubernetes

MIG's Sweet Spot

How MPS Works

Setting Up MPS in Kubernetes

MPS Limitations and Risks

Time-Slicing: Fair Scheduling Without Isolation

The Time-Slicing Strategy

NVIDIA Device Plugin Time-Slicing Configuration

Oversubscription Ratios and Fairness

The Trade-Offs: Context Switching Overhead

Workload-to-Strategy Matching

Real-World Examples

When to Choose Each Approach: A Decision Matrix

MIG: Best for SaaS and Multi-Tenant Inference

MPS: Best for Trusted Workloads in Controlled Environments

Time-Slicing: Best for Development and Maximizing Utilization

Scheduling and Bin-Packing Strategies

Gang Scheduling for MIG Workloads

Chargeback: Billing Based on Actual GPU Time Used

Why This Matters in Production

Comparative Architecture Diagram

Putting It All Together: Complete Example Setup

Monitoring and Troubleshooting

Lessons from Large-Scale GPU Deployments

Key Takeaways

Implementation Challenges That Teams Face

Cost Attribution and Billing in Shared Environments

Observability and Monitoring at Scale

Scaling Beyond Single Clusters

Need help implementing this?

The GPU Sharing Dilemma

MIG: Hardware Isolation at the Instance Level

What MIG Actually Does

Why MIG Matters

Hardware Guarantees and Isolation

Kubernetes MIG Device Plugin Setup

MIG Scheduling in Kubernetes

MIG's Sweet Spot

MPS: Software-Level Context Sharing

How MPS Works

Setting Up MPS in Kubernetes

MPS Limitations and Risks

Time-Slicing: Fair Scheduling Without Isolation

The Time-Slicing Strategy

NVIDIA Device Plugin Time-Slicing Configuration

Oversubscription Ratios and Fairness

The Trade-Offs: Context Switching Overhead

Workload-to-Strategy Matching

Real-World Examples

When to Choose Each Approach: A Decision Matrix

MIG: Best for SaaS and Multi-Tenant Inference

MPS: Best for Trusted Workloads in Controlled Environments

Time-Slicing: Best for Development and Maximizing Utilization

Scheduling and Bin-Packing Strategies

Gang Scheduling for MIG Workloads

Resource Quotas for Fair Multi-Team Sharing

Chargeback: Billing Based on Actual GPU Time Used

Why This Matters in Production

Comparative Architecture Diagram

Putting It All Together: Complete Example Setup

Monitoring and Troubleshooting

Lessons from Large-Scale GPU Deployments

Key Takeaways

Implementation Challenges That Teams Face

Cost Attribution and Billing in Shared Environments

Observability and Monitoring at Scale

Evolution of Your GPU Sharing Strategy

Scaling Beyond Single Clusters

Future of GPU Sharing Technology

Need help implementing this?