Multi-Tenant GPU Sharing: MIG, MPS, and Time-Slicing
You've got expensive GPUs sitting in your cluster, and they're only being used 30% of the time. Yeah, that hurts to think about. The good news? NVIDIA's given us three solid strategies to squeeze more value out of that hardware: Multi-Instance GPU (MIG), CUDA Multi-Process Service (MPS), and Kubernetes-nvidia-kai-scheduler-gpu-job-scheduling)-ml-gpu-workloads) time-slicing. Each approach trades off isolation, performance, and complexity in different ways. Let's dig into when-scale)-real-time-ml-features)-apache-spark))-training-smaller-models)) and how to use each one.
Table of Contents
- The GPU Sharing Dilemma
- MIG: Hardware Isolation at the Instance Level
- What MIG Actually Does
- Why MIG Matters
- Hardware Guarantees and Isolation
- Kubernetes MIG Device Plugin Setup
- MIG Scheduling in Kubernetes
- MIG's Sweet Spot
- MPS: Software-Level Context Sharing
- How MPS Works
- Setting Up MPS in Kubernetes
- MPS Limitations and Risks
- Time-Slicing: Fair Scheduling Without Isolation
- The Time-Slicing Strategy
- NVIDIA Device Plugin Time-Slicing Configuration
- Oversubscription Ratios and Fairness
- The Trade-Offs: Context Switching Overhead
- Workload-to-Strategy Matching
- Real-World Examples
- When to Choose Each Approach: A Decision Matrix
- MIG: Best for SaaS and Multi-Tenant Inference
- MPS: Best for Trusted Workloads in Controlled Environments
- Time-Slicing: Best for Development and Maximizing Utilization
- Scheduling and Bin-Packing Strategies
- Gang Scheduling for MIG Workloads
- Resource Quotas for Fair Multi-Team Sharing
- Chargeback: Billing Based on Actual GPU Time Used
- Why This Matters in Production
- Comparative Architecture Diagram
- Putting It All Together: Complete Example Setup
- Monitoring and Troubleshooting
- Lessons from Large-Scale GPU Deployments
- Key Takeaways
- Implementation Challenges That Teams Face
- Cost Attribution and Billing in Shared Environments
- Observability and Monitoring at Scale
- Evolution of Your GPU Sharing Strategy
- Scaling Beyond Single Clusters
- Future of GPU Sharing Technology
The GPU Sharing Dilemma
Here's the fundamental problem: modern GPUs like the A100 and H100 are monsters - too much compute power for most individual workloads. A single inference job might use 5% of available GPU memory and less than 10% of compute. Meanwhile, another team's notebook is blocked waiting for GPU availability.
Traditional GPU scheduling is binary: either you own the whole GPU, or you don't get one. This leads to massive waste. Running three small inference models on separate GPUs costs 3x the silicon, 3x the power, 3x the cooling - but you're getting maybe 35% utilization across all of them.
The economics are brutal when you think about it. An A100 80GB GPU costs roughly $12,000 to purchase and another $3,000-5,000 annually to operate (power, cooling, networking, depreciation). In the cloud, that translates to $2-3 per GPU-hour on-demand. If your GPU is idle 70% of the time, you're hemorrhaging money. Even large enterprises with hundreds of GPUs find themselves with expensive paperweights. For a 100-GPU cluster running at 30% utilization, that's $1.8 million per year in pure waste.
NVIDIA's answer comes in three flavors, each with different guarantees and trade-offs:
- MIG: Hardware-level partitioning. True isolation at the silicon level.
- MPS: Software-level sharing. Single GPU context serving multiple processes.
- Time-slicing: Pure scheduling fairness. Everyone gets turns at the full GPU.
Pick the wrong one, and you'll hit performance cliffs or reliability issues. Pick the right one, and you multiply your effective GPU capacity without buying more hardware. This isn't just about being thrifty with resources. It's about making your engineering team happy. If your data scientists are waiting for GPU availability while there are GPUs in the rack at 30% utilization, you've got a problem bigger than money. You've got a cultural problem where the infrastructure is fighting the team instead of enabling them.
The decision isn't abstract either. Get this right in a 100-GPU cluster, and you've effectively purchased 300 GPUs' worth of throughput without spending a dime on new silicon. Get it wrong, and your data science team is waiting for GPU availability while your infrastructure team explains why sharing doesn't work. Understanding the trade-offs between these three approaches will determine whether you're running an enabling infrastructure or a constraining one.
MIG: Hardware Isolation at the Instance Level
What MIG Actually Does
Multi-Instance GPU carves an A100 or H100 into truly independent compute units at the hardware level. It's not virtualization - it's physical partitioning. Each MIG instance gets dedicated memory, dedicated compute cores, and dedicated interconnects. If your job crashes on MIG instance 1, MIG instance 2 keeps running. Period. This is the critical difference from the other approaches: MIG is failure-isolated at the silicon level.
Think of it like dividing a physical GPU into separate mini-GPUs. That's closer to the truth than it sounds. NVIDIA actually carved up the silicon during manufacturing. Each instance has its own L2 cache partitions, its own memory controllers, its own execution units. When you enable MIG mode, you're not doing software magic - you're unlocking hardware that was designed from the ground up to be partitionable.
Why MIG Matters
The reason MIG is valuable isn't just technical - it's organizational. In a multi-tenant environment, you might have multiple customer workloads running on the same hardware. With MIG, you have a guarantee: if one customer's code crashes the GPU, other customers aren't affected. This is literally your SLA. You promised 99.9% availability, and one customer's bug shouldn't violate that promise to another customer.
Compare this to time-slicing, where all jobs take turns on the same GPU. If one job crashes the GPU during its turn, all waiting jobs are delayed. Or MPS, where a crash in one client process resets the entire shared context. With MIG, crashes are contained. This isolation is especially valuable in SaaS scenarios where you're serving multiple customers and one customer's problems can't cascade to others.
NVIDIA's A100 and H100 support up to 7 MIG instances simultaneously. The sizes vary:
1g.5gb - 1 GPU, 5GB memory (lightweight inference)
2g.10gb - 2 GPUs, 10GB memory (small training)
3g.20gb - 3 GPUs, 20GB memory (medium workloads)
4g.20gb - 4 GPUs, 20GB memory (balanced workloads)
7g.40gb - 7 GPUs, 40GB memory (large multi-tenant setup)
(There's also 7g.80gb for the full GPU in single-instance mode, which defeats MIG's purpose.)
The constraint: all instances on a GPU must be the same size. You can't mix a 3g.20gb and a 2g.10gb instance on the same physical GPU. This is a hardware limitation that sometimes forces you to leave capacity on the table.
Understanding this constraint is critical for planning. If you partition a physical A100 into four 3g.20gb instances (12GB each), you're using 48GB of the 80GB available. That's only 60% utilization. But if you tried to add a 1g.5gb instance, you'd violate the "all same size" rule and have to reconfigure the entire GPU. This forces a binary decision: optimize for a single workload size, or accept capacity waste.
Hardware Guarantees and Isolation
Here's where MIG shines: the isolation is enforced at the hardware level. Each MIG instance has its own:
- Execution units (SMs - Streaming Multiprocessors)
- L2 cache partition
- Memory controllers
- Interconnect bandwidth
If one MIG instance goes rogue and starts hammering memory, it can't steal bandwidth from another instance. If a kernel on instance 1 deadlocks, instance 2 keeps running unaffected. This is true fault isolation.
The implications are profound. In multi-tenant environments - think a SaaS company running customer models on shared GPUs - this guarantee is everything. One customer's buggy training code can't crash another customer's inference. One team's experimental research can't poison another team's production service. This isolation is legally and operationally valuable in ways that are hard to quantify until you've been burned by a neighbor process.
Compare that to MPS (which we'll cover next), where a single GPU context serves all processes. One misbehaving process can bring down the entire context.
Kubernetes MIG Device Plugin Setup
Getting MIG working in Kubernetes means:
- Enable MIG on the GPU (driver configuration)
- Partition it into your desired instance sizes
- Tell Kubernetes about the partitions (NVIDIA device plugin)
- Schedule workloads to MIG instances (resource requests)
This is where your infrastructure decisions get locked in. Once you've configured a GPU for MIG with four 3g.20gb instances, you're committed. Changing that partition later requires downtime.
Here's what that looks like in practice:
# SSH into the node with GPUs
# Check current MIG mode
nvidia-smi -L
# Enable MIG mode (GPU 0)
sudo nvidia-smi -i 0 -mig 1
# Configure MIG instances
# Create 4 instances of 3g.20gb on GPU 0
sudo nvidia-smi -i 0 -cip 3g.20gb,3g.20gb,3g.20gb,3g.20gb
# Verify
nvidia-smi
# Output should show:
# GPU 0
# MIG 0: 3g.20gb
# MIG 1: 3g.20gb
# MIG 2: 3g.20gb
# MIG 3: 3g.20gbNow deploy the NVIDIA device plugin to make Kubernetes aware of these instances:
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
---
apiVersion: helm.sh/v1
kind: HelmChart
metadata:
name: nvidia-device-plugin
namespace: gpu-operator
spec:
repo: https://nvidia.github.io/k8s-device-plugin
chart: nvidia-device-plugin
targetNamespace: gpu-operator
valuesContent: |-
driver:
enabled: false
devicePlugin:
config:
shares: 1
failOnInitError: true
migStrategy: "mixed"
migMonitoringEnabled: true
deviceListStrategy: "envvar"
deviceIDStrategy: "uuid"After the device plugin starts, check what Kubernetes sees:
kubectl get nodes -o json | jq '.items[].status.allocatable' | grep nvidia
# Output:
# "nvidia.com/gpu": "4" (4 MIG instances on this node)MIG Scheduling in Kubernetes
Now request MIG instances in your pod specs:
apiVersion: v1
kind: Pod
metadata:
name: inference-pod
namespace: default
spec:
containers:
- name: inference-server
image: myregistry/inference:latest
resources:
requests:
nvidia.com/gpu: "1" # Request 1 MIG instance
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: CUDA_VISIBLE_DEVICES
value: "0" # The runtime sets this automaticallyThe kubelet's device plugin will bind this pod to one of the four 3g.20gb instances we created. Each instance looks like a fully independent GPU from the container's perspective. The pod can't see the other instances, can't starve them, can't crash them.
MIG's Sweet Spot
MIG is your answer when:
- You run inference workloads that don't need the full GPU
- You need hard isolation guarantees (multi-tenant SaaS, compliance-heavy)
- You want to maximize utilization across many small workloads
- Your workloads can fit in the smaller instance sizes (the constraint is always memory, never compute)
MIG's weakness: inflexibility. You're locked into fixed partitions. If you partition for small inference (1g.5gb) but then need to run a larger model, you're stuck with smaller instances than you actually need. This isn't just an inconvenience - it affects your total cluster cost and capacity planning for months ahead.
MPS: Software-Level Context Sharing
How MPS Works
CUDA Multi-Process Service takes a different approach: instead of hardware partitioning, MPS creates a single GPU context and lets multiple user processes share it. Think of it like a context manager that schedules competing kernel launches onto the same GPU.
MPS architecture looks like this:
┌─────────────────────────────────────────┐
│ NVIDIA CUDA MPS Service (Daemon) │
│ ┌──────────────────────────────────────┐│
│ │ GPU Context (A100 or H100) ││
│ │ - Shared Memory Space ││
│ │ - Single Kernel Queue ││
│ └──────────────────────────────────────┘│
└─────────────────────────────────────────┘
↑ ↑ ↑ ↑
Client Client Client Client
Process 1 Process 2 Process 3 Process 4
Each client process connects to the MPS daemon, submits kernels to the shared context, and the daemon schedules them. From the application's perspective, it looks like it has exclusive GPU access.
The magic: MPS enforces memory isolation between clients through memory protection tables. Client 1's memory is fenced off from Client 2's memory even though they're in the same GPU context. If Client 1 tries to write to Client 2's heap, the hardware blocks it.
But here's the catch: fault isolation is not guaranteed. If Client 1 submits a kernel with a bug that causes a GPU reset, Client 2's kernels in-flight will be dropped. The GPU context recovers, but work is lost. This is the critical difference from MIG, and why MPS is for trusted workloads only.
Setting Up MPS in Kubernetes
First, start the MPS daemon on each node with GPUs:
# On the GPU node (outside Kubernetes)
# Or as a daemonset if you want it container-based
# Start the daemon
nvidia-cuda-mps-control -d
# Verify it's running
ps aux | grep mps
# Set memory limits per client (optional but recommended)
nvidia-smi -i 0 -mps 1 # Enable MPS modeThen deploy a daemonset to ensure MPS is always running:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-mps-daemon
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-mps
template:
metadata:
labels:
app: nvidia-mps
spec:
hostPID: true
hostNetwork: true
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: mps-daemon
image: nvidia/cuda:12.0-base-ubuntu22.04
command:
- sh
- -c
- |
nvidia-cuda-mps-control -d
echo "start_server -uid 0" | nvidia-cuda-mps-control
sleep infinity
securityContext:
privileged: true
runAsUser: 0
volumeMounts:
- mountPath: /usr/local/nvidia
name: nvidia-install-dir
volumes:
- name: nvidia-install-dir
hostPath:
path: /usr/local/nvidiaNow your pods requesting GPUs will connect to the MPS daemon:
apiVersion: v1
kind: Pod
metadata:
name: mps-client-1
spec:
containers:
- name: training-worker
image: myregistry/pytorch:latest
resources:
requests:
nvidia.com/gpu: "0.25" # 25% of GPU capacity
limits:
nvidia.com/gpu: "0.25"
env:
- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
value: "25" # Soft cap on resourcesMPS Limitations and Risks
MPS is clever, but it has real limitations:
-
No guaranteed fault isolation: A kernel crash resets the entire GPU context. All clients lose in-flight work. In a production inference-production-inference-deployment) scenario where you have four models running on one GPU, a bug in one model crashes all four simultaneously. That's catastrophic for availability.
-
Memory isolation is application-dependent: MPS enforces memory bounds, but only if the CUDA application respects them. A buggy app writing out of bounds can still corrupt other clients' memory. MPS's memory protection is a speedbump, not a wall.
-
Performance variability: Multiple processes sharing the same GPU context compete for kernel queue slots. Latency becomes unpredictable. Your inference endpoint might respond in 10ms one moment and 50ms the next, depending on what else is running.
-
Debugging is harder: When things go wrong, you're debugging multiple processes' interactions in a shared context instead of isolated hardware. That 11ms latency spike - was it your model, someone else's, or contention between both?
-
Compute isolation is limited: Unlike MIG, compute resources aren't partitioned. One process running a heavy kernel can starve others. It's not a hard resource limit; it's a scheduling fairness mechanism that breaks under adversarial conditions.
MPS works best for:
- Development environments where isolation isn't critical
- Inference workloads with similar latency requirements (batch inferenceprocessing-millions-records), LLM serving)
- Trusted workloads (internal teams, not multi-tenant SaaS)
- Cost-sensitive scenarios where you accept the isolation trade-offs
Time-Slicing: Fair Scheduling Without Isolation
The Time-Slicing Strategy
Time-slicing is the simplest approach: configure the NVIDIA device plugin to let multiple pods request the same physical GPU. The Kubernetes scheduler oversubscribes the resource, and the plugin enforces fairness through time-slicing.
Here's the idea:
- Pod A gets 100ms of GPU time
- Pod B gets 100ms of GPU time
- Pod C gets 100ms of GPU time
- Repeat until one pod finishes
From each pod's perspective, it has exclusive GPU access for brief windows. In reality, they're rotating through the GPU.
Time-slicing is entirely transparent to your applications. Your models have no idea they're sharing. The GPU driver handles context switching automatically. This transparency is powerful for adoption - you don't need to modify code or handle special cases.
NVIDIA Device Plugin Time-Slicing Configuration
Enable time-slicing in the device plugin config:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-configs
namespace: gpu-operator
data:
any: |
version: v1
sharing:
timeSlicing:
replicas: 4
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: gpu-operator
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
args:
- "-config=/etc/nvidia-container-runtime/device-plugin-configs.yaml"
volumeMounts:
- name: device-metrics
mountPath: /run/prometheus
- name: device-plugin-config
mountPath: /etc/nvidia-container-runtime
volumes:
- name: device-plugin-config
configMap:
name: nvidia-device-plugin-configs
- name: device-metrics
emptyDir: {}The replicas: 4 setting means Kubernetes will report each GPU as having 4 allocatable units. If you schedule 4 pods requesting nvidia.com/gpu: "1" each, all 4 will land on the same physical GPU, and time-slicing handles the rotation.
Oversubscription Ratios and Fairness
The oversubscription ratio (how many pods you let share one GPU) depends on your workload:
- Inference workloads: 4-8x oversubscription is safe. Inference is latency-tolerant, and multiple models benefit from batching.
- Development/notebook workloads: 2-4x is safer. These are interactive, and users get annoyed by slowdowns.
- Training workloads: 1x (don't time-slice). Training is compute-hungry and time-sensitive.
Here's a more tuned config:
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-configs
namespace: gpu-operator
data:
any: |
version: v1
sharing:
timeSlicing:
replicas: 8
failOnInitError: false
cudaProc:
start: 0
end: 10
deviceListStrategy: "envvar"
deviceIDStrategy: "uuid"The cudaProc.start and cudaProc.end settings control which CUDA processes participate in time-slicing. This gives you fine-grained fairness guarantees across namespaces.
The Trade-Offs: Context Switching Overhead
Time-slicing isn't free. Each time you switch contexts from one pod to another, you pay overhead:
- GPU memory flush: GPU caches clear.
- Context state swap: The new pod's context loads into GPU memory.
- Synchronization: Kernels must complete before switching.
For a 4x oversubscription ratio, expect:
- 5-15% latency increase per pod compared to exclusive GPU access
- Throughput still improves overall because 4 pods are making progress instead of 1
- Variability increases: One pod's performance depends on what else is running
These aren't theoretical numbers - they're from real deployments. A single inference call that takes 10ms exclusively might take 11.2ms under 4x time-slicing. That matters if your SLA is 10ms p99. But if your workload is batch inference, where you're processing 10,000 images and latency variance doesn't matter, time-slicing is almost free.
Time-slicing shines when:
- You have many small, batch workloads (inference)
- Fairness across teams is critical (you want everyone's notebooks to work)
- You accept latency trade-offs for better overall throughput
- You don't need hard isolation guarantees
Workload-to-Strategy Matching
Here's the decision tree:
Does your workload need hard isolation?
├─ YES (multi-tenant, compliance, safety-critical)
│ └─ Use MIG if instances fit your memory needs
│ └─ If not, use separate GPUs
│
├─ NO (internal teams, trusted workloads)
│ ├─ Is the workload compute-intensive (training)?
│ │ ├─ YES → Use dedicated GPUs (no sharing)
│ │ └─ NO → Is latency critical (real-time serving)?
│ │ ├─ YES → MPS with caution, or MIG
│ │ └─ NO → Time-slicing (batch inference, notebooks)
│ │
│ └─ Is memory the constraint?
│ ├─ YES → MIG (if it fits) or time-slicing
│ └─ NO → MPS or time-slicing
Real-World Examples
Scenario 1: SaaS LLM Inference Provider
You run a managed inference service where customers deploy their own models. You absolutely need isolation - one customer's model crashing shouldn't affect another's.
→ Use MIG exclusively. Partition each A100 into 3g.20gb or 2g.10gb instances depending on model sizes. Schedule each customer's inference pod to a dedicated MIG instance.
apiVersion: v1
kind: Pod
metadata:
name: customer-acme-inference
spec:
containers:
- name: inference
image: customer-acme-registry/llm:latest
resources:
requests:
nvidia.com/gpu: "1" # One 3g.20gb MIG instanceScenario 2: Internal Development Cluster
Your data science team needs GPU access for notebooks, small experiments, and model development. You have 4 A100s and 20 people wanting GPU access simultaneously.
→ Use time-slicing at 4x oversubscription. Everyone gets turns at the GPU. Latency isn't critical; the important thing is that everyone's notebooks stay responsive.
apiVersion: v1
kind: Pod
metadata:
name: notebook-alice
spec:
containers:
- name: jupyter
image: jupyter/pytorch-notebook:latest
resources:
requests:
nvidia.com/gpu: "0.25" # Share 1 GPU with 3 othersScenario 3: Mixed Training and Inference Workload
You have both model training (needs exclusive GPU) and batch inference (can share). You have 8 GPUs.
→ Partition the cluster: Reserve 4 GPUs for training (no sharing), partition the other 4 with MIG into 3g.20gb instances for inference.
---
# Training pod - reserves entire GPU
apiVersion: v1
kind: Pod
metadata:
name: training-job
spec:
containers:
- name: trainer
image: myregistry/training:latest
resources:
requests:
nvidia.com/gpu: "1" # Full GPU
---
# Inference pod - requests MIG instance
apiVersion: v1
kind: Pod
metadata:
name: inference-job
spec:
containers:
- name: inference
image: myregistry/inference:latest
resources:
requests:
nvidia.com/gpu: "1" # One MIG 3g.20gb instanceWhen to Choose Each Approach: A Decision Matrix
At this point you know the mechanics of all three approaches. The question is: which one for your cluster? The answer depends on your workloads, your tolerance for risk, and your utilization goals.
MIG: Best for SaaS and Multi-Tenant Inference
Use MIG when you need strong isolation guarantees and your workloads are roughly homogeneous in size. MIG is expensive (in terms of capacity waste due to the same-size constraint), but it's the only approach with hardware-level fault isolation.
When to use MIG:
- Multi-tenant SaaS where customer isolation is critical
- Production inference serving multiple models
- Scenarios where one workload crashing is unacceptable
- You value reliability over maximum utilization
When NOT to use MIG:
- Your workloads are wildly different sizes (8GB jobs and 24GB jobs)
- You're running research code and crashes are expected
- Utilization is already good (MIG wastes capacity)
MPS: Best for Trusted Workloads in Controlled Environments
Use MPS when you trust all workloads on the GPU, you want better utilization than MIG, and you can accept occasional context resets.
When to use MPS:
- Internal research clusters where team members know each other's code quality
- Batch inference from known, tested models
- Model serving with models you trained yourself (not customer code)
- You want finer control over resource sharing than time-slicing
When NOT to use MPS:
- Multi-tenant SaaS (untrusted code)
- Scenarios where GPU crashes are unacceptable
- You're running experimental code from multiple teams
Time-Slicing: Best for Development and Maximizing Utilization
Use time-slicing when you want simplicity, fair sharing among many workloads, and maximum utilization. It's the easiest to implement and understand.
When to use time-slicing:
- Development clusters with notebooks and experiments
- You have many jobs (20+) waiting for GPU
- Academic settings with student code
- You want simple scheduling fairness
- Cost is more important than latency
When NOT to use time-slicing:
- You have strict latency SLOs (context switching adds delay)
- You have long-running jobs (they'll be preempted frequently)
- You need deterministic performance (time-slicing is inherently variable)
Scheduling and Bin-Packing Strategies
Gang Scheduling for MIG Workloads
If your job needs multiple MIG instances (e.g., a distributed training-pipelines-training-orchestration)-fundamentals)) job using 4 MIG instances), you need gang scheduling. Without it, the scheduler might allocate 2 instances to your job, then get stuck because other pods claimed the remaining 2.
Use Karpenter or KAI Scheduler (NVIDIA's own) to coordinate multi-instance allocations:
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training-job
spec:
parallelism: 4
completions: 4
template:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-id
operator: In
values:
- distributed-training-job
topologyKey: kubernetes.io/hostname
containers:
- name: trainer
image: myregistry/distributed-training:latest
resources:
requests:
nvidia.com/gpu: "1" # 4 pods × 1 MIG instance = 4 instances neededThe podAffinity rule ensures all 4 pods land on the same node. Then the device plugin allocates them to MIG instances.
Resource Quotas for Fair Multi-Team Sharing
When multiple teams share a cluster, enforce quotas at the namespace level:
---
apiVersion: v1
kind: Namespace
metadata:
name: team-data-science
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: gpu-quota
namespace: team-data-science
spec:
hard:
requests.nvidia.com/gpu: "8" # This team gets max 8 GPU slots
limits.nvidia.com/gpu: "8"
---
apiVersion: v1
kind: PodDisruptionBudget
metadata:
name: protect-gpu-workloads
namespace: team-data-science
spec:
minAvailable: 1
selector:
matchLabels:
workload-type: gpu-intensiveNow if team-data-science tries to schedule a pod requesting 10 GPU slots, the quota controller rejects it. Fair sharing enforced.
Chargeback: Billing Based on Actual GPU Time Used
If you're doing chargeback (billing internal teams for GPU usage), you need to track actual GPU time consumed, not just requested.
MIG and MPS make this cleaner than time-slicing:
- MIG: Pod running on MIG instance = GPU time consumed. Track by instance.
- MPS: Monitor
nvidia-smioutput;Current Userstells you how many processes are using each GPU. - Time-slicing: Harder. You need metrics on how much time each pod actually held the GPU.
Use NVIDIA DCGM (Data Center GPU Manager) to export metrics:
apiVersion: v1
kind: Service
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
app: dcgm-exporter
ports:
- port: 9400
protocol: TCP
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/gpu: "true"
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.0
securityContext:
privileged: true
env:
- name: DCGM_EXPORTER_INTERVAL
value: "30000"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
volumeMounts:
- mountPath: /run/prometheus
name: pod-resources
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resourcesThis exports Prometheus-grafana-ml-infrastructure-metrics) metrics like:
dcgm_gpu_utilization{gpu="0",uuid="GPU-xxx"} 85
dcgm_sm_clock{gpu="0",uuid="GPU-xxx"} 1410
dcgm_memory_clock{gpu="0",uuid="GPU-xxx"} 7001
Scrape these into Prometheus, join with pod metadata, and you can bill teams based on actual utilization.
Why This Matters in Production
In a real production environment, GPU sharing becomes an operational necessity around 50-100 GPUs. At that scale, you have:
- Multiple teams competing for resources
- Different workload characteristics (some need real-time response, others batch)
- Cost pressures to maximize utilization
- Availability requirements that demand fault tolerance
Getting this wrong doesn't just waste money - it causes friction that kills adoption. If your data scientists can never get GPU access because training jobs hog the entire machine, they switch to smaller models that don't need GPUs, defeating the purpose of buying them. If your inference system crashes when one customer's job breaks, you lose customer trust.
The right strategy, implemented thoughtfully, compounds benefits: better team satisfaction, lower cost per compute unit, higher cluster utilization, and ultimately faster research and product iteration.
Comparative Architecture Diagram
Here's how the three strategies differ at a high level:
MIG ISOLATION
┌──────────┐
│ Physical │
│ GPU │
├──────────┤
│ MIG 0 ├─→ Pod A (isolated compute + memory)
│ MIG 1 ├─→ Pod B (isolated compute + memory)
│ MIG 2 ├─→ Pod C (isolated compute + memory)
└──────────┘
Guarantee: Hardware-level isolation. Pod A crash ≠ Pod B crash.
MPS SHARING
┌──────────────────────────┐
│ GPU Context (shared) │
├──────────────────────────┤
│ Kernel Queue │
├──┬──┬──┬──────────────────┤
│ K│ K│ K│ from Pod A, B, C │
├──┴──┴──┴──────────────────┤
│ Memory: Pod A | Pod B | C │ (fenced)
└──────────────────────────┘
Guarantee: Memory isolation. No compute or fault isolation.
TIME-SLICING
┌──────────────────────────┐
│ GPU Context │
├──────┬──────┬──────┬─────┤
│Time 0│Time 1│Time 2│Time 3
├──────┼──────┼──────┼─────┤
│Pod A │Pod B │Pod C │Pod A
└──────┴──────┴──────┴─────┘
Guarantee: Fair scheduling only. Coarse context switching overhead.
Putting It All Together: Complete Example Setup
Here's a production-ready cluster setup using all three strategies:
---
# 1. Install NVIDIA GPU Operator (prerequisite)
apiVersion: v1
kind: Namespace
metadata:
name: gpu-operator
---
# 2. Configure device plugin for MIG + time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-configs
namespace: gpu-operator
data:
any: |
version: v1
sharing:
timeSlicing:
replicas: 4
migStrategy: "mixed"
---
# 3. Deploy device plugin daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-device-plugin
template:
metadata:
labels:
app: nvidia-device-plugin
spec:
nodeSelector:
nvidia.com/gpu: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
args:
- "-config=/etc/nvidia-container-runtime/configs.yaml"
securityContext:
privileged: true
volumeMounts:
- name: device-plugin-config
mountPath: /etc/nvidia-container-runtime
volumes:
- name: device-plugin-config
configMap:
name: nvidia-device-plugin-configs
---
# 4. MIG-only namespace (multi-tenant inference)
apiVersion: v1
kind: Namespace
metadata:
name: saas-inference
labels:
gpu-strategy: "mig-only"
---
# 5. SaaS inference pod requesting MIG instance
apiVersion: v1
kind: Pod
metadata:
name: customer-acme-llm
namespace: saas-inference
spec:
containers:
- name: llm-inference
image: myregistry/llm-server:latest
resources:
requests:
nvidia.com/gpu: "1" # One MIG instance
limits:
nvidia.com/gpu: "1"
ports:
- containerPort: 8000
---
# 6. Time-slicing namespace (development)
apiVersion: v1
kind: Namespace
metadata:
name: development
labels:
gpu-strategy: "time-slicing"
---
# 7. Development notebook pod (shared GPU)
apiVersion: v1
kind: Pod
metadata:
name: notebook-alice
namespace: development
spec:
containers:
- name: jupyter
image: jupyter/pytorch-notebook:latest
resources:
requests:
nvidia.com/gpu: "0.25" # 1/4 of GPU
limits:
nvidia.com/gpu: "0.25"
ports:
- containerPort: 8888
env:
- name: JUPYTER_ENABLE_LAB
value: "yes"
---
# 8. Resource quota for development namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: dev-gpu-quota
namespace: development
spec:
hard:
requests.nvidia.com/gpu: "2" # Max 2 full GPUs worth
limits.nvidia.com/gpu: "2"
---
# 9. DCGM metrics exporter for chargeback
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/gpu: "true"
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.0
securityContext:
privileged: true
env:
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
volumeMounts:
- mountPath: /run/prometheus
name: pod-resources
volumes:
- name: pod-resources
hostPath:
path: /var/lib/kubelet/pod-resourcesDeploy this, verify each layer:
# Verify device plugin is running
kubectl get daemonset -n gpu-operator nvidia-device-plugin
# Check available GPU resources
kubectl describe nodes | grep nvidia.com/gpu
# Expected output (for 4x time-slicing):
# nvidia.com/gpu: 16 (4 physical GPUs × 4 shares each)
# Submit MIG workload
kubectl apply -f saas-inference.yaml
# Submit time-sliced workload
kubectl apply -f development-notebook.yaml
# Verify pod scheduling
kubectl get pods -n saas-inference
kubectl get pods -n development
# Check GPU utilization
nvidia-smi dmon # Real-time monitoringMonitoring and Troubleshooting
When something goes wrong, start here:
Pod stuck in Pending?
kubectl describe pod <pod-name> -n <namespace>
# Look for "Insufficient nvidia.com/gpu" in events
# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-pluginGPU performance degraded?
# Check active processes on GPU
nvidia-smi
# If using MPS, check client count
nvidia-cuda-mps-control -server -get active_thread_percentage
# If using time-slicing, check scheduling pressure
kubectl top pods -n <namespace> --containersMIG instance not appearing in Kubernetes?
# Verify MIG is enabled at the node level
nvidia-smi -L # Should show MIG instances
# Restart device plugin to pick up changes
kubectl rollout restart daemonset nvidia-device-plugin -n gpu-operator
# Check device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin | grep -i migLessons from Large-Scale GPU Deployments
Real deployments at scale teach lessons that aren't obvious from smaller experiments. A team running ten GPUs might not see the problems that emerge at one hundred GPUs. Problems scale differently than you'd expect. If you have one hundred percent GPU utilization at ten GPUs, that's probably fine - you're just using all your resources efficiently. If you have one hundred percent utilization at one hundred GPUs, you have a problem. You have no slack for spikes in demand, no room for node failures, no headroom for oncall paging. Best practices suggest aiming for seventy to eighty percent utilization as a sustainable operating point. Below that and you're wasting money. Above that and you're constantly fighting capacity constraints.
The coordination problem grows at scale. At ten GPUs, one person can manage things. At one hundred GPUs, you need dedicated infrastructure. Who manages the device plugin? Who handles GPU node failures? Who monitors utilization and decides when to add capacity? Who debugs when scheduling goes wrong? These questions become organizational, not just technical. Mature large-scale GPU operations have clear ownership and runbooks for every scenario.
Another lesson: heterogeneous GPU types complicate strategy decisions. Ideally, your entire cluster is the same GPU (all A100s, all H100s). In practice, you end up with mixed generations. You have some older V100s, some newer A100s, and some latest-generation H100s. MIG support varies across generations. MPS is more consistent but latency characteristics differ. Time-slicing works everywhere but performs differently. Accommodating this heterogeneity requires sophisticated bin-packing logic in your scheduler. You need to understand which workloads can run on which GPUs, which are sensitive to GPU generation, and make those constraints clear to your orchestration system.
Key Takeaways
You now have three tools in your arsenal:
- MIG for hard isolation: Use it when workload independence is non-negotiable. Trade flexibility for guarantees.
- MPS for shared contexts: Use it for inference workloads that trust each other. Moderate risk, better utilization than MIG alone.
- Time-slicing for fairness: Use it when you want simple scheduling with minimal isolation concerns. Highest utilization, highest contention.
The real win comes from matching the right strategy to the right workload. A SaaS platform runs MIG. A development cluster runs time-slicing. A batch inference pipeline-parallelism)-automated-model-compression) might use all three on different GPUs.
Start by auditing your current GPU utilization. If it's below 50%, you're leaving money on the table. Pick one strategy, pilot it with a subset of workloads, measure the results, then scale.
GPU sharing isn't magic - it's thoughtful scheduling. Get it right, and you'll double or triple your effective GPU capacity without buying more hardware.
Implementation Challenges That Teams Face
The theory of GPU sharing sounds clean until you hit production reality. The first challenge is handling heterogeneous workloads on the same physical GPU. You might partition a GPU for inference, thinking all workloads will have similar characteristics. Then someone submits a training job that requires significantly more memory. Your carefully tuned MIG partitions become misaligned with actual workload demands. Real teams maintain multiple partition configurations and switch between them based on observed workload patterns. Some use predictive scaling based on historical demand, automatically reconfiguring GPUs before the demand arrives.
The second challenge is debugging performance issues in shared environments. When a pod is slow, is it slow because the model is inherently slow, or because it's competing for GPU time? Time-slicing makes this difficult to determine without detailed instrumentation. Teams often maintain shadow pods running the same workload in isolation to establish baseline performance. If the isolated version is faster, you have contention. But determining what caused the contention requires detailed GPU tracing, which is time-consuming. Investing in automated performance regression detection helps catch degradation before users complain.
The third challenge is fair scheduling across many workloads. Kubernetes doesn't inherently understand GPU fairness. It knows about compute and memory but treats GPU time as opaque. If you have strict fairness requirements, you might need additional scheduling layers or custom policies. Some teams use karpenter or KAI scheduler to provide more nuanced control. Others implement custom webhook logic that intercepts scheduling decisions and applies fairness rules.
The fourth challenge is managing stateful GPU operations. GPU memory is limited, and with sharing it's even more constrained. Applications that allocate large GPU buffers for the session lifetime become problematic. If pod A allocates thirty percent of GPU memory for its session, then pod B tries to start, pod B gets less memory than expected. Some applications handle this gracefully; others crash. The gateway pattern helps here - applications should request GPU memory dynamically and handle resize situations. This requires application changes, which adds engineering burden.
Cost Attribution and Billing in Shared Environments
Billing for shared GPUs is conceptually simple but operationally complex. If you're doing cost allocation, you need to track actual GPU time consumed per workload, not just requested. A pod might request one MIG instance for one hour, but if it only uses it for five minutes, should you bill for the full hour? Fair billing might charge by actual consumption, but that requires accurate measurement. With time-slicing, measurement is particularly difficult because the same physical GPU might run multiple workloads and you need to attribute time fairly.
Some teams implement billing based on requested resources as a simplification. It's easier to implement but feels unfair to efficient workloads - a model that trains efficiently might consume less compute in the same time window as an inefficient model, but they're charged equally. Other teams implement billing based on actual utilization measured through DCGM metrics, which is fairer but requires sophisticated accounting infrastructure.
A third approach is fixed pricing per pod per month, removing the complexity of usage-based billing entirely. This works well for development clusters where fairness is more important than precise cost allocation. It breaks down for production workloads where cost optimization is critical.
Observability and Monitoring at Scale
Proper observability is the difference between GPU sharing working smoothly and teams blaming the scheduling system for their own problems. You need visibility into resource contention, pod interference, and individual workload performance. NVIDIA's DCGM provides low-level metrics but lacks context about which pod is causing which behavior.
Sophisticated observability requires joining DCGM metrics with pod metadata and application logs. A team deployed GPU sharing without investing in observability. When pods started running slower, they assumed the scheduling strategy was wrong. Weeks of troubleshooting later, they discovered the real issue was a memory leak in one pod fragment, consuming more memory over time and degrading performance for other pods. If they'd had proper memory tracking and anomaly detection, they would have caught this in minutes.
Latency tracing becomes critical with sharing. When a request takes longer than expected, you need to understand whether the time is spent computing or waiting for GPU allocation. Distributed tracing)) systems integrated with GPU metrics provide this visibility. Some teams build custom monitoring that correlates pod scheduling timestamps with latency metrics to identify scheduling-induced delays.
Evolution of Your GPU Sharing Strategy
Most teams don't pick the optimal strategy immediately and stick with it forever. Instead, they evolve based on operational experience. A common pattern is starting with time-slicing for simplicity, discovering that performance variability becomes problematic, then moving to MIG for better isolation. Another pattern is starting with simple resource requests and evolving to more sophisticated bin-packing as utilization demands increase.
The evolution requires planning. If you start with pure time-slicing and later want to migrate to MIG, you need a transition period where both coexist. Some clusters have MIG nodes and time-slicing nodes, with workloads explicitly assigned to the appropriate tier. This hybrid approach adds complexity but provides flexibility during transitions.
Understanding your evolution path in advance prevents rework. Teams that anticipate growth design their clusters with MIG-capable GPUs from the start, even if they're only using time-slicing initially. When demands change, they reconfigure rather than upgrade hardware.
Scaling Beyond Single Clusters
GPU sharing patterns change when you operate at multi-cluster scale. Different clusters might have different sharing strategies. Maybe your research cluster uses time-slicing for fairness, while your production cluster uses MIG for isolation. Your scheduling system needs to understand these different capabilities and route workloads to appropriate clusters.
Federation of multiple clusters adds complexity. You need admission control that understands overall capacity across clusters. You need to handle cases where one cluster is at capacity and workloads need to queue or spill over to other clusters. You need to maintain fairness across cluster boundaries. This typically requires a global scheduler or cluster federation framework like karmada that understands these constraints.
Some teams maintain per-team GPU quotas across clusters, ensuring fair resource distribution even as workloads move between clusters. This requires a global accounting system that knows how much GPU time each team has consumed across all clusters.
Future of GPU Sharing Technology
GPU sharing technology is rapidly evolving. Newer NVIDIA GPUs may have better built-in support for fine-grained sharing that moves beyond)) MIG, MPS, and time-slicing. Architectural support for mixing workload sizes on the same GPU might eliminate the constraint that all MIG instances on a GPU must be the same size. Improved scheduler support in Kubernetes for GPU-specific scheduling logic might reduce the need for custom controllers.
The broader trend is toward more sophisticated resource scheduling for heterogeneous workloads. GPUs are getting more powerful and more specialized. Sharing becomes increasingly necessary but also increasingly complex. Teams investing in solid foundations now will be better positioned to adopt new technologies as they emerge.
Practical infrastructure for AI systems that scale.