May 1, 2025
AI/ML Infrastructure Training GPU

Building a GPU Cluster for ML Training

You're training large language models or vision transformers, and your single GPU just isn't cutting it anymore. Maybe training runs that once took days now need to happen in hours. Maybe you're exploring distributed training frameworks but have no idea how to architect the underlying infrastructure to support them. Here's the truth: the difference between a bottlenecked cluster and a well-designed one often comes down to understanding the interconnect, storage, and cluster management layers - not just throwing GPUs at the problem.

In this guide, we'll walk through designing a production-grade GPU cluster from the ground up, including architecture decisions, hardware selection, network validation, and the operational tooling that keeps everything running smoothly.

Table of Contents
  1. Understanding GPU Cluster Architecture
  2. Compute Node Architecture: Inside the Black Box
  3. Interconnect Technologies: Choosing Your Fabric
  4. InfiniBand HDR vs NDR
  5. RoCEv2: The Ethernet Alternative
  6. NVLink Across Nodes?
  7. Parallel Storage: The Chokepoint Nobody Expects
  8. Why Storage Is Your Real Bottleneck
  9. Lustre
  10. WekaFS
  11. GPFS / IBM Spectrum Scale
  12. Storage Architecture for Our 8-Node Cluster
  13. Thermal & Power: The Unsexy Engineering
  14. Why Power Matters More Than People Think
  15. Power Delivery
  16. Cooling Strategy
  17. Data Parallelism vs Model Parallelism: An Architecture Perspective
  18. Cluster Management: Slurm + DCGM
  19. Slurm: Resource Allocation
  20. DCGM: Health Monitoring
  21. Network Validation: Ensuring IB Fabric Health
  22. Check Fabric Topology
  23. Test RDMA Connectivity
  24. Validate All-Reduce (Multi-GPU Communication)
  25. Procurement Checklist: 8-Node, 64-GPU H100 Cluster
  26. Configuration Reference: Real Slurm Setup
  27. Validation: Proving Your Cluster Works
  28. Advanced Topics: Tuning for Peak Performance
  29. NCCL Optimization
  30. Lustre Tuning for ML Workloads
  31. GPU Clock and Power Management
  32. Memory and CPU Pinning
  33. Operations & Troubleshooting
  34. Common Failure Modes
  35. Cost-Performance Trade-offs
  36. Option 1: Budget Cluster ($1.2M)
  37. Option 2: Mid-Range ($2.0M)
  38. Option 3: Full Production ($2.8M)
  39. Maintenance and Long-Term Operations
  40. Year 1: Optimization
  41. Year 2-3: Scaling
  42. Year 4-5: Refresh
  43. Why Production Reliability Matters
  44. Summary

Understanding GPU Cluster Architecture

A modern GPU cluster isn't just a rack of GPU servers connected to a switch. It's a carefully orchestrated system where compute nodes, high-speed fabric, parallel storage, and management networks all work in concert.

Here's what a complete architecture looks like:

┌─────────────────────────────────────────────────────────────┐
│                    GPU Cluster Architecture                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Compute Nodes (8x)                  Parallel Storage       │
│  ┌──────────────────────────┐       ┌──────────────────┐   │
│  │ GPU Servers (H100 SXM5)  │       │  Lustre/WekaFS  │   │
│  │ ├─ 8x GPUs per node      │───┐   │  ├─ 100+ TB      │   │
│  │ ├─ NVLink Topology       │   │   │  └─ 1+ GB/s I/O  │   │
│  │ ├─ InfiniBand HCA        │   │   └──────────────────┘   │
│  │ └─ 1.5TB DRAM            │   │                          │
│  └──────────────────────────┘   │   ┌──────────────────┐   │
│  (Repeated 8 times)             └───│ IB Fabric (HDR)  │   │
│                                     │ 200 Gbps full-dup│   │
│                                     └──────────────────┘   │
│                                                              │
│  Management Network (1GbE)                                  │
│  ├─ Slurm head node / DCGM aggregator                      │
│  ├─ Out-of-band BMC access                                 │
│  └─ Monitoring / logging backend                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

The key insight here: separate your networks by traffic class. Training data flows over the InfiniBand fabric. Management and monitoring go over a dedicated 1GbE network. This isolation prevents a logging spike from killing your training job's all-reduce operation.

When you start building your first cluster, this separation feels like overkill. You think "why can't I just connect everything to one big Ethernet switch?" But the moment you start running distributed training on 8+ GPUs, you'll understand. A logging service accidentally hammering the network with verbose telemetry can reduce your gradient synchronization bandwidth from 600 GB/s to 200 GB/s. That's a 2-3x slowdown for your training. A single monitoring query that pulls metrics from all nodes at once can trigger packet loss on a shared network. Separate networks solve this by guaranteeing each traffic class gets dedicated bandwidth.

Compute Node Architecture: Inside the Black Box

Let's zoom into a single compute node. Each node in our 8-node cluster will have this topology:

┌──────────────────────────────────────────────────────────┐
│               Per-Node Architecture (H100 SXM5)          │
├──────────────────────────────────────────────────────────┤
│                                                           │
│  GPU-GPU Connections (NVLink)                            │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  GPU0 ─────NVLink 900GB/s─── GPU1                 │ │
│  │   ├────── GPU2 ──────────── GPU3                   │ │
│  │   │  ┌────────────────────────┐                    │ │
│  │   │  │  (Each link 900 GB/s)  │                    │ │
│  │   │  └────────────────────────┘                    │ │
│  │  GPU4 ──── GPU5 ──── GPU6 ──── GPU7               │ │
│  └─────────────────────────────────────────────────────┘ │
│                       ↓                                   │
│              NVIDIA NVSwitch (on-node)                    │
│              Bandwidth: 7.2 TB/s aggregate               │
│                                                           │
│  CPU & Memory                                             │
│  ├─ Dual-socket Intel/AMD (18-128 cores)                │
│  ├─ 1.5TB DRAM per node                                 │
│  └─ PCIe 5.0 x16 per GPU (max ~300GB/s per link)       │
│                                                           │
│  Storage per Node                                         │
│  ├─ ~1-2TB NVMe SSD (model cache, temp data)            │
│  └─ Mounted via network (Lustre/WekaFS)                 │
│                                                           │
│  Network                                                  │
│  ├─ InfiniBand HDR HCA (200 Gbps)                       │
│  ├─ 1GbE management port (BMC, monitoring)              │
│  └─ 1GbE OOB management switch                           │
│                                                           │
│  Power & Cooling                                          │
│  ├─ 2x 2000W power supplies per node (700W TDP × 8)     │
│  ├─ Direct liquid cooling (DLC) inlet: ~40°C            │
│  └─ Node-level thermal sensors                          │
│                                                           │
└──────────────────────────────────────────────────────────┘

Here's what matters: NVLink is your intra-node lifeline. When two GPUs on the same node communicate, they use NVLink - that's 900 GB/s in each direction. Cross-node communication drops to ~200 Gbps (InfiniBand HDR), which is why data parallelism works better than model parallelism on loosely coupled clusters.

The CPU sits in a supporting role. You're not doing heavy compute on it; it's handling collective communications, I/O coordination, and running the training framework runtime.

The NVLink topology deserves its own moment of explanation, because it's not trivial. In a modern H100 node with 8 GPUs, they're arranged in a specific pattern to maximize bandwidth between all pairs. The NVSwitch acts as a crossbar that can route traffic between any two GPUs at full speed simultaneously. This is different from older topologies where you had a bottleneck between GPU groups. With the NVSwitch, GPU0 can talk to GPU7 at 900 GB/s while GPU2 talks to GPU5 at 900 GB/s - no contention. This matters when you're doing all-reduce operations where every GPU is communicating with every other GPU.

Interconnect Technologies: Choosing Your Fabric

This is where many teams get it wrong. They pick a network based on "fast enough" rather than "right for this workload." Let's compare the options.

The fabric choice determines whether your cluster scales gracefully or hits a communication wall at 16 GPUs. Most teams don't realize this until they've already spent $2M on hardware and discovered that their gradient synchronization is bottlenecked. The truth is less obvious than you'd expect: raw bandwidth isn't everything. Latency, switch design, ecosystem maturity, and operational complexity matter just as much.

When you're training large models, every training step involves a backward pass that computes gradients, then a synchronization phase where all GPUs share those gradients via an all-reduce operation. If that all-reduce takes 10 seconds instead of 100 milliseconds, your 8-GPU cluster becomes 80 slower. Most teams don't measure this directly, so they don't understand why their training runs are unexpectedly slow. They blame the model, the data loader, or their implementation when the real culprit is a network that can't keep up with the compute.

The problem compounds with scale. With two GPUs, communication overhead is negligible - maybe 5 percent of total time. With eight GPUs, it's 20-30 percent. With sixteen GPUs, suddenly communication is eating half your training time if you've chosen the wrong fabric. This is why the fabric choice is a foundational architecture decision that you can't revisit without rewiring your entire cluster.

Consider also that you're building infrastructure that will operate for five years. The fabric you choose needs to support future growth. If you pick a technology that maxes out at 16 GPUs with acceptable overhead, adding GPUs 17-32 suddenly multiplies your communication costs. You either accept degraded throughput or you buy new network hardware. Planning for this from day one prevents a painful rebuild.

The operational complexity of network administration is often underestimated. A misconfigured InfiniBand switch port can silently drop packets, making training jobs hang indefinitely. Your team needs operational expertise to debug these issues. RoCEv2 requires even more tuning because Ethernet wasn't originally designed for the requirements of GPU communication. The "cheaper" option often becomes expensive in operational burden.

Beyond raw performance, there's the matter of ecosystem. When something breaks, can you find support? When you need to troubleshoot, are the debugging tools mature and well-documented? InfiniBand has decades of HPC deployment history. The entire supercomputing industry uses it. Tool support, documentation, and vendor expertise are mature. RoCEv2 is newer. The tools exist but are less battle-tested. For a production cluster, this risk premium is real.

InfiniBand HDR vs NDR

InfiniBand HDR (200 Gbps)

  • Bandwidth: 200 Gbps per link
  • Latency: ~0.6 microseconds (end-to-end)
  • Switch cost: $10-15K per 36-port switch
  • HCA cost: $3-4K per adapter
  • Ecosystem: Mature. Supported by every ML framework.

InfiniBand NDR (400 Gbps)

  • Bandwidth: 400 Gbps per link
  • Latency: ~0.5 microseconds
  • Switch cost: $25-35K per 36-port switch
  • HCA cost: $6-8K per adapter
  • Ecosystem: Newer. Good NCCL support, but less mature.

For an 8-node cluster with H100s, HDR is the pragmatic choice. Your aggregate bandwidth (8 × 200 Gbps = 1.6 TB/s) easily saturates the NVLink backbone. NDR buys you future headroom, but the cost premium isn't justified until you're at 16+ nodes.

Why does this matter so much? Because the bottleneck in distributed training isn't usually computation - it's communication. During a typical training step, you forward pass, backward pass, then synchronize gradients. That synchronization takes time. On a PCIe network, it might take 5 seconds. On InfiniBand HDR, it might take 0.5 seconds. That's 10x faster. Now multiply that by thousands of training steps, and you see why the network choice can make or break your cluster efficiency.

RoCEv2: The Ethernet Alternative

RoCEv2 (RDMA over Converged Ethernet) runs RDMA over standard Ethernet switches. Here's the appeal: Ethernet is everywhere, and you probably have it already.

Trade-offs:

  • Bandwidth: 100-400 Gbps (matches Ethernet speed)
  • Latency: 1-2 microseconds (worse than IB)
  • Switch cost: $2-5K per 48-port 400G switch (cheaper!)
  • HCA cost: $800-2K per 100G NIC (much cheaper)
  • Downside: Requires PFC (Priority Flow Control) and ECN (Explicit Congestion Notification) configuration. One misconfigured switch = packet loss = training hangs.

If you're on a tight budget and comfortable with network tuning, RoCEv2 works. For a first cluster, stick with InfiniBand. The operational headache savings are worth it.

The real-world story with RoCEv2 is that it requires meticulous tuning. Your switch ports need to support Priority Flow Control, which pauses low-priority traffic when congestion is detected. ECN needs to be enabled so endpoints know to slow down when the network is congested. One port that doesn't support these features, one misconfigured buffer, and suddenly your training jobs are hanging because packets are being dropped silently. With InfiniBand, you get these guarantees by default. The extra cost is insurance against weeks of debugging.

You might ask: "Can I use NVLink between nodes?" Not directly. NVIDIA's Quantum Infiniband switch has native NVLink-to-Ethernet bridging on newer systems, but that's specialty hardware. For a standard cluster, InfiniBand or RoCEv2 connects the nodes, and NVLink stays within-node.

This is an architectural constraint that matters. You can do tensor parallelism (splitting a model across GPUs) within a node using NVLink, then use data parallelism (different data on each node) with InfiniBand. This hybrid approach saturates both networks.

Parallel Storage: The Chokepoint Nobody Expects

You've got GPUs that can do 140 teraflops of compute. Now you need to feed them data fast enough that they're not idle. This is where parallel storage comes in.

Why Storage Is Your Real Bottleneck

Let's do some math. Eight H100 GPUs, training on 4096-token sequences, batch size 8. That's 8 GPUs × 4096 tokens × 8 batch × 2 bytes per token (for BF16) = 4 MB per forward pass. Forward and backward is 8 MB. You want to do this 100 times per second to saturate the GPUs. That's 800 MB/sec per node. With 8 nodes, that's 6.4 GB/sec aggregate. Your storage needs to deliver this consistently, all day long.

If you're storing training data on a slow NFS mount, your GPUs will wait for data. They'll sit at 20% utilization instead of 95%. Your expensive hardware becomes expensive waiting room.

Lustre

What it is: A distributed filesystem optimized for HPC. It separates metadata (MDS) from object storage (OSS). Multiple clients hammer multiple OSS nodes in parallel.

Per-GPU bandwidth target: 10-50 GB/s depending on model size and batch size. With 64 GPUs, you need 640-3200 GB/s aggregate. Lustre with 8-16 OSS nodes can easily hit this.

Setup complexity: Moderate. Configuration is non-trivial, but battle-tested.

Cost: Open source (software), but hardware for 8 OSS nodes + MDS + management = $200-400K.

Example Lustre config:

bash
# On MDS node
mkfs.lustre --fsname=training --mgs /dev/sda1
mount -t lustre /dev/sda1 /mnt/lustre
 
# On OSS nodes (1 through 8)
mkfs.lustre --fsname=training --ost --index=0 /dev/sdb1
mount -t lustre /dev/sdb1 /mnt/lustre
 
# On compute nodes: mount the filesystem
mount -t lustre mds@tcp:/training /mnt/lustre

Expected performance: 200-300 GB/s sustained read throughput with 8-16 OSS nodes.

The beauty of Lustre is that it's battle-tested at massive scale. Supercomputers use it. You're not experimenting with bleeding-edge stuff; you're using the same filesystem that trains the largest models on Earth.

WekaFS

What it is: Modern distributed filesystem built for NVMe and GPUs. Much simpler than Lustre.

Per-GPU bandwidth: Easily delivers 50+ GB/s per GPU with fewer nodes.

Setup complexity: Simple. API-driven, most config via REST calls.

Cost: $5-8K per TB/year for a managed offering, or ~$1.5-2M for an on-prem 100TB instance.

Standout feature: True NVMe backend. No spinning disk bottlenecks.

GPFS / IBM Spectrum Scale

What it is: Enterprise-grade parallel filesystem. Very stable, heavy operational overhead.

Where it shines: Existing IBM shop with Spectrum Scale skills.

Cost: 5-10x Lustre licensing.

Verdict: Overkill for a new cluster unless you already own it.


Storage Architecture for Our 8-Node Cluster

Here's what I'd build:

  • Metadata: 1x dedicated server (dual-socket CPU, 256GB DRAM, 1TB NVMe for MDS journal)
  • Object Storage: 4x storage servers, each with 48 × 10TB SSD in RAID-6 groups (192 TB raw)
  • Aggregate throughput: ~250 GB/s sustained
  • Cost: ~$300K (hardware + Lustre support)

If you can stretch the budget, add one more OSS node and hit 300+ GB/s.

Why RAID-6 instead of RAID-5? Failure tolerance. If you're writing to 8 OSS nodes, one fails, and you have a degraded RAID-5 during rebuild, another disk can fail and you lose data. RAID-6 can survive two simultaneous failures. With 4TB+ drives, rebuild takes days, so the second failure is plausible. RAID-6 costs you one extra disk per OSS node but saves your data.

Thermal & Power: The Unsexy Engineering

H100 SXM5 GPUs dissipate 700W each. With 8 per node, that's 5.6 kW per compute node. Across 8 nodes: 44.8 kW. Your datacenter better be ready.

This is where infrastructure projects encounter reality. You've designed the perfect cluster topology. You've chosen your network fabric. You've planned your storage architecture. Then you call your datacenter and ask if they can support 45 kilowatts of sustained power draw, and they tell you the answer is no. This conversation destroys many projects before they even begin.

Power planning is unsexy work. It doesn't contribute to model convergence or training speed. But it determines whether your cluster even exists. Underestimate power by 20 percent and you'll hit a breaker that takes the entire cluster offline. Exceed your facility's cooling capacity and GPUs will thermally throttle, reducing your training speed by half. Overspecify power to be "safe" and you're paying for infrastructure you don't use.

The practical reality is that GPU clusters expose two truths about infrastructure that most software engineers never encounter. First, your hardware constraints are absolute. A network that can handle 100 Gbps can't suddenly handle 200 Gbps. A power budget of 60 kW can't accommodate a 70 kW cluster no matter how cleverly you optimize. Second, these constraints create non-negotiable design boundaries. Software scalability means something completely different when your hardware is bounded by fundamental physical limits.

Many teams discover this the hard way. They provision a cluster assuming average power draw, not peak power draw. Then someone runs a benchmark that stresses all GPUs at full power simultaneously. Breakers trip. The cluster goes dark. The team spends the next week negotiating with facilities, adding power infrastructure, and extending timelines by months.

The lesson is to be conservative when planning power and cooling infrastructure. Your models won't always train at peak power. Sometimes you'll do validation at reduced power. Sometimes you'll be preprocessing data while GPUs idle. But peak power must be budgeted for, and it must be guaranteed available. Otherwise you're gambling that worst-case scenarios won't happen simultaneously.

Cooling is equally important but often treated as an afterthought. H100 GPUs reach 85°C under full load. If your cooling system can't maintain that temperature ceiling, clocks throttle automatically, reducing throughput by 10-30 percent. Many teams install inadequate cooling, then spend months tuning software inefficiently when the real problem is thermal. You can't optimize your way out of a thermal bottleneck. You need better cooling hardware, which requires capital investment and planning. Again, this constraint is absolute.

The intersection of power and cooling creates a reliability challenge. If your cooling fails, GPUs throttle, then exceed temperature limits and shut down. If your power delivery fails, the entire cluster goes dark. Both are critical systems that require redundancy and monitoring. Dual pump groups for cooling. Dual power supplies per node. Automated health checks that detect thermal creep or power anomalies before they become failures. These aren't nice-to-have features; they're the difference between a cluster that runs reliably and one that requires heroic intervention every month.

Why Power Matters More Than People Think

Most teams budget for peak power, then discover their facility can't handle it. An H100 at full throttle draws 700W. Eight of them draw 5.6 kW. Your facility has 60 kW available for the cluster. You've budgeted for 8 nodes × 5.6 kW = 44.8 kW. That leaves 15.2 kW for storage, management, cooling systems. Sounds okay, but then you realize your chilled water pump needs 2 kW, your storage servers need 3 kW, and suddenly you're 10 kW over budget. You either don't run at full power (your GPUs are throttled), or you blow a breaker (your cluster goes dark).

Power Delivery

Each compute node needs two independent 2000W power supplies. That's redundancy; if one PSU fails, the node stays up. You'll want:

  • Per-node PDU assignment: 2x 2000W circuits
  • Per-rack: 60 kW capacity (headroom for future expansion)
  • UPS backup: 10-minute minimum for graceful shutdown

Example rack layout:

Rack PDU Config:
├─ Breaker A: 32A (two nodes × 5.6 kW)
├─ Breaker B: 32A (two nodes × 5.6 kW)
├─ Breaker C: 32A (two nodes × 5.6 kW)
└─ Breaker D: 32A (two nodes × 5.6 kW)
   Total: ~45 kW cluster + 5 kW management
   Require: 60 kW PDU minimum

Cooling Strategy

H100 SXM5 cards come with liquid cooling. Direct liquid cooling (DLC) is non-negotiable at this scale.

Flow: Facility chilled water (12-15°C) → Node inlet loop (40°C) → GPU dies → Outlet loop (45°C) → Facility return.

Redundancy:

  • Two pump groups (active/standby)
  • Dual facility chiller connections
  • On-node thermal monitoring (Tegra T234 DCGM agent)

PUE target: 1.2-1.4. With efficient cooling, you're looking at ~55 kW total facility power (45 kW compute + 10 kW cooling/aux).

bash
# Check node thermal status via DCGM
dcgmi dmon -c 1
 
# Expected output for each GPU:
# GPU  SM   Memory   Power  Temp
# 0   79%   92%     680W    48°C
# 1   81%   95%     685W    49°C
# ... (GPUs 2-7)

If any GPU exceeds 85°C, DCGM will throttle clocks. You've designed a thermal issue into your cluster.

Data Parallelism vs Model Parallelism: An Architecture Perspective

Before we dive into cluster management, let's address the elephant in the room: which training strategy should your cluster support?

This architectural choice ripples through every decision you've made so far. Your network bandwidth decisions, your storage architecture, even your Slurm configuration all depend on whether you're primarily doing data parallelism or model parallelism. Getting this wrong means building infrastructure that's fundamentally misaligned with how you'll actually train models.

Understanding the difference is crucial, but even more crucial is understanding when each makes sense. Data parallelism and model parallelism are not equally viable at all scales. At small scales, one dominates. At large scales, you need both, orchestrated carefully.

Let's think about what happens when you train a 7-billion parameter model across eight GPUs using data parallelism. Each GPU holds the full model - 7 billion parameters worth of weights, activations, and gradient buffers. You duplicate the data: GPU 0 trains on batch 1, GPU 1 trains on batch 2, and so on. After the backward pass, all eight GPUs synchronize their gradients via an all-reduce operation. The synchronization is the communication bottleneck, but it's a single operation per training step. The computation-to-communication ratio is favorable. You're doing billions of floating-point operations for every all-reduce synchronization. This is why data parallelism scales so well - the communication overhead is amortized across massive amounts of computation.

Model parallelism, by contrast, splits the model itself across GPUs. GPU 0 holds layers 1-4, GPU 1 holds layers 5-8, and so forth. During the forward pass, the activation from GPU 0 must move to GPU 1, then from GPU 1 to GPU 2. This introduces communication at every layer boundary, 32 times per forward pass for a 32-layer model. Then the backward pass has the same problem in reverse. Suddenly the computation-to-communication ratio is much worse. You're moving data between GPUs constantly. Unless your inter-GPU bandwidth is exceptional - NVLink speeds, not Ethernet - model parallelism becomes a communication bottleneck.

This is why architectural decisions matter so much. If you design for data parallelism, standard Ethernet is sufficient for 8-16 GPUs. If you design for model parallelism at scale, you need NVLink or InfiniBand. But equally important, if you design your software for one and later realize you need the other, you're rewriting everything. The synchronization patterns are completely different. The load balancing is different. The fault tolerance mechanisms are different.

The most successful large-scale training clusters use a hybrid approach: tensor parallelism (a type of model parallelism) within nodes via NVLink, and data parallelism across nodes via InfiniBand. This leverages each network's strengths. Within a node, NVLink provides 900 GB/s of bandwidth, allowing you to keep model parallelism efficient. Across nodes, InfiniBand handles the all-reduce operations for data parallelism, where communication is less frequent and bandwidth is the limiting factor rather than latency.

Your cluster architecture should make this hybrid approach easy. If you force global tensor parallelism across all eight nodes, you've locked yourself into a low-throughput configuration. If you make within-node tensor parallelism the default and data parallelism the outer loop, you've built something that scales gracefully.

Data Parallelism duplicates the model on every GPU and broadcasts gradients during backprop. All-reduce operations happen frequently, but the model stays on one GPU per node. For transformer models at 7B-70B scale, this is your default choice. Overhead: ~5-10% per additional node.

Model Parallelism splits the model across GPUs. Tensor parallelism (within-node, over NVLink) stays fast. Pipeline parallelism (across nodes, over IB) introduces bubble time - GPUs wait for pipeline stages to complete. Model parallelism is not a cluster-level problem; it's a within-node optimization.

Our cluster architecture assumes data parallelism with optional tensor parallelism. Here's why:

  • Each node has 8 GPUs with 7.2 TB/s NVLink bandwidth. Run tensor parallelism (2-4 way split) within the node.
  • Replicate the distributed model across all 8 nodes. Use data parallelism (IB all-reduce) across nodes.

This hybrid approach saturates both the intra-node and inter-node fabrics.

Cluster Management: Slurm + DCGM

You've got the hardware. Now you need software that allocates resources, monitors health, and isolates failures.

Slurm: Resource Allocation

Slurm (Simple Linux Utility for Resource Management) does three critical things:

  1. GPU accounting: Track which job uses which GPU
  2. Gang scheduling: Suspend/resume jobs fairly
  3. Constraint enforcement: Route jobs to nodes with specific hardware

Here's a Slurm config for our 8-node, 64-GPU cluster:

bash
# /etc/slurm/slurm.conf
 
ClusterName=ml-training-cluster
ControlMachine=slurm-head
 
NodeName=compute[01-08] \
    CPUs=128 \
    RealMemory=1500000 \
    Gres=gpu:h100:8 \
    Features=nvlink,infiniband,dxl-cooled
 
PartitionName=gpu-training \
    Nodes=compute[01-08] \
    MaxCPUsPerNode=128 \
    DefMemPerNode=1500000 \
    Priority=10 \
    State=UP \
    MaxTime=infinite
 
GresTypes=gpu
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-head
AccountingStoragePort=6819

Submit a training job:

bash
sbatch --partition=gpu-training \
       --nodes=2 \
       --gpus-per-node=8 \
       --cpus-per-task=16 \
       --mem-per-gpu=180G \
       train.sh

Slurm now reserves 16 GPUs, 32 CPUs, and 1.4 TB RAM across two nodes. No other job can touch those resources.

DCGM: Health Monitoring

Data Center GPU Manager (DCGM) is NVIDIA's monitoring daemon. It runs on each node and exposes GPU health metrics via HTTP.

bash
# Install DCGM on each compute node
apt-get install datacenter-gpu-manager
 
# Start the daemon
systemctl start nvidia-dcgm
systemctl enable nvidia-dcgm
 
# Query health from login node
dcgmi -h compute01 dmon -c 3

DCGM tracks:

  • Temperature per GPU
  • Power draw
  • ECC errors (single-bit and multi-bit)
  • Throttling events
  • Clock speeds

Set up automated health checks:

python
#!/usr/bin/env python3
import pynvml
import requests
import subprocess
import sys
 
pynvml.nvmlInit()
 
def check_node_health(hostname):
    """Query DCGM on a node and report unhealthy GPUs."""
    try:
        # DCGM listens on port 5555 by default
        response = requests.get(
            f"http://{hostname}:5555/api/v2/health/status",
            timeout=5
        )
        health = response.json()
 
        # Check for critical issues
        for gpu_id, metrics in health.get('gpu_health', {}).items():
            if metrics['ecc_multi_bit_errors'] > 0:
                print(f"[CRITICAL] {hostname}:GPU{gpu_id} has ECC multi-bit error")
                # Isolate the node
                subprocess.run(['scontrol', 'update', f'NodeName={hostname}', 'State=DOWN',
                              'Reason=ECC_Error'])
                return False
 
            if metrics['temperature_c'] > 85:
                print(f"[WARNING] {hostname}:GPU{gpu_id} thermal throttling at {metrics['temperature_c']}°C")
                # Reduce job allocation
                subprocess.run(['scontrol', 'update', f'NodeName={hostname}',
                              'Features=throttled'])
                return False
 
        return True
    except Exception as e:
        print(f"[ERROR] Failed to query {hostname}: {e}")
        return False
 
if __name__ == '__main__':
    nodes = [f'compute{i:02d}' for i in range(1, 9)]
    healthy = all(check_node_health(node) for node in nodes)
    sys.exit(0 if healthy else 1)

Run this health check every 30 seconds via cron:

bash
*/1 * * * * /usr/local/bin/cluster_health_check.py >> /var/log/cluster_health.log 2>&1

When DCGM detects an ECC multi-bit error or critical thermal event, that node is automatically marked DOWN. No more training jobs are scheduled on it until you investigate.

Network Validation: Ensuring IB Fabric Health

InfiniBand is fast but finicky. Before you run your first training job, validate the fabric.

Check Fabric Topology

bash
# On any IB-connected node
ibnetdiscover > /tmp/fabric.topo
 
# Visualize
ibtracert compute01 compute02
 
# Should see a single-hop path. If latency > 5µs, you have a fabric issue.

Test RDMA Connectivity

bash
# On compute01, start a simple RDMA responder
ib_recv_bw -d mlx5_0 --report_gbits
 
# On compute02, send data
ib_send_bw -d mlx5_0 compute01 --report_gbits --size 1m
 
# Expected: ~190 Gbps (98% of 200 Gbps HDR) with latency <1µs

Validate All-Reduce (Multi-GPU Communication)

bash
# On slurm head node
srun --nodes=8 --gpus-per-node=8 \
     nccl-tests/build/all_reduce_perf -b 100M -e 1G -n 100 -w 100
 
# Expected output (on 64 H100 GPUs, InfiniBand HDR):
# Size(B)      Count      Type       Reduce      BusWidth
# 100000000    100        float      25 GB/s     1600 GB/s aggregate
 
# If you see <15 GB/s aggregate, you have:
# - IB fabric misconfiguration (RDMA MTU, flow control)
# - Storage contention (background I/O)
# - Thermal throttling

This all-reduce test is your canary. If it hangs or drops performance, the cluster isn't ready.

Procurement Checklist: 8-Node, 64-GPU H100 Cluster

Use this checklist when ordering hardware:

Compute Nodes (8×)

  • Server: HPE XL675d or Dell PowerEdge XE9680 (dual-socket, 256GB+ DRAM)
  • GPU: 8× NVIDIA H100 SXM5 80GB per node
  • CPU: 2× Intel Xeon Platinum 8592+ (32-core) or AMD EPYC 9754
  • Memory: 24× 64GB DDR5 RDIMM @ 4800 MT/s
  • NVMe: 2× 3.84TB U.2 SSD (model cache)
  • PSU: 2× 2000W 80+ Platinum
  • GPU Power: 12× 600W 12VHPPC connectors minimum
  • IB HCA: 1× Mellanox ConnectX-7 HDR (200 Gbps)
  • Out-of-band: 1× 10GbE or 1GbE NIC (Baseboard Management)
  • Cooling: DLC-equipped or retrofit DLC + liquid coolant

Interconnect (InfiniBand HDR)

  • 1× 36-port IB HDR switch (e.g., Mellanox SB7790)
  • 8× IB HDR QSFP114 AOC cables (100m max)
  • 1× secondary 1GbE management switch (48-port)
  • Spare IB HCA (1 extra for future expansion)

Storage

  • 1× Metadata Server: dual-socket, 256GB DRAM, 1TB NVMe
  • 4× Object Storage Servers: 48× 10TB SSD per server (192 TB total)
  • 4× Dual-port Ethernet NICs on storage nodes (bonded pairs)
  • Lustre or WekaFS licenses

Facility

  • 1× 60 kW PDU (dedicated to cluster)
  • Dual-pump DLC circulation unit (60 LPM minimum)
  • Chilled water to 12-15°C facility capability
  • 1× 10-minute UPS (5 kVA minimum)

Management

  • 1× Slurm head node (modest hardware, high redundancy)
  • 1× Monitoring/logging server (Prometheus + Grafana)

Total cost estimate: $2.5M - $3.2M hardware + installation.

Configuration Reference: Real Slurm Setup

Here's a production-ready Slurm configuration you can adapt:

ini
# /etc/slurm/slurm.conf
#
# GPU Cluster ML Training Configuration
# Cluster: 8 nodes × 8 GPUs = 64 H100 SXM5
 
ClusterName=ml-training-cluster
ControlMachine=slurm-head.local
ControlAddr=10.0.1.1
 
# Database accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurm-head.local
AccountingStoragePort=6819
 
# State retention
StateSaveLocation=/var/lib/slurm
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldLogFile=/var/log/slurm/slurmctld.log
 
# Timeouts
SlurmdTimeout=300
MessageTimeout=30
 
# GPU support
GresTypes=gpu
 
# Node definitions: 8-node cluster
NodeName=compute[01-08] \
    CPUs=256 \
    RealMemory=1500000 \
    Gres=gpu:h100_sxm5_80gb:8 \
    Features=nvlink,infiniband-hdr,dxl-cooled
 
# Partition: GPU training queue
PartitionName=gpu-training \
    Nodes=compute[01-08] \
    MaxCPUsPerNode=256 \
    Default=YES \
    Priority=10 \
    State=UP \
    MaxTime=infinite \
    MinNodes=1
 
# Limits to prevent runaway allocations
# Max 2 concurrent jobs per user
MaxJobCount=64
MaxConcurrentJobs=64
 
# Advanced scheduling
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=1000
PriorityWeightPartition=1000
 
# Fair share: decay by 7 days
PriorityDecayType=priority/decay_linear
PriorityCalcPeriod=3600
PriorityUsageResetPeriodType=WEEKLY
 
# Preemption: allow fair-share preemption
PreemptMode=CANCEL
PreemptType=preempt/qos
 
# Epilog: health check after job finishes
Epilog=/usr/local/sbin/slurm_epilog.sh
EpilogMsgTime=10

DCGM health check script:

bash
#!/bin/bash
# /usr/local/sbin/slurm_epilog.sh
# Run after each job to check node health
 
HOSTNAME=$(hostname -s)
DCGM_PORT=5555
DCGM_URL="http://localhost:${DCGM_PORT}/api/v2/health/status"
 
# Query DCGM
HEALTH=$(curl -s "${DCGM_URL}" 2>/dev/null)
 
if [ -z "$HEALTH" ]; then
    echo "ERROR: DCGM unreachable on ${HOSTNAME}"
    scontrol update NodeName=${HOSTNAME} State=DOWN Reason="DCGM_Unreachable"
    exit 1
fi
 
# Check for critical GPU errors
ECC_ERRORS=$(echo "$HEALTH" | jq '.gpu_health[].ecc_multi_bit_errors | add // 0' 2>/dev/null)
 
if [ "$ECC_ERRORS" -gt 0 ]; then
    echo "ERROR: ${HOSTNAME} ECC errors detected: ${ECC_ERRORS}"
    scontrol update NodeName=${HOSTNAME} State=DOWN Reason="ECC_Errors"
    exit 1
fi
 
# If we made it here, node is healthy
exit 0

Validation: Proving Your Cluster Works

Before declaring victory, run these benchmarks:

1. Bandwidth Test (NCCL)

bash
# All-reduce on 64 GPUs
srun --nodes=8 --gpus-per-node=8 \
     nccl-tests/build/all_reduce_perf -b 100M -e 1G -n 100
 
# Target: >100 GB/s aggregate (1.5 GB/s per GPU)

2. End-to-End Training (ResNet50 synthetic)

bash
# Using Horovod + PyTorch
horovodrun -np 64 python train_resnet50.py
 
# Should achieve 80-85% GPU utilization
# Throughput: ~10K samples/sec on 64 GPUs

3. Storage I/O Test

bash
# Parallel read from Lustre
srun --nodes=8 --cpus-per-task=16 \
     dd if=/mnt/lustre/testfile bs=1M count=10000 of=/dev/null
 
# Target: >200 GB/s (minimum 10x H100 memory bandwidth)

If you hit these targets, your cluster is correctly integrated. Proceed to production training.

Advanced Topics: Tuning for Peak Performance

Now that your cluster is running, let's optimize it. The difference between 60% and 80% GPU utilization is often just configuration tuning.

NCCL Optimization

NCCL (NVIDIA Collective Communications Library) handles all-reduce, all-gather, and broadcast operations. Out of the box, it's conservative. Be aggressive.

bash
# On compute01, export tuning parameters
export NCCL_DEBUG=INFO
export NCCL_IB_HCA=mlx5_0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_TIMEOUT=20
export NCCL_SOCKET_IFNAME=eth0
 
# Run NCCL test with explicit topology
export NCCL_TOPO=/path/to/topo.xml
nccl-tests/build/all_reduce_perf -b 100M -e 1G -n 100 -w 100

The key variables:

  • NCCL_IB_HCA: Force use of specific IB adapter (prevents NIC fallback).
  • NCCL_IB_GID_INDEX: Use correct GID for RoCEv2 (index 3 typically).
  • NCCL_TOPO: Provide explicit topology file for ring and tree selection.

Expected gain: 10-15% improvement if defaults were conservative.

Lustre Tuning for ML Workloads

Lustre's default stripe count is 1 (single OSS). For 64 concurrent GPU readers, you want 8.

bash
# On compute01, set stripe parameters before writing
lfs setstripe -c 8 /mnt/lustre/training-data/
 
# Verify
lfs getstripe /mnt/lustre/training-data/
# stripe_count: 8
# stripe_size: 1048576
# pattern: raid0

This distributes reads across all 8 OSS nodes in parallel.

Read-ahead tuning:

bash
# Set RPC size to 1MB (default is 256KB)
lctl set_param osc.*.max_pages_per_rpc=1024
 
# Enable read-ahead on client
lctl set_param llite.*.read_ahead_stats=1

Expected throughput gain: 50-100 GB/s (from 200 GB/s baseline).

GPU Clock and Power Management

By default, GPUs clock down when not under full load. Lock clocks at max for training to reduce variance.

bash
# Set GPU to max-performance mode (requires root on host)
nvidia-smi -pm 1
nvidia-smi -lgc 2505  # Lock clock to 2505 MHz (H100 max)
 
# Verify
nvidia-smi -q | grep "Clocks"
# Current Clocks              : 2505 MHz (locked)

Power capping: H100 allows power limiting for thermal control.

bash
nvidia-smi -pl 600  # Cap at 600W (vs 700W default)

Caution: Power limiting reduces throughput. Only use if thermal budget is tight.

Memory and CPU Pinning

Prevent OS scheduler from migrating processes. Pin compute threads to CPU sockets.

bash
# Query NUMA layout
numactl -H
 
# Run training with CPU pinning
numactl -l -m 0 python train.py  # Pin to NUMA 0 and local memory

This reduces memory latency for CPU-side collective operations.

Operations & Troubleshooting

Your cluster is live. Now what breaks?

Common Failure Modes

Symptom: Training jobs hang on all-reduce.

Diagnosis:

bash
# SSH to compute01
dcgmi dmon -c 1  # Check GPU temps and power
ibnodes  # Verify IB fabric discovery
scontrol show node compute01  # Check Slurm state

Root causes:

  1. Thermal throttling (GPU >85°C): Cooling malfunction. Check DLC flow.
  2. IB fabric misconfiguration: Run ibdiagnet to scan for link errors.
  3. Packet loss on RoCEv2: PFC/ECN not enabled on switch ports.

Remedy:

  • Throttling: Check coolant temperature (should be 12-15°C at input). Verify pump is running.
  • IB issues: Reboot HCA (rmmod mlx5; modprobe mlx5) and check link speed.
  • RoCEv2 packet loss: Enable PFC on all switch ports: lldpconfig -X "Admin_Status_RxTx".

Symptom: Job gets preempted even though cluster is idle.

Diagnosis:

bash
scontrol show job <job_id>
# Look for Priority, QOS, and PreemptedBy fields

Root cause: Fair-share priority is inverted. High-priority user jobs are preempting lower-priority runs.

Remedy:

bash
# Check current priority weights
scontrol show config | grep Priority
scontrol show config | grep Preempt
 
# If fair-share is too aggressive, reduce weight
scontrol reconfigure
# Edit /etc/slurm/slurm.conf
# PriorityWeightFairshare=10000 → 5000

Symptom: GPU memory fragmentation. Training job OOMs despite free memory showing available.

Diagnosis:

bash
# Check memory fragmentation
nvidia-smi --query-gpu=memory.free,memory.used --format=csv
# Free should be contiguous, not scattered
 
# Inside PyTorch
torch.cuda.empty_cache()

Root cause: Previous job didn't release memory cleanly (driver bug or process crash).

Remedy:

bash
# Hard reset GPU memory
nvidia-smi --gpu-reset  # Requires root and no active jobs
 
# Or reboot node
scontrol update NodeName=compute01 State=DOWN
# After reboot
scontrol update NodeName=compute01 State=UP

Cost-Performance Trade-offs

Let's be real about budget. What can you skip?

Option 1: Budget Cluster ($1.2M)

  • 4 nodes × 8 H100 GPUs = 32 GPUs
  • Single InfiniBand HDR switch
  • Lustre with 2 OSS nodes (60 GB/s)
  • No redundancy on power/cooling

Trade-off: Slower training (32 vs 64 GPUs). Higher cost per GPU (economies of scale lost). No HA.

Training time: 2x slower than full cluster.

Option 2: Mid-Range ($2.0M)

  • 8 nodes × 4 H100 GPUs = 32 GPUs
  • Same fabric as full cluster
  • WekaFS instead of Lustre (simpler ops)
  • Shared cooling (not DLC per node)

Trade-off: Fewer GPUs per node. Reduced tensor parallelism options.

Training time: 2x slower, but more flexible.

Option 3: Full Production ($2.8M)

  • 8 nodes × 8 H100 GPUs = 64 GPUs
  • Redundant fabric (dual IB switches)
  • Lustre with 8 OSS nodes (300+ GB/s)
  • Per-node DLC, redundant cooling loops

Trade-off: Highest upfront cost. Best long-term economics.

Training time: Baseline.

For a research team doing weekly training runs, Option 1 is pragmatic. For production ML services, Option 3 is essential.

Maintenance and Long-Term Operations

Your cluster is built. Now it's a 5-year commitment.

Year 1: Optimization

  • Profile bottlenecks (network, storage, compute).
  • Tune NCCL, Lustre, and kernel parameters.
  • Document configuration (version control your slurm.conf!).

Year 2-3: Scaling

  • Add 2-4 more compute nodes (fabric supports it).
  • Expand storage if training datasets grow.
  • Monitor GPU health trends (ECC errors, thermal drift).

Year 4-5: Refresh

  • H100s will age. New GPUs (likely 2-3x more performance) will be available.
  • Plan GPU upgrade on oldest nodes first.
  • Plan Ethernet/IB switch refresh (300G → 800G fabric).

Pro tip: Keep spare parts on hand. One failed NVLink bridge or IB HCA can idle the cluster for days if you have to order it.

Why Production Reliability Matters

This might seem like overkill for a new cluster. You think, "I'll just rebuild if something breaks." But in practice, a training job that dies at day 10 of 14 is catastrophic. You restart, you're back at day 0. Multiply that by hundreds of researchers running jobs, and suddenly reliability isn't nice-to-have, it's foundational.

Dual power supplies mean one PSU failure doesn't take the cluster down. Redundant cooling means one pump failure is detected before GPUs overheat. Automated health checks mean bad GPUs are offline before they corrupt your training data. RAID-6 on storage means two drive failures don't wipe out your training datasets.

These might cost 10-15% more upfront. Over a 5-year cluster lifetime, they save millions in lost training time and lost data.

Summary

Building a GPU cluster requires discipline across multiple domains:

  • Architecture: Separate networks by traffic class (training fabric vs. management).
  • Interconnect: InfiniBand HDR is the reliable choice for 8-node clusters; validate with NCCL.
  • Storage: Lustre or WekaFS at 200+ GB/s aggregate to feed your 64 GPUs without starvation.
  • Thermal: Direct liquid cooling at 40°C inlet, automated health checks at 85°C.
  • Cluster management: Slurm + DCGM isolate failures and prevent cascading issues.

The configuration reference and checklist above are starting points. Every cluster is different, but these fundamentals apply universally.

Start with validation. Before you train a model, prove the network, storage, and thermal systems work under load. A 10-minute NCCL test saves you hours of debugging training job timeouts.


Need help implementing this?

We build automation systems like this for clients every day.

Discuss Your Project