Weights & Biases for Experiment Tracking at Scale
You're running ten different training experiments across multiple GPUs. One uses a different learning rate schedule. Another tries a new regularization technique. A third is distributed across eight machines. By tomorrow, you won't remember which config produced that promising 94.2% accuracy. Welcome to the problem that every ML team faces: experiment chaos at scale.
Without proper tracking, you're flying blind. Your models improve, but you can't reproduce results. Your team runs redundant experiments. Your deployments use weights that came from... which run exactly? Weights & Biases (W&B) solves this by giving you a unified, production-grade experiment tracking platform that scales with your ambitions. Let's build a real system together.
Table of Contents
- The Core Problem: Why Experiment Tracking Matters
- Setting Up W&B for Production
- Initialization with Project and Entity Naming
- Environment Variables for CI/CD Integration
- Group and Job Type for Distributed Training
- Logging Rich Metrics and Artifacts
- Logging Custom Visualizations
- Logging Sample Predictions with Images
- Logging 3D Point Clouds and Spectrograms
- Hyperparameter Sweeps at Scale
- Defining a Sweep Configuration
- Launching Sweep Agents
- Analyzing Results with Parallel Coordinates
- Artifact Management and Versioning
- Creating and Logging Artifacts
- Artifact Aliases for Model Selection
- Dataset Versioning and Lineage
- Team Collaboration and Sharing
- Creating Collaborative Reports
- Model Benchmarking Tables
- Access Control and Secrets
- Building a Reusable W&B Training Wrapper
- Data Model Visualization
- Distributed Training Tracking
- Putting It All Together
- Advanced Configuration for Multi-Team Workflows
- Entity Structure for Large Organizations
- Custom Metrics and Derived Values
- Integration with Model Registries
- Production Lessons: Making Tracking Actually Stick
- Do: Log Often, Log Everything
- Don't: Log Sensitive Data
- Do: Use Sweeps for Hyperparameter Tuning
- Don't: Ignore Artifact Cleanup
- Do: Version Your Data
- Measuring Success: How to Know If Your Tracking Is Working
- Summary and Next Steps
The Core Problem: Why Experiment Tracking Matters
Machine learning is fundamentally empirical. You change a hyperparameter, train a model, measure accuracy, and iterate. The challenge isn't running experiments - it's managing them intelligently when you're shipping dozens per week across distributed infrastructure.
Think about what happens in practice at scale. A typical ML team might run three models in parallel on Monday, diverge from those to try five variants by Wednesday, then realize two of the variants from last week had some parameter configuration that produced interesting intermediate results worth investigating further. Without tracking, you're left scrolling through command history, hoping to reconstruct exactly which config produced that 94.2% accuracy score. With proper tracking, you open a dashboard and immediately see it was run_id 847392 with learning rate 0.0003 and batch size 128, trained on commit hash abc123def456. That reproducibility is how teams scale from single-person projects to enterprise ML operations.
The real cost of missing experiment tracking isn't just in lost productivity, though that's real. It's in the compounding technical debt. Each experiment you can't reproduce means each team member essentially starts from scratch on the next iteration. You lose organizational knowledge. You repeat failed experiments because you forgot you already tried that approach. You ship models to production without being able to explain exactly which hyperparameters they used. You have no audit trail if something goes wrong in production and you need to trace back to the training decision that caused it. These aren't edge cases - they're what happens when organizations skip infrastructure investment early and pay for it constantly later.
This is where most ML teams fail, not from technical incompetence but from organizational chaos. A team of five engineers running experiments without proper tracking inevitably produces redundant work. Engineer A runs learning_rate=1e-4 and gets 92% accuracy. Engineer B doesn't know this and runs the same experiment again. Engineer C modifies the config slightly and gets 92.3% but doesn't document why. By the end of the month, nobody remembers which configuration is actually in production. Is it from run_23, run_47, or run_51? They all have similar accuracy. You shrug and pick run_51 because it's most recent. Later, you realize run_23 was actually best, but you can't reproduce it because the data pipeline-pipelines-training-orchestration)-fundamentals) has changed.
This scenario repeats in teams every day. And it's incredibly expensive. You're wasting GPU hours on duplicate experiments. You're making decisions with incomplete information. You're losing institutional knowledge when engineers leave. You're unable to confidently explain why production model v3 is better than v2.
Think about what this costs. Each wasted GPU hour is real money - maybe fifty dollars on cloud infrastructure. If your team runs ten wasted duplicate experiments per month due to poor tracking, that's thousands of dollars in compute waste. More insidiously, you lose time. Engineers spend hours digging through Slack histories and Google Drive folders trying to remember which config produced good results. That's time stolen from actual progress. And when model performance degrades in production and you need to debug, you can't reproduce the original training conditions because the logs are gone and the data has drifted.
Traditional approaches fail fast. Spreadsheets get out of sync. Cloud storage bucket names become cryptic. Git commits containing model weights bloat your repository. Slack messages saying "this run was good" aren't reproducible. You need a system that:
- Tracks every experiment and its complete configuration
- Captures metrics, artifacts, and system telemetry automatically
- Enables collaboration so your team shares findings, not confusion
- Scales to hundreds or thousands of runs without manual overhead
- Integrates with your existing CI/CD and training infrastructure
W&B is designed specifically for this problem. It's not just logging - it's a complete experiment management platform built for teams doing serious ML at scale.
The fundamental insight is that experiment tracking isn't overhead - it's core infrastructure. If you can't reproduce an experiment, you haven't learned anything. You're just running code. Real learning requires reproducibility: the ability to say "that run was good because of X and Y," and then prove it by running it again and getting the same result.
W&B makes reproducibility automatic. Every run records its hyperparameters, the exact code commit that produced it, all metrics, all artifacts, and full system telemetry. Want to know why run 47 had 2GB more memory usage than run 46? Check the metadata. Want to reproduce run 47 exactly? Check out the git commit, load the artifact, and run with the config. The platform handles all the bookkeeping.
This reproducibility has profound implications for team dynamics. When you have a system where every experiment is fully captured and reproducible, team members can build on each other's work confidently. Alice can say "I got 93% accuracy with this config" and hand off her checkpoint to Bob. Bob can load that checkpoint, understand exactly what Alice did, and iterate from there. They're not duplicating effort. They're compounding knowledge. This is how high-velocity ML teams operate.
But W&B goes deeper than just recording history. It provides intelligence on top of that history. Hyperparameter sweep agents suggest which configurations to try next based on what worked before. Custom metrics-keda-hpa-custom-metrics) let you compute and track derived values (tokens per GPU hour, accuracy per training dollar spent). Lineage tracking shows you which dataset version trained which model. This is where W&B shifts from "nice logging tool" to "strategic infrastructure." You can actually understand what works and why.
Consider what you'd need to build if W&B didn't exist. You'd need a database to store experiment metadata. You'd need a web interface to browse results. You'd need APIs for logging. You'd need a hyperparameter optimization service. You'd need artifact versioning-ab-testing) and lineage tracking. You'd need to integrate with every framework you use. You'd need to maintain all of this as your team grows. That's easily six months of engineering work, and it's a distraction from actual model development. W&B is a productivity multiplier because it eliminates all this infrastructure work and lets you focus on the actual science.
Setting Up W&B for Production
Before you can track experiments effectively, you need to initialize W&B correctly. This means establishing naming conventions, configuration patterns, and environment-based setup that work across your entire organization.
The difference between a W&B setup that works and one that becomes a nightmare is organizational discipline. If every engineer initializes W&B differently, you end up with a dashboard that's unusable - thousands of runs with inconsistent naming, missing configs, and unclear purpose. Six months in, nobody can find anything. The platform that was supposed to solve chaos becomes a source of chaos.
The solution is simple: establish patterns early and enforce them. Create a standard wrapper that every team member uses. Define naming conventions. Specify which fields are required. Make it so easy to do it right that it's actually harder to do it wrong. Think of this as building guardrails-infrastructure-content-safety-llm-applications), not requirements. You're not asking engineers to remember a bunch of rules; you're building the rules into the code they use.
Initialization with Project and Entity Naming
Every W&B run lives in a project within an entity. Your entity is your workspace or team. Your project groups related runs together. Getting this right prevents chaos as your team grows.
Here's the pattern we recommend:
import wandb
import os
from datetime import datetime
import subprocess
def setup_wandb_run(
project_name: str,
run_name: str = None,
job_type: str = "train",
config_dict: dict = None,
entity: str = None,
):
"""Initialize a W&B run with production naming conventions."""
entity = entity or os.getenv("WANDB_ENTITY", "my-team")
# Generate consistent run name with timestamp + git hash
if run_name is None:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
git_hash = subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"]
).decode().strip()
run_name = f"{timestamp}-{git_hash}"
run = wandb.init(
entity=entity,
project=project_name,
name=run_name,
job_type=job_type,
config=config_dict or {},
tags=["production"] if os.getenv("CI") else ["local"],
)
return run
# Usage
config = {
"learning_rate": 1e-3,
"batch_size": 32,
"epochs": 50,
"model": "resnet50",
}
run = setup_wandb_run(
project_name="computer-vision-2024",
job_type="train",
config_dict=config,
)This naming convention makes it trivial to find any run. The timestamp + git hash ensures reproducibility - you can always check out that exact commit and retrain))-ml-model-testing)-scale)-real-time-ml-features)-apache-spark)-training-smaller-models)).
Environment Variables for CI/CD Integration
W&B respects several environment variables that make CI/CD integration seamless. You don't want to hardcode credentials or project names in your training scripts.
# .env.prod (not in git!)
export WANDB_API_KEY="your-api-key-here"
export WANDB_ENTITY="my-team"
export WANDB_PROJECT="model-training"
export WANDB_MODE="online"
# For local development
export WANDB_MODE="offline" # Syncs later
# Disable W&B in testing
export WANDB_DISABLED="true"Your training script reads these automatically:
import os
# W&B respects these environment variables
# WANDB_API_KEY - authentication
# WANDB_ENTITY - team/workspace
# WANDB_PROJECT - project name
# WANDB_MODE - online/offline/disabled
run = wandb.init() # Uses env vars automaticallyGroup and Job Type for Distributed Training
When you're running distributed training-pipeline-parallelism)-automated-model-compression)) across multiple machines, you need to group them together so you can analyze aggregate metrics alongside per-rank metrics.
import wandb
import os
rank = int(os.getenv("RANK", 0))
world_size = int(os.getenv("WORLD_SIZE", 1))
run = wandb.init(
# All ranks in this distributed training belong to same group
group=f"distributed-training-{os.getenv('TIMESTAMP')}",
# Each rank is a separate "job" within the group
job_type=f"train-rank-{rank}",
name=f"rank-{rank}",
)
# Rank 0 logs shared metrics (loss, accuracy)
# All ranks log per-rank metrics (per-GPU memory, throughput)
if rank == 0:
wandb.log({"global_loss": loss})
wandb.log({"rank_memory": get_memory_usage()})When you view this in W&B, you see one logical experiment with multiple child jobs. You can zoom in to rank 0's loss or zoom out to see per-GPU memory usage across all ranks.
Logging Rich Metrics and Artifacts
Scalar metrics (loss, accuracy) are just the beginning. W&B shines when you log rich media: confusion matrices, sample predictions with images, 3D point clouds, spectrograms. This is what transforms W&B from "nice logging tool" to "experiment understanding system."
Here's the insight: numbers don't tell the whole story. Your model has 94% accuracy. Is that good? It depends on which classes you're getting right and which you're getting wrong. Maybe you're getting cats and dogs right but completely failing on horses. That 94% is misleading - your model is actually good at 66% of your task (2 out of 3 classes). Logging a confusion matrix shows this immediately. You see the failure pattern and know exactly what to fix.
Similarly, logging sample predictions lets you spot systematic errors that aggregate metrics hide. Your model might have 95% accuracy overall, but all five errors are edge cases (images at odd angles, poor lighting, occlusions). That tells you something different than if the errors are distributed randomly. You might conclude "I need more edge case data" versus "I need better data labeling." The metrics are identical; the correct next step is completely different.
This is why rich logging matters for actual learning. You're not just recording what happened; you're recording the evidence that lets you understand why it happened. This is the difference between "I ran an experiment" and "I learned something from an experiment."
Logging Custom Visualizations
The W&B media logging API goes far beyond)) scalars. Let's log a confusion matrix that updates during training:
import wandb
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
def log_confusion_matrix(y_true, y_pred, class_names, step):
"""Log a confusion matrix as a W&B chart."""
cm = confusion_matrix(y_true, y_pred)
# Create wandb.plots.confusion_matrix for interactive visualization
wandb.log({
"confusion_matrix": wandb.plot.confusion_matrix(
y_true=y_true,
preds=y_pred,
class_names=class_names,
),
"step": step,
})
# During evaluation
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 1, 1, 0, 1, 2]
class_names = ["cat", "dog", "bird"]
log_confusion_matrix(y_true, y_pred, class_names, step=1000)In the W&B dashboard, you get an interactive confusion matrix. Hover to see exact counts. Filter by class. Watch it improve over epochs.
Logging Sample Predictions with Images
For computer vision, logging actual predictions alongside ground truth is invaluable for debugging:
import wandb
def log_prediction_samples(images, predictions, ground_truth, num_samples=8):
"""Log sample predictions with images for visual inspection."""
samples = []
for i in range(min(num_samples, len(images))):
samples.append(
wandb.Image(
images[i],
caption=f"Pred: {predictions[i]}, True: {ground_truth[i]}",
)
)
wandb.log({"prediction_samples": samples})
# Usage
images = next(iter(validation_loader))[0] # Batch of images
logits = model(images)
predictions = logits.argmax(dim=1)
ground_truth = next(iter(validation_loader))[1]
log_prediction_samples(
images.cpu().numpy(),
predictions.cpu().numpy(),
ground_truth.cpu().numpy(),
num_samples=16,
)Now you can visually inspect failures, spot systematic errors, and identify edge cases - without downloading datasets.
Logging 3D Point Clouds and Spectrograms
W&B handles other modalities too. For 3D vision:
import wandb
# 3D point cloud (Nx3 array of points)
point_cloud = np.random.rand(1000, 3)
wandb.log({
"point_cloud": wandb.Object3D(point_cloud),
})For audio:
import wandb
# Audio spectrogram
audio_array = np.random.rand(16000) # 1 second at 16kHz
wandb.log({
"audio_spectrogram": wandb.Audio(
audio_array,
sample_rate=16000,
caption="Model output spectrogram",
),
})The point: W&B's media logging API is extensible. You're not limited to numbers. You're logging the raw evidence of what your model learned.
Hyperparameter Sweeps at Scale
Searching for the right hyperparameters is one of ML's most expensive operations. Run a thousand different learning rates, and you'll waste GPU hours on bad configurations. Sweep agents + early termination + Bayesian optimization make this efficient.
Defining a Sweep Configuration
A sweep YAML declares your search space and strategy:
# sweep_config.yaml
program: train.py
method: bayes # Can be: bayes, grid, random
metric:
name: validation/accuracy
goal: maximize
parameters:
learning_rate:
min: 1e-5
max: 1e-1
batch_size:
values: [16, 32, 64, 128]
dropout:
min: 0.0
max: 0.5
weight_decay:
min: 0.0
max: 0.01
early_terminate:
type: hyperband
max_iter: 10
min_iter: 5
eta: 3This says: "Search for the best learning rate, batch size, dropout, and weight decay using Bayesian optimization. Stop runs that don't improve by iteration 5."
Launching Sweep Agents
Now you launch agents that run trials:
# Create the sweep (returns sweep_id)
wandb sweep sweep_config.yaml
# Output: Creating sweep with ID: xyz123
# Launch 4 agents on this machine, each running up to 100 trials
wandb agent xyz123 --count=100 &
wandb agent xyz123 --count=100 &
wandb agent xyz123 --count=100 &
wandb agent xyz123 --count=100 &
# Or on a different machine
wandb agent xyz123 --count=100 &Each agent pulls the next trial from the central W&B service, runs train.py with those hyperparameters, logs metrics, and asks for the next trial. Bayes optimization analyzes results and suggests promising regions. Early termination stops unpromising runs.
Analyzing Results with Parallel Coordinates
Once your sweep completes, W&B's parallel coordinates visualization becomes invaluable:
# Query sweep results via API
import wandb
api = wandb.Api()
sweep = api.sweep("entity/project/sweep_id")
# Access all runs in sweep
for run in sweep.runs:
print(f"Run: {run.name}")
print(f" LR: {run.config['learning_rate']}")
print(f" Accuracy: {run.summary['validation/accuracy']}")In the dashboard, you see a parallel coordinates plot: each line is a run, colored by accuracy. Drag axes to filter - "show me all runs with learning rate > 1e-3 and accuracy > 92%." This reveals which hyperparameter combinations worked and why.
Artifact Management and Versioning
Artifacts are W&B's way of versioning datasets, model weights, and any other large files. Unlike raw files, artifacts track lineage: "this model was trained on this dataset version."
Creating and Logging Artifacts
import wandb
import torch
def save_model_artifact(model, epoch, accuracy):
"""Save model weights as a versioned W&B artifact."""
# Save locally
model_path = f"model_epoch_{epoch}.pt"
torch.save(model.state_dict(), model_path)
# Create W&B artifact
artifact = wandb.Artifact(
name="model",
type="model",
description=f"Model at epoch {epoch}, accuracy={accuracy:.4f}",
)
# Add the file
artifact.add_file(model_path)
# Log it (creates version)
run.log_artifact(artifact)
return artifact
# Usage during training
artifact = save_model_artifact(model, epoch=50, accuracy=0.942)
print(f"Model saved as {artifact.name}:{artifact.version}")
# Output: Model saved as model:v0Each call creates a new version (v0, v1, v2, ...). You're building a complete history.
Artifact Aliases for Model Selection
Instead of remembering "v7 was the best model," use aliases:
import wandb
run = wandb.init(project="vision-models")
# During training
current_best_artifact = run.log_artifact(model_artifact)
# Mark it as best
current_best_artifact.aliases.append("best")
current_best_artifact.save()
# Mark previous best as archive
previous_best = artifact.get("model:best")
previous_best.aliases.remove("best")
previous_best.aliases.append("archived")
previous_best.save()
# Later, in deployment
api = wandb.Api()
best_model = api.artifact("entity/project/model:best")
# Always fetches the latest version tagged as "best"Now your deployment pipeline always uses model:best, and this alias automatically points to the latest production-ready weights.
Dataset Versioning and Lineage
Track which dataset version trained which model:
import wandb
# Create dataset artifact
dataset_artifact = wandb.Artifact(
name="imagenet-subset",
type="dataset",
description="ImageNet subset, 100K images, normalized",
)
dataset_artifact.add_dir("./data/imagenet-100k")
run.log_artifact(dataset_artifact)
# The run now has lineage: it consumed this exact dataset version
# Later, if you discover a problem with the data, you can trace all models
# that used it and retrain with the corrected versionIn W&B's lineage view, you see: Dataset v2 → Model v5 → Production Deployment. If dataset v2 had a bug, you know exactly which models need retraining.
Team Collaboration and Sharing
Individual experiments are useful. Shared findings are invaluable. W&B Reports let your team document and compare experiments.
Creating Collaborative Reports
import wandb
# Reports are created in the W&B dashboard, but you can generate
# summary data programmatically
api = wandb.Api()
project = api.project("entity/project")
runs = project.runs()
# Collect data for comparison
results = []
for run in runs:
if "final_accuracy" in run.summary:
results.append({
"name": run.name,
"accuracy": run.summary["final_accuracy"],
"config": run.config,
"tags": run.tags,
})
# Sort and print
results_sorted = sorted(results, key=lambda x: x["accuracy"], reverse=True)
for i, result in enumerate(results_sorted[:10]):
print(f"{i+1}. {result['name']}: {result['accuracy']:.4f}")Use the W&B dashboard to create a Report: drag in charts, add notes, compare runs side-by-side. Share the link with your team. Reports are self-updating - as new runs complete, charts update automatically.
Model Benchmarking Tables
Compare models across multiple metrics:
import wandb
api = wandb.Api()
benchmark_data = []
for model_name in ["resnet50", "efficientnet", "vit-b16"]:
best_run = api.runs(
f"entity/project",
filters={"config.model": model_name},
order="-summary_metrics.validation/accuracy"
)[0]
benchmark_data.append({
"model": model_name,
"accuracy": best_run.summary["validation/accuracy"],
"f1": best_run.summary["validation/f1"],
"latency_ms": best_run.summary["inference_latency_ms"],
"parameters_m": best_run.summary["num_parameters"] / 1e6,
})
# Log as table
table = wandb.Table(
dataframe=pd.DataFrame(benchmark_data)
)
run.log({"model_benchmarks": table})Now your team has a living, shared document of model performance. Everyone sees the same data.
Access Control and Secrets
W&B projects have visibility settings. Restricted projects are team-only. Public projects are shareable. Secrets are encrypted:
import wandb
import os
# Store sensitive values in W&B Secrets
os.environ["HF_API_KEY"] = wandb.config.get("hf_api_key", "")
# Or define in project settings, access like:
api_key = run.config.get("database_password") # Retrieved securelyYour team members see results but never see credentials. Secrets are decrypted only during run execution.
Building a Reusable W&B Training Wrapper
Let's consolidate everything into a production-grade wrapper that handles the complex parts automatically:
from dataclasses import dataclass
from typing import Optional
import wandb
import torch
import torch.distributed as dist
from datetime import datetime
import subprocess
@dataclass
class TrainingConfig:
"""Training configuration dataclass."""
learning_rate: float = 1e-3
batch_size: int = 32
epochs: int = 50
model_name: str = "resnet50"
dropout: float = 0.2
weight_decay: float = 0.0
# ... other hyperparameters
class DistributedWandBWrapper:
"""Handles W&B logging for distributed training."""
def __init__(
self,
config: TrainingConfig,
project: str,
entity: str = None,
run_name: str = None,
):
self.rank = int(os.environ.get("RANK", 0))
self.world_size = int(os.environ.get("WORLD_SIZE", 1))
# Only rank 0 logs to W&B
if self.rank == 0:
if run_name is None:
timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
git_hash = subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"]
).decode().strip()
run_name = f"{timestamp}-{git_hash}"
self.run = wandb.init(
entity=entity or os.getenv("WANDB_ENTITY"),
project=project,
name=run_name,
config=config.__dict__,
group=os.getenv("TRAINING_GROUP"),
job_type=f"train-rank-{self.rank}",
)
else:
self.run = None
self.config = config
def log_metrics(self, metrics: dict, step: int):
"""Log metrics (rank 0 only)."""
if self.rank == 0 and self.run is not None:
wandb.log(metrics, step=step)
def log_model_artifact(self, model_path: str, epoch: int):
"""Create and version model artifact."""
if self.rank == 0 and self.run is not None:
artifact = wandb.Artifact(
name=f"model-{self.config.model_name}",
type="model",
description=f"Epoch {epoch}",
)
artifact.add_file(model_path)
self.run.log_artifact(artifact)
def finish(self):
"""Finalize run."""
if self.rank == 0 and self.run is not None:
self.run.finish()
# Usage
if __name__ == "__main__":
config = TrainingConfig(learning_rate=1e-3, batch_size=64)
# Initialize W&B
wb = DistributedWandBWrapper(
config=config,
project="distributed-training",
)
# Training loop
for epoch in range(config.epochs):
# ... training code ...
loss = train_one_epoch(model, train_loader)
val_acc = evaluate(model, val_loader)
# Log (only rank 0 logs)
wb.log_metrics({
"train/loss": loss,
"val/accuracy": val_acc,
}, step=epoch)
# Save and version checkpoint
if epoch % 10 == 0:
torch.save(model.state_dict(), f"model_epoch_{epoch}.pt")
wb.log_model_artifact(f"model_epoch_{epoch}.pt", epoch)
wb.finish()This wrapper eliminates the need to sprinkle W&B code everywhere. You initialize once, call log_metrics(), and everything else (rank handling, artifact versioning, grouping) is automatic.
Data Model Visualization
Understanding W&B's structure clarifies how everything connects:
graph TD
Entity["🏢 Entity<br/>(Team/Workspace)"]
Entity -->|contains| Project["📊 Project<br/>(computer-vision)"]
Project -->|contains| Run1["▶️ Run<br/>(2024-01-15-abc123)"]
Project -->|contains| Run2["▶️ Run<br/>(2024-01-16-def456)"]
Project -->|contains| Sweep["🔄 Sweep<br/>(Hyperparameter Search)"]
Run1 -->|logs| Metrics["📈 Metrics<br/>(loss, accuracy, etc)"]
Run1 -->|logs| Artifacts["📦 Artifacts<br/>(v0, v1, v2...)"]
Run1 -->|produces| Config["⚙️ Config<br/>(hyperparameters)"]
Artifacts -->|tags| BestAlias["🏆 best<br/>(v7)"]
Artifacts -->|tags| LatestAlias["⭐ latest<br/>(v9)"]
Sweep -->|contains| ChildRun1["▶️ Trial Run<br/>(sweep/1)"]
Sweep -->|contains| ChildRun2["▶️ Trial Run<br/>(sweep/2)"]
Project -->|contains| Report["📋 Report<br/>(Shared Analysis)"]
Report -->|references| Run1
Report -->|references| Run2
style Entity fill:#e1f5ff
style Project fill:#f3e5f5
style Run1 fill:#fff3e0
style Artifacts fill:#e8f5e9Key relationships:
- Each Entity can have multiple Projects
- Each Project contains many Runs
- Each Run logs Metrics (scalars, charts, media) and Artifacts (versioned files)
- Artifacts can have Aliases (best, latest, production)
- Sweeps create child Runs for each trial
- Reports reference Runs and communicate findings to the team
Distributed Training Tracking
When you scale to multiple machines, tracking becomes complex. W&B simplifies this:
graph LR
Machine1["🖥️ Machine 1<br/>(GPU 0-3)"]
Machine2["🖥️ Machine 2<br/>(GPU 4-7)"]
Machine1 -->|Rank 0| Rank0["Rank 0<br/>(Primary Logs)"]
Machine1 -->|Rank 1| Rank1["Rank 1<br/>(Per-Rank Logs)"]
Machine1 -->|Rank 2| Rank2["Rank 2<br/>(Per-Rank Logs)"]
Machine1 -->|Rank 3| Rank3["Rank 3<br/>(Per-Rank Logs)"]
Machine2 -->|Rank 4| Rank4["Rank 4<br/>(Per-Rank Logs)"]
Machine2 -->|Rank 5| Rank5["Rank 5<br/>(Per-Rank Logs)"]
Machine2 -->|Rank 6| Rank6["Rank 6<br/>(Per-Rank Logs)"]
Machine2 -->|Rank 7| Rank7["Rank 7<br/>(Per-Rank Logs)"]
Rank0 -->|logs global_loss| WandB["☁️ W&B<br/>Single Run"]
Rank1 -->|logs rank_memory| WandB
Rank2 -->|logs throughput| WandB
Rank3 -->|logs rank_memory| WandB
Rank4 -->|logs throughput| WandB
Rank5 -->|logs rank_memory| WandB
Rank6 -->|logs throughput| WandB
Rank7 -->|logs rank_memory| WandB
WandB -->|aggregates| Dashboard["📊 Unified Dashboard"]
Dashboard -->|shows| GlobalLoss["Loss over time<br/>(all ranks)"]
Dashboard -->|shows| MemoryDistribution["Memory per rank<br/>(per-GPU view)"]
style Machine1 fill:#bbdefb
style Machine2 fill:#bbdefb
style WandB fill:#c8e6c9
style Dashboard fill:#fff9c4Distributed tracking strategy:
- Rank 0 logs global metrics (loss, accuracy)
- All ranks log per-rank metrics (memory, throughput)
- W&B aggregates into a single run
- Dashboard shows both aggregate trends and per-rank details
- Identify bottlenecks: "Rank 5 uses 2x memory" or "Rank 3 is slower"
Putting It All Together
You now have a system that scales from single-GPU experiments to distributed training-ddp-advanced-distributed-training) across hundreds of GPUs. Here's a complete example:
import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.distributed import init_process_group
import os
def main():
# Initialize distributed if needed
if "RANK" in os.environ:
init_process_group(backend="nccl")
# Load config
config = TrainingConfig(
learning_rate=1e-3,
batch_size=64,
epochs=50,
)
# Initialize W&B (only rank 0)
wb = DistributedWandBWrapper(
config=config,
project="production-training",
entity="my-team",
)
# Setup training
model = load_model(config.model_name)
optimizer = torch.optim.Adam(
model.parameters(),
lr=config.learning_rate,
weight_decay=config.weight_decay,
)
train_loader = DataLoader(dataset, batch_size=config.batch_size)
val_loader = DataLoader(val_dataset, batch_size=config.batch_size)
# Training loop
best_accuracy = 0
for epoch in range(config.epochs):
model.train()
total_loss = 0
for batch_idx, (images, labels) in enumerate(train_loader):
images, labels = images.cuda(), labels.cuda()
optimizer.zero_grad()
logits = model(images)
loss = nn.functional.cross_entropy(logits, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
# Evaluate
model.eval()
correct, total = 0, 0
with torch.no_grad():
for images, labels in val_loader:
images, labels = images.cuda(), labels.cuda()
logits = model(images)
preds = logits.argmax(dim=1)
correct += (preds == labels).sum().item()
total += labels.size(0)
accuracy = correct / total
# Log to W&B
wb.log_metrics({
"epoch": epoch,
"train/loss": total_loss / len(train_loader),
"val/accuracy": accuracy,
}, step=epoch)
# Save best model
if accuracy > best_accuracy:
best_accuracy = accuracy
model_path = f"model_best.pt"
torch.save(model.state_dict(), model_path)
wb.log_model_artifact(model_path, epoch)
wb.finish()
if __name__ == "__main__":
main()Run this on a single GPU, eight GPUs with DDP, or across multiple machines - W&B handles the tracking automatically.
Advanced Configuration for Multi-Team Workflows
As your organization grows, you need multi-team W&B setups. Different teams work on different projects, but all need visibility into what's happening.
Entity Structure for Large Organizations
Large organizations typically have one entity (workspace) with multiple projects, each with different access levels:
import wandb
import os
def setup_multi_team_logging(team_name: str, project_name: str):
"""Setup W&B for multi-team organization."""
# All runs go to single entity
entity = "my-organization"
# But we namespace projects by team
full_project = f"{team_name}/{project_name}"
run = wandb.init(
entity=entity,
project=full_project,
tags=[team_name], # Tag for filtering
)
return run
# Vision team
vision_run = setup_multi_team_logging("vision", "object-detection")
# NLP team
nlp_run = setup_multi_team_logging("nlp", "language-models")
# In the W&B dashboard, you see organized structure:
# - my-organization/
# - vision/object-detection
# - vision/segmentation
# - nlp/language-models
# - nlp/translation
# - infra/benchmarksYou can set project-level permissions so each team sees only their projects. Admins see everything.
Custom Metrics and Derived Values
Beyond the raw metrics your training loop produces, W&B supports computed metrics:
import wandb
# Log raw metrics
wandb.log({
"train/loss": loss,
"val/accuracy": accuracy,
"val/f1": f1_score,
})
# Compute and log derived metrics
efficiency = accuracy / training_time_seconds
wandb.log({
"computed/accuracy_per_second": efficiency,
"computed/loss_reduction": (initial_loss - loss) / initial_loss,
})
# Log system metrics automatically
wandb.log({
"system/gpu_memory_mb": torch.cuda.memory_allocated() / 1e6,
"system/wall_time": time.time(),
})In the W&B dashboard, you can create custom charts combining these metrics, making it easy to spot correlation between, say, GPU memory and inference latency.
Integration with Model Registries
For production workflows, W&B integrates with model registries:
import wandb
from huggingface_hub import push_to_hub
run = wandb.init(project="model-training")
# Train model...
model = train()
# Save to W&B
artifact = wandb.Artifact("my-model", type="model")
artifact.add_file("model.safetensors")
run.log_artifact(artifact)
# Also push to HuggingFace Hub for broader access
model.push_to_hub("my-org/my-model", tags=["production"])
# Log the HuggingFace URL
wandb.log({"huggingface_url": "https://huggingface.co/my-org/my-model"})Now your model is versioned both in W&B (for experiment tracking) and in HuggingFace (for open source access).
Production Lessons: Making Tracking Actually Stick
In practice, adoption of experiment tracking often fails for reasons that have nothing to do with technical capability. You can have the most beautiful W&B workspace with perfect organization, but if your team isn't actually using it, you've built something that looks good but provides no value. The truth is that the best tracking system is the one your team actually uses consistently. This means understanding the friction points that prevent adoption and actively removing them.
The first friction point is setup complexity. Every new experiment template and every new engineer joining the team represents a moment where you can lose them to "I'll just run it manually this once and track it in spreadsheet." The solution is making W&B integration completely automatic. Your default training template should have it configured and enabled by default, requiring only an environment variable to disable it. This inverts the psychology: instead of asking teams to opt into tracking, you make tracking the default and let teams opt out if they absolutely must. In my experience, this increases adoption by an order of magnitude because the barrier to entry is literally zero.
The second friction point is feedback latency. If a data scientist runs an experiment and then waits five minutes for the W&B dashboard to update with the results, they'll skip W&B and check the local output files instead. Modern streaming and the real-time logging capability in W&B eliminate this - metrics appear in the dashboard as they're logged, not in a batch update later. But you need to actually use the streaming API, not just dump metrics at the end of training. Train your team to log metrics every N batches during training, not just at the end of epoch. This creates a feedback loop that keeps people engaged with the tracking system.
The third point is accountability. In many organizations, "experiment tracking is everyone's job" which means nobody's job. Results get tracked inconsistently. Some people log every metric, others log almost nothing. The runs have inconsistent naming schemes. Some include the commit hash, others don't. Some include the dataset version, others guess. Without someone responsible for maintaining standards, entropy wins and your W&B workspace becomes as disorganized as an unmaintained codebase. The solution is simple: designate one person (rotate quarterly) as the W&B workspace owner. This person doesn't run all the experiments - they oversee the quality and organization. They catch experiments that failed silently, they clean up runs with bad names, they document standards and enforce them gently. This one person prevents chaos that would otherwise cost weeks of debugging.
Do: Log Often, Log Everything
Log at multiple granularities. Log every batch during training so you can spot when things go wrong:
# Good: Log frequently
for batch_idx, (images, labels) in enumerate(train_loader):
# ... training step ...
if batch_idx % 10 == 0:
wandb.log({"batch_loss": loss}, step=batch_idx)
# Better: Log with multiple frequencies
wandb.log({
"batch/loss": loss, # Per-batch
"epoch/loss": epoch_loss, # Per-epoch
"system/gpu_memory": gpu_mem, # Every batch
})Don't: Log Sensitive Data
Never log API keys, passwords, or personal information:
# Bad
wandb.log({"api_key": os.getenv("SECRET_KEY")})
# Good: Use W&B Secrets for sensitive config
run = wandb.init()
# Access via run.config which decrypts at runtime
api_key = run.config.get("api_key") # Encrypted in logsDo: Use Sweeps for Hyperparameter Tuning
Don't manually try ten learning rates. Let Bayesian optimization explore efficiently:
# Bad: Manual grid search
for lr in [1e-5, 1e-4, 1e-3, 1e-2]:
train(lr) # 4 sequential runs, no intelligence
# Good: W&B sweep with Bayes optimization
# Runs adaptive search, stops bad runs earlyDon't: Ignore Artifact Cleanup
Artifacts accumulate. W&B doesn't auto-delete old versions. Set a retention policy:
import wandb
# Query sweep results periodically
api = wandb.Api()
artifacts = api.list_artifacts("entity/project/model", type="model")
# Keep only recent versions
for artifact in artifacts[:-5]: # Keep last 5
artifact.delete()Or set retention in your artifact creation:
artifact = wandb.Artifact(
"model",
type="model",
metadata={"retention_days": 30},
)Do: Version Your Data
Datasets change. Track which dataset version trained which model:
import hashlib
def hash_dataset(data_dir):
"""Compute deterministic hash of dataset."""
hasher = hashlib.sha256()
for file in sorted(os.listdir(data_dir)):
with open(os.path.join(data_dir, file), "rb") as f:
hasher.update(f.read())
return hasher.hexdigest()
dataset_hash = hash_dataset("./data")
artifact = wandb.Artifact(
"dataset",
type="dataset",
metadata={"hash": dataset_hash},
)
run.log_artifact(artifact)Later, if you discover a data quality issue, you can query: "Show me all models trained on dataset hash X."
Measuring Success: How to Know If Your Tracking Is Working
Most teams implement experiment tracking and then check back in six months wondering if it's actually helping. The problem is they're not measuring the right things. Raw metrics like "how many experiments tracked" are useless. What matters is whether tracking is accelerating your development cycle and preventing mistakes.
The metric that actually matters is decision speed. How long does it take your team to go from "I want to try a new approach" to "I understand whether it works and whether to pursue it further"? With good tracking, that time should shrink by 30-50% compared to manual tracking. You're not searching through dozens of runs trying to remember which config you used. You're clicking the dashboard and seeing clearly that run 483920 tried exactly this and here's what happened. Decision speed compounds over time. Teams that can iterate three times while competitors iterate twice ship better models faster. Tracking is infrastructure that enables velocity.
The second important metric is consistency. Are all your team members using W&B or only some? Are all your models tracked or only some? If you have 80% adoption and 20% of experiments are still tracked manually in spreadsheets, you've succeeded in creating two parallel tracking systems which is worse than having one. Push for full adoption. Make it easier to use W&B than to track manually. Once you cross the threshold to 95%+ adoption, the system becomes genuinely useful because you can trust that the data is complete.
The third metric is artifact recovery. Can you reproduce any model deployed in the last year without searching for hours? If someone asks "what's the exact config for the model in production," can you answer in thirty seconds or does it take a day of detective work? This is the real test of whether your tracking system is actually functional. It's not about pretty dashboards - it's about whether you can actually use the data you've collected. Tracking is infrastructure that should make your life easier, not create more work. If it's harder to find information in W&B than to recreate the experiment from scratch, you haven't built a system that actually works.
Let's see W&B in action for a realistic multi-run fine-tuning workflow:
import wandb
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch
def finetune_with_tracking(model_name, datasets, hyperparams):
"""Fine-tune model and track with W&B."""
for dataset_name, train_data, val_data in datasets:
for lr in hyperparams["learning_rates"]:
for batch_size in hyperparams["batch_sizes"]:
# Initialize W&B
run = wandb.init(
project="llm-finetuning",
name=f"{model_name}-{dataset_name}-lr{lr}-bs{batch_size}",
group=dataset_name,
config={
"model": model_name,
"learning_rate": lr,
"batch_size": batch_size,
},
)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Train
args = TrainingArguments(
output_dir="./output",
learning_rate=lr,
per_device_train_batch_size=batch_size,
num_train_epochs=3,
evaluation_strategy="epoch",
logging_steps=100,
report_to="wandb", # Trainer logs to W&B automatically
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_data,
eval_dataset=val_data,
)
trainer.train()
# Save best model as artifact
best_model_path = trainer.state.best_model_checkpoint
artifact = wandb.Artifact("finetuned-model", type="model")
artifact.add_dir(best_model_path)
run.log_artifact(artifact)
run.finish()
# Run sweep across datasets and hyperparams
datasets = [
("sst2", train_sst2, val_sst2),
("agnews", train_agnews, val_agnews),
]
hyperparams = {
"learning_rates": [1e-5, 2e-5, 5e-5],
"batch_sizes": [8, 16, 32],
}
finetune_with_tracking("bert-base-uncased", datasets, hyperparams)You're running 2 datasets × 3 LRs × 3 batch sizes = 18 runs. Each logs automatically to W&B. The dashboard shows which combination worked best. Your team can compare all runs instantly.
This fine-tuning pattern scales to much larger search spaces. You could run a hundred runs across ten different models, five datasets, and different augmentation strategies. Every run is tracked. Every result is comparable. You can ask questions like "which model performed best on dataset X?" or "what augmentation strategy consistently improved performance?" and get instant answers. That's the power of comprehensive tracking - you're not just running experiments, you're building a knowledge base that grows with each run.
Summary and Next Steps
Weights & Biases transforms experiment tracking from a chore into a superpower. You get:
- Reproducibility: Every run is fully captured - config, code, metrics, artifacts
- Scalability: Track thousands of experiments without overhead
- Collaboration: Share findings with your team instantly
- Intelligence: Hyperparameter sweeps, early termination, and analysis tools
- Lineage: Understand which datasets trained which models
- Production Ready: Artifacts with versioning and aliases for deployment
The patterns here - environment-based configuration, rank-aware distributed logging, artifact versioning, sweep-driven hyperparameter search - aren't just good practices. They're necessary when running ML at scale. W&B makes them accessible.
Start small. Log metrics from your next training run. Add artifacts. Share a report with your team. As your system grows, you'll appreciate having a platform that scales with you.
The real power of W&B emerges over time. In month one, it's just a logging tool. In month three, you're running hyperparameter sweeps and catching your best models automatically. By month six, you have historical data showing how your models have evolved. Which dataset versions worked best? Which augmentation strategies consistently improved results? What's the relationship between batch size and final accuracy? You can answer these questions instantly because your data is all there. This institutional knowledge becomes a competitive moat. New engineers can study past successful runs and learn from real data instead of tribal knowledge.
The adoption curve varies by team. Some teams integrate W&B into their CI/CD pipeline immediately and get comprehensive tracking from day one. Others start manually and gradually shift toward automation. The key is establishing a minimal habit early. Log your metrics. Log your best models. Share one report. Once team members see the value, adoption accelerates naturally. You're not convincing people to do something new - you're making their existing workflow better.
Thinking about W&B also forces conversations about standards and discipline. When you decide to track something in W&B, you're making a commitment to reproducibility. You're saying "this experiment matters enough to document." This implicit pressure, surprisingly, increases team quality. People think more carefully about their experiments because they're being tracked. They avoid haphazard changes. They maintain cleaner code. They write better hyperparameter names. The tracking system becomes a forcing function for good practice.
The key insight: experiment tracking isn't overhead if it's the backbone of your ML workflow. With W&B, tracking becomes your competitive advantage. You iterate faster because you understand what worked before. Your team collaborates more effectively because results are shared instantly. And your models improve more reliably because every experiment is reproducible.
For teams considering W&B, the question isn't whether to track experiments - you must, or you're leaving money on the table. The question is whether to use W&B specifically or roll your own. Most teams that attempt to roll their own end up with fragmented solutions that break. W&B solves this problem comprehensively, and the cost is trivial compared to the cost of experiment waste. For most organizations, especially those with multiple data scientists, W&B pays for itself on day one in reduced duplicate work alone.