PyTorch Tensors and Automatic Differentiation

If you've been following along in this series, you've built end-to-end ML pipelines and explored what it takes to get real models working. Now we're diving deeper, into the foundation that makes deep learning actually work: PyTorch's tensor ecosystem and the magic of automatic differentiation.
Here's the thing: tensors are just multi-dimensional arrays. But in PyTorch, they're so much more. They're the bridge between your code and the computational graph that powers backpropagation. Understanding how they work, and how gradients flow through them, is the difference between cargo-cult deep learning and really understanding what your model is doing.
By the end of this article, you'll know how to create tensors, perform operations on them, move them between devices (CPU/GPU), and, most importantly, understand exactly how PyTorch tracks operations and computes gradients. We'll build intuition by actually walking through a computational graph together.
Table of Contents
- Deep Learning Foundations: Why This All Matters
- What Is a Tensor, Really?
- Data Types (dtype)
- Shape and Reshaping
- Tensors vs NumPy Arrays
- Tensor Operations: The Computational Foundation
- Arithmetic and Broadcasting
- Matrix Operations
- The NumPy Bridge
- Devices: CPU vs. GPU
- Autograd: The Magic Behind Training
- Automatic Differentiation (Autograd)
- The requires_grad Flag
- Building a Computational Graph
- Computing Gradients with backward()
- Computational Graph Intuition
- The Computational Graph Walkthrough
- Gradient Accumulation and zero_grad()
- Detach and Stopping Gradient Flow
- retain_graph: Running Multiple Backwards
- Common Tensor Mistakes
- GPU Training: Putting It Together
- Common Pitfalls
- In-Place Operations and Gradients
- Graph Breaks and Non-Differentiable Operations
- Memory Leaks with Large Graphs
- Conclusion
Deep Learning Foundations: Why This All Matters
Before we get into the mechanics, let's step back and ask the bigger question: what is deep learning actually doing, and why do tensors sit at the center of it?
At its core, every neural network is a function. You feed in some input, an image, a sentence, a row of data, and the network transforms it through a sequence of mathematical operations to produce an output. Those operations involve multiplying by matrices, adding bias vectors, applying nonlinearities. The network "learns" by adjusting the numbers in those matrices and vectors. But how does it know which direction to adjust them?
This is where calculus enters the picture. We define a loss function that measures how wrong the network's output is. A perfect prediction gives a loss of zero; a terrible prediction gives a large loss. Training means minimizing this loss. To minimize a function, you need its gradient, the direction of steepest ascent, and you step in the opposite direction. Do that repeatedly across thousands or millions of examples and the network parameters slowly converge toward values that produce correct outputs.
The catch is that modern neural networks have millions of parameters, and the loss is a function of all of them, composed through dozens of layers of operations. Computing the gradient of the loss with respect to every single parameter by hand is absurd. This is the problem that automatic differentiation solves, and it's the reason PyTorch exists in the form it does.
PyTorch's approach is called "define-by-run" or dynamic computation graphs. You write ordinary Python code to do forward computations, and PyTorch silently records every operation in a graph. When you call backward(), PyTorch traverses that graph in reverse, applying the chain rule at every node to compute gradients. It sounds magical, and honestly, it kind of is, but the mechanics are completely understandable once you see them clearly. That's what we're going to do in this article.
Understanding this pipeline from first principles, tensors as data containers, operations as graph nodes, autograd as the differentiation engine, means you'll be able to debug training failures, implement custom operations, and reason about why your model is or isn't learning. This is the foundation everything else in deep learning is built on.
What Is a Tensor, Really?
A tensor is a generalization of vectors and matrices to arbitrary dimensions. You already know this on paper. But in PyTorch? It's an object that knows about itself: its data type, its shape, what device it's on, and whether it should track gradients.
Let's start simple:
import torch
# Create tensors in different ways
a = torch.tensor([1.0, 2.0, 3.0]) # From a list
b = torch.zeros(3, 4) # Zeros of shape (3, 4)
c = torch.randn(2, 3, 4) # Random normal, 3D tensor
d = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
print(a.shape) # torch.Size([3])
print(b.shape) # torch.Size([3, 4])
print(c.dtype) # torch.float32
print(d.device) # cpuNotice that every tensor carries metadata about itself, shape, dtype, and device are always available as attributes. This isn't just bookkeeping; these properties determine how tensors interact with each other and what operations are valid. Two tensors with incompatible shapes can't be added naively. A tensor on CPU can't be multiplied directly with a tensor on GPU. PyTorch enforces these rules at runtime, which means you get clear errors rather than silent garbage.
Three things matter here: shape (dimensions), dtype (data type), and device (where it lives).
By default, PyTorch creates float32 tensors on CPU. That's fine for learning, but later you'll want to move to GPU for speed. We'll talk about that.
Data Types (dtype)
Tensors can hold different data types:
# Float types: torch.float32 (default), torch.float64, torch.float16
# Integer types: torch.int32, torch.int64, torch.int16
# Boolean: torch.bool
x_float = torch.tensor([1.0, 2.0], dtype=torch.float32)
y_int = torch.tensor([1, 2], dtype=torch.int64)
z_bool = torch.tensor([True, False])
print(x_float.dtype) # torch.float32
print(y_int.dtype) # torch.int64The dtype choice has real performance implications that compound as your models grow larger. float32 is the industry standard for neural network training because it offers a good balance between numerical precision and memory usage, a 1000x1000 matrix of float32 values takes 4 MB, while float64 takes 8 MB. In production with billion-parameter models, these differences become enormous. float16 halves memory again and speeds up matrix operations on modern GPUs, but introduces the risk of numerical underflow, gradients can literally become zero when they shouldn't, silently breaking training.
Why does this matter? Performance and precision. float32 is the standard for neural networks. float64 is slower but more precise. float16 is faster but riskier, numerical instability can creep in. For now, stick with float32.
Shape and Reshaping
The shape is critical:
x = torch.randn(2, 3, 4) # Shape: [2, 3, 4]
# View (reshape) to a different shape
y = x.view(2, 12) # Shape: [2, 12]
z = x.reshape(6, 4) # Shape: [6, 4]
flat = x.view(-1) # -1 means "infer this dimension": [24]
# Squeeze (remove dims of size 1)
a = torch.randn(1, 5, 1)
b = a.squeeze() # Shape: [5]
# Unsqueeze (add a dimension)
c = torch.randn(5)
d = c.unsqueeze(0) # Shape: [1, 5]Shape manipulation is one of the most common operations you'll do when building neural networks, and getting it wrong is one of the most common sources of bugs. The difference between view and reshape is subtle but important: view requires the tensor to be contiguous in memory and returns a view of the same storage, while reshape will make a copy if necessary. In practice, prefer reshape unless you specifically need the zero-copy guarantee of view. The squeeze and unsqueeze operations exist because many PyTorch functions expect batch dimensions, a single image might need to become a batch of one before being passed to a model.
The -1 is a lifesaver, it tells PyTorch to infer that dimension. If you have shape [2, 3, 4] (24 elements) and reshape to [6, -1], PyTorch figures out the second dimension must be 4.
Tensors vs NumPy Arrays
If you're coming from a NumPy background, PyTorch tensors will feel immediately familiar. The API is deliberately similar, torch.zeros, torch.ones, torch.arange, broadcasting rules, fancy indexing, all of it mirrors NumPy closely. But the resemblance is surface level. Under the hood, tensors and arrays serve different masters.
NumPy arrays are optimized for CPU computation. They're excellent at numerical work and integrate with the entire scientific Python ecosystem. But they have no concept of GPU execution, no awareness of gradients, and no mechanism for automatic differentiation. They're static data containers.
PyTorch tensors are designed to be computation graph nodes. Every tensor knows whether it should participate in gradient tracking. Every operation on a gradient-enabled tensor gets recorded in a graph that can be replayed in reverse. Tensors can live on GPU, and the same code that runs on CPU runs unchanged on GPU just by moving the tensors. These features make tensors the right abstraction for deep learning in a way that NumPy arrays fundamentally cannot be.
The practical distinction shows up in inference. When you're evaluating a trained model, not training it, you don't need gradients. Using torch.no_grad() context or converting to NumPy for downstream processing is not just acceptable, it's the right thing to do. It frees memory and speeds up computation because PyTorch doesn't build the graph it will never use.
One more key difference: PyTorch tensors support automatic broadcasting and device-aware operations natively, but NumPy has richer support for masked arrays, structured arrays, and certain statistical operations. In practice, most ML workflows use both: NumPy for data loading and preprocessing, tensors for model computations.
Tensor Operations: The Computational Foundation
Now we get to the good stuff. When you operate on tensors, PyTorch is silently building a computation graph. Every operation is recorded, and later, that graph will be used to compute gradients.
Arithmetic and Broadcasting
x = torch.randn(3, 4)
y = torch.randn(3, 4)
z = torch.randn(4)
# Element-wise operations
result1 = x + y # Shape: [3, 4]
result2 = x * y # Element-wise multiplication
result3 = x / y # Division
result4 = torch.sqrt(x) # Square root
# Broadcasting: z has shape [4], x has shape [3, 4]
# z is automatically broadcast to [3, 4]
result5 = x + z # Shape: [3, 4]
# In-place operations (modify in place, don't create new tensor)
x += y # x.add_(y) under the hood
y *= 2 # y.mul_(2) under the hoodBroadcasting is what lets you add a bias vector of shape [4] to every row of a weight matrix of shape [3, 4] without writing a loop. PyTorch aligns dimensions from the right and expands smaller tensors along missing dimensions. The rule is: dimensions of size 1 are stretched to match, and dimensions that don't exist are treated as size 1. Once you internalize this, you'll find yourself using it constantly, and you'll also understand a whole class of shape mismatch errors that arise when broadcasting goes wrong.
Broadcasting is Python/NumPy magic: when dimensions don't match, PyTorch aligns them from the right and expands smaller dimensions. Shape [4] broadcasts to [3, 4] by repeating along the new dimension. This saves memory and computation.
Matrix Operations
A = torch.randn(3, 4)
B = torch.randn(4, 5)
# Matrix multiplication
C = torch.matmul(A, B) # Shape: [3, 5]
D = A @ B # Same thing, @ operator
# Batched matmul (batch of matrices)
A_batch = torch.randn(2, 3, 4) # 2 batches, each 3x4
B_batch = torch.randn(2, 4, 5) # 2 batches, each 4x5
C_batch = A_batch @ B_batch # Shape: [2, 3, 5]
# Transpose
E = A.T # Same as A.transpose(0, 1)
# Trace, determinant, inverse (for 2D)
square = torch.randn(5, 5)
tr = torch.trace(square)
det = torch.det(square)
inv = torch.inverse(square)Matrix multiplication is the workhorse of neural networks, every linear layer is fundamentally a matrix multiply. The batched @ operator is crucial for training efficiency: instead of processing examples one at a time, you stack them into a batch dimension and process them all simultaneously. On a GPU, this parallelism is what makes training a 100-example batch take almost the same time as training on a single example.
The @ operator is your friend, it's the Pythonic way to do matmul. And batched operations are where PyTorch shines: you can process entire batches of matrices in a single operation.
The NumPy Bridge
One of PyTorch's superpowers is seamless NumPy interoperability. You can convert back and forth without copying memory (sometimes).
import numpy as np
# NumPy -> PyTorch
np_array = np.array([1.0, 2.0, 3.0])
torch_tensor = torch.from_numpy(np_array)
# PyTorch -> NumPy
tensor = torch.randn(3, 4)
np_from_torch = tensor.numpy()
# Important: If tensor is on CPU and not using requires_grad,
# this is zero-copy (they share memory)
tensor[0, 0] = 999
print(np_from_torch[0, 0]) # 999, same memory!The shared-memory behavior is worth pausing on. When you call tensor.numpy() on a CPU tensor without gradients, you don't get a copy, you get a view into the same underlying memory. This means modifications to one are immediately visible in the other. In a data loading pipeline, this is wonderful: you can convert between NumPy and PyTorch without paying a memory copy tax. In a debugging session, it's a potential source of subtle bugs if you're not expecting it.
This is beautiful when you're working with existing NumPy codebases. No data duplication, just a view into the same memory.
But be careful: if your tensor requires gradients, you can't convert to NumPy directly (you'd lose gradient info). We'll cover this when we talk about autograd.
Devices: CPU vs. GPU
This is where PyTorch gets practical:
# Check available devices
print(torch.cuda.is_available()) # True if CUDA installed
print(torch.cuda.get_device_name(0)) # "NVIDIA RTX 3090" or similar
# Create tensors on specific device
x_cpu = torch.randn(100, 100, device='cpu')
x_gpu = torch.randn(100, 100, device='cuda') # On GPU 0
x_gpu_1 = torch.randn(100, 100, device='cuda:1') # On GPU 1
# Move tensors between devices
x = torch.randn(3, 4)
x_gpu = x.to('cuda') # Move to GPU
x_back = x_gpu.to('cpu') # Back to CPU
# Specify device and dtype together
x = torch.randn(3, 4, device='cuda', dtype=torch.float32)
# Common pattern: dynamic device handling
device = 'cuda' if torch.cuda.is_available() else 'cpu'
x = torch.randn(1000, 1000, device=device)The device = 'cuda' if torch.cuda.is_available() else 'cpu' pattern is something you'll write so many times it becomes muscle memory. It lets your code run on any machine, your laptop without a GPU, a cloud VM with one, a workstation with several, without modification. Get into the habit of defining device at the top of every training script and passing it everywhere, rather than hardcoding 'cuda' and then wondering why your code breaks on someone else's machine.
Here's the gotcha: you can't do math between tensors on different devices. GPU and CPU must be compatible:
x_cpu = torch.randn(3, 4)
y_gpu = torch.randn(3, 4, device='cuda')
# This will crash
# z = x_cpu + y_gpu # RuntimeError!
# You must move one to match the other
z = x_cpu.to('cuda') + y_gpu # WorksThis is a common source of bugs. Get used to checking device placement.
Autograd: The Magic Behind Training
Automatic differentiation is the engine that makes gradient-based learning possible at scale. The concept sounds exotic, but the underlying idea is beautifully simple: if you build a computational graph of operations, you can mechanically apply the chain rule backwards through every node to compute derivatives. PyTorch automates this entirely.
The reason this matters so profoundly is that neural networks have layers upon layers of function composition. Calculating "how much does the loss change if I nudge this weight in layer 3 by a tiny amount" requires the chain rule applied through every layer from 3 to the output. With 50 layers and millions of weights, you can't do this by hand. Autograd does it automatically, correctly, and efficiently for every parameter simultaneously.
PyTorch implements what's called reverse-mode automatic differentiation. In the forward pass, it records operations in a graph. In the backward pass, it starts at the loss and propagates gradient information backward through the graph, accumulating contributions at each leaf node (your model's parameters). This is fundamentally different from numerical differentiation (which approximates gradients by finite differences and is too slow for large models) and from symbolic differentiation (which manipulates mathematical expressions and doesn't scale to dynamic computation).
The dynamic graph is PyTorch's key architectural choice. Unlike TensorFlow 1.x, which required you to build a static graph before running it, PyTorch builds the graph as your code executes. This means you can use ordinary Python control flow, if statements, for loops, recursion, and the graph adapts to the actual execution path. Debugging a PyTorch model is like debugging regular Python code, because it is regular Python code.
Automatic Differentiation (Autograd)
Now we reach the heart of PyTorch: autograd. This is how neural networks learn.
The idea is simple: if you do math operations on a tensor, PyTorch records them. Later, you can ask "what's the gradient of my loss with respect to this tensor?" PyTorch walks backward through all those operations and computes it.
The requires_grad Flag
# Regular tensor, won't track gradients
x = torch.randn(3, 4)
print(x.requires_grad) # False
# Tensor that tracks gradients
x = torch.randn(3, 4, requires_grad=True)
print(x.requires_grad) # True
# You can flip it later
x.requires_grad_(True) # In-place, note the underscore
x.requires_grad_(False)
# Or use torch.no_grad() to temporarily disable
with torch.no_grad():
y = x * 2 # This won't be trackedThe requires_grad flag is the on/off switch for gradient tracking. When you create model parameters using torch.nn.Parameter or torch.nn.Linear, PyTorch automatically sets requires_grad=True on all those tensors. Your input data and labels, on the other hand, should have requires_grad=False, you don't want to compute gradients with respect to your training data, only with respect to the model weights. This distinction is fundamental.
Why not always track gradients? Performance. If you're not training (just evaluating), you don't need gradients, and disabling them saves memory and computation.
Building a Computational Graph
Let's see this in action:
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)
# These operations are recorded
z = x * y # z = 6
w = z + x # w = 8
loss = w ** 2 # loss = 64
print(loss) # tensor(64., grad_fn=<PowBackward0>)The grad_fn attribute is how tensors announce their history. Every tensor produced by a differentiable operation carries a reference to the function that created it and the inputs to that function. When you see <PowBackward0>, PyTorch is saying "I created this tensor by raising something to a power, and I know how to differentiate through that operation." This chain of grad_fn references is the computational graph, follow them recursively from loss back to x and y, and you have the complete record of every operation in the forward pass.
See that grad_fn=<PowBackward0>? That's PyTorch telling you: "I know how you computed this. I can reverse it."
Computing Gradients with backward()
Now the magic:
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)
z = x * y # z = 6
w = z + x # w = 8
loss = w ** 2 # loss = 64
# Backward pass: compute gradients
loss.backward()
# Now x.grad and y.grad contain the gradients
print(x.grad) # tensor(80.)
print(y.grad) # tensor(16.)Let's verify this makes sense. Working backward:
dloss/dw = 2*w = 2*8 = 16dw/dz = 1,dw/dx = 1dz/dx = y = 3,dz/dy = x = 2- By chain rule:
dloss/dx = dloss/dw * dw/dx * (dz/dx + dw/dx) = 16 * (3 + 1) = 80 dloss/dy = dloss/dw * dw/dz * dz/dy = 16 * 1 * 2 = 32
(Wait, y.grad should be 32, not 16, let me check.) Actually, y only appears in z = x * y, so dloss/dy = dloss/dw * dw/dz * dz/dy = 16 * 1 * 2 = 32. If you run this, you'll see y.grad is 32. The point stands: these are real, computed gradients.
Computational Graph Intuition
To build deep intuition for what's happening, it helps to visualize the computational graph as a directed acyclic graph (DAG) with two types of nodes: leaf nodes (your input tensors and parameters, which have no grad_fn) and operation nodes (every intermediate result). Data flows forward through edges; gradients flow backward.
Each operation node knows two things: how to compute its output from its inputs (the forward function), and how to propagate gradients from its output back to its inputs (the backward function). These backward functions are called vector-Jacobian products, and they implement the chain rule for that specific operation. Multiplication's backward function multiplies the incoming gradient by the other operand. Addition's backward function simply passes the incoming gradient through unchanged. Exponentiation's backward function scales the incoming gradient by the derivative of the power function.
When you call loss.backward(), PyTorch starts at the loss node with a gradient of 1.0 (the loss is the scalar we're differentiating). It calls the backward function of each operation node it encounters, passing gradients forward in the backward direction, accumulating contributions at each leaf node. By the time the traversal is complete, every leaf tensor with requires_grad=True has its .grad attribute populated with the total gradient of the loss with respect to that tensor.
The elegance here is compositionality. No matter how complex your network, 50 layers, skip connections, attention mechanisms, custom activation functions, as long as every operation has a defined backward function, the entire gradient computation is handled automatically. Adding a new layer to your network doesn't require you to rederive any gradients. This compositionality is why deep learning became tractable as an engineering discipline.
The Computational Graph Walkthrough
This is crucial. Let me show you what's happening under the hood:
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)
# Step 1: z = x * y
# Graph node: z = Mul(x, y)
z = x * y
# Step 2: w = z + x
# Graph node: w = Add(z, x)
w = z + x
# Step 3: loss = w ** 2
# Graph node: loss = Pow(w, 2)
loss = w ** 2
# At this point, the graph looks like:
#
# x ----------+
# | |
# +---> Mul ---> z
# | |
# | +---> Add ---> w ---> Pow ---> loss
# |
# +---> Add /
#
# When we call loss.backward(), PyTorch traverses this graph backward,
# computing gradients at each node using the chain rule.
loss.backward()
# Gradients are now populated
print(f"dloss/dx = {x.grad}")
print(f"dloss/dy = {y.grad}")After calling backward(), the graph is freed from memory by default. PyTorch does this because the graph was only needed to compute gradients, and storing it indefinitely would waste memory proportional to the number of forward operations. This is why you can't call backward() twice on the same computation without passing retain_graph=True, the graph is gone after the first backward pass. Keep this in mind when you encounter the "Trying to backward through the graph a second time" error, which is a very common PyTorch gotcha.
This is the computational graph. Every operation creates a node, and edges represent data flow. When you call backward(), PyTorch starts at the loss node and works backward, multiplying gradients as it goes.
Gradient Accumulation and zero_grad()
Here's a critical detail: gradients accumulate. If you call backward() twice, gradients add up:
x = torch.tensor([2.0], requires_grad=True)
# First backward pass
y = x ** 2
y.backward()
print(x.grad) # 4.0
# Second backward pass (without zeroing)
y = x ** 2
y.backward()
print(x.grad) # 8.0, accumulated!Gradient accumulation is sometimes deliberately used, for example, when your GPU can only fit small batches and you want to simulate a larger effective batch size by accumulating gradients over several forward passes before updating weights. But in standard training, you almost always want to zero gradients before each batch. Forgetting zero_grad() is one of the most common bugs in PyTorch training loops, and it's particularly insidious because it doesn't cause an error, it just makes your model update with gradient information from previous batches mixed in, producing inconsistent and often worsening performance.
This is intentional, sometimes you want to accumulate. But in training, you usually don't. You compute gradients for a batch, update the model, then zero out the gradients:
optimizer = torch.optim.SGD([x], lr=0.01)
for epoch in range(10):
loss = (x ** 2).sum()
loss.backward()
optimizer.step()
optimizer.zero_grad() # Clear gradients for next iterationIf you forget zero_grad(), gradients keep piling up and your learning goes haywire.
Detach and Stopping Gradient Flow
Sometimes you want to compute something without gradients, or you want to break the graph:
x = torch.randn(3, 4, requires_grad=True)
# Detach: stop tracking gradients from this point
y = x.detach()
print(y.requires_grad) # False
# Now operations on y don't affect the graph
z = y ** 2
z.backward() # Works, but x.grad is not updated
# Or use torch.no_grad() context
with torch.no_grad():
z = x ** 2
# z has no grad_fn, won't contribute to backwardThe detach pattern appears constantly in more advanced PyTorch code. In reinforcement learning, you detach target network outputs so gradients don't flow through them. In generative adversarial networks, you detach the generator output when training the discriminator. In contrastive learning, you detach one branch of the network to implement a "stop-gradient" operation. Understanding detach at a deep level opens up a whole class of training techniques that would be impossible otherwise.
Why do this? Sometimes you have a part of your computation that you don't want to update (a frozen network, a target network, etc.). Detach lets you reuse tensors without creating gradients.
retain_graph: Running Multiple Backwards
By default, backward() frees the computational graph after running. If you need to call backward() multiple times:
x = torch.tensor([2.0], requires_grad=True)
loss = x ** 2
loss.backward(retain_graph=True) # Keep the graph
print(x.grad) # 4.0
# Can call backward again
loss.backward()
print(x.grad) # 8.0 (accumulated)This is less common but useful for multi-task learning or when you have multiple losses.
Common Tensor Mistakes
After working with PyTorch for a while, you'll accumulate a catalog of mistakes you've made and learned from. Here's a head start on the most common ones, so you can learn from them without the pain.
The first and most frequent mistake is device mismatches. You load your model on GPU, but your input batch stays on CPU. The error message is clear, "Expected all tensors to be on the same device", but the fix requires discipline: always move every tensor to the same device before any operation. The defensive pattern is to check tensor.device when something goes wrong, and to standardize on the device variable pattern shown earlier.
The second common mistake is calling .item() inside a training loop. When you write loss = loss.item(), you convert the scalar tensor to a Python float. This detaches it from the graph, which is fine if you're logging, but catastrophic if you accidentally do this to a value you still need to differentiate. Use .item() only when you're extracting values for logging or comparison, never mid-computation.
The third mistake is confusing in-place operations with their out-of-place counterparts. In PyTorch, functions ending with an underscore (like add_, mul_, zero_) modify the tensor in place. This is efficient but can break the computational graph if the modified tensor is needed for backward. The safe rule: don't use in-place operations on tensors with requires_grad=True, and use optimizer.zero_grad() rather than manually zeroing gradient tensors with in-place ops inside the computation.
The fourth mistake is forgetting that .backward() accumulates gradients. Even experienced practitioners occasionally forget to call zero_grad() and spend hours debugging mysteriously diverging training loss. Make optimizer.zero_grad() the first line of your training loop body, not an afterthought at the end.
Finally, there's the memory leak trap: building graphs inside evaluation loops. If you're running validation after each epoch without torch.no_grad(), PyTorch is building and retaining computational graphs for every batch, graphs you'll never use for backprop. Over a long validation set, this can consume all your GPU memory. Wrap all evaluation code in with torch.no_grad(): and wrap all inference code the same way.
GPU Training: Putting It Together
Now, a realistic example: moving everything to GPU and training a tiny model:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Create data on device
X = torch.randn(100, 10, device=device, requires_grad=False)
y_true = torch.randn(100, 1, device=device)
# Initialize model weights on device
W = torch.randn(10, 1, device=device, requires_grad=True)
b = torch.zeros(1, device=device, requires_grad=True)
# Training loop
learning_rate = 0.01
for epoch in range(100):
# Forward pass
y_pred = X @ W + b
loss = ((y_pred - y_true) ** 2).mean()
# Backward pass
loss.backward()
# Manual gradient update (SGD)
with torch.no_grad():
W -= learning_rate * W.grad
b -= learning_rate * b.grad
# Zero gradients
W.grad.zero_()
b.grad.zero_()
if epoch % 20 == 0:
print(f"Epoch {epoch}, Loss: {loss.item():.4f}")This example deliberately uses manual gradient updates instead of an optimizer to make the mechanics visible. In production code, you'd replace the with torch.no_grad() update block with optimizer.step() followed by optimizer.zero_grad(). But seeing the manual version once is valuable, it makes clear that "training a neural network" is ultimately just: compute loss, compute gradients, nudge weights in the direction that reduces loss, repeat. The abstractions in torch.nn and torch.optim are conveniences built on exactly this foundation.
Notice the device=device everywhere? That's discipline. And note with torch.no_grad() around the update, we don't want PyTorch tracking the SGD step itself.
Common Pitfalls
In-Place Operations and Gradients
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
# DANGER: In-place operations break the graph
y += 1 # Or y.add_(1), or x.mul_(2) before the backward
y.backward() # RuntimeError: one of the variables needed for gradient computation...In-place ops modify a tensor in place. But if that tensor is part of the computational graph, it breaks backward(). The rule: avoid in-place ops on tensors that require gradients, or at least understand when it's safe.
Safe in-place ops:
x = torch.randn(3, 4, requires_grad=True)
z = x + 5
# Safe to modify z in-place (it's not an input to the graph)
z.add_(1)
# But don't modify x in-place if z depends on it
# x.mul_(2) # Danger!Graph Breaks and Non-Differentiable Operations
Some operations aren't differentiable:
x = torch.randn(3, requires_grad=True)
# Differentiable operations
y = x ** 2
# Non-differentiable operations (among others)
z = x.argmax() # Returns indices, not differentiable
w = torch.round(x) # Rounding breaks gradients
# These won't crash, but backward() through them is invalidMemory Leaks with Large Graphs
If you keep building graphs without calling backward(), you can leak memory:
for i in range(10000):
x = torch.randn(1000, requires_grad=True)
loss = (x ** 2).sum()
# No backward(), so the graph is never freed
# Memory usage growsSolution: call backward() or use torch.no_grad() if you don't need gradients:
for i in range(10000):
with torch.no_grad():
x = torch.randn(1000)
loss = (x ** 2).sum()
# Graph not built, no memory leakConclusion
What we've covered in this article is the bedrock of everything else in deep learning. Tensors are not just arrays, they're computation graph nodes, device-aware data containers, and the primary abstraction through which PyTorch expresses all mathematical operations. Autograd is not just a convenience, it's the mechanism that makes training deep networks computationally feasible, automating a differentiation process that would otherwise require maintaining thousands of hand-derived gradient expressions.
The skills you've built here, creating and manipulating tensors, understanding dtype and device, reasoning about the computational graph, working with requires_grad and backward(), are foundational in a way that higher-level frameworks obscure. When your training loss doesn't decrease, when you hit memory errors, when your model gives nonsensical outputs, you'll return to this level of abstraction to diagnose the problem. Knowing how PyTorch works under the hood makes you a better debugger and a more thoughtful model architect.
The path forward is to practice until these patterns feel instinctive. Write training loops from scratch. Compute a gradient manually with the chain rule and verify it matches what PyTorch computes. Break a graph on purpose with an in-place operation and read the error message. This kind of hands-on exploration builds the intuition that documentation alone cannot give you.
In the next article, we'll layer torch.nn on top of everything you've learned here. You'll see how PyTorch's module system abstracts away manual weight management, makes model composition elegant, and integrates seamlessly with the autograd engine. The tensors and computational graphs will still be there, you'll just be working with them at a higher level of abstraction.
Keep your zero_grad() calls close and your torch.no_grad() contexts closer.