Chapter 12 — Gradients & Automatic Differentiation
Gradients are fundamental to training AI & ML models. This chapter explains derivatives, gradients, and how automatic differentiation allows efficient gradient computation in deep learning frameworks.
12.1 What is a Gradient?
A gradient is a vector of partial derivatives of a function with respect to its inputs. AI/ML Context: In neural networks, gradients tell us how to adjust weights to minimize the loss function.
# Example: f(x, y) = x^2 + y^2
# Gradient ∇f = [∂f/∂x, ∂f/∂y] = [2x, 2y]
x, y = 3, 4
gradient = [2*x, 2*y] # [6, 8]
12.2 Partial Derivatives
A partial derivative measures how a function changes with respect to one variable while keeping others constant. AI/ML Context: In multivariable loss functions, we calculate partial derivatives to update each weight independently.
f(x, y) = x*y + y^2
∂f/∂x = y
∂f/∂y = x + 2*y
12.3 Chain Rule in ML
Neural networks are compositions of functions. The chain rule allows us to compute gradients through multiple layers. AI/ML Context: This is the basis of backpropagation.
# f(x) = (3x + 2)^2
# df/dx = 2*(3x+2) * 3 = 6*(3x+2)
12.4 Automatic Differentiation (AutoGrad)
Frameworks like PyTorch and TensorFlow compute gradients automatically without manual derivative calculations. AI/ML Context: AutoGrad is critical for deep learning, enabling efficient optimization of millions of parameters.
import torch
x = torch.tensor([3.0], requires_grad=True)
y = x**2 + 2*x + 1 # y = x^2 + 2x + 1
y.backward() # computes dy/dx
print(x.grad) # prints [8.0] because dy/dx = 2*3 + 2 = 8
12.5 Gradient Descent
Gradient descent updates parameters in the direction opposite to the gradient to minimize the loss function:
# theta = theta - learning_rate * gradient
learning_rate = 0.1
theta = 3.0
gradient = 8.0
theta_new = theta - learning_rate * gradient # 3 - 0.1*8 = 2.2
AI/ML Context: This is the core optimization method used to train neural networks.
12.6 Stochastic, Mini-Batch, and Full Gradient Descent
- Full Gradient: Uses all training samples to compute gradient.
- Stochastic: Uses one sample at a time, fast but noisy.
- Mini-Batch: Uses small batches, balances speed and stability.
- AI/ML Context: Mini-batch gradient descent is most commonly used in modern deep learning.
12.7 Why Gradients Matter in AI/ML
Gradients allow us to understand how each parameter affects the output. Without gradients, we could not train neural networks efficiently. Automatic differentiation handles complex networks with thousands or millions of parameters, ensuring fast and accurate updates.
12.8 Exercises
- Compute the gradient of
f(x, y) = x^2 + xy + y^2
manually. - Use PyTorch to compute the derivative of
f(x) = x^3 + 2x
atx = 2
. - Implement a simple linear regression using gradient descent manually (without a library).
- Experiment with different learning rates and observe how gradient descent behaves.
Answers / Hints
- ∂f/∂x = 2x + y, ∂f/∂y = x + 2y
- PyTorch:
x = torch.tensor([2.0], requires_grad=True); y = x**3 + 2*x; y.backward(); print(x.grad)
→ 3*2^2 + 2 = 14 - θ_new = θ - learning_rate * gradient for each parameter.
- Too high learning rate may overshoot, too low may converge slowly.
12.9 Practice Projects / Mini Tasks
- Implement gradient descent to fit a line to synthetic data.
- Use PyTorch AutoGrad to compute gradients for a small neural network.
- Visualize the loss landscape of a simple quadratic function and show gradient directions.
12.10 Further Reading & Videos
- Deep Learning Book — Chapters on optimization and backpropagation
- PyTorch AutoGrad documentation
- 3Blue1Brown — Gradient Descent Visualizations (YouTube)
- Stanford CS231n — Lecture notes on backpropagation
Next chapter: Jacobians & Hessians — understanding higher-order derivatives for advanced optimization and stability in neural networks.