Chapter 9 — Gradient Descent & Variants
Gradient descent is the primary optimization algorithm used in machine learning, particularly in training deep learning models. This chapter explains gradient descent, its variants, and practical applications with examples.
9.1 What is Gradient Descent?
Gradient descent is an iterative optimization algorithm used to minimize a function (often a loss function in ML) by moving in the direction of the negative gradient.
Importance in ML: Neural networks, linear regression, logistic regression, and many ML algorithms rely on gradient descent to adjust parameters (weights) to reduce error.
9.2 Basic Gradient Descent Algorithm
Update rule for a parameter vector θ
:
θ_new = θ_old - η * ∇L(θ_old)
where η
is the learning rate, and ∇L(θ)
is the gradient of the loss function.
9.3 Learning Rate
- Too small: slow convergence.
- Too large: may overshoot minima or diverge.
Adaptive methods help adjust the learning rate dynamically.
9.4 Momentum
Momentum adds a fraction of the previous update to the current step to accelerate convergence:
v = β * v_old + (1 - β) * ∇L(θ)
θ_new = θ_old - η * v
Helps overcome shallow local minima and smooth oscillations.
9.5 Adaptive Methods
- RMSProp: Adjusts learning rate based on moving average of squared gradients.
- Adam: Combines momentum + adaptive learning rate. Most widely used in deep learning.
- Adagrad: Adapts learning rate per parameter based on historical gradients.
9.6 Convergence Criteria
Stop gradient descent when:
- Gradient magnitude is very small (close to zero).
- Change in loss function between steps is below a threshold.
- Maximum number of iterations reached.
9.7 Visualization
For a simple 2D function f(x, y), gradient descent moves along the surface following the steepest descent direction until reaching the minimum.
9.8 Quick Python Example
import numpy as np
# Example: f(x) = x^2 + y^2
def f(theta):
x, y = theta
return x**2 + y**2
def grad_f(theta):
x, y = theta
return np.array([2*x, 2*y])
theta = np.array([3.0, 4.0])
eta = 0.1
for i in range(20):
theta = theta - eta * grad_f(theta)
print(f"Step {i+1}, theta = {theta}, f(theta) = {f(theta)}")
9.9 ML Applications
- Training Neural Networks: Gradient descent adjusts weights layer by layer.
- Regression: Minimize mean squared error using gradient descent.
- Logistic Regression: Minimize cross-entropy loss for classification tasks.
- Deep Learning: Adam optimizer is commonly used for fast and stable convergence.
9.10 Exercises
- Implement vanilla gradient descent for f(x, y) = x² + y² and plot convergence.
- Experiment with different learning rates and observe convergence speed.
- Implement momentum-based gradient descent and compare with vanilla gradient descent.
- Try Adam optimizer on a small regression dataset and compare loss reduction.
Hints / Answers
- Convergence depends heavily on learning rate choice; visualize to see overshooting or slow convergence.
- Momentum accelerates convergence along shallow valleys.
- Adam often converges faster and is less sensitive to hyperparameters.
9.11 Further Reading & Videos
- Deep Learning Book (Goodfellow et al.) — Chapters on optimization algorithms.
- 3Blue1Brown — Gradient descent visualization (YouTube).
- Hands-on Python tutorials: implement various optimizers using NumPy and PyTorch.
Next chapter: Jacobian & Hessian Matrices — understanding derivatives in multivariate settings and their applications in backpropagation for neural networks.