Chapter 8 — Optimization Techniques
Optimization techniques are at the heart of machine learning. Training models involves finding the best parameters (weights) that minimize a loss function. This chapter explains critical points, extrema, and Lagrange multipliers with practical examples and ML applications.
8.1 What is Optimization?
Optimization is the process of finding the maximum or minimum of a function. In machine learning, we often minimize a loss function to improve model performance.
Importance in ML: Gradient-based optimization algorithms like gradient descent rely on these concepts to efficiently update weights during training. Understanding critical points helps avoid local minima or saddle points.
8.2 Critical Points & Extrema
A critical point occurs where the gradient of a function is zero (∇f = 0). These points can be:
- Local minimum: function value is lower than surrounding points.
- Local maximum: function value is higher than surrounding points.
- Saddle point: function is neither maximum nor minimum (common in high-dimensional ML loss landscapes).
Example: f(x) = x² → derivative f'(x) = 2x → critical point at x = 0 (global minimum).
8.3 Lagrange Multipliers
Used for constrained optimization: maximize or minimize a function f(x, y, …) subject to a constraint g(x, y, …) = 0.
Method: Introduce a multiplier λ and solve:
∇f(x, y) = λ ∇g(x, y)
g(x, y) = 0
Example: Maximize f(x, y) = xy subject to x² + y² = 1.
∇f = (y, x)
∇g = (2x, 2y)
Set ∇f = λ ∇g → y = 2λx, x = 2λy
Constraint: x² + y² = 1
Solution: x = ±1/√2, y = ±1/√2
8.4 Gradient Descent
Gradient descent is the backbone of ML optimization:
x_new = x_old - η * ∇f(x_old)
where η is the learning rate. It iteratively moves towards minima of the loss function.
8.5 Hessian Matrix
The Hessian contains second-order partial derivatives and helps determine the nature of critical points:
- If Hessian is positive definite → local minimum
- If Hessian is negative definite → local maximum
- If Hessian has mixed eigenvalues → saddle point
ML context: Second-order optimization methods like Newton's method use the Hessian to converge faster than first-order methods.
8.6 Applications in Machine Learning
- Training Neural Networks: Gradient descent & variants minimize loss functions.
- Constrained Optimization: Lagrange multipliers are used in SVMs (maximize margin with constraints).
- Regularization: Adding constraints or penalties (like L1/L2) is a form of constrained optimization.
- Newton & Quasi-Newton methods: Used in optimization with second-order derivatives for faster convergence.
8.7 Quick Python Example (Gradient Descent)
import numpy as np
# f(x) = x^2, derivative f'(x) = 2x
def f(x): return x**2
def grad_f(x): return 2*x
x = 5 # initial guess
eta = 0.1 # learning rate
for i in range(10):
x = x - eta * grad_f(x)
print(f"Step {i+1}, x = {x}, f(x) = {f(x)}")
8.8 Exercises
- Find critical points of f(x, y) = x² + xy + y² and determine their nature using Hessian.
- Use Lagrange multipliers to maximize f(x, y) = x + y subject to x² + y² = 1.
- Implement gradient descent for f(x) = x³ - 3x² + 2 and visualize convergence.
Hints / Answers
- Critical points: solve ∇f = 0 → (0,0) and (-1,1). Use Hessian to classify.
- Lagrange: ∇f = λ∇g → solutions x = ±1/√2, y = ±1/√2.
- Gradient descent: iterate x_new = x_old - η * f'(x_old).
8.9 Practice Projects / Mini Tasks
- Implement gradient descent with momentum and compare with vanilla gradient descent.
- Train a small neural network on a dataset and visualize loss convergence.
- Use Lagrange multipliers to solve a toy SVM problem with 2D points.
8.10 Further Reading & Videos
- Deep Learning Book (Goodfellow et al.) — Chapters on Optimization
- Khan Academy — Multivariable Calculus: Critical Points & Lagrange Multipliers
- MIT OCW — Optimization in Machine Learning
Next chapter: Gradient, Jacobian & Hessian in Neural Networks — practical applications in backpropagation and advanced ML optimization.