Chapter 8 — Entropy, Information, and KL Divergence
Measuring uncertainty and information is fundamental in AI/ML for decision making, loss computation, and probabilistic modeling.
8.1 Entropy
Entropy measures the uncertainty or unpredictability in a probability distribution. High entropy means more uncertainty; low entropy means more predictability.
Mathematical Definition: For a discrete random variable X with probability mass function P(X):
H(X) = - Σ P(x) log₂ P(x)
Example: A fair coin flip has entropy H(X) = 1 bit. A biased coin (90% heads, 10% tails) has lower entropy, ~0.47 bits. AI/ML context: Entropy helps in constructing decision trees by measuring the impurity of a split (ID3, C4.5 algorithms).
8.2 Cross-Entropy
Cross-entropy measures the difference between two probability distributions — often the true labels vs predicted probabilities in classification tasks.
Mathematical Definition:
H(P,Q) = - Σ P(x) log Q(x)
Example: In a 3-class classification problem, if true labels are P = [1,0,0] and predictions Q = [0.8,0.1,0.1], cross-entropy = - (1*log0.8 + 0 + 0) ≈ 0.223. AI/ML context: Cross-entropy loss is widely used in training neural networks for classification tasks.
8.3 Kullback-Leibler (KL) Divergence
KL Divergence measures how one probability distribution diverges from a reference distribution. It is asymmetric and non-negative.
Mathematical Definition:
KL(P || Q) = Σ P(x) log (P(x) / Q(x))
Example: Comparing a true distribution P = [0.5,0.5] with an estimated Q = [0.8,0.2] gives KL(P||Q) = 0.193. AI/ML context: KL divergence is used in variational autoencoders (VAEs) to regularize learned latent distributions and in reinforcement learning for policy updates.
8.4 Practical Examples in Python
import numpy as np
from scipy.stats import entropy
# Entropy of a distribution
p = np.array([0.5, 0.5])
H = entropy(p, base=2)
print("Entropy H(X):", H)
# Cross-entropy loss
true = np.array([1, 0, 0])
pred = np.array([0.8, 0.1, 0.1])
cross_entropy = -np.sum(true * np.log(pred))
print("Cross-Entropy:", cross_entropy)
# KL Divergence
P = np.array([0.5, 0.5])
Q = np.array([0.8, 0.2])
kl_div = entropy(P, Qk=Q, base=2)
print("KL Divergence KL(P||Q):", kl_div)
8.5 Key Takeaways
- Entropy quantifies uncertainty in data or predictions.
- Cross-entropy measures the difference between predicted and true distributions — key for classification losses.
- KL Divergence measures how one distribution diverges from another, essential for probabilistic modeling, VAEs, and RL.
Next chapter: Markov Chains & Stochastic Processes — modeling sequences and transitions in AI/ML.