What is Gradient Descent?
Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize a loss function. It's how neural networks learn from data.
The Intuition
Imagine standing in a hilly landscape blindfolded, trying to reach the lowest point. You feel the slope beneath your feet and step downhill. Repeat until you reach a valley.
How It Works
The algorithm computes how much each parameter contributes to the error, then adjusts parameters in the opposite direction.
Compute Loss
Measure how wrong the current predictions are.
Calculate Gradients
Use backpropagation to find how each weight affects the loss.
Update Weights
Adjust weights in the direction that reduces loss.
Repeat
Iterate until the loss stops decreasing.
Learning Rate
Controls how big each step is. Too high: overshoot. Too low: slow progress.
Variants
Stochastic Gradient Descent
Uses random mini-batches instead of the full dataset.
Momentum
Accumulates velocity to push through local minima.
Adam
Adaptive learning rates per parameter. Combines momentum with RMSprop.
AdamW
Adam with decoupled weight decay. Now preferred over standard Adam for most applications, especially in training large language models.
Learning Rate Scheduling
Instead of using a fixed learning rate, schedules adjust it during training for better convergence.
Step Decay
Reduce learning rate by a factor at specific epochs (e.g., halve every 30 epochs).
Exponential Decay
Continuously decrease learning rate: lr = lr_0 * e^(-kt). Smooth but can decay too fast.
Cosine Annealing
Follows a cosine curve from initial to minimum LR. Popular in modern training, allows gentle cooldown.
Warmup
Start with very low LR, gradually increase to target, then decay. Stabilizes early training, essential for transformers.
Convergence Challenges
Understanding obstacles that can prevent gradient descent from finding the global optimum.
Local Minima
Points where the loss is lower than nearby areas but not the global minimum. Momentum and adaptive methods help escape.
Saddle Points
Points where the gradient is zero but it's neither a minimum nor maximum. Common in high dimensions, slowing convergence.
Plateaus
Flat regions where gradients are very small. Progress stalls until the optimizer escapes. Adaptive LR helps navigate these.
Gradient Descent Visualizer
Watch gradient descent find the minimum
Gradient Descent
Optimizing a 2D loss function
Iterations
Current Loss
Position (x)
The red ball follows the gradient (slope) downhill. A higher learning rate takes bigger steps but may overshoot. Starting position determines whether you reach the global or local minimum.
Key Takeaways
- 1Gradient descent minimizes loss by following the slope
- 2Learning rate is the most important hyperparameter
- 3AdamW is now the preferred optimizer for most deep learning applications
- 4Backpropagation computes gradients efficiently