Gradient Descent

The optimization algorithm that enables neural networks to learn.

What is Gradient Descent?

Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize a loss function. It's how neural networks learn from data.

The Intuition

Imagine standing in a hilly landscape blindfolded, trying to reach the lowest point. You feel the slope beneath your feet and step downhill. Repeat until you reach a valley.

How It Works

The algorithm computes how much each parameter contributes to the error, then adjusts parameters in the opposite direction.

1

Compute Loss

Measure how wrong the current predictions are.

2

Calculate Gradients

Use backpropagation to find how each weight affects the loss.

3

Update Weights

Adjust weights in the direction that reduces loss.

4

Repeat

Iterate until the loss stops decreasing.

Learning Rate

Controls how big each step is. Too high: overshoot. Too low: slow progress.

Variants

Stochastic Gradient Descent

Uses random mini-batches instead of the full dataset.

Momentum

Accumulates velocity to push through local minima.

Adam

Adaptive learning rates per parameter. Combines momentum with RMSprop.

AdamW

Adam with decoupled weight decay. Now preferred over standard Adam for most applications, especially in training large language models.

Learning Rate Scheduling

Instead of using a fixed learning rate, schedules adjust it during training for better convergence.

Step Decay

Reduce learning rate by a factor at specific epochs (e.g., halve every 30 epochs).

Exponential Decay

Continuously decrease learning rate: lr = lr_0 * e^(-kt). Smooth but can decay too fast.

Cosine Annealing

Follows a cosine curve from initial to minimum LR. Popular in modern training, allows gentle cooldown.

Warmup

Start with very low LR, gradually increase to target, then decay. Stabilizes early training, essential for transformers.

Convergence Challenges

Understanding obstacles that can prevent gradient descent from finding the global optimum.

Local Minima

Points where the loss is lower than nearby areas but not the global minimum. Momentum and adaptive methods help escape.

Saddle Points

Points where the gradient is zero but it's neither a minimum nor maximum. Common in high dimensions, slowing convergence.

Plateaus

Flat regions where gradients are very small. Progress stalls until the optimizer escapes. Adaptive LR helps navigate these.

📉

Gradient Descent Visualizer

Watch gradient descent find the minimum

Gradient Descent

Optimizing a 2D loss function

0.10
0.01 (slow)0.3 (fast)
0

Iterations

6.92

Current Loss

1.80

Position (x)

Global MinimumLocal Minimum

The red ball follows the gradient (slope) downhill. A higher learning rate takes bigger steps but may overshoot. Starting position determines whether you reach the global or local minimum.

Key Takeaways

  • 1Gradient descent minimizes loss by following the slope
  • 2Learning rate is the most important hyperparameter
  • 3AdamW is now the preferred optimizer for most deep learning applications
  • 4Backpropagation computes gradients efficiently