06. Optimizers#
FOUNDATION TIER | Difficulty: ⭐⭐⭐⭐ (4/4) | Time: 6-8 hours
Overview#
Welcome to the Optimizers module! You’ll implement the learning algorithms that power every neural network—transforming gradients into intelligent parameter updates that enable models to learn from data. This module builds the optimization foundation used across all modern deep learning frameworks.
Learning Objectives#
By the end of this module, you will be able to:
Understand optimization dynamics: Master convergence behavior, learning rate sensitivity, and how gradients guide parameter updates in high-dimensional loss landscapes
Implement core optimization algorithms: Build SGD, momentum, Adam, and AdamW optimizers from mathematical first principles
Analyze memory-convergence trade-offs: Understand why Adam uses 3x memory but converges faster than SGD on many problems
Master adaptive learning rates: See how Adam’s per-parameter learning rates handle different gradient scales automatically
Connect to production frameworks: Understand how your implementations mirror PyTorch’s torch.optim.SGD and torch.optim.Adam design patterns
Build → Use → Reflect#
This module follows TinyTorch’s Build → Use → Reflect framework:
Build: Implement SGD with momentum, Adam optimizer with adaptive learning rates, and AdamW with decoupled weight decay from mathematical foundations
Use: Apply optimization algorithms to train neural networks on real classification and regression tasks
Reflect: Why does Adam converge faster initially but SGD often achieves better final test accuracy? What’s the memory cost of adaptive learning rates?
Implementation Guide#
Core Optimization Algorithms#
# Base optimizer class with parameter management
class Optimizer:
"""Base class defining optimizer interface."""
def __init__(self, params: List[Tensor]):
self.params = list(params)
self.step_count = 0
def zero_grad(self):
"""Clear gradients from all parameters."""
for param in self.params:
param.grad = None
def step(self):
"""Update parameters - implemented by subclasses."""
raise NotImplementedError
# SGD with momentum for accelerated convergence
sgd = SGD(parameters=[w1, w2, bias], lr=0.01, momentum=0.9)
sgd.zero_grad() # Clear previous gradients
loss.backward() # Compute new gradients via autograd
sgd.step() # Update parameters with momentum
# Adam optimizer with adaptive learning rates
adam = Adam(parameters=[w1, w2, bias], lr=0.001, betas=(0.9, 0.999))
adam.zero_grad()
loss.backward()
adam.step() # Adaptive updates per parameter
# AdamW with decoupled weight decay
adamw = AdamW(parameters=[w1, w2, bias], lr=0.001, weight_decay=0.01)
adamw.zero_grad()
loss.backward()
adamw.step() # Adam + proper regularization
SGD with Momentum Implementation#
class SGD(Optimizer):
"""Stochastic Gradient Descent with momentum.
Momentum physics: velocity accumulates gradients over time,
smoothing noisy updates and accelerating in consistent directions.
"""
def __init__(self, params: List[Tensor], lr: float = 0.01,
momentum: float = 0.0, weight_decay: float = 0.0):
super().__init__(params)
self.lr = lr
self.momentum = momentum
self.weight_decay = weight_decay
# Initialize momentum buffers (created lazily)
self.momentum_buffers = [None for _ in self.params]
def step(self):
"""Update parameters using momentum: v = βv + ∇L, θ = θ - αv"""
for i, param in enumerate(self.params):
if param.grad is None:
continue
grad = param.grad
# Apply weight decay
if self.weight_decay != 0:
grad = grad + self.weight_decay * param.data
# Update momentum buffer
if self.momentum != 0:
if self.momentum_buffers[i] is None:
self.momentum_buffers[i] = np.zeros_like(param.data)
# Update velocity: v_t = β*v_{t-1} + grad
self.momentum_buffers[i] = (self.momentum * self.momentum_buffers[i]
+ grad)
grad = self.momentum_buffers[i]
# Update parameter: θ_t = θ_{t-1} - α*v_t
param.data = param.data - self.lr * grad
self.step_count += 1
Adam Optimizer Implementation#
class Adam(Optimizer):
"""Adam optimizer with adaptive learning rates.
Combines momentum (first moment) with RMSprop-style adaptive rates
(second moment) for robust optimization across different scales.
"""
def __init__(self, params: List[Tensor], lr: float = 0.001,
betas: tuple = (0.9, 0.999), eps: float = 1e-8,
weight_decay: float = 0.0):
super().__init__(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
# Initialize moment estimates (3x memory vs SGD)
self.m_buffers = [None for _ in self.params] # First moment
self.v_buffers = [None for _ in self.params] # Second moment
def step(self):
"""Update parameters with adaptive learning rates"""
self.step_count += 1
for i, param in enumerate(self.params):
if param.grad is None:
continue
grad = param.grad
# Apply weight decay (Adam's approach - has issues)
if self.weight_decay != 0:
grad = grad + self.weight_decay * param.data
# Initialize buffers if needed
if self.m_buffers[i] is None:
self.m_buffers[i] = np.zeros_like(param.data)
self.v_buffers[i] = np.zeros_like(param.data)
# Update biased first moment: m_t = β1*m_{t-1} + (1-β1)*grad
self.m_buffers[i] = (self.beta1 * self.m_buffers[i]
+ (1 - self.beta1) * grad)
# Update biased second moment: v_t = β2*v_{t-1} + (1-β2)*grad²
self.v_buffers[i] = (self.beta2 * self.v_buffers[i]
+ (1 - self.beta2) * (grad ** 2))
# Bias correction (critical for early training steps)
bias_correction1 = 1 - self.beta1 ** self.step_count
bias_correction2 = 1 - self.beta2 ** self.step_count
m_hat = self.m_buffers[i] / bias_correction1
v_hat = self.v_buffers[i] / bias_correction2
# Adaptive parameter update: θ = θ - α*m_hat/(√v_hat + ε)
param.data = (param.data - self.lr * m_hat
/ (np.sqrt(v_hat) + self.eps))
AdamW Implementation (Decoupled Weight Decay)#
class AdamW(Optimizer):
"""AdamW optimizer with decoupled weight decay.
AdamW fixes Adam's weight decay bug by applying regularization
directly to parameters, separate from gradient-based updates.
"""
def __init__(self, params: List[Tensor], lr: float = 0.001,
betas: tuple = (0.9, 0.999), eps: float = 1e-8,
weight_decay: float = 0.01):
super().__init__(params)
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
# Initialize moment buffers (same as Adam)
self.m_buffers = [None for _ in self.params]
self.v_buffers = [None for _ in self.params]
def step(self):
"""Perform AdamW update with decoupled weight decay"""
self.step_count += 1
for i, param in enumerate(self.params):
if param.grad is None:
continue
# Get gradient (NOT modified by weight decay - key difference!)
grad = param.grad
# Initialize buffers if needed
if self.m_buffers[i] is None:
self.m_buffers[i] = np.zeros_like(param.data)
self.v_buffers[i] = np.zeros_like(param.data)
# Update moments using pure gradients
self.m_buffers[i] = (self.beta1 * self.m_buffers[i]
+ (1 - self.beta1) * grad)
self.v_buffers[i] = (self.beta2 * self.v_buffers[i]
+ (1 - self.beta2) * (grad ** 2))
# Compute bias correction
bias_correction1 = 1 - self.beta1 ** self.step_count
bias_correction2 = 1 - self.beta2 ** self.step_count
m_hat = self.m_buffers[i] / bias_correction1
v_hat = self.v_buffers[i] / bias_correction2
# Apply gradient-based update
param.data = (param.data - self.lr * m_hat
/ (np.sqrt(v_hat) + self.eps))
# Apply decoupled weight decay (after gradient update!)
if self.weight_decay != 0:
param.data = param.data * (1 - self.lr * self.weight_decay)
Complete Training Integration#
# Modern training workflow combining all components
from tinytorch.core.tensor import Tensor
from tinytorch.core.optimizers import SGD, Adam, AdamW
# Model setup (from previous modules)
model = Sequential([
Linear(784, 128), ReLU(),
Linear(128, 64), ReLU(),
Linear(64, 10)
])
# Optimization setup
optimizer = AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
criterion = CrossEntropyLoss()
# Training loop
for epoch in range(num_epochs):
epoch_loss = 0.0
for batch_inputs, batch_targets in dataloader:
# Forward pass
predictions = model(batch_inputs)
loss = criterion(predictions, batch_targets)
# Backward pass and optimization
optimizer.zero_grad() # Clear old gradients
loss.backward() # Compute new gradients
optimizer.step() # Update parameters
epoch_loss += loss.data
print(f"Epoch {epoch}: Loss = {epoch_loss:.4f}")
Getting Started#
Prerequisites#
Ensure you understand the mathematical foundations:
# Activate TinyTorch environment
source scripts/activate-tinytorch
# Verify prerequisite modules
tito test tensor
tito test autograd
Required Background:
Tensor Operations: Understanding parameter storage and update mechanics
Automatic Differentiation: Gradients computed via backpropagation
Calculus: Derivatives, gradient descent, chain rule
Linear Algebra: Vector operations, element-wise operations
Development Workflow#
Open the development file:
modules/06_optimizers/optimizers_dev.ipynbImplement Optimizer base class: Start with parameter management and zero_grad interface
Build SGD with momentum: Add velocity accumulation for smoother convergence
Create Adam optimizer: Implement adaptive learning rates with moment estimation and bias correction
Add AdamW optimizer: Build decoupled weight decay for proper regularization
Export and verify:
tito module complete 06 && tito test optimizers
Development Tips:
Test each optimizer on simple quadratic functions (f(x) = x²) where you can verify analytical convergence
Compare convergence speed between SGD and Adam on the same problem
Visualize loss curves to understand optimization dynamics
Check momentum/moment buffers are properly initialized and updated
Compare Adam vs AdamW to see the effect of decoupled weight decay
Testing#
Comprehensive Test Suite#
Run the full test suite to verify optimization algorithm correctness:
# TinyTorch CLI (recommended)
tito test optimizers
# Direct pytest execution
python -m pytest tests/ -k optimizers -v
# Test specific optimizer
python -m pytest tests/test_optimizers.py::test_adam_convergence -v
Test Coverage Areas#
Algorithm Implementation: Verify Optimizer base, SGD, Adam, and AdamW compute mathematically correct parameter updates
Mathematical Correctness: Test against analytical solutions for convex optimization problems (quadratic functions)
State Management: Ensure proper momentum and moment estimation tracking across training steps
Memory Efficiency: Verify buffer initialization and memory usage patterns
Training Integration: Test optimizers in complete neural network training workflows with real data
Inline Testing & Convergence Analysis#
The module includes comprehensive mathematical validation and convergence visualization:
# Example inline test output
🔬 Unit Test: Base Optimizer...
✅ Parameter validation working correctly
✅ zero_grad clears all gradients properly
✅ Error handling for non-gradient parameters
📈 Progress: Base Optimizer ✓
# SGD with momentum validation
🔬 Unit Test: SGD with momentum...
✅ Parameter updates follow momentum equation v_t = βv_{t-1} + ∇L
✅ Velocity accumulation working correctly
✅ Weight decay applied properly
✅ Momentum accelerates convergence vs vanilla SGD
📈 Progress: SGD with Momentum ✓
# Adam optimizer validation
🔬 Unit Test: Adam optimizer...
✅ First moment estimation (m_t) computed correctly
✅ Second moment estimation (v_t) computed correctly
✅ Bias correction applied properly (critical for early steps)
✅ Adaptive learning rates working per parameter
✅ Convergence faster than SGD on ill-conditioned problem
📈 Progress: Adam Optimizer ✓
# AdamW decoupled weight decay validation
🔬 Unit Test: AdamW optimizer...
✅ Weight decay decoupled from gradient updates
✅ Results differ from Adam (proving proper implementation)
✅ Regularization consistent across gradient scales
✅ With zero weight decay, matches Adam behavior
📈 Progress: AdamW Optimizer ✓
Manual Testing Examples#
from tinytorch.core.optimizers import SGD, Adam, AdamW
from tinytorch.core.tensor import Tensor
# Test 1: SGD convergence on simple quadratic
print("Test 1: SGD on f(x) = x²")
x = Tensor([10.0], requires_grad=True)
sgd = SGD([x], lr=0.1, momentum=0.9)
for step in range(100):
sgd.zero_grad()
loss = (x ** 2).sum() # Minimize f(x) = x², minimum at x=0
loss.backward()
sgd.step()
if step % 10 == 0:
print(f"Step {step}: x = {x.data[0]:.6f}, loss = {loss.data:.6f}")
# Expected: x should converge to 0
# Test 2: Adam on multidimensional optimization
print("\nTest 2: Adam on f(x,y) = x² + y²")
params = Tensor([5.0, -3.0], requires_grad=True)
adam = Adam([params], lr=0.1)
for step in range(50):
adam.zero_grad()
loss = (params ** 2).sum() # Minimize ||x||²
loss.backward()
adam.step()
if step % 10 == 0:
print(f"Step {step}: params = {params.data}, loss = {loss.data:.6f}")
# Expected: Both parameters converge to 0
# Test 3: Compare SGD vs Adam vs AdamW convergence
print("\nTest 3: Optimizer comparison")
x_sgd = Tensor([10.0], requires_grad=True)
x_adam = Tensor([10.0], requires_grad=True)
x_adamw = Tensor([10.0], requires_grad=True)
sgd = SGD([x_sgd], lr=0.01, momentum=0.9)
adam = Adam([x_adam], lr=0.01)
adamw = AdamW([x_adamw], lr=0.01, weight_decay=0.01)
for step in range(20):
# SGD update
sgd.zero_grad()
loss_sgd = (x_sgd ** 2).sum()
loss_sgd.backward()
sgd.step()
# Adam update
adam.zero_grad()
loss_adam = (x_adam ** 2).sum()
loss_adam.backward()
adam.step()
# AdamW update
adamw.zero_grad()
loss_adamw = (x_adamw ** 2).sum()
loss_adamw.backward()
adamw.step()
if step % 5 == 0:
print(f"Step {step}: SGD={x_sgd.data[0]:.6f}, Adam={x_adam.data[0]:.6f}, AdamW={x_adamw.data[0]:.6f}")
# Expected: Adam/AdamW converge faster initially
Systems Thinking Questions#
Real-World Applications#
Large Language Models: GPT and BERT training relies on AdamW optimizer for stable convergence across billions of parameters with varying gradient scales and proper regularization
Computer Vision: ResNet and Vision Transformer training typically uses SGD with momentum for best final test accuracy despite slower initial convergence
Recommendation Systems: Online learning systems use adaptive optimizers like Adam for continuous model updates with non-stationary data distributions
Reinforcement Learning: Policy gradient methods depend heavily on careful optimizer choice and learning rate tuning due to high variance gradients
Optimization Theory Foundations#
Gradient Descent: Update rule θ_{t+1} = θ_t - α∇L(θ_t) where α is learning rate controlling step size in steepest descent direction
Momentum: Velocity accumulation v_{t+1} = βv_t + ∇L(θ_t), then θ_{t+1} = θ_t - αv_{t+1} smooths noisy gradients and accelerates convergence
Adam: Combines momentum (first moment m_t) with adaptive learning rates (second moment v_t), includes bias correction for early training steps
AdamW: Decouples weight decay from gradient updates: applies gradient update first, then weight decay, fixing Adam’s regularization bug
Performance Characteristics#
SGD Memory: O(2n) memory for n parameters (params + momentum buffers), most memory-efficient optimizer with momentum
Adam Memory: O(3n) memory due to first and second moment buffers (params + m_buffers + v_buffers), 1.5x SGD cost
Convergence Speed: Adam often converges faster initially due to adaptive rates, especially with sparse gradients or varying scales
Final Performance: SGD with momentum often achieves better test accuracy on computer vision tasks despite slower convergence
Learning Rate Sensitivity: Adam/AdamW are more robust to learning rate choice than vanilla SGD, making them popular for transformer training
Computational Cost: Adam requires ~1.5x more computation per step (moment updates + bias correction + sqrt operations) than SGD
Critical Thinking: Memory vs Convergence Trade-offs#
Reflection Question: Why does Adam use 3x the memory of parameter-only storage (and 1.5x SGD), and when is this trade-off worth it?
Key Insights:
Memory Cost: Adam stores parameter data + first moment (momentum) + second moment (variance) for every parameter
Adaptive Benefit: Per-parameter learning rates handle different gradient scales automatically
Use Case: Transformers benefit from Adam (varying embedding vs attention scales), CNNs often prefer SGD (more uniform scales)
Production Decision: Memory-constrained systems (mobile, edge devices) may prefer SGD despite slower convergence
Training Time: Faster convergence can save GPU hours, offsetting memory cost in cloud training scenarios
Reflection Question: Why does SGD with momentum often achieve better test accuracy than Adam on vision tasks, despite slower training?
Key Insights:
Generalization: SGD explores flatter minima that generalize better to test data
Overfitting: Adam’s fast convergence may lead to sharper minima with worse generalization
Learning Rate Schedule: Careful learning rate decay with SGD achieves better final performance
Task Dependency: Effect is strongest on CNNs, less pronounced on transformers
Modern Practice: AdamW with proper weight decay often bridges this gap
Reflection Question: How does AdamW’s decoupled weight decay fix Adam’s regularization bug?
Key Insights:
Adam Bug: Adds weight decay to gradients, so adaptive learning rates affect regularization strength inconsistently
AdamW Fix: Applies weight decay directly to parameters after gradient update, decoupling optimization from regularization
Consistency: Weight decay effect is now uniform across parameters regardless of gradient magnitudes
Production Impact: AdamW is now preferred over Adam in most modern training pipelines (BERT, GPT-3, etc.)
Ready to Build?#
You’re about to implement the algorithms that enable all of modern deep learning! Every neural network—from the image classifiers in your phone to GPT-4—depends on the optimization algorithms you’re building in this module.
Understanding these algorithms from first principles will transform how you think about training. When you implement momentum physics and see how velocity accumulation smooths noisy gradients, when you build Adam’s adaptive learning rates and understand why they help with varying parameter scales, when you create AdamW and see how decoupled weight decay fixes Adam’s bug—you’ll develop deep intuition for why some training configurations work and others fail.
Take your time with the mathematics. Test your optimizers on simple quadratic functions where you can verify convergence analytically. Compare SGD vs Adam vs AdamW on the same problem to see their different behaviors. Visualize loss curves to understand optimization dynamics. Monitor memory usage to see the trade-offs. This hands-on experience will make you a better practitioner who can debug training failures, tune hyperparameters effectively, and make informed decisions about optimizer choice in production systems. Enjoy building the intelligence behind intelligent systems!
Choose your preferred way to engage with this module:
Run this module interactively in your browser. No installation required!
Use Google Colab for GPU access and cloud compute power.
Browse the Jupyter notebook and understand the implementation.
💾 Save Your Progress
Binder sessions are temporary! Download your completed notebook when done, or switch to local development for persistent work.