19. Benchmarking - Fair Performance Comparison#
OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours
Overview#
You’ll build a rigorous performance measurement system that enables fair comparison of all your optimizations. This module implements educational benchmarking with statistical testing, normalized metrics, and reproducible protocols. Your benchmarking framework provides the measurement methodology used in Module 20’s competition workflow, where you’ll apply these tools to validate optimizations systematically.
Learning Objectives#
By the end of this module, you will be able to:
Understand benchmark design principles: Reproducibility requirements; representative workload selection; measurement methodology; controlling for confounding variables; fair comparison protocols
Implement statistical rigor: Multiple runs with warmup periods; confidence interval calculation; variance reporting not just means; understanding measurement uncertainty; detecting outliers
Master fair comparison protocols: Hardware normalization strategies; environmental controls (thermal, OS noise); baseline selection criteria; same workload/data/environment enforcement; apples-to-apples measurement
Build normalized metrics systems: Speedup ratios (baseline_time / optimized_time); compression factors (original_size / compressed_size); accuracy preservation tracking; efficiency scores combining multiple objectives; hardware-independent reporting
Analyze measurement trade-offs: Benchmark coverage vs runtime cost; statistical power vs sample size requirements; reproducibility vs realism; instrumentation overhead (observer effect); when 5% speedup is significant vs noise
Build → Use → Analyze#
This module follows TinyTorch’s Build → Use → Analyze framework:
Build: Implement benchmarking framework with statistical testing (confidence intervals, t-tests), normalized metrics (speedup, compression, efficiency), warmup protocols, and automated report generation
Use: Benchmark all your Optimization Tier implementations (profiling, quantization, compression, memoization, acceleration) against baselines on real tasks; compare fairly with statistical rigor
Analyze: Why do benchmark results vary across runs? How does hardware affect comparison fairness? When is 5% speedup statistically significant vs noise? What makes benchmarks representative vs over-fitted?
Implementation Guide#
Core Benchmarking Components#
Your benchmarking framework implements four key systems:
1. Statistical Measurement Infrastructure#
Why Multiple Runs Matter
Single measurements are meaningless in ML systems. Performance varies 10-30% across runs due to:
Thermal throttling: CPU frequency drops when hot
OS background tasks: Interrupts, garbage collection, other processes
Memory state: Cache coldness, fragmentation, swap pressure
CPU frequency scaling: Dynamic frequency adjustment
Statistical Solution
class BenchmarkResult:
"""Container for measurements with statistical analysis."""
def __init__(self, metric_name: str, values: List[float]):
self.mean = statistics.mean(values)
self.std = statistics.stdev(values)
self.median = statistics.median(values)
# 95% confidence interval for the mean
t_score = 1.96 # Normal approximation
margin = t_score * (self.std / np.sqrt(len(values)))
self.ci_lower = self.mean - margin
self.ci_upper = self.mean + margin
What This Reveals: If confidence intervals overlap between baseline and optimized, the difference might be noise. Statistical rigor prevents false claims.
2. Warmup and Measurement Protocol#
The Warmup Problem
First run: 120ms. Second run: 100ms. Third run: 98ms. What happened?
Cold cache: First run pays cache miss penalties
JIT compilation: NumPy and frameworks compile code paths on first use
Memory allocation: Initial runs establish memory patterns
Warmup Solution
class Benchmark:
def __init__(self, warmup_runs=5, measurement_runs=10):
self.warmup_runs = warmup_runs
self.measurement_runs = measurement_runs
def run_latency_benchmark(self, model, input_data):
# Warmup: stabilize performance
for _ in range(self.warmup_runs):
model.forward(input_data)
# Measurement: collect statistics
latencies = []
for _ in range(self.measurement_runs):
start = time.perf_counter()
model.forward(input_data)
latencies.append(time.perf_counter() - start)
return BenchmarkResult("latency_ms", latencies)
Why This Matters: Warmup runs discard cold-start effects. Measurement runs capture true steady-state performance.
3. Normalized Metrics for Fair Comparison#
Hardware-Independent Speedup
# Speedup ratio: baseline_time / optimized_time
speedup = baseline_result.mean / optimized_result.mean
# Example: 100ms / 80ms = 1.25x speedup (25% faster)
# Speedup > 1.0 means optimization helped
# Speedup < 1.0 means optimization regressed
Compression Ratio
# Model size reduction
compression_ratio = original_size_mb / compressed_size_mb
# Example: 100MB / 25MB = 4x compression
Efficiency Score (Multi-Objective)
# Combine speed + size + accuracy
efficiency = (speedup * compression) / (1 + abs(accuracy_delta))
# Penalizes accuracy loss
# Rewards speed AND compression
# Single metric for ranking
Why Normalized Metrics: Speedup ratios work on any hardware. “2x faster” is meaningful whether you have M1 Mac or Intel i9. Absolute times (100ms → 50ms) are hardware-specific.
4. Comprehensive Benchmark Suite#
Multiple Benchmark Types
Your BenchmarkSuite runs:
Latency Benchmark: How fast is inference? (milliseconds)
Accuracy Benchmark: How correct are predictions? (0.0-1.0)
Memory Benchmark: How much RAM is used? (megabytes)
Energy Benchmark: How efficient is compute? (estimated joules)
Pareto Frontier Analysis
Accuracy
↑
| A ● ← Model A: High accuracy, high latency
|
| B ● ← Model B: Balanced (Pareto optimal)
|
| C ●← Model C: Low accuracy, low latency
|__________→ Latency (lower is better)
Models on the Pareto frontier aren’t strictly dominated—each represents a valid optimization trade-off. Your suite automatically identifies these optimal points.
Real-World Benchmarking Principles#
Your implementation teaches industry-standard methodology:
Reproducibility Requirements#
Every benchmark run documents:
system_info = {
'platform': 'macOS-14.2-arm64', # OS version
'processor': 'Apple M1 Max', # CPU type
'python_version': '3.11.6', # Runtime
'memory_gb': 64, # RAM
'cpu_count': 10 # Cores
}
Why: Colleague should reproduce your results given same environment. Missing details make verification impossible.
Fair Comparison Protocol#
Don’t Compare:
GPU-optimized code vs CPU baseline (unfair hardware)
Quantized INT8 vs FP32 baseline (unfair precision)
Batch size 32 vs batch size 1 (unfair workload)
Cold start vs warmed up (unfair cache state)
Do Compare:
Same hardware, same workload, same environment
Baseline vs optimized on identical conditions
Report speedup with confidence intervals
Test statistical significance (t-test, p < 0.05)
Statistical Significance Testing#
from scipy import stats
baseline_times = [100, 102, 98, 101, 99] # ms
optimized_times = [95, 97, 93, 96, 94]
# Is the difference real or noise?
t_stat, p_value = stats.ttest_ind(baseline_times, optimized_times)
if p_value < 0.05:
print("Statistically significant (p < 0.05)")
else:
print("Not significant—could be noise")
Why This Matters: 5% speedup with p=0.08 isn’t significant. Could be measurement variance. Production teams don’t merge optimizations without statistical confidence.
Connection to Competition Workflow (Module 20)#
This benchmarking infrastructure provides the measurement harness used in Module 20’s competition workflow:
How Module 20 Uses This Framework
Module 20 uses your
Benchmarkclass to measure baseline and optimized performanceStatistical rigor from this module ensures fair comparison across submissions
Normalized metrics enable hardware-independent ranking
Reproducible protocols ensure all competitors use the same measurement methodology
The Workflow
Module 19: Learn benchmarking methodology (statistical rigor, fair comparison)
Module 20: Apply benchmarking tools in competition workflow (submission generation, validation)
Competition: Use Benchmark harness to measure and validate optimizations
Your benchmarking framework provides the foundation for fair competition—same measurement methodology, same statistical analysis, same reporting format. Module 20 teaches how to use these tools in a competition context.
Getting Started#
Prerequisites#
Ensure you understand the optimization foundations:
# Activate TinyTorch environment
source scripts/activate-tinytorch
# Verify prerequisite modules
tito test profiling
tito test quantization
tito test compression
Development Workflow#
Open the development file:
modules/19_benchmarking/benchmarking_dev.pyImplement BenchmarkResult: Container for measurements with statistical analysis
Build Benchmark class: Runner with warmup, multiple runs, metrics collection
Create BenchmarkSuite: Full evaluation with latency/accuracy/memory/energy
Add reporting: Automated report generation with visualizations
Export and verify:
tito module complete 19 && tito test benchmarking
Testing#
Comprehensive Test Suite#
Run the full test suite to verify benchmarking functionality:
# TinyTorch CLI (recommended)
tito test benchmarking
# Direct pytest execution
python -m pytest tests/ -k benchmarking -v
Test Coverage Areas#
✅ Statistical Calculations: Mean, std, median, confidence intervals computed correctly
✅ Multiple Runs: Warmup and measurement phases work properly
✅ Normalized Metrics: Speedup, compression, efficiency calculated accurately
✅ Fair Comparison: Same workload enforcement, baseline vs optimized
✅ Result Serialization: BenchmarkResult converts to dict for storage
✅ Visualization: Plots generate with proper formatting and error bars
✅ System Info: Metadata captured for reproducibility
✅ Pareto Analysis: Optimal trade-off points identified correctly
Inline Testing & Validation#
The module includes comprehensive unit tests:
🔬 Unit Test: BenchmarkResult...
✅ Mean calculation correct: 3.0
✅ Std calculation matches statistics module
✅ Confidence intervals bound mean
✅ Serialization preserves data
📈 Progress: BenchmarkResult ✓
🔬 Unit Test: Benchmark latency...
✅ Warmup runs executed before measurement
✅ Multiple measurement runs collected
✅ Results include mean ± CI
📈 Progress: Benchmark ✓
🔬 Unit Test: BenchmarkSuite...
✅ All benchmark types run (latency, accuracy, memory, energy)
✅ Results organized by metric type
✅ Visualizations generated
📈 Progress: BenchmarkSuite ✓
Manual Testing Examples#
from tinytorch.benchmarking.benchmark import Benchmark, BenchmarkSuite
from tinytorch.core.tensor import Tensor
import numpy as np
# Create simple models for testing
class FastModel:
name = "fast_model"
def forward(self, x):
return x * 2
class SlowModel:
name = "slow_model"
def forward(self, x):
import time
time.sleep(0.01) # Simulate 10ms latency
return x * 2
# Benchmark comparison
models = [FastModel(), SlowModel()]
benchmark = Benchmark(models, datasets=[None])
# Run latency benchmark
results = benchmark.run_latency_benchmark()
for model_name, result in results.items():
print(f"{model_name}: {result.mean:.2f} ± {result.std:.2f}ms")
print(f" 95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
# Speedup calculation
fast_time = results['fast_model'].mean
slow_time = results['slow_model'].mean
speedup = slow_time / fast_time
print(f"\nSpeedup: {speedup:.2f}x")
Systems Thinking Questions#
Real-World Applications#
Production ML Deployment: PyTorch runs continuous benchmarking before merging optimizations—statistical rigor prevents performance regressions
Hardware Evaluation: Google’s TPU teams benchmark every architecture iteration—measurements justify billion-dollar hardware investments
Model Optimization: Meta benchmarks training efficiency (samples/sec, memory, convergence)—10% speedup saves hundreds of thousands in compute costs
Research Validation: Papers require reproducible benchmarks with statistical significance—ablation studies need fair comparison protocols
Statistical Foundations#
Central Limit Theorem: Multiple measurements → normal distribution → confidence intervals and significance testing
Measurement Uncertainty: Every measurement has variance—systematic errors (timer overhead) and random errors (thermal noise)
Statistical Power: How many runs needed for significance? Depends on effect size and variance—5% speedup requires more runs than 50%
Type I/II Errors: False positive (claiming speedup when it’s noise) vs false negative (missing real speedup due to insufficient samples)
Performance Characteristics#
Warmup Effects: First run 20% slower than steady-state—cold cache, JIT compilation, memory allocation
System Noise Sources: Thermal throttling (CPU frequency drops), OS interrupts (background tasks), memory pressure (GC pauses), network interference
Observer Effect: Instrumentation changes behavior—profiling overhead 5%, cache effects from measurement code, branch prediction altered
Hardware Variability: Optimization 3x faster on GPU but 1.1x on CPU—memory bandwidth helps GPU, CPU cache doesn’t fit data
Ready to Build?#
You’ve reached the penultimate module of the Optimization Tier. This benchmarking framework validates all your previous work from Modules 14-18, transforming subjective claims (“feels faster”) into objective data (“1.8x speedup, p < 0.01, 95% CI [1.6x, 2.0x]”).
Your benchmarking infrastructure provides the measurement foundation for Module 20’s competition workflow, where you’ll use these tools to validate optimizations systematically. Fair measurement methodology ensures your innovation is recognized—not who got lucky with thermal throttling.
Module 20 teaches how to use your benchmarking framework in a competition context—generating submissions, validating constraints, and packaging results. Your benchmarking framework measures cumulative impact with statistical rigor. This is how production ML teams validate optimizations before deployment—rigorous measurement prevents regressions and quantifies improvements.
Statistical rigor isn’t just academic formality—it’s engineering discipline. When Meta claims 10% training speedup saves hundreds of thousands in compute costs, that claim requires measurements with confidence intervals and significance testing. Your framework implements this methodology from first principles.
Choose your preferred way to engage with this module:
Run this module interactively in your browser. No installation required.
Use Google Colab for GPU access and cloud compute power.
Browse the Python source code and understand the implementation.
Save Your Progress
Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work.