19. Benchmarking - Fair Performance Comparison#

OPTIMIZATION TIER | Difficulty: ⭐⭐⭐ (3/4) | Time: 5-6 hours

Overview#

You’ll build a rigorous performance measurement system that enables fair comparison of all your optimizations. This module implements educational benchmarking with statistical testing, normalized metrics, and reproducible protocols. Your benchmarking framework provides the measurement methodology used in Module 20’s competition workflow, where you’ll apply these tools to validate optimizations systematically.

Learning Objectives#

By the end of this module, you will be able to:

  • Understand benchmark design principles: Reproducibility requirements; representative workload selection; measurement methodology; controlling for confounding variables; fair comparison protocols

  • Implement statistical rigor: Multiple runs with warmup periods; confidence interval calculation; variance reporting not just means; understanding measurement uncertainty; detecting outliers

  • Master fair comparison protocols: Hardware normalization strategies; environmental controls (thermal, OS noise); baseline selection criteria; same workload/data/environment enforcement; apples-to-apples measurement

  • Build normalized metrics systems: Speedup ratios (baseline_time / optimized_time); compression factors (original_size / compressed_size); accuracy preservation tracking; efficiency scores combining multiple objectives; hardware-independent reporting

  • Analyze measurement trade-offs: Benchmark coverage vs runtime cost; statistical power vs sample size requirements; reproducibility vs realism; instrumentation overhead (observer effect); when 5% speedup is significant vs noise

Build → Use → Analyze#

This module follows TinyTorch’s Build → Use → Analyze framework:

  1. Build: Implement benchmarking framework with statistical testing (confidence intervals, t-tests), normalized metrics (speedup, compression, efficiency), warmup protocols, and automated report generation

  2. Use: Benchmark all your Optimization Tier implementations (profiling, quantization, compression, memoization, acceleration) against baselines on real tasks; compare fairly with statistical rigor

  3. Analyze: Why do benchmark results vary across runs? How does hardware affect comparison fairness? When is 5% speedup statistically significant vs noise? What makes benchmarks representative vs over-fitted?

Implementation Guide#

Core Benchmarking Components#

Your benchmarking framework implements four key systems:

1. Statistical Measurement Infrastructure#

Why Multiple Runs Matter

Single measurements are meaningless in ML systems. Performance varies 10-30% across runs due to:

  • Thermal throttling: CPU frequency drops when hot

  • OS background tasks: Interrupts, garbage collection, other processes

  • Memory state: Cache coldness, fragmentation, swap pressure

  • CPU frequency scaling: Dynamic frequency adjustment

Statistical Solution

class BenchmarkResult:
    """Container for measurements with statistical analysis."""

    def __init__(self, metric_name: str, values: List[float]):
        self.mean = statistics.mean(values)
        self.std = statistics.stdev(values)
        self.median = statistics.median(values)

        # 95% confidence interval for the mean
        t_score = 1.96  # Normal approximation
        margin = t_score * (self.std / np.sqrt(len(values)))
        self.ci_lower = self.mean - margin
        self.ci_upper = self.mean + margin

What This Reveals: If confidence intervals overlap between baseline and optimized, the difference might be noise. Statistical rigor prevents false claims.

2. Warmup and Measurement Protocol#

The Warmup Problem

First run: 120ms. Second run: 100ms. Third run: 98ms. What happened?

  • Cold cache: First run pays cache miss penalties

  • JIT compilation: NumPy and frameworks compile code paths on first use

  • Memory allocation: Initial runs establish memory patterns

Warmup Solution

class Benchmark:
    def __init__(self, warmup_runs=5, measurement_runs=10):
        self.warmup_runs = warmup_runs
        self.measurement_runs = measurement_runs

    def run_latency_benchmark(self, model, input_data):
        # Warmup: stabilize performance
        for _ in range(self.warmup_runs):
            model.forward(input_data)

        # Measurement: collect statistics
        latencies = []
        for _ in range(self.measurement_runs):
            start = time.perf_counter()
            model.forward(input_data)
            latencies.append(time.perf_counter() - start)

        return BenchmarkResult("latency_ms", latencies)

Why This Matters: Warmup runs discard cold-start effects. Measurement runs capture true steady-state performance.

3. Normalized Metrics for Fair Comparison#

Hardware-Independent Speedup

# Speedup ratio: baseline_time / optimized_time
speedup = baseline_result.mean / optimized_result.mean

# Example: 100ms / 80ms = 1.25x speedup (25% faster)
# Speedup > 1.0 means optimization helped
# Speedup < 1.0 means optimization regressed

Compression Ratio

# Model size reduction
compression_ratio = original_size_mb / compressed_size_mb

# Example: 100MB / 25MB = 4x compression

Efficiency Score (Multi-Objective)

# Combine speed + size + accuracy
efficiency = (speedup * compression) / (1 + abs(accuracy_delta))

# Penalizes accuracy loss
# Rewards speed AND compression
# Single metric for ranking

Why Normalized Metrics: Speedup ratios work on any hardware. “2x faster” is meaningful whether you have M1 Mac or Intel i9. Absolute times (100ms → 50ms) are hardware-specific.

4. Comprehensive Benchmark Suite#

Multiple Benchmark Types

Your BenchmarkSuite runs:

  1. Latency Benchmark: How fast is inference? (milliseconds)

  2. Accuracy Benchmark: How correct are predictions? (0.0-1.0)

  3. Memory Benchmark: How much RAM is used? (megabytes)

  4. Energy Benchmark: How efficient is compute? (estimated joules)

Pareto Frontier Analysis

Accuracy
    ↑
    |  A ●     ← Model A: High accuracy, high latency
    |
    |    B ●  ← Model B: Balanced (Pareto optimal)
    |
    |      C ●← Model C: Low accuracy, low latency
    |__________→ Latency (lower is better)

Models on the Pareto frontier aren’t strictly dominated—each represents a valid optimization trade-off. Your suite automatically identifies these optimal points.

Real-World Benchmarking Principles#

Your implementation teaches industry-standard methodology:

Reproducibility Requirements#

Every benchmark run documents:

system_info = {
    'platform': 'macOS-14.2-arm64',  # OS version
    'processor': 'Apple M1 Max',      # CPU type
    'python_version': '3.11.6',       # Runtime
    'memory_gb': 64,                  # RAM
    'cpu_count': 10                   # Cores
}

Why: Colleague should reproduce your results given same environment. Missing details make verification impossible.

Fair Comparison Protocol#

Don’t Compare:

  • GPU-optimized code vs CPU baseline (unfair hardware)

  • Quantized INT8 vs FP32 baseline (unfair precision)

  • Batch size 32 vs batch size 1 (unfair workload)

  • Cold start vs warmed up (unfair cache state)

Do Compare:

  • Same hardware, same workload, same environment

  • Baseline vs optimized on identical conditions

  • Report speedup with confidence intervals

  • Test statistical significance (t-test, p < 0.05)

Statistical Significance Testing#

from scipy import stats

baseline_times = [100, 102, 98, 101, 99]  # ms
optimized_times = [95, 97, 93, 96, 94]

# Is the difference real or noise?
t_stat, p_value = stats.ttest_ind(baseline_times, optimized_times)

if p_value < 0.05:
    print("Statistically significant (p < 0.05)")
else:
    print("Not significant—could be noise")

Why This Matters: 5% speedup with p=0.08 isn’t significant. Could be measurement variance. Production teams don’t merge optimizations without statistical confidence.

Connection to Competition Workflow (Module 20)#

This benchmarking infrastructure provides the measurement harness used in Module 20’s competition workflow:

How Module 20 Uses This Framework

  1. Module 20 uses your Benchmark class to measure baseline and optimized performance

  2. Statistical rigor from this module ensures fair comparison across submissions

  3. Normalized metrics enable hardware-independent ranking

  4. Reproducible protocols ensure all competitors use the same measurement methodology

The Workflow

  1. Module 19: Learn benchmarking methodology (statistical rigor, fair comparison)

  2. Module 20: Apply benchmarking tools in competition workflow (submission generation, validation)

  3. Competition: Use Benchmark harness to measure and validate optimizations

Your benchmarking framework provides the foundation for fair competition—same measurement methodology, same statistical analysis, same reporting format. Module 20 teaches how to use these tools in a competition context.

Getting Started#

Prerequisites#

Ensure you understand the optimization foundations:

# Activate TinyTorch environment
source scripts/activate-tinytorch

# Verify prerequisite modules
tito test profiling
tito test quantization
tito test compression

Development Workflow#

  1. Open the development file: modules/19_benchmarking/benchmarking_dev.py

  2. Implement BenchmarkResult: Container for measurements with statistical analysis

  3. Build Benchmark class: Runner with warmup, multiple runs, metrics collection

  4. Create BenchmarkSuite: Full evaluation with latency/accuracy/memory/energy

  5. Add reporting: Automated report generation with visualizations

  6. Export and verify: tito module complete 19 && tito test benchmarking

Testing#

Comprehensive Test Suite#

Run the full test suite to verify benchmarking functionality:

# TinyTorch CLI (recommended)
tito test benchmarking

# Direct pytest execution
python -m pytest tests/ -k benchmarking -v

Test Coverage Areas#

  • Statistical Calculations: Mean, std, median, confidence intervals computed correctly

  • Multiple Runs: Warmup and measurement phases work properly

  • Normalized Metrics: Speedup, compression, efficiency calculated accurately

  • Fair Comparison: Same workload enforcement, baseline vs optimized

  • Result Serialization: BenchmarkResult converts to dict for storage

  • Visualization: Plots generate with proper formatting and error bars

  • System Info: Metadata captured for reproducibility

  • Pareto Analysis: Optimal trade-off points identified correctly

Inline Testing & Validation#

The module includes comprehensive unit tests:

🔬 Unit Test: BenchmarkResult...
 Mean calculation correct: 3.0
 Std calculation matches statistics module
 Confidence intervals bound mean
 Serialization preserves data
📈 Progress: BenchmarkResult 

🔬 Unit Test: Benchmark latency...
 Warmup runs executed before measurement
 Multiple measurement runs collected
 Results include mean ± CI
📈 Progress: Benchmark 

🔬 Unit Test: BenchmarkSuite...
 All benchmark types run (latency, accuracy, memory, energy)
 Results organized by metric type
 Visualizations generated
📈 Progress: BenchmarkSuite 

Manual Testing Examples#

from tinytorch.benchmarking.benchmark import Benchmark, BenchmarkSuite
from tinytorch.core.tensor import Tensor
import numpy as np

# Create simple models for testing
class FastModel:
    name = "fast_model"
    def forward(self, x):
        return x * 2

class SlowModel:
    name = "slow_model"
    def forward(self, x):
        import time
        time.sleep(0.01)  # Simulate 10ms latency
        return x * 2

# Benchmark comparison
models = [FastModel(), SlowModel()]
benchmark = Benchmark(models, datasets=[None])

# Run latency benchmark
results = benchmark.run_latency_benchmark()

for model_name, result in results.items():
    print(f"{model_name}: {result.mean:.2f} ± {result.std:.2f}ms")
    print(f"  95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")

# Speedup calculation
fast_time = results['fast_model'].mean
slow_time = results['slow_model'].mean
speedup = slow_time / fast_time
print(f"\nSpeedup: {speedup:.2f}x")

Systems Thinking Questions#

Real-World Applications#

  • Production ML Deployment: PyTorch runs continuous benchmarking before merging optimizations—statistical rigor prevents performance regressions

  • Hardware Evaluation: Google’s TPU teams benchmark every architecture iteration—measurements justify billion-dollar hardware investments

  • Model Optimization: Meta benchmarks training efficiency (samples/sec, memory, convergence)—10% speedup saves hundreds of thousands in compute costs

  • Research Validation: Papers require reproducible benchmarks with statistical significance—ablation studies need fair comparison protocols

Statistical Foundations#

  • Central Limit Theorem: Multiple measurements → normal distribution → confidence intervals and significance testing

  • Measurement Uncertainty: Every measurement has variance—systematic errors (timer overhead) and random errors (thermal noise)

  • Statistical Power: How many runs needed for significance? Depends on effect size and variance—5% speedup requires more runs than 50%

  • Type I/II Errors: False positive (claiming speedup when it’s noise) vs false negative (missing real speedup due to insufficient samples)

Performance Characteristics#

  • Warmup Effects: First run 20% slower than steady-state—cold cache, JIT compilation, memory allocation

  • System Noise Sources: Thermal throttling (CPU frequency drops), OS interrupts (background tasks), memory pressure (GC pauses), network interference

  • Observer Effect: Instrumentation changes behavior—profiling overhead 5%, cache effects from measurement code, branch prediction altered

  • Hardware Variability: Optimization 3x faster on GPU but 1.1x on CPU—memory bandwidth helps GPU, CPU cache doesn’t fit data

Ready to Build?#

You’ve reached the penultimate module of the Optimization Tier. This benchmarking framework validates all your previous work from Modules 14-18, transforming subjective claims (“feels faster”) into objective data (“1.8x speedup, p < 0.01, 95% CI [1.6x, 2.0x]”).

Your benchmarking infrastructure provides the measurement foundation for Module 20’s competition workflow, where you’ll use these tools to validate optimizations systematically. Fair measurement methodology ensures your innovation is recognized—not who got lucky with thermal throttling.

Module 20 teaches how to use your benchmarking framework in a competition context—generating submissions, validating constraints, and packaging results. Your benchmarking framework measures cumulative impact with statistical rigor. This is how production ML teams validate optimizations before deployment—rigorous measurement prevents regressions and quantifies improvements.

Statistical rigor isn’t just academic formality—it’s engineering discipline. When Meta claims 10% training speedup saves hundreds of thousands in compute costs, that claim requires measurements with confidence intervals and significance testing. Your framework implements this methodology from first principles.

Choose your preferred way to engage with this module:

Launch Binder

Run this module interactively in your browser. No installation required.

https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/19_benchmarking/benchmarking_dev.ipynb
View Source

Browse the Python source code and understand the implementation.

https://github.com/mlsysbook/TinyTorch/blob/main/modules/19_benchmarking/benchmarking_dev.py

Save Your Progress

Binder sessions are temporary. Download your completed notebook when done, or switch to local development for persistent work.