Journey Through ML History#
Experience the evolution of AI by rebuilding history’s most important breakthroughs with YOUR TinyTorch implementations.
What Are Milestones?#
Milestones are proof-of-mastery demonstrations that showcase what you can build after completing specific modules. Each milestone recreates a historically significant ML achievement using YOUR implementations.
Why This Approach?#
Deep Understanding: Experience the actual challenges researchers faced
Progressive Learning: Each milestone builds on previous foundations
Real Achievements: Not toy examples - these are historically significant breakthroughs
Systems Thinking: Understand WHY each innovation mattered for ML systems
Two Dimensions of Your Progress#
As you build TinyTorch, you’re progressing along TWO dimensions simultaneously:
Pedagogical Dimension (Acts): What You’re LEARNING#
Act I (01-04): Building atomic components - mathematical foundations Act II (05-07): The gradient revolution - systems that learn Act III (08-09): Real-world complexity - data and scale Act IV (10-13): Sequential intelligence - language understanding Act V (14-19): Production systems - optimization and deployment Act VI (20): Complete integration - unified AI systems
See The Learning Journey for the complete pedagogical narrative explaining WHY modules flow this way.
Historical Dimension (Milestones): What You CAN Build#
1957: Perceptron - Binary classification 1969: XOR - Non-linear learning 1986: MLP - Multi-class vision 1998: CNN - Spatial intelligence 2017: Transformers - Language generation 2018: Torch Olympics - Production optimization
How They Connect#
graph TB
subgraph "Pedagogical Acts (What You're Learning)"
A1["Act I: Foundation<br/>Modules 01-04<br/>Atomic Components"]
A2["Act II: Learning<br/>Modules 05-07<br/>Gradient Revolution"]
A3["Act III: Data & Scale<br/>Modules 08-09<br/>Real-World Complexity"]
A4["Act IV: Language<br/>Modules 10-13<br/>Sequential Intelligence"]
A5["Act V: Production<br/>Modules 14-19<br/>Optimization"]
A6["Act VI: Integration<br/>Module 20<br/>Complete Systems"]
end
subgraph "Historical Milestones (What You Can Build)"
M1["1957: Perceptron<br/>Binary Classification"]
M2["1969: XOR Crisis<br/>Non-linear Learning"]
M3["1986: MLP<br/>Multi-class Vision<br/>95%+ MNIST"]
M4["1998: CNN<br/>Spatial Intelligence<br/>75%+ CIFAR-10"]
M5["2017: Transformers<br/>Language Generation"]
M6["2018: Torch Olympics<br/>Production Speed"]
end
A1 --> M1
A2 --> M2
A2 --> M3
A3 --> M4
A4 --> M5
A5 --> M6
style A1 fill:#e3f2fd
style A2 fill:#fff8e1
style A3 fill:#e8f5e9
style A4 fill:#f3e5f5
style A5 fill:#fce4ec
style A6 fill:#fff3e0
style M1 fill:#ffcdd2
style M2 fill:#f8bbd0
style M3 fill:#e1bee7
style M4 fill:#d1c4e9
style M5 fill:#c5cae9
style M6 fill:#bbdefb
Learning Act |
Unlocked Milestone |
Proof of Mastery |
|---|---|---|
Act I: Foundation (01-04) |
1957 Perceptron |
Your Linear layer recreates history |
Act II: Learning (05-07) |
1969 XOR + 1986 MLP |
Your autograd enables training (95%+ MNIST) |
Act III: Data & Scale (08-09) |
1998 CNN |
Your Conv2d achieves 75%+ on CIFAR-10 |
Act IV: Language (10-13) |
2017 Transformers |
Your attention generates coherent text |
Act V: Production (14-18) |
2018 Torch Olympics |
Your optimizations achieve production speed |
Act VI: Integration (19-20) |
Benchmarking + Capstone |
Your complete framework competes |
Understanding Both Dimensions: The Acts explain WHY you’re building each component (pedagogical progression). The Milestones prove WHAT you’ve built works (historical validation). Together, they show you’re not just completing exercises - you’re building something real.
The Timeline#
timeline
title Journey Through ML History
1957 : Perceptron : Binary classification with gradient descent
1969 : XOR Crisis : Hidden layers solve non-linear problems
1986 : MLP Revival : Backpropagation enables deep learning
1998 : CNN Era : Spatial intelligence for computer vision
2017 : Transformers : Attention revolutionizes language AI
2018 : Torch Olympics : Production benchmarking and optimization
01. Perceptron (1957) - Rosenblatt#
After Modules 02-04
Input → Linear → Sigmoid → Output
The Beginning: The first trainable neural network. Frank Rosenblatt proved machines could learn from data.
What You’ll Build:
Binary classification with gradient descent
Simple but revolutionary architecture
YOUR Linear layer recreates history
Systems Insights:
Memory: O(n) parameters
Compute: O(n) operations
Limitation: Only linearly separable problems
cd milestones/01_1957_perceptron
python 01_rosenblatt_forward.py # See the problem (random weights)
python 02_rosenblatt_trained.py # See the solution (trained)
Expected Results: ~50% (untrained) → 95%+ (trained) accuracy
02. XOR Crisis (1969) - Minsky & Papert#
After Modules 02-06
Input → Linear → ReLU → Linear → Output
The Challenge: Minsky proved perceptrons couldn’t solve XOR. This crisis nearly ended AI research.
What You’ll Build:
Hidden layers enable non-linear solutions
Multi-layer networks break through limitations
YOUR autograd makes it possible
Systems Insights:
Memory: O(n²) with hidden layers
Compute: O(n²) operations
Breakthrough: Hidden representations
cd milestones/02_1969_xor
python 01_xor_crisis.py # Watch it fail (loss stuck at 0.69)
python 02_xor_solved.py # Hidden layers solve it!
Expected Results: 50% (single layer) → 100% (multi-layer) on XOR
03. MLP Revival (1986) - Backpropagation Era#
After Modules 02-08
Images → Flatten → Linear → ReLU → Linear → ReLU → Linear → Classes
The Revolution: Backpropagation enabled training deep networks on real datasets like MNIST.
What You’ll Build:
Multi-class digit recognition
Complete training pipelines
YOUR optimizers achieve 95%+ accuracy
Systems Insights:
Memory: ~100K parameters for MNIST
Compute: Dense matrix operations
Architecture: Multi-layer feature learning
cd milestones/03_1986_mlp
python 01_rumelhart_tinydigits.py # 8x8 digits (quick)
python 02_rumelhart_mnist.py # Full MNIST
Expected Results: 95%+ accuracy on MNIST
04. CNN Revolution (1998) - LeCun’s Breakthrough#
After Modules 02-09 • 🎯 North Star Achievement
Images → Conv → ReLU → Pool → Conv → ReLU → Pool → Flatten → Linear → Classes
The Game-Changer: CNNs exploit spatial structure for computer vision. This enabled modern AI.
What You’ll Build:
Convolutional feature extraction
Natural image classification (CIFAR-10)
YOUR Conv2d + MaxPool2d unlock spatial intelligence
Systems Insights:
Memory: ~1M parameters (weight sharing reduces vs dense)
Compute: Convolution is intensive but parallelizable
Architecture: Local connectivity + translation invariance
cd milestones/04_1998_cnn
python 01_lecun_tinydigits.py # Spatial features on digits
python 02_lecun_cifar10.py # CIFAR-10 @ 75%+ accuracy
Expected Results: 75%+ accuracy on CIFAR-10 ✨
05. Transformer Era (2017) - Attention Revolution#
After Modules 02-13
Tokens → Embeddings → Attention → FFN → ... → Attention → Output
The Modern Era: Transformers + attention launched the LLM revolution (GPT, BERT, ChatGPT).
What You’ll Build:
Self-attention mechanisms
Autoregressive text generation
YOUR attention implementation generates language
Systems Insights:
Memory: O(n²) attention requires careful management
Compute: Highly parallelizable
Architecture: Long-range dependencies
cd milestones/05_2017_transformer
python 01_vaswani_generation.py # Q&A generation with TinyTalks
python 02_vaswani_dialogue.py # Multi-turn dialogue
Expected Results: Loss < 1.5, coherent responses to questions
06. Torch Olympics Era (2018) - The Optimization Revolution#
After Modules 14-18
Profile → Compress → Accelerate
The Turning Point: As models grew larger, MLCommons’ Torch Olympics (2018) established systematic optimization as a discipline - profiling, compression, and acceleration became essential for deployment.
What You’ll Build:
Performance profiling and bottleneck analysis
Model compression (quantization + pruning)
Inference acceleration (KV-cache + batching)
Systems Insights:
Memory: 4-16× compression through quantization/pruning
Speed: 12-40× faster generation with KV-cache + batching
Workflow: Systematic “measure → optimize → validate” methodology
cd milestones/06_2018_mlperf
python 01_baseline_profile.py # Find bottlenecks
python 02_compression.py # Reduce size (quantize + prune)
python 03_generation_opts.py # Speed up inference (cache + batch)
Expected Results: 8-16× smaller models, 12-40× faster inference
Learning Philosophy#
Progressive Capability Building#
Stage |
Era |
Capability |
Your Tools |
|---|---|---|---|
1957 |
Foundation |
Binary classification |
Linear + Sigmoid |
1969 |
Depth |
Non-linear problems |
Hidden layers + Autograd |
1986 |
Scale |
Multi-class vision |
Optimizers + Training |
1998 |
Structure |
Spatial understanding |
Conv2d + Pooling |
2017 |
Attention |
Sequence modeling |
Transformers + Attention |
2018 |
Optimization |
Production deployment |
Profiling + Compression + Acceleration |
Systems Engineering Progression#
Each milestone teaches critical systems thinking:
Memory Management: From O(n) → O(n²) → O(n²) with optimizations
Computational Trade-offs: Accuracy vs efficiency
Architectural Patterns: How structure enables capability
Production Deployment: What it takes to scale
How to Use Milestones#
1. Complete Prerequisites#
# Check which modules you've completed
tito checkpoint status
# Complete required modules
tito module complete 02_tensor
tito module complete 03_activations
# ... and so on
2. Run the Milestone#
cd milestones/01_1957_perceptron
python 02_rosenblatt_trained.py
3. Understand the Systems#
Each milestone includes:
📊 Memory profiling: See actual memory usage
⚡ Performance metrics: FLOPs, parameters, timing
🧠 Architectural analysis: Why this design matters
📈 Scaling insights: How performance changes with size
4. Reflect and Compare#
Questions to ask:
How does this compare to modern architectures?
What were the computational constraints in that era?
How would you optimize this for production?
What patterns appear in PyTorch/TensorFlow?
Quick Reference#
Milestone Prerequisites#
Milestone |
After Module |
Key Requirements |
|---|---|---|
01. Perceptron (1957) |
04 |
Tensor, Activations, Layers |
02. XOR (1969) |
06 |
+ Losses, Autograd |
03. MLP (1986) |
08 |
+ Optimizers, Training |
04. CNN (1998) |
09 |
+ Spatial, DataLoader |
05. Transformer (2017) |
13 |
+ Tokenization, Embeddings, Attention |
06. Torch Olympics (2018) |
18 |
+ Profiling, Quantization, Compression, Memoization, Acceleration |
What Each Milestone Proves#
Your implementations work - Not just toy code
Historical significance - These breakthroughs shaped modern AI
Systems understanding - You know memory, compute, scaling
Production relevance - Patterns used in real ML frameworks
Further Learning#
After completing milestones, explore:
Torch Olympics Competition: Optimize your implementations
Leaderboard: Compare with other students
Capstone Projects: Build your own ML applications
Research Papers: Read the original papers for each milestone
Why This Matters#
Most courses teach you to USE frameworks.
TinyTorch teaches you to UNDERSTAND them.
By rebuilding ML history, you gain:
🧠 Deep intuition for how neural networks work
🔧 Systems thinking for production ML
🏆 Portfolio projects demonstrating mastery
💼 Preparation for ML systems engineering roles
Ready to start your journey through ML history?
cd milestones/01_1957_perceptron
python 02_rosenblatt_trained.py
Build the future by understanding the past. 🚀