Links
Day1
- ML Perf Training and Moore’s Laws

Competition Blog Post
Intro Talk: The AlgoPerf Benchmark / Frank Schneider / University of Tübingen

- Shampoo 28% faster than compared to baseline
- Schedule free andam achieve 10% faster than baseline
- In ResNet workload, excpt for generalized adam, all the submiddion failed to beat the baseline
- Workload of ResNet is well studied, and well established benchmark, so it is hard to beat the baseline.
- No single algorithm is the best for all workloads.
QA
- How to make the benchmark?
- Pick several popular training algorithms with extensive hyperparameter tuning
Next TODO:

- correcting scaling across blocks
- corecting preconditioner staleness
Eigenvalue-corrected shampoo
- technical contribution
- enable warm-started QR algorithm to decrease precondition frequency significantly
Two theory practice mismatch
[1] Folk-law: sqrt t schedule is bad, a flat schedule is worse

.%20When%20considering%20only%20worst-case%20analysis%2C%20our%20theory%20predicts%20that%20the%20optimal%20choice%20is%20the%20linear%20decay%20schedule%20where%20the%20step-size%20is%20set%20proportional%20to%201%20-%20t%2FT%2C%20where%20t%20is%20the%20current%20iteration%20and%20T%20is%20the%20total%20number%20of%20steps.%20To%20go%20beyond%20this%20worst-case%20analysis%2C%20we%20use%20the%20observed%20gradient%20norms%20to%20derive%20schedules%20refined%20for%20any%20particular%20task.%20These%20refined%20schedules%20exhibit%20learning%20rate%20warm-up%20and%20rapid%20learning%20rate%20annealing%20near%20the%20end%20of%20training.%20Ours%20is%20the%20first%20systematic%20approach%20to%20automatically%20yield%20both%20of%20these%20properties.%20We%20perform%20the%20most%20comprehensive%20evaluation%20of%20learning%20rate%20schedules%20to%20date%2C%20evaluating%20across%2010%20diverse%20deep%20learning%20problems%2C%20a%20series%20of%20LLMs%2C%20and%20a%20suite%20of%20logistic%20regression%20problems.%20We%20validate%20that%20overall%2C%20the%20linear-decay%20schedule%20outperforms%20all%20commonly%20used%20default%20schedules%20including%20cosine%20annealing.%20Our%20adaptive%20schedule%20refinement%20method%20gives%20further%20improvements.)
- why does not polyak averaging work well
Lightning Talks / AlgoPerf Submissions & their Follow-Ups
Niccolò Ajroldi: Weight Averaging Techniques on AlgoPerf
Niccolò Ajroldi
- large evaluation of weight averaging on algo perf
- speedup training
- works well, but there is diminishing returns
- cannot beat resnet, but can beat other workload
- shampoo + lawa works well
- drawback
- CPU-GPU communication is slow
- improve generalization
- replace lr schedule
- averaging: LAWA, EMA
- baseline: nadamw + linear warmup + cosine decay
David Tweedle: Applying Randomized Singular Value Decomposition During Training
David Tweedle
Sourabh Medapati: Lessons from competing in AlgoPerf v0.5
Roundtable Discussion
A moderated open audience discussion on
“The Future of Training Algorithms”
Day2
Invited Talk I / How does Gradient Descent Work? / Jeremy Cohen - Flatiron Institute
Invited Talk II / What is the best O(n) Hessian query? / Madeleine Udell - Stanford University
Challenges in Training PINNs: A Loss Landscape Perspective
low rank aproximation of curvature
Invited Talk III / Stochastic-Gradient-based Algorithms for Nonconvex Constrained Optimization and Learning / Frank E. Curtis - Lehigh University
- Michael Rabbat (Meta)
- Panel
- Rohan Anil (GDM -> Meta)
- Zachary Nado (GDM)
- Hao-Jun Michael Shi (Meta)
- Guna Lakshminarayanan (Meta -> LinkedIn)
- …..
Panel Discussion / The Future of AlgoPerf / George Dahl - Google DeepMind
- George Dahl (Google DeepMind)
- Panel
- Runa Eschenhagen (Meta, Cambridge)
- Priya Kasimbeg (Google DeepMind)
- Niccolò Ajroldi (MPI-IS)
- Michael Rabbat (Meta)
Topics:
Other Referece
Acknowlegements
- GDM team and Meta AI team for providing the opportunity to attend the workshop.