Hiroki Naganuma

Links

Workshop Website

Day1

Opening Remark / David Kanter / MLCommons

ML Perf Training and Moore’s Laws
- arxiv

Competition Blog Post

How it started (algorithm)
- e.g. Shampoo paper is rejected at first (ICLR2021)
  - openreview
What is next?
- e.g. SOAP optimizer
  - arxiv
success of MLPerf
- Accelerating neural network training: An analysis of the AlgoPerf competition (ICLR2025)
Caspr Adaptive / Combining Axes Preconditioners through Kronecker Approximation for Deep Learning

Intro Talk: The AlgoPerf Benchmark / Frank Schneider / University of Tübingen

Shampoo 28% faster than compared to baseline
Schedule free andam achieve 10% faster than baseline
In ResNet workload, excpt for generalized adam, all the submiddion failed to beat the baseline
- Workload of ResNet is well studied, and well established benchmark, so it is hard to beat the baseline.
No single algorithm is the best for all workloads.

QA

How to make the benchmark?
- Pick several popular training algorithms with extensive hyperparameter tuning

Next TODO:

SOAP, MUON, AdEMAMIX
- Pierre Ablin, Apple

Spotlight Talk I: External Tuning Track Winner / Scaling Beyond Diagonal Preconditioners for Training Neural Networks At-Scale / Anna Cai & Michael Shi / Meta AI

Two approximation strategies of shampoo
- Block diagonal
- kronecker factored approximation
Why kronecker factored approximation?
- upper bound or coarse kronecker rank-one approximation (1900s)
Learning Rate Grafting: Transferability of Optimizer Tuning / Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, Cyril Zhang (ICLR2022 Rejected) is a key to success of shampoo
Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective / Wu Lin, Felix Dangel, Runa Eschenhagen, Juhan Bae, Richard E. Turner, Alireza Makhzani (ICML2024)

Why do we need grafting?

correcting scaling across blocks
corecting preconditioner staleness

Eigenvalue-corrected shampoo

technical contribution
- enable warm-started QR algorithm to decrease precondition frequency significantly

Spotlight Talk II: Self-Tuning Track Winner Schedules & Schedule-Free Learning / Aaron Defazio / Meta AI

Two theory practice mismatch

theory: look nothing like the best-performing schedules used by experiments
SGDR: Stochastic Gradient Descent with Warm Restarts
we never use sgd in precise form as we analyze
- sgd with averaging

[1] Folk-law: sqrt t schedule is bad, a flat schedule is worse

from gradient norm observation, we can derive refined schedules for any particular task
Making the Last Iterate of SGD Information Theoretically Optimal

Optimal Linear Decay Learning Rate Schedules and Further Refinements

why does not polyak averaging work well

Lightning Talks / AlgoPerf Submissions & their Follow-Ups

Niccolò Ajroldi: Weight Averaging Techniques on AlgoPerf

Niccolò Ajroldi

large evaluation of weight averaging on algo perf

speedup training

works well, but there is diminishing returns
cannot beat resnet, but can beat other workload
shampoo + lawa works well
drawback
- CPU-GPU communication is slow

improve generalization
replace lr schedule

averaging: LAWA, EMA
baseline: nadamw + linear warmup + cosine decay

David Tweedle: Applying Randomized Singular Value Decomposition During Training

Sourabh Medapati: Lessons from competing in AlgoPerf v0.5

conduct extensive hyperparameter search
the configuration which beats the baseline of ResNet is behave like a momentum sgd optimizer
nadam
- power = 1 : sign sgd
- power = 2 : adam
- make this power a hyperparameter for tuning

Roundtable Discussion

A moderated open audience discussion on “The Future of Training Algorithms”

Aaron Defazio
Jeremy Cohen
Frank E. Curtis
current and previous neural network architechture is co-designed with the optimizer
- e.g.
  - ResNet w/ SGD
  - Transformer w/ Adam
composition of optimizer and model
- e.g. shampoo ignores the model’s inter-layer dependencies since it drop non-diagonal blocks for approximation
overlooked components
- label smoothing
- EMA
metrics
- L1 norm of the gradient
- L2
- entropy

Day2

Invited Talk I / How does Gradient Descent Work? / Jeremy Cohen - Flatiron Institute

Understanding Optimization in Deep Learning with Central Flows

Invited Talk II / What is the best O(n) Hessian query? / Madeleine Udell - Stanford University

PINN

Challenges in Training PINNs: A Loss Landscape Perspective

low rank aproximation of curvature

Invited Talk III / Stochastic-Gradient-based Algorithms for Nonconvex Constrained Optimization and Learning / Frank E. Curtis - Lehigh University

Roundtable Discussion / A moderated open audience discussion on “Training Algorithms in Production” / Michael Rabbat - Meta AI

Michael Rabbat (Meta)
Panel
- Rohan Anil (GDM -> Meta)
- Zachary Nado (GDM)
- Hao-Jun Michael Shi (Meta)
- Guna Lakshminarayanan (Meta -> LinkedIn)
- …..

Panel Discussion / The Future of AlgoPerf / George Dahl - Google DeepMind

George Dahl (Google DeepMind)
Panel
- Runa Eschenhagen (Meta, Cambridge)
- Priya Kasimbeg (Google DeepMind)
- Niccolò Ajroldi (MPI-IS)
- Michael Rabbat (Meta)

Topics:

Modern workloads

Other Referece

Acknowlegements

GDM team and Meta AI team for providing the opportunity to attend the workshop.