Hiroki Naganuma

List of Papers

Algorithms

  1. LAMB: https://arxiv.org/abs/1904.00962 (Layer Adaptation applied to AdamW)
  2. LARS: https://arxiv.org/abs/1708.03888 (Layer Adaptation applied to SGD-M)
  3. LANS: https://arxiv.org/abs/2006.13484 (Nesterov’s momentum into LAMB)

Adaptive Batch Sizes

  1. Adaptive Sampling Strategies for Stochastic Optimization: https://arxiv.org/abs/1710.11258 (Statistical tests that can help determine when to increase batch size)
  2. An Empirical Model of Large-Batch Training: https://arxiv.org/abs/1812.06162 (This work discusses gradient noise scale, critical batch size  and how to compute it in detail)
  3. AdaAdaGrad (Adaptive Batch Size Schemes for Adaptive Gradient Methods): https://arxiv.org/pdf/2402.11215 (same as #4 focusing more on empirical study using Vision and Language Models)

Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters: https://arxiv.org/pdf/2108.03645

LBT and Hyper-Parameter Tuning

  1. A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes: https://arxiv.org/abs/2102.06356 (Argues against customized LBT optimizers viewing HP tuning as a means to make LBT work)
  2. Don’t increase batch size decrease LR: https://arxiv.org/abs/1711.00489

Survey

  1. Large-Scale Deep Learning Optimizations: A Comprehensive Survey: https://arxiv.org/abs/2111.00856 (Recommend reading Section 4 and 5 for a summary of LBT approaches and challenges)
  2. Mustafa et al. https://indico.physics.lbl.gov/event/805/contributions/2901/attachments/1696/2051/2018-11-07-Large-Batch-Training-Mustafa.pdf

Re-visiting norm choices for LBT metrics

  1. The Geometry of Sign Gradient Descent: https://arxiv.org/pdf/2002.08056 (Does viewing gradient in terms of other norms bring out the LBT training behavior better?)