Hiroki Naganuma

List of Papers

Adaptive Sampling Strategies for Stochastic Optimization: https://arxiv.org/abs/1710.11258 (Statistical tests that can help determine when to increase batch size)
An Empirical Model of Large-Batch Training: https://arxiv.org/abs/1812.06162 (This work discusses gradient noise scale, critical batch size and how to compute it in detail)
AdaAdaGrad (Adaptive Batch Size Schemes for Adaptive Gradient Methods): https://arxiv.org/pdf/2402.11215 (same as #4 focusing more on empirical study using Vision and Language Models)

Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters: https://arxiv.org/pdf/2108.03645

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes: https://arxiv.org/abs/2102.06356 (Argues against customized LBT optimizers viewing HP tuning as a means to make LBT work)
Don’t increase batch size decrease LR: https://arxiv.org/abs/1711.00489

Large-Scale Deep Learning Optimizations: A Comprehensive Survey: https://arxiv.org/abs/2111.00856 (Recommend reading Section 4 and 5 for a summary of LBT approaches and challenges)
Mustafa et al. https://indico.physics.lbl.gov/event/805/contributions/2901/attachments/1696/2051/2018-11-07-Large-Batch-Training-Mustafa.pdf

The Geometry of Sign Gradient Descent: https://arxiv.org/pdf/2002.08056 (Does viewing gradient in terms of other norms bring out the LBT training behavior better?)