Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi
2018
6 references

Abstract

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

1 repository
3 references

Code References

â–¶ pytorch/pytorch
3 files
â–¶ torch/distributed/algorithms/model_averaging/averagers.py
1
L41 This can be used for running `post-local SGD <https://arxiv.org/abs/1808.07217>`_,
â–¶ torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py
1
L25 that supports `post-local SGD <https://arxiv.org/abs/1808.07217>`_, which essentially only supports
â–¶ torch/distributed/optim/post_localSGD_optimizer.py
1
L10 Wraps an arbitrary :class:`torch.optim.Optimizer` and runs `post-local SGD <https://arxiv.org/abs/1808.07217>`_,
Link copied to clipboard!