gradient descent - Selection of Mini-batch Size for Neural Network Regression -
i doing neural network regression 4 features. how determine size of mini-batch problem? see people use 100 ~ 1000 batch size computer vision 32*32*3 features each image, mean should use batch size of 1 million? have billions of data , tens of gb of memory there no hard requirement me not that.
i observed using mini-batch size ~ 1000 makes convergence faster batch size of 1 million. thought should other way around, since gradient calculated larger batch size representative of gradient of whole sample? why using mini-batch make convergence faster?
from tradeoff batch size vs. number of iterations train neural network:
from nitish shirish keskar, dheevatsa mudigere, jorge nocedal, mikhail smelyanskiy, ping tak peter tang. on large-batch training deep learning: generalization gap , sharp minima. https://arxiv.org/abs/1609.04836 :
the stochastic gradient descent method , variants algorithms of choice many deep learning tasks. these methods operate in small-batch regime wherein fraction of training data, 32--512 data points, sampled compute approximation gradient. it has been observed in practice when using larger batch there significant degradation in quality of model, measured ability generalize. there have been attempts investigate cause generalization drop in large-batch regime, precise answer phenomenon is, hitherto unknown. in paper, present ample numerical evidence supports view large-batch methods tend converge sharp minimizers of training , testing functions -- , sharp minima lead poorer generalization. in contrast, small-batch methods consistently converge flat minimizers, , our experiments support commonly held view due inherent noise in gradient estimation. discuss several empirical strategies large-batch methods eliminate generalization gap , conclude set of future research ideas , open questions.
[…]
the lack of generalization ability due fact large-batch methods tend converge sharp minimizers of training function. these minimizers characterized large positive eigenvalues in $\nabla^2 f(x)$ , tend generalize less well. in contrast, small-batch methods converge flat minimizers characterized small positive eigenvalues of $\nabla^2 f(x)$. have observed loss function landscape of deep neural networks such large-batch methods invariably attracted regions sharp minima , that, unlike small batch methods, unable escape basins of these minimizers.
[…]
also, insights ian goodfellow answering why not use whole training set compute gradient? on quora:
the size of learning rate limited factors how curved cost function is. can think of gradient descent making linear approximation cost function, moving downhill along approximate cost. if cost function highly non-linear (highly curved) approximation not far, small step sizes safe. can read more in chapter 4 of deep learning textbook, on numerical computation: http://www.deeplearningbook.org/contents/numerical.html
when put m examples in minibatch, need o(m) computation , use o(m) memory, reduce amount of uncertainty in gradient factor of o(sqrt(m)). in other words, there diminishing marginal returns putting more examples in minibatch. can read more in chapter 8 of deep learning textbook, on optimization algorithms deep learning: http://www.deeplearningbook.org/contents/optimization.html
also, if think it, using entire training set doesn’t give true gradient. true gradient expected gradient expectation taken on possible examples, weighted data generating distribution. using entire training set using large minibatch size, size of minibatch limited amount spend on data collection, rather amount spend on computation.

Comments
Post a Comment