Convergence of stochastic gradient descent under a local Lajasiewicz condition for deep neural networks
We extend the global convergence result of Chatterjee <cit.> by considering the stochastic gradient descent (SGD) for non-convex objective functions. With minimal additional assumptions that can be realized by finitely wide neural networks, we prove that if we initialize inside a local region where the Łajasiewicz condition holds, with a positive probability, the stochastic gradient iterates converge to a global minimum inside this region. A key component of our proof is to ensure that the whole trajectories of SGD stay inside the local region with a positive probability. For that, we assume the SGD noise scales with the objective function, which is called machine learning noise and achievable in many real examples. Furthermore, we provide a negative argument to show why using the boundedness of noise with Robbins-Monro type step sizes is not enough to keep the key component valid.
READ FULL TEXT