From: Tao Jiang Title: Statistically Efficient Stochastic Gradient Methods Abstract: The presence of heavy-tail noise has become a major challenge for stochastic gradient methods to obtain good performance in modern deep learning applications. Polyak and Tsypkin have developed in the 1980's a framework of nonlinear stochastic gradient methods to address such problems, where the stochastic gradients first go through a nonlinear mapping, determined by the probability distribution of the noise, before being used to update the optimization variable. These methods are statistically efficient or optimal, in the sense that their asymptotic performance achieves the Cramer-Rao lower bound. However, beyond a few special cases that admit simple analytical forms, it's hard to apply the method in practice because of the difficulty of characterizing the distribution of the noise in practice.   We make statistically efficient stochastic gradient methods practical with the following contributions. Theoretically, we extend the framework of Polyak and Tsypkin to the non-asymptotic regime, showing that the same nonlinear map obtains the optimal statistical performance; in addition, we relax the requirement on the nonlinear mapping to be non-monotone, thus allowing heavier tail noises than Laplacian. Empirically, we discover that the family of Student-t distributions gives good approximations of the heavy-tail noise in many deep learning applications. We derive a simple nonlinear mapping that is optimal for Student-t distribution and has only a couple of parameters that can be estimated efficiently online. Numerical experiments demonstrate that this algorithm matches the state-of-the-art performance on certain deep learning tasks.