From: Tao Jiang 

Title: Statistically Efficient Stochastic Gradient Methods

Abstract: The presence of heavy-tail noise has become a major challenge for
stochastic gradient methods to obtain good performance in modern deep learning
applications. Polyak and Tsypkin have developed in the 1980's a framework of
nonlinear stochastic gradient methods to address such problems, where the
stochastic gradients first go through a nonlinear mapping, determined by the
probability distribution of the noise, before being used to update the
optimization variable. These methods are statistically efficient or optimal, in
the sense that their asymptotic performance achieves the Cramer-Rao lower bound.
However, beyond a few special cases that admit simple analytical forms, it's
hard to apply the method in practice because of the difficulty of characterizing
the distribution of the noise in practice.  

We make statistically efficient stochastic gradient methods practical with the
following contributions. Theoretically, we extend the framework of Polyak and
Tsypkin to the non-asymptotic regime, showing that the same nonlinear map
obtains the optimal statistical performance; in addition, we relax the
requirement on the nonlinear mapping to be non-monotone, thus allowing heavier
tail noises than Laplacian. Empirically, we discover that the family of
Student-t distributions gives good approximations of the heavy-tail noise in
many deep learning applications. We derive a simple nonlinear mapping that is
optimal for Student-t distribution and has only a couple of parameters that can
be estimated efficiently online. Numerical experiments demonstrate that this
algorithm matches the state-of-the-art performance on certain deep learning
tasks.