Reproducing the example from the paper [1] (tribute).
In addition, we compare Normalized SGD to Pop-Art SGD; while the former uses gradient rescaling and the latter is based on rescaling weights, the two are equivalent in case of squared loss.
To build the Docker image and run the example, use
make run
[1] Hasselt et al. Learning values across many orders of magnitude. NIPS 2016.