Difference between RMSProp with momentum and Adam Optimizers












4












$begingroup$


According to this scintillating blogpost Adam is very similar to RMSProp with momentum.



From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters



__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)


while tf.train.AdamOptimizer:



__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)


What is the conceptual difference if we put beta1 = decay and beta2 = momentum?










share|improve this question









$endgroup$












  • $begingroup$
    Did you mean beta1=momentum and beta2=decay?
    $endgroup$
    – Oren Milman
    Sep 20 '18 at 18:33










  • $begingroup$
    Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
    $endgroup$
    – Oren Milman
    Oct 4 '18 at 15:43


















4












$begingroup$


According to this scintillating blogpost Adam is very similar to RMSProp with momentum.



From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters



__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)


while tf.train.AdamOptimizer:



__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)


What is the conceptual difference if we put beta1 = decay and beta2 = momentum?










share|improve this question









$endgroup$












  • $begingroup$
    Did you mean beta1=momentum and beta2=decay?
    $endgroup$
    – Oren Milman
    Sep 20 '18 at 18:33










  • $begingroup$
    Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
    $endgroup$
    – Oren Milman
    Oct 4 '18 at 15:43
















4












4








4


3



$begingroup$


According to this scintillating blogpost Adam is very similar to RMSProp with momentum.



From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters



__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)


while tf.train.AdamOptimizer:



__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)


What is the conceptual difference if we put beta1 = decay and beta2 = momentum?










share|improve this question









$endgroup$




According to this scintillating blogpost Adam is very similar to RMSProp with momentum.



From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters



__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)


while tf.train.AdamOptimizer:



__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)


What is the conceptual difference if we put beta1 = decay and beta2 = momentum?







tensorflow optimization






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 18 '18 at 15:21









hanshans

1335




1335












  • $begingroup$
    Did you mean beta1=momentum and beta2=decay?
    $endgroup$
    – Oren Milman
    Sep 20 '18 at 18:33










  • $begingroup$
    Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
    $endgroup$
    – Oren Milman
    Oct 4 '18 at 15:43




















  • $begingroup$
    Did you mean beta1=momentum and beta2=decay?
    $endgroup$
    – Oren Milman
    Sep 20 '18 at 18:33










  • $begingroup$
    Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
    $endgroup$
    – Oren Milman
    Oct 4 '18 at 15:43


















$begingroup$
Did you mean beta1=momentum and beta2=decay?
$endgroup$
– Oren Milman
Sep 20 '18 at 18:33




$begingroup$
Did you mean beta1=momentum and beta2=decay?
$endgroup$
– Oren Milman
Sep 20 '18 at 18:33












$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43






$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43












2 Answers
2






active

oldest

votes


















3












$begingroup$

(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)





Some Background



Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.



Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:




  • In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
    while:



    • $text{learning_rate}$ is a hyperparameter.


    • $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).


      • All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".




    • $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

    • For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.

      Both the Adam paper and TensorFlow use the following notation:



      • $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$


      • $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$



    • The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.



  • In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
    while:



    • $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).

      In TensorFlow, this is the momentum argument of RMSPropOptimizer.


    • $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.


    • $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

      The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.




High-Level Comparison



Now we are finally ready to talk about the differences between the two.



The denominator is quite similar (except for the bias-correction, which I explain about later).

However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:




Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]




This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.



Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.



Lastly, I would describe the difference with regard to bias-correction.





Accumulating Momentum



Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

I.e. we can imagine our algorithm as a stationary ball on a linear slope.

What would happen when we use each of the methods?



scenario 1



In Adam



The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$



I.e. the "ball" immediately starts moving downhill in a constant speed.



In rmsprop with momentum



Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$

So the "ball" starts accelerating downhill.

Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$

Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)



Changing Direction



Now, consider a scenario following the previous one:

After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

What would happen when we use each of the methods?



scenario 2



This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:




  • Adam is slower to change its direction, and then much slower to get back to the minimum.

  • However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

    Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.


Bias-Correction



What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)



Generally speaking, a moving average is a weighted average of:




  • The moving average of all of the previous terms

  • The current term


Then what is the moving average in the first step?




  • A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

  • When you only have one term, by definition the average should be equal to that term.

    Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).


So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.



For more about the bias-correction in Adam, see section 3 in the paper and also this answer.



Simulation Python Code





import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation

###########################################
# script parameters
def f(x):
if x > 0:
return x
else:
return -0.1 * x

def f_grad(x):
if x > 0:
return 1
else:
return -0.1

METHOD_TO_LEARNING_RATE = {
'Adam': 0.01,
'GD': 0.00008,
'rmsprop_with_Nesterov_momentum': 0.008,
'rmsprop_with_momentum': 0.001,
'rmsprop': 0.02,
'momentum': 0.00008,
'Nesterov': 0.008,
'Adadelta': None,
}
X0 = 2
METHOD = 'rmsprop'
METHOD = 'momentum'
METHOD = 'GD'
METHOD = 'rmsprop_with_Nesterov_momentum'
METHOD = 'Nesterov'
METHOD = 'Adadelta'
METHOD = 'rmsprop_with_momentum'
METHOD = 'Adam'
LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]

MOMENTUM_DECAY_FACTOR = 0.9
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
ADADELTA_DECAY_FACTOR = 0.9
RMSPROP_EPSILON = 1e-10
ADADELTA_EPSILON = 1e-6
ADAM_EPSILON = 1e-10
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
ADAM_GRADS_AVG_DECAY_FACTOR = 0.9

INTERVAL = 9e2
INTERVAL = 1
INTERVAL = 3e2
INTERVAL = 3e1
###########################################

def plot_func(axe, f):
xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
vf = np.vectorize(f)
ys = vf(xs)
return axe.plot(xs, ys, color='grey')

def next_color(color, f):
color[1] -= 0.01
if color[1] < 0:
color[1] = 1
return color[:]

def update(frame):
global k, x, prev_step, squared_grads_decaying_avg,
squared_prev_steps_decaying_avg, grads_decaying_avg

if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
gradient = f_grad(x + step_momentum_portion)
else:
gradient = f_grad(x)

if METHOD == 'GD':
step = -LEARNING_RATE * gradient
elif METHOD in ('momentum', 'Nesterov'):
step = step_momentum_portion - LEARNING_RATE * gradient
elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
squared_grads_decaying_avg = (
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
if METHOD == 'rmsprop':
step = -LEARNING_RATE / grads_rms * gradient
else:
assert(METHOD in ('rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'))
print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
elif METHOD == 'Adadelta':
gradient = f_grad(x)
squared_grads_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
squared_prev_steps_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
step = - prev_steps_rms / grads_rms * gradient
elif METHOD == 'Adam':
squared_grads_decaying_avg = (
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
unbiased_squared_grads_decaying_avg = (
squared_grads_decaying_avg /
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
grads_decaying_avg = (
ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
(1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
unbiased_grads_decaying_avg = (
grads_decaying_avg /
(1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
step = - (LEARNING_RATE /
(np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
unbiased_grads_decaying_avg)

x += step
prev_step = step
k += 1

color = next_color(cur_color, f)

print(f'k: {k}n'
f'x: {x}n'
f'step: {step}n'
f'gradient: {gradient}n')

k_x_marker, = k_and_x.plot(k, x, '.', color=color)
x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)

return k_x_marker, x_y_marker

k = 0
x = X0
cur_color = [0, 1, 1]
prev_step = 0
squared_grads_decaying_avg = 0
squared_prev_steps_decaying_avg = 0
grads_decaying_avg = 0

fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
k_and_x.set_xlabel('k')
k_and_x.set_ylabel('x', rotation=0)
x_and_y.set_xlabel('x')
x_and_y.set_ylabel('y', rotation=0)
plot_func(x_and_y, f)
x_and_y.plot(x, f(x), '.', color=cur_color[:])
k_and_x.plot(k, x, '.', color=cur_color[:])
plt.tight_layout()

ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
plt.show()





share|improve this answer











$endgroup$





















    0












    $begingroup$

    the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?






    share|improve this answer








    New contributor




    Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "557"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26792%2fdifference-between-rmsprop-with-momentum-and-adam-optimizers%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3












      $begingroup$

      (My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)





      Some Background



      Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.



      Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:




      • In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
        while:



        • $text{learning_rate}$ is a hyperparameter.


        • $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).


          • All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".




        • $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

        • For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.

          Both the Adam paper and TensorFlow use the following notation:



          • $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$


          • $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$



        • The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.



      • In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
        while:



        • $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).

          In TensorFlow, this is the momentum argument of RMSPropOptimizer.


        • $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.


        • $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

          The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.




      High-Level Comparison



      Now we are finally ready to talk about the differences between the two.



      The denominator is quite similar (except for the bias-correction, which I explain about later).

      However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

      E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:




      Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]




      This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.



      Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.



      Lastly, I would describe the difference with regard to bias-correction.





      Accumulating Momentum



      Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

      I.e. we can imagine our algorithm as a stationary ball on a linear slope.

      What would happen when we use each of the methods?



      scenario 1



      In Adam



      The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

      Thus we get:
      $$begin{gathered}\
      Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
      downarrow\
      Delta x_{j}^{left(tright)}=-text{learning_rate}
      end{gathered}
      $$



      I.e. the "ball" immediately starts moving downhill in a constant speed.



      In rmsprop with momentum



      Similarly, we get:
      $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
      This case is a little more complicated, but we can see that:
      $$begin{gathered}\
      Delta x_{j}^{left(tright)}=-text{learning_rate}\
      Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
      end{gathered}
      $$

      So the "ball" starts accelerating downhill.

      Given that the gradient stays constant, you can prove that if:
      $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
      then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
      Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
      $$begin{gathered}\
      Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
      downarrow\
      Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
      end{gathered}
      $$

      Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)



      Changing Direction



      Now, consider a scenario following the previous one:

      After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

      What would happen when we use each of the methods?



      scenario 2



      This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

      Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:




      • Adam is slower to change its direction, and then much slower to get back to the minimum.

      • However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

        Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.


      Bias-Correction



      What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)



      Generally speaking, a moving average is a weighted average of:




      • The moving average of all of the previous terms

      • The current term


      Then what is the moving average in the first step?




      • A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

      • When you only have one term, by definition the average should be equal to that term.

        Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).


      So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.



      For more about the bias-correction in Adam, see section 3 in the paper and also this answer.



      Simulation Python Code





      import numpy as np
      import matplotlib.pyplot as plt
      from matplotlib.animation import FuncAnimation

      ###########################################
      # script parameters
      def f(x):
      if x > 0:
      return x
      else:
      return -0.1 * x

      def f_grad(x):
      if x > 0:
      return 1
      else:
      return -0.1

      METHOD_TO_LEARNING_RATE = {
      'Adam': 0.01,
      'GD': 0.00008,
      'rmsprop_with_Nesterov_momentum': 0.008,
      'rmsprop_with_momentum': 0.001,
      'rmsprop': 0.02,
      'momentum': 0.00008,
      'Nesterov': 0.008,
      'Adadelta': None,
      }
      X0 = 2
      METHOD = 'rmsprop'
      METHOD = 'momentum'
      METHOD = 'GD'
      METHOD = 'rmsprop_with_Nesterov_momentum'
      METHOD = 'Nesterov'
      METHOD = 'Adadelta'
      METHOD = 'rmsprop_with_momentum'
      METHOD = 'Adam'
      LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]

      MOMENTUM_DECAY_FACTOR = 0.9
      RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
      ADADELTA_DECAY_FACTOR = 0.9
      RMSPROP_EPSILON = 1e-10
      ADADELTA_EPSILON = 1e-6
      ADAM_EPSILON = 1e-10
      ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
      ADAM_GRADS_AVG_DECAY_FACTOR = 0.9

      INTERVAL = 9e2
      INTERVAL = 1
      INTERVAL = 3e2
      INTERVAL = 3e1
      ###########################################

      def plot_func(axe, f):
      xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
      vf = np.vectorize(f)
      ys = vf(xs)
      return axe.plot(xs, ys, color='grey')

      def next_color(color, f):
      color[1] -= 0.01
      if color[1] < 0:
      color[1] = 1
      return color[:]

      def update(frame):
      global k, x, prev_step, squared_grads_decaying_avg,
      squared_prev_steps_decaying_avg, grads_decaying_avg

      if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
      'rmsprop_with_Nesterov_momentum'):
      step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
      if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
      gradient = f_grad(x + step_momentum_portion)
      else:
      gradient = f_grad(x)

      if METHOD == 'GD':
      step = -LEARNING_RATE * gradient
      elif METHOD in ('momentum', 'Nesterov'):
      step = step_momentum_portion - LEARNING_RATE * gradient
      elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
      'rmsprop_with_Nesterov_momentum'):
      squared_grads_decaying_avg = (
      RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
      (1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
      grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
      if METHOD == 'rmsprop':
      step = -LEARNING_RATE / grads_rms * gradient
      else:
      assert(METHOD in ('rmsprop_with_momentum',
      'rmsprop_with_Nesterov_momentum'))
      print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
      step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
      elif METHOD == 'Adadelta':
      gradient = f_grad(x)
      squared_grads_decaying_avg = (
      ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
      (1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
      grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
      squared_prev_steps_decaying_avg = (
      ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
      (1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
      prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
      step = - prev_steps_rms / grads_rms * gradient
      elif METHOD == 'Adam':
      squared_grads_decaying_avg = (
      ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
      (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
      unbiased_squared_grads_decaying_avg = (
      squared_grads_decaying_avg /
      (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
      grads_decaying_avg = (
      ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
      (1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
      unbiased_grads_decaying_avg = (
      grads_decaying_avg /
      (1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
      step = - (LEARNING_RATE /
      (np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
      unbiased_grads_decaying_avg)

      x += step
      prev_step = step
      k += 1

      color = next_color(cur_color, f)

      print(f'k: {k}n'
      f'x: {x}n'
      f'step: {step}n'
      f'gradient: {gradient}n')

      k_x_marker, = k_and_x.plot(k, x, '.', color=color)
      x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)

      return k_x_marker, x_y_marker

      k = 0
      x = X0
      cur_color = [0, 1, 1]
      prev_step = 0
      squared_grads_decaying_avg = 0
      squared_prev_steps_decaying_avg = 0
      grads_decaying_avg = 0

      fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
      k_and_x.set_xlabel('k')
      k_and_x.set_ylabel('x', rotation=0)
      x_and_y.set_xlabel('x')
      x_and_y.set_ylabel('y', rotation=0)
      plot_func(x_and_y, f)
      x_and_y.plot(x, f(x), '.', color=cur_color[:])
      k_and_x.plot(k, x, '.', color=cur_color[:])
      plt.tight_layout()

      ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
      plt.show()





      share|improve this answer











      $endgroup$


















        3












        $begingroup$

        (My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)





        Some Background



        Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.



        Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:




        • In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
          while:



          • $text{learning_rate}$ is a hyperparameter.


          • $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).


            • All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".




          • $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

          • For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.

            Both the Adam paper and TensorFlow use the following notation:



            • $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$


            • $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$



          • The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.



        • In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
          while:



          • $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).

            In TensorFlow, this is the momentum argument of RMSPropOptimizer.


          • $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.


          • $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

            The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.




        High-Level Comparison



        Now we are finally ready to talk about the differences between the two.



        The denominator is quite similar (except for the bias-correction, which I explain about later).

        However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

        E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:




        Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]




        This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.



        Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.



        Lastly, I would describe the difference with regard to bias-correction.





        Accumulating Momentum



        Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

        I.e. we can imagine our algorithm as a stationary ball on a linear slope.

        What would happen when we use each of the methods?



        scenario 1



        In Adam



        The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

        Thus we get:
        $$begin{gathered}\
        Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
        downarrow\
        Delta x_{j}^{left(tright)}=-text{learning_rate}
        end{gathered}
        $$



        I.e. the "ball" immediately starts moving downhill in a constant speed.



        In rmsprop with momentum



        Similarly, we get:
        $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
        This case is a little more complicated, but we can see that:
        $$begin{gathered}\
        Delta x_{j}^{left(tright)}=-text{learning_rate}\
        Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
        end{gathered}
        $$

        So the "ball" starts accelerating downhill.

        Given that the gradient stays constant, you can prove that if:
        $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
        then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
        Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
        $$begin{gathered}\
        Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
        downarrow\
        Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
        end{gathered}
        $$

        Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)



        Changing Direction



        Now, consider a scenario following the previous one:

        After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

        What would happen when we use each of the methods?



        scenario 2



        This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

        Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:




        • Adam is slower to change its direction, and then much slower to get back to the minimum.

        • However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

          Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.


        Bias-Correction



        What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)



        Generally speaking, a moving average is a weighted average of:




        • The moving average of all of the previous terms

        • The current term


        Then what is the moving average in the first step?




        • A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

        • When you only have one term, by definition the average should be equal to that term.

          Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).


        So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.



        For more about the bias-correction in Adam, see section 3 in the paper and also this answer.



        Simulation Python Code





        import numpy as np
        import matplotlib.pyplot as plt
        from matplotlib.animation import FuncAnimation

        ###########################################
        # script parameters
        def f(x):
        if x > 0:
        return x
        else:
        return -0.1 * x

        def f_grad(x):
        if x > 0:
        return 1
        else:
        return -0.1

        METHOD_TO_LEARNING_RATE = {
        'Adam': 0.01,
        'GD': 0.00008,
        'rmsprop_with_Nesterov_momentum': 0.008,
        'rmsprop_with_momentum': 0.001,
        'rmsprop': 0.02,
        'momentum': 0.00008,
        'Nesterov': 0.008,
        'Adadelta': None,
        }
        X0 = 2
        METHOD = 'rmsprop'
        METHOD = 'momentum'
        METHOD = 'GD'
        METHOD = 'rmsprop_with_Nesterov_momentum'
        METHOD = 'Nesterov'
        METHOD = 'Adadelta'
        METHOD = 'rmsprop_with_momentum'
        METHOD = 'Adam'
        LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]

        MOMENTUM_DECAY_FACTOR = 0.9
        RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
        ADADELTA_DECAY_FACTOR = 0.9
        RMSPROP_EPSILON = 1e-10
        ADADELTA_EPSILON = 1e-6
        ADAM_EPSILON = 1e-10
        ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
        ADAM_GRADS_AVG_DECAY_FACTOR = 0.9

        INTERVAL = 9e2
        INTERVAL = 1
        INTERVAL = 3e2
        INTERVAL = 3e1
        ###########################################

        def plot_func(axe, f):
        xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
        vf = np.vectorize(f)
        ys = vf(xs)
        return axe.plot(xs, ys, color='grey')

        def next_color(color, f):
        color[1] -= 0.01
        if color[1] < 0:
        color[1] = 1
        return color[:]

        def update(frame):
        global k, x, prev_step, squared_grads_decaying_avg,
        squared_prev_steps_decaying_avg, grads_decaying_avg

        if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
        'rmsprop_with_Nesterov_momentum'):
        step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
        if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
        gradient = f_grad(x + step_momentum_portion)
        else:
        gradient = f_grad(x)

        if METHOD == 'GD':
        step = -LEARNING_RATE * gradient
        elif METHOD in ('momentum', 'Nesterov'):
        step = step_momentum_portion - LEARNING_RATE * gradient
        elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
        'rmsprop_with_Nesterov_momentum'):
        squared_grads_decaying_avg = (
        RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
        (1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
        grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
        if METHOD == 'rmsprop':
        step = -LEARNING_RATE / grads_rms * gradient
        else:
        assert(METHOD in ('rmsprop_with_momentum',
        'rmsprop_with_Nesterov_momentum'))
        print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
        step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
        elif METHOD == 'Adadelta':
        gradient = f_grad(x)
        squared_grads_decaying_avg = (
        ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
        (1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
        grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
        squared_prev_steps_decaying_avg = (
        ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
        (1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
        prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
        step = - prev_steps_rms / grads_rms * gradient
        elif METHOD == 'Adam':
        squared_grads_decaying_avg = (
        ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
        (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
        unbiased_squared_grads_decaying_avg = (
        squared_grads_decaying_avg /
        (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
        grads_decaying_avg = (
        ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
        (1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
        unbiased_grads_decaying_avg = (
        grads_decaying_avg /
        (1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
        step = - (LEARNING_RATE /
        (np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
        unbiased_grads_decaying_avg)

        x += step
        prev_step = step
        k += 1

        color = next_color(cur_color, f)

        print(f'k: {k}n'
        f'x: {x}n'
        f'step: {step}n'
        f'gradient: {gradient}n')

        k_x_marker, = k_and_x.plot(k, x, '.', color=color)
        x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)

        return k_x_marker, x_y_marker

        k = 0
        x = X0
        cur_color = [0, 1, 1]
        prev_step = 0
        squared_grads_decaying_avg = 0
        squared_prev_steps_decaying_avg = 0
        grads_decaying_avg = 0

        fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
        k_and_x.set_xlabel('k')
        k_and_x.set_ylabel('x', rotation=0)
        x_and_y.set_xlabel('x')
        x_and_y.set_ylabel('y', rotation=0)
        plot_func(x_and_y, f)
        x_and_y.plot(x, f(x), '.', color=cur_color[:])
        k_and_x.plot(k, x, '.', color=cur_color[:])
        plt.tight_layout()

        ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
        plt.show()





        share|improve this answer











        $endgroup$
















          3












          3








          3





          $begingroup$

          (My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)





          Some Background



          Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.



          Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:




          • In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
            while:



            • $text{learning_rate}$ is a hyperparameter.


            • $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).


              • All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".




            • $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

            • For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.

              Both the Adam paper and TensorFlow use the following notation:



              • $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$


              • $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$



            • The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.



          • In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
            while:



            • $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).

              In TensorFlow, this is the momentum argument of RMSPropOptimizer.


            • $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.


            • $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

              The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.




          High-Level Comparison



          Now we are finally ready to talk about the differences between the two.



          The denominator is quite similar (except for the bias-correction, which I explain about later).

          However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

          E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:




          Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]




          This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.



          Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.



          Lastly, I would describe the difference with regard to bias-correction.





          Accumulating Momentum



          Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

          I.e. we can imagine our algorithm as a stationary ball on a linear slope.

          What would happen when we use each of the methods?



          scenario 1



          In Adam



          The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

          Thus we get:
          $$begin{gathered}\
          Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
          downarrow\
          Delta x_{j}^{left(tright)}=-text{learning_rate}
          end{gathered}
          $$



          I.e. the "ball" immediately starts moving downhill in a constant speed.



          In rmsprop with momentum



          Similarly, we get:
          $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
          This case is a little more complicated, but we can see that:
          $$begin{gathered}\
          Delta x_{j}^{left(tright)}=-text{learning_rate}\
          Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
          end{gathered}
          $$

          So the "ball" starts accelerating downhill.

          Given that the gradient stays constant, you can prove that if:
          $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
          then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
          Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
          $$begin{gathered}\
          Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
          downarrow\
          Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
          end{gathered}
          $$

          Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)



          Changing Direction



          Now, consider a scenario following the previous one:

          After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

          What would happen when we use each of the methods?



          scenario 2



          This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

          Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:




          • Adam is slower to change its direction, and then much slower to get back to the minimum.

          • However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

            Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.


          Bias-Correction



          What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)



          Generally speaking, a moving average is a weighted average of:




          • The moving average of all of the previous terms

          • The current term


          Then what is the moving average in the first step?




          • A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

          • When you only have one term, by definition the average should be equal to that term.

            Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).


          So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.



          For more about the bias-correction in Adam, see section 3 in the paper and also this answer.



          Simulation Python Code





          import numpy as np
          import matplotlib.pyplot as plt
          from matplotlib.animation import FuncAnimation

          ###########################################
          # script parameters
          def f(x):
          if x > 0:
          return x
          else:
          return -0.1 * x

          def f_grad(x):
          if x > 0:
          return 1
          else:
          return -0.1

          METHOD_TO_LEARNING_RATE = {
          'Adam': 0.01,
          'GD': 0.00008,
          'rmsprop_with_Nesterov_momentum': 0.008,
          'rmsprop_with_momentum': 0.001,
          'rmsprop': 0.02,
          'momentum': 0.00008,
          'Nesterov': 0.008,
          'Adadelta': None,
          }
          X0 = 2
          METHOD = 'rmsprop'
          METHOD = 'momentum'
          METHOD = 'GD'
          METHOD = 'rmsprop_with_Nesterov_momentum'
          METHOD = 'Nesterov'
          METHOD = 'Adadelta'
          METHOD = 'rmsprop_with_momentum'
          METHOD = 'Adam'
          LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]

          MOMENTUM_DECAY_FACTOR = 0.9
          RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
          ADADELTA_DECAY_FACTOR = 0.9
          RMSPROP_EPSILON = 1e-10
          ADADELTA_EPSILON = 1e-6
          ADAM_EPSILON = 1e-10
          ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
          ADAM_GRADS_AVG_DECAY_FACTOR = 0.9

          INTERVAL = 9e2
          INTERVAL = 1
          INTERVAL = 3e2
          INTERVAL = 3e1
          ###########################################

          def plot_func(axe, f):
          xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
          vf = np.vectorize(f)
          ys = vf(xs)
          return axe.plot(xs, ys, color='grey')

          def next_color(color, f):
          color[1] -= 0.01
          if color[1] < 0:
          color[1] = 1
          return color[:]

          def update(frame):
          global k, x, prev_step, squared_grads_decaying_avg,
          squared_prev_steps_decaying_avg, grads_decaying_avg

          if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
          'rmsprop_with_Nesterov_momentum'):
          step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
          if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
          gradient = f_grad(x + step_momentum_portion)
          else:
          gradient = f_grad(x)

          if METHOD == 'GD':
          step = -LEARNING_RATE * gradient
          elif METHOD in ('momentum', 'Nesterov'):
          step = step_momentum_portion - LEARNING_RATE * gradient
          elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
          'rmsprop_with_Nesterov_momentum'):
          squared_grads_decaying_avg = (
          RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
          (1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
          grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
          if METHOD == 'rmsprop':
          step = -LEARNING_RATE / grads_rms * gradient
          else:
          assert(METHOD in ('rmsprop_with_momentum',
          'rmsprop_with_Nesterov_momentum'))
          print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
          step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
          elif METHOD == 'Adadelta':
          gradient = f_grad(x)
          squared_grads_decaying_avg = (
          ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
          (1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
          grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
          squared_prev_steps_decaying_avg = (
          ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
          (1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
          prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
          step = - prev_steps_rms / grads_rms * gradient
          elif METHOD == 'Adam':
          squared_grads_decaying_avg = (
          ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
          (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
          unbiased_squared_grads_decaying_avg = (
          squared_grads_decaying_avg /
          (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
          grads_decaying_avg = (
          ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
          (1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
          unbiased_grads_decaying_avg = (
          grads_decaying_avg /
          (1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
          step = - (LEARNING_RATE /
          (np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
          unbiased_grads_decaying_avg)

          x += step
          prev_step = step
          k += 1

          color = next_color(cur_color, f)

          print(f'k: {k}n'
          f'x: {x}n'
          f'step: {step}n'
          f'gradient: {gradient}n')

          k_x_marker, = k_and_x.plot(k, x, '.', color=color)
          x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)

          return k_x_marker, x_y_marker

          k = 0
          x = X0
          cur_color = [0, 1, 1]
          prev_step = 0
          squared_grads_decaying_avg = 0
          squared_prev_steps_decaying_avg = 0
          grads_decaying_avg = 0

          fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
          k_and_x.set_xlabel('k')
          k_and_x.set_ylabel('x', rotation=0)
          x_and_y.set_xlabel('x')
          x_and_y.set_ylabel('y', rotation=0)
          plot_func(x_and_y, f)
          x_and_y.plot(x, f(x), '.', color=cur_color[:])
          k_and_x.plot(k, x, '.', color=cur_color[:])
          plt.tight_layout()

          ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
          plt.show()





          share|improve this answer











          $endgroup$



          (My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)





          Some Background



          Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.



          Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:




          • In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
            while:



            • $text{learning_rate}$ is a hyperparameter.


            • $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).


              • All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".




            • $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

            • For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.

              Both the Adam paper and TensorFlow use the following notation:



              • $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$


              • $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$



            • The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.



          • In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
            while:



            • $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).

              In TensorFlow, this is the momentum argument of RMSPropOptimizer.


            • $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.


            • $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.

              The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.




          High-Level Comparison



          Now we are finally ready to talk about the differences between the two.



          The denominator is quite similar (except for the bias-correction, which I explain about later).

          However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

          E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:




          Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]




          This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.



          Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.



          Lastly, I would describe the difference with regard to bias-correction.





          Accumulating Momentum



          Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

          I.e. we can imagine our algorithm as a stationary ball on a linear slope.

          What would happen when we use each of the methods?



          scenario 1



          In Adam



          The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

          Thus we get:
          $$begin{gathered}\
          Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
          downarrow\
          Delta x_{j}^{left(tright)}=-text{learning_rate}
          end{gathered}
          $$



          I.e. the "ball" immediately starts moving downhill in a constant speed.



          In rmsprop with momentum



          Similarly, we get:
          $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
          This case is a little more complicated, but we can see that:
          $$begin{gathered}\
          Delta x_{j}^{left(tright)}=-text{learning_rate}\
          Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
          end{gathered}
          $$

          So the "ball" starts accelerating downhill.

          Given that the gradient stays constant, you can prove that if:
          $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
          then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
          Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
          $$begin{gathered}\
          Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
          downarrow\
          Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
          end{gathered}
          $$

          Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)



          Changing Direction



          Now, consider a scenario following the previous one:

          After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

          What would happen when we use each of the methods?



          scenario 2



          This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

          Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:




          • Adam is slower to change its direction, and then much slower to get back to the minimum.

          • However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

            Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.


          Bias-Correction



          What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)



          Generally speaking, a moving average is a weighted average of:




          • The moving average of all of the previous terms

          • The current term


          Then what is the moving average in the first step?




          • A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

          • When you only have one term, by definition the average should be equal to that term.

            Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).


          So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.



          For more about the bias-correction in Adam, see section 3 in the paper and also this answer.



          Simulation Python Code





          import numpy as np
          import matplotlib.pyplot as plt
          from matplotlib.animation import FuncAnimation

          ###########################################
          # script parameters
          def f(x):
          if x > 0:
          return x
          else:
          return -0.1 * x

          def f_grad(x):
          if x > 0:
          return 1
          else:
          return -0.1

          METHOD_TO_LEARNING_RATE = {
          'Adam': 0.01,
          'GD': 0.00008,
          'rmsprop_with_Nesterov_momentum': 0.008,
          'rmsprop_with_momentum': 0.001,
          'rmsprop': 0.02,
          'momentum': 0.00008,
          'Nesterov': 0.008,
          'Adadelta': None,
          }
          X0 = 2
          METHOD = 'rmsprop'
          METHOD = 'momentum'
          METHOD = 'GD'
          METHOD = 'rmsprop_with_Nesterov_momentum'
          METHOD = 'Nesterov'
          METHOD = 'Adadelta'
          METHOD = 'rmsprop_with_momentum'
          METHOD = 'Adam'
          LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]

          MOMENTUM_DECAY_FACTOR = 0.9
          RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
          ADADELTA_DECAY_FACTOR = 0.9
          RMSPROP_EPSILON = 1e-10
          ADADELTA_EPSILON = 1e-6
          ADAM_EPSILON = 1e-10
          ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
          ADAM_GRADS_AVG_DECAY_FACTOR = 0.9

          INTERVAL = 9e2
          INTERVAL = 1
          INTERVAL = 3e2
          INTERVAL = 3e1
          ###########################################

          def plot_func(axe, f):
          xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
          vf = np.vectorize(f)
          ys = vf(xs)
          return axe.plot(xs, ys, color='grey')

          def next_color(color, f):
          color[1] -= 0.01
          if color[1] < 0:
          color[1] = 1
          return color[:]

          def update(frame):
          global k, x, prev_step, squared_grads_decaying_avg,
          squared_prev_steps_decaying_avg, grads_decaying_avg

          if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
          'rmsprop_with_Nesterov_momentum'):
          step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
          if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
          gradient = f_grad(x + step_momentum_portion)
          else:
          gradient = f_grad(x)

          if METHOD == 'GD':
          step = -LEARNING_RATE * gradient
          elif METHOD in ('momentum', 'Nesterov'):
          step = step_momentum_portion - LEARNING_RATE * gradient
          elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
          'rmsprop_with_Nesterov_momentum'):
          squared_grads_decaying_avg = (
          RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
          (1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
          grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
          if METHOD == 'rmsprop':
          step = -LEARNING_RATE / grads_rms * gradient
          else:
          assert(METHOD in ('rmsprop_with_momentum',
          'rmsprop_with_Nesterov_momentum'))
          print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
          step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
          elif METHOD == 'Adadelta':
          gradient = f_grad(x)
          squared_grads_decaying_avg = (
          ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
          (1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
          grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
          squared_prev_steps_decaying_avg = (
          ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
          (1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
          prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
          step = - prev_steps_rms / grads_rms * gradient
          elif METHOD == 'Adam':
          squared_grads_decaying_avg = (
          ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
          (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
          unbiased_squared_grads_decaying_avg = (
          squared_grads_decaying_avg /
          (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
          grads_decaying_avg = (
          ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
          (1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
          unbiased_grads_decaying_avg = (
          grads_decaying_avg /
          (1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
          step = - (LEARNING_RATE /
          (np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
          unbiased_grads_decaying_avg)

          x += step
          prev_step = step
          k += 1

          color = next_color(cur_color, f)

          print(f'k: {k}n'
          f'x: {x}n'
          f'step: {step}n'
          f'gradient: {gradient}n')

          k_x_marker, = k_and_x.plot(k, x, '.', color=color)
          x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)

          return k_x_marker, x_y_marker

          k = 0
          x = X0
          cur_color = [0, 1, 1]
          prev_step = 0
          squared_grads_decaying_avg = 0
          squared_prev_steps_decaying_avg = 0
          grads_decaying_avg = 0

          fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
          k_and_x.set_xlabel('k')
          k_and_x.set_ylabel('x', rotation=0)
          x_and_y.set_xlabel('x')
          x_and_y.set_ylabel('y', rotation=0)
          plot_func(x_and_y, f)
          x_and_y.plot(x, f(x), '.', color=cur_color[:])
          k_and_x.plot(k, x, '.', color=cur_color[:])
          plt.tight_layout()

          ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
          plt.show()






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Oct 5 '18 at 12:40

























          answered Oct 5 '18 at 9:38









          Oren MilmanOren Milman

          2549




          2549























              0












              $begingroup$

              the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?






              share|improve this answer








              New contributor




              Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$


















                0












                $begingroup$

                the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?






                share|improve this answer








                New contributor




                Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?






                  share|improve this answer








                  New contributor




                  Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$



                  the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?







                  share|improve this answer








                  New contributor




                  Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer






                  New contributor




                  Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered 16 mins ago









                  Xiaogang WangXiaogang Wang

                  1




                  1




                  New contributor




                  Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  Xiaogang Wang is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26792%2fdifference-between-rmsprop-with-momentum-and-adam-optimizers%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Ponta tanko

                      Tantalo (mitologio)

                      Erzsébet Schaár