Gfyuki

Question

According to this scintillating blogpost Adam is very similar to RMSProp with momentum.

From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters

__init__(

    learning_rate,

    decay=0.9,

    momentum=0.0,

    epsilon=1e-10,

    use_locking=False,

    centered=False,

    name='RMSProp'

)

while tf.train.AdamOptimizer:

__init__(

    learning_rate=0.001,

    beta1=0.9,

    beta2=0.999,

    epsilon=1e-08,

    use_locking=False,

    name='Adam'

)

What is the conceptual difference if we put beta1 = decay and beta2 = momentum?

Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar. — Oct 4 '18 at 15:43

score 3 · Accepted Answer · 2018-10-05 12:40:28Z

(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)

Some Background

Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.

Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:

In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
- $text{learning_rate}$ is a hyperparameter.
- $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
  - All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
- $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
  
  Both the Adam paper and TensorFlow use the following notation:
  - $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
  - $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.

In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
- $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
  
  In TensorFlow, this is the momentum argument of RMSPropOptimizer.
- $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
- $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
  
  The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.

High-Level Comparison

Now we are finally ready to talk about the differences between the two.

The denominator is quite similar (except for the bias-correction, which I explain about later).

However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:

Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]

This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.

Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.

Lastly, I would describe the difference with regard to bias-correction.

Accumulating Momentum

Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

I.e. we can imagine our algorithm as a stationary ball on a linear slope.

What would happen when we use each of the methods?

scenario 1

In Adam

The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$

I.e. the "ball" immediately starts moving downhill in a constant speed.

In rmsprop with momentum

Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.

Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)

Changing Direction

Now, consider a scenario following the previous one:

After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

What would happen when we use each of the methods?

scenario 2

This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:

Adam is slower to change its direction, and then much slower to get back to the minimum.

However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.

Bias-Correction

What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)

Generally speaking, a moving average is a weighted average of:

The moving average of all of the previous terms

The current term

Then what is the moving average in the first step?

A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

When you only have one term, by definition the average should be equal to that term.

Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).

So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.

For more about the bias-correction in Adam, see section 3 in the paper and also this answer.

Simulation Python Code

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.animation import FuncAnimation



###########################################

# script parameters

def f(x):

    if x > 0:

        return x

    else:

        return -0.1 * x



def f_grad(x):

    if x > 0:

        return 1

    else:

        return -0.1



METHOD_TO_LEARNING_RATE = {

    'Adam': 0.01,

    'GD': 0.00008,

    'rmsprop_with_Nesterov_momentum': 0.008,

    'rmsprop_with_momentum': 0.001,

    'rmsprop': 0.02,

    'momentum': 0.00008,

    'Nesterov': 0.008,

    'Adadelta': None,

    }

X0 = 2

METHOD = 'rmsprop'

METHOD = 'momentum'

METHOD = 'GD'

METHOD = 'rmsprop_with_Nesterov_momentum'

METHOD = 'Nesterov'

METHOD = 'Adadelta'

METHOD = 'rmsprop_with_momentum'

METHOD = 'Adam'

LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]



MOMENTUM_DECAY_FACTOR = 0.9

RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9

ADADELTA_DECAY_FACTOR = 0.9

RMSPROP_EPSILON = 1e-10

ADADELTA_EPSILON = 1e-6

ADAM_EPSILON = 1e-10

ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999

ADAM_GRADS_AVG_DECAY_FACTOR = 0.9



INTERVAL = 9e2

INTERVAL = 1

INTERVAL = 3e2

INTERVAL = 3e1

###########################################



def plot_func(axe, f):

    xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)

    vf = np.vectorize(f)

    ys = vf(xs)

    return axe.plot(xs, ys, color='grey')



def next_color(color, f):

    color[1] -= 0.01

    if color[1] < 0:

        color[1] = 1

    return color[:]



def update(frame):

    global k, x, prev_step, squared_grads_decaying_avg, 

           squared_prev_steps_decaying_avg, grads_decaying_avg



    if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',

                  'rmsprop_with_Nesterov_momentum'):

        step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step

    if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):

        gradient = f_grad(x + step_momentum_portion)

    else:

        gradient = f_grad(x)



    if METHOD == 'GD':

        step = -LEARNING_RATE * gradient

    elif METHOD in ('momentum', 'Nesterov'):

        step = step_momentum_portion - LEARNING_RATE * gradient

    elif METHOD in ('rmsprop', 'rmsprop_with_momentum',

                    'rmsprop_with_Nesterov_momentum'):

        squared_grads_decaying_avg = (

            RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +

            (1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)

        grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)

        if METHOD == 'rmsprop':

            step = -LEARNING_RATE / grads_rms * gradient

        else:

            assert(METHOD in ('rmsprop_with_momentum',

                              'rmsprop_with_Nesterov_momentum'))

            print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')

            step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient

    elif METHOD == 'Adadelta':

        gradient = f_grad(x)

        squared_grads_decaying_avg = (

            ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +

            (1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)

        grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)

        squared_prev_steps_decaying_avg = (

            ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +

            (1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)

        prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)

        step = - prev_steps_rms / grads_rms * gradient

    elif METHOD == 'Adam':

        squared_grads_decaying_avg = (

            ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +

            (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)

        unbiased_squared_grads_decaying_avg = (

            squared_grads_decaying_avg /

            (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))

        grads_decaying_avg = (

            ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +

            (1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)

        unbiased_grads_decaying_avg = (

            grads_decaying_avg /

            (1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))

        step = - (LEARNING_RATE /

                  (np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *

                  unbiased_grads_decaying_avg)



    x += step

    prev_step = step

    k += 1



    color = next_color(cur_color, f)



    print(f'k: {k}n'

          f'x: {x}n'

          f'step: {step}n'

          f'gradient: {gradient}n')



    k_x_marker, = k_and_x.plot(k, x, '.', color=color)

    x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)



    return k_x_marker, x_y_marker



k = 0

x = X0

cur_color = [0, 1, 1]

prev_step = 0

squared_grads_decaying_avg = 0

squared_prev_steps_decaying_avg = 0

grads_decaying_avg = 0



fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))

k_and_x.set_xlabel('k')

k_and_x.set_ylabel('x', rotation=0)

x_and_y.set_xlabel('x')

x_and_y.set_ylabel('y', rotation=0)

plot_func(x_and_y, f)

x_and_y.plot(x, f(x), '.', color=cur_color[:])

k_and_x.plot(k, x, '.', color=cur_color[:])

plt.tight_layout()



ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)

plt.show()

Xiaogang WangXiaogang Wang 1 · Accepted Answer · 2019-03-12 03:20:15Z

0

the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?

answered 16 mins ago

Xiaogang Wang

1

New contributor

add a comment |

score 3 · Accepted Answer · 2018-10-05 12:40:28Z

(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)

Some Background

Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.

Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:

In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
- $text{learning_rate}$ is a hyperparameter.
- $text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
  - All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
- $g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
  
  Both the Adam paper and TensorFlow use the following notation:
  - $beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
  - $beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.

In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
- $text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
  
  In TensorFlow, this is the momentum argument of RMSPropOptimizer.
- $g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
- $text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
  
  The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is the decay argument of RMSPropOptimizer.

High-Level Comparison

Now we are finally ready to talk about the differences between the two.

The denominator is quite similar (except for the bias-correction, which I explain about later).

However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.

E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:

Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]

This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.

Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.

Lastly, I would describe the difference with regard to bias-correction.

Accumulating Momentum

Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.

I.e. we can imagine our algorithm as a stationary ball on a linear slope.

What would happen when we use each of the methods?

scenario 1

In Adam

The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.

Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$

I.e. the "ball" immediately starts moving downhill in a constant speed.

In rmsprop with momentum

Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.

Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)

Changing Direction

Now, consider a scenario following the previous one:

After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.

What would happen when we use each of the methods?

scenario 2

This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).

Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:

Adam is slower to change its direction, and then much slower to get back to the minimum.

However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).

Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.

Bias-Correction

What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)

Generally speaking, a moving average is a weighted average of:

The moving average of all of the previous terms

The current term

Then what is the moving average in the first step?

A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.

When you only have one term, by definition the average should be equal to that term.

Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).

So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.

For more about the bias-correction in Adam, see section 3 in the paper and also this answer.

Simulation Python Code

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.animation import FuncAnimation



###########################################

# script parameters

def f(x):

    if x > 0:

        return x

    else:

        return -0.1 * x



def f_grad(x):

    if x > 0:

        return 1

    else:

        return -0.1



METHOD_TO_LEARNING_RATE = {

    'Adam': 0.01,

    'GD': 0.00008,

    'rmsprop_with_Nesterov_momentum': 0.008,

    'rmsprop_with_momentum': 0.001,

    'rmsprop': 0.02,

    'momentum': 0.00008,

    'Nesterov': 0.008,

    'Adadelta': None,

    }

X0 = 2

METHOD = 'rmsprop'

METHOD = 'momentum'

METHOD = 'GD'

METHOD = 'rmsprop_with_Nesterov_momentum'

METHOD = 'Nesterov'

METHOD = 'Adadelta'

METHOD = 'rmsprop_with_momentum'

METHOD = 'Adam'

LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]



MOMENTUM_DECAY_FACTOR = 0.9

RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9

ADADELTA_DECAY_FACTOR = 0.9

RMSPROP_EPSILON = 1e-10

ADADELTA_EPSILON = 1e-6

ADAM_EPSILON = 1e-10

ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999

ADAM_GRADS_AVG_DECAY_FACTOR = 0.9



INTERVAL = 9e2

INTERVAL = 1

INTERVAL = 3e2

INTERVAL = 3e1

###########################################



def plot_func(axe, f):

    xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)

    vf = np.vectorize(f)

    ys = vf(xs)

    return axe.plot(xs, ys, color='grey')



def next_color(color, f):

    color[1] -= 0.01

    if color[1] < 0:

        color[1] = 1

    return color[:]



def update(frame):

    global k, x, prev_step, squared_grads_decaying_avg, 

           squared_prev_steps_decaying_avg, grads_decaying_avg



    if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',

                  'rmsprop_with_Nesterov_momentum'):

        step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step

    if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):

        gradient = f_grad(x + step_momentum_portion)

    else:

        gradient = f_grad(x)



    if METHOD == 'GD':

        step = -LEARNING_RATE * gradient

    elif METHOD in ('momentum', 'Nesterov'):

        step = step_momentum_portion - LEARNING_RATE * gradient

    elif METHOD in ('rmsprop', 'rmsprop_with_momentum',

                    'rmsprop_with_Nesterov_momentum'):

        squared_grads_decaying_avg = (

            RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +

            (1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)

        grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)

        if METHOD == 'rmsprop':

            step = -LEARNING_RATE / grads_rms * gradient

        else:

            assert(METHOD in ('rmsprop_with_momentum',

                              'rmsprop_with_Nesterov_momentum'))

            print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')

            step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient

    elif METHOD == 'Adadelta':

        gradient = f_grad(x)

        squared_grads_decaying_avg = (

            ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +

            (1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)

        grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)

        squared_prev_steps_decaying_avg = (

            ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +

            (1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)

        prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)

        step = - prev_steps_rms / grads_rms * gradient

    elif METHOD == 'Adam':

        squared_grads_decaying_avg = (

            ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +

            (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)

        unbiased_squared_grads_decaying_avg = (

            squared_grads_decaying_avg /

            (1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))

        grads_decaying_avg = (

            ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +

            (1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)

        unbiased_grads_decaying_avg = (

            grads_decaying_avg /

            (1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))

        step = - (LEARNING_RATE /

                  (np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *

                  unbiased_grads_decaying_avg)



    x += step

    prev_step = step

    k += 1



    color = next_color(cur_color, f)



    print(f'k: {k}n'

          f'x: {x}n'

          f'step: {step}n'

          f'gradient: {gradient}n')



    k_x_marker, = k_and_x.plot(k, x, '.', color=color)

    x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)



    return k_x_marker, x_y_marker



k = 0

x = X0

cur_color = [0, 1, 1]

prev_step = 0

squared_grads_decaying_avg = 0

squared_prev_steps_decaying_avg = 0

grads_decaying_avg = 0



fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))

k_and_x.set_xlabel('k')

k_and_x.set_ylabel('x', rotation=0)

x_and_y.set_xlabel('x')

x_and_y.set_ylabel('y', rotation=0)

plot_func(x_and_y, f)

x_and_y.plot(x, f(x), '.', color=cur_color[:])

k_and_x.plot(k, x, '.', color=cur_color[:])

plt.tight_layout()



ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)

plt.show()

Xiaogang WangXiaogang Wang 1 · Accepted Answer · 2019-03-12 03:20:15Z

0

the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?

answered 16 mins ago

Xiaogang Wang

1

New contributor

add a comment |

Difference between RMSProp with momentum and Adam Optimizers

2 Answers 2

Some Background

High-Level Comparison

Accumulating Momentum

In Adam

In rmsprop with momentum

Changing Direction

Bias-Correction

Simulation Python Code

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Some Background

High-Level Comparison

Accumulating Momentum

In Adam

In rmsprop with momentum

Changing Direction

Bias-Correction

Simulation Python Code

Some Background

High-Level Comparison

Accumulating Momentum

In Adam

In rmsprop with momentum

Changing Direction

Bias-Correction

Simulation Python Code

Some Background

High-Level Comparison

Accumulating Momentum

In Adam

In rmsprop with momentum

Changing Direction

Bias-Correction

Simulation Python Code

Some Background

High-Level Comparison

Accumulating Momentum

In Adam

In rmsprop with momentum

Changing Direction

Bias-Correction

Simulation Python Code

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Ponta tanko

Tantalo (mitologio)

Erzsébet Schaár

2 Answers
2

2 Answers
2

2 Answers
2