Difference between RMSProp with momentum and Adam Optimizers
$begingroup$
According to this scintillating blogpost Adam is very similar to RMSProp with momentum.
From tensorflow documentation we see that tf.train.RMSPropOptimizer
has following parameters
__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)
while tf.train.AdamOptimizer
:
__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
What is the conceptual difference if we put beta1 = decay
and beta2 = momentum
?
tensorflow optimization
$endgroup$
add a comment |
$begingroup$
According to this scintillating blogpost Adam is very similar to RMSProp with momentum.
From tensorflow documentation we see that tf.train.RMSPropOptimizer
has following parameters
__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)
while tf.train.AdamOptimizer
:
__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
What is the conceptual difference if we put beta1 = decay
and beta2 = momentum
?
tensorflow optimization
$endgroup$
$begingroup$
Did you meanbeta1=momentum
andbeta2=decay
?
$endgroup$
– Oren Milman
Sep 20 '18 at 18:33
$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43
add a comment |
$begingroup$
According to this scintillating blogpost Adam is very similar to RMSProp with momentum.
From tensorflow documentation we see that tf.train.RMSPropOptimizer
has following parameters
__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)
while tf.train.AdamOptimizer
:
__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
What is the conceptual difference if we put beta1 = decay
and beta2 = momentum
?
tensorflow optimization
$endgroup$
According to this scintillating blogpost Adam is very similar to RMSProp with momentum.
From tensorflow documentation we see that tf.train.RMSPropOptimizer
has following parameters
__init__(
learning_rate,
decay=0.9,
momentum=0.0,
epsilon=1e-10,
use_locking=False,
centered=False,
name='RMSProp'
)
while tf.train.AdamOptimizer
:
__init__(
learning_rate=0.001,
beta1=0.9,
beta2=0.999,
epsilon=1e-08,
use_locking=False,
name='Adam'
)
What is the conceptual difference if we put beta1 = decay
and beta2 = momentum
?
tensorflow optimization
tensorflow optimization
asked Jan 18 '18 at 15:21
hanshans
1335
1335
$begingroup$
Did you meanbeta1=momentum
andbeta2=decay
?
$endgroup$
– Oren Milman
Sep 20 '18 at 18:33
$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43
add a comment |
$begingroup$
Did you meanbeta1=momentum
andbeta2=decay
?
$endgroup$
– Oren Milman
Sep 20 '18 at 18:33
$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43
$begingroup$
Did you mean
beta1=momentum
and beta2=decay
?$endgroup$
– Oren Milman
Sep 20 '18 at 18:33
$begingroup$
Did you mean
beta1=momentum
and beta2=decay
?$endgroup$
– Oren Milman
Sep 20 '18 at 18:33
$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43
$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)
Some Background
Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.
Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:
- In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
$text{learning_rate}$ is a hyperparameter.
$text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
- All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
$g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
Both the Adam paper and TensorFlow use the following notation:
$beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
$beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.
- In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
$text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
In TensorFlow, this is themomentum
argument ofRMSPropOptimizer
.
$g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
$text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is thedecay
argument ofRMSPropOptimizer
.
High-Level Comparison
Now we are finally ready to talk about the differences between the two.
The denominator is quite similar (except for the bias-correction, which I explain about later).
However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.
E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:
Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]
This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.
Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.
Lastly, I would describe the difference with regard to bias-correction.
Accumulating Momentum
Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.
I.e. we can imagine our algorithm as a stationary ball on a linear slope.
What would happen when we use each of the methods?
In Adam
The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.
Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$
I.e. the "ball" immediately starts moving downhill in a constant speed.
In rmsprop with momentum
Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.
Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)
Changing Direction
Now, consider a scenario following the previous one:
After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.
What would happen when we use each of the methods?
This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).
Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:
- Adam is slower to change its direction, and then much slower to get back to the minimum.
- However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).
Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.
Bias-Correction
What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)
Generally speaking, a moving average is a weighted average of:
- The moving average of all of the previous terms
- The current term
Then what is the moving average in the first step?
- A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.
- When you only have one term, by definition the average should be equal to that term.
Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).
So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.
For more about the bias-correction in Adam, see section 3 in the paper and also this answer.
Simulation Python Code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
###########################################
# script parameters
def f(x):
if x > 0:
return x
else:
return -0.1 * x
def f_grad(x):
if x > 0:
return 1
else:
return -0.1
METHOD_TO_LEARNING_RATE = {
'Adam': 0.01,
'GD': 0.00008,
'rmsprop_with_Nesterov_momentum': 0.008,
'rmsprop_with_momentum': 0.001,
'rmsprop': 0.02,
'momentum': 0.00008,
'Nesterov': 0.008,
'Adadelta': None,
}
X0 = 2
METHOD = 'rmsprop'
METHOD = 'momentum'
METHOD = 'GD'
METHOD = 'rmsprop_with_Nesterov_momentum'
METHOD = 'Nesterov'
METHOD = 'Adadelta'
METHOD = 'rmsprop_with_momentum'
METHOD = 'Adam'
LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]
MOMENTUM_DECAY_FACTOR = 0.9
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
ADADELTA_DECAY_FACTOR = 0.9
RMSPROP_EPSILON = 1e-10
ADADELTA_EPSILON = 1e-6
ADAM_EPSILON = 1e-10
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
ADAM_GRADS_AVG_DECAY_FACTOR = 0.9
INTERVAL = 9e2
INTERVAL = 1
INTERVAL = 3e2
INTERVAL = 3e1
###########################################
def plot_func(axe, f):
xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
vf = np.vectorize(f)
ys = vf(xs)
return axe.plot(xs, ys, color='grey')
def next_color(color, f):
color[1] -= 0.01
if color[1] < 0:
color[1] = 1
return color[:]
def update(frame):
global k, x, prev_step, squared_grads_decaying_avg,
squared_prev_steps_decaying_avg, grads_decaying_avg
if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
gradient = f_grad(x + step_momentum_portion)
else:
gradient = f_grad(x)
if METHOD == 'GD':
step = -LEARNING_RATE * gradient
elif METHOD in ('momentum', 'Nesterov'):
step = step_momentum_portion - LEARNING_RATE * gradient
elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
squared_grads_decaying_avg = (
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
if METHOD == 'rmsprop':
step = -LEARNING_RATE / grads_rms * gradient
else:
assert(METHOD in ('rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'))
print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
elif METHOD == 'Adadelta':
gradient = f_grad(x)
squared_grads_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
squared_prev_steps_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
step = - prev_steps_rms / grads_rms * gradient
elif METHOD == 'Adam':
squared_grads_decaying_avg = (
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
unbiased_squared_grads_decaying_avg = (
squared_grads_decaying_avg /
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
grads_decaying_avg = (
ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
(1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
unbiased_grads_decaying_avg = (
grads_decaying_avg /
(1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
step = - (LEARNING_RATE /
(np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
unbiased_grads_decaying_avg)
x += step
prev_step = step
k += 1
color = next_color(cur_color, f)
print(f'k: {k}n'
f'x: {x}n'
f'step: {step}n'
f'gradient: {gradient}n')
k_x_marker, = k_and_x.plot(k, x, '.', color=color)
x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)
return k_x_marker, x_y_marker
k = 0
x = X0
cur_color = [0, 1, 1]
prev_step = 0
squared_grads_decaying_avg = 0
squared_prev_steps_decaying_avg = 0
grads_decaying_avg = 0
fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
k_and_x.set_xlabel('k')
k_and_x.set_ylabel('x', rotation=0)
x_and_y.set_xlabel('x')
x_and_y.set_ylabel('y', rotation=0)
plot_func(x_and_y, f)
x_and_y.plot(x, f(x), '.', color=cur_color[:])
k_and_x.plot(k, x, '.', color=cur_color[:])
plt.tight_layout()
ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
plt.show()
$endgroup$
add a comment |
$begingroup$
the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26792%2fdifference-between-rmsprop-with-momentum-and-adam-optimizers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)
Some Background
Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.
Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:
- In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
$text{learning_rate}$ is a hyperparameter.
$text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
- All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
$g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
Both the Adam paper and TensorFlow use the following notation:
$beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
$beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.
- In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
$text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
In TensorFlow, this is themomentum
argument ofRMSPropOptimizer
.
$g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
$text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is thedecay
argument ofRMSPropOptimizer
.
High-Level Comparison
Now we are finally ready to talk about the differences between the two.
The denominator is quite similar (except for the bias-correction, which I explain about later).
However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.
E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:
Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]
This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.
Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.
Lastly, I would describe the difference with regard to bias-correction.
Accumulating Momentum
Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.
I.e. we can imagine our algorithm as a stationary ball on a linear slope.
What would happen when we use each of the methods?
In Adam
The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.
Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$
I.e. the "ball" immediately starts moving downhill in a constant speed.
In rmsprop with momentum
Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.
Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)
Changing Direction
Now, consider a scenario following the previous one:
After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.
What would happen when we use each of the methods?
This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).
Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:
- Adam is slower to change its direction, and then much slower to get back to the minimum.
- However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).
Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.
Bias-Correction
What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)
Generally speaking, a moving average is a weighted average of:
- The moving average of all of the previous terms
- The current term
Then what is the moving average in the first step?
- A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.
- When you only have one term, by definition the average should be equal to that term.
Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).
So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.
For more about the bias-correction in Adam, see section 3 in the paper and also this answer.
Simulation Python Code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
###########################################
# script parameters
def f(x):
if x > 0:
return x
else:
return -0.1 * x
def f_grad(x):
if x > 0:
return 1
else:
return -0.1
METHOD_TO_LEARNING_RATE = {
'Adam': 0.01,
'GD': 0.00008,
'rmsprop_with_Nesterov_momentum': 0.008,
'rmsprop_with_momentum': 0.001,
'rmsprop': 0.02,
'momentum': 0.00008,
'Nesterov': 0.008,
'Adadelta': None,
}
X0 = 2
METHOD = 'rmsprop'
METHOD = 'momentum'
METHOD = 'GD'
METHOD = 'rmsprop_with_Nesterov_momentum'
METHOD = 'Nesterov'
METHOD = 'Adadelta'
METHOD = 'rmsprop_with_momentum'
METHOD = 'Adam'
LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]
MOMENTUM_DECAY_FACTOR = 0.9
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
ADADELTA_DECAY_FACTOR = 0.9
RMSPROP_EPSILON = 1e-10
ADADELTA_EPSILON = 1e-6
ADAM_EPSILON = 1e-10
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
ADAM_GRADS_AVG_DECAY_FACTOR = 0.9
INTERVAL = 9e2
INTERVAL = 1
INTERVAL = 3e2
INTERVAL = 3e1
###########################################
def plot_func(axe, f):
xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
vf = np.vectorize(f)
ys = vf(xs)
return axe.plot(xs, ys, color='grey')
def next_color(color, f):
color[1] -= 0.01
if color[1] < 0:
color[1] = 1
return color[:]
def update(frame):
global k, x, prev_step, squared_grads_decaying_avg,
squared_prev_steps_decaying_avg, grads_decaying_avg
if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
gradient = f_grad(x + step_momentum_portion)
else:
gradient = f_grad(x)
if METHOD == 'GD':
step = -LEARNING_RATE * gradient
elif METHOD in ('momentum', 'Nesterov'):
step = step_momentum_portion - LEARNING_RATE * gradient
elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
squared_grads_decaying_avg = (
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
if METHOD == 'rmsprop':
step = -LEARNING_RATE / grads_rms * gradient
else:
assert(METHOD in ('rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'))
print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
elif METHOD == 'Adadelta':
gradient = f_grad(x)
squared_grads_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
squared_prev_steps_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
step = - prev_steps_rms / grads_rms * gradient
elif METHOD == 'Adam':
squared_grads_decaying_avg = (
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
unbiased_squared_grads_decaying_avg = (
squared_grads_decaying_avg /
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
grads_decaying_avg = (
ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
(1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
unbiased_grads_decaying_avg = (
grads_decaying_avg /
(1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
step = - (LEARNING_RATE /
(np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
unbiased_grads_decaying_avg)
x += step
prev_step = step
k += 1
color = next_color(cur_color, f)
print(f'k: {k}n'
f'x: {x}n'
f'step: {step}n'
f'gradient: {gradient}n')
k_x_marker, = k_and_x.plot(k, x, '.', color=color)
x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)
return k_x_marker, x_y_marker
k = 0
x = X0
cur_color = [0, 1, 1]
prev_step = 0
squared_grads_decaying_avg = 0
squared_prev_steps_decaying_avg = 0
grads_decaying_avg = 0
fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
k_and_x.set_xlabel('k')
k_and_x.set_ylabel('x', rotation=0)
x_and_y.set_xlabel('x')
x_and_y.set_ylabel('y', rotation=0)
plot_func(x_and_y, f)
x_and_y.plot(x, f(x), '.', color=cur_color[:])
k_and_x.plot(k, x, '.', color=cur_color[:])
plt.tight_layout()
ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
plt.show()
$endgroup$
add a comment |
$begingroup$
(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)
Some Background
Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.
Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:
- In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
$text{learning_rate}$ is a hyperparameter.
$text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
- All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
$g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
Both the Adam paper and TensorFlow use the following notation:
$beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
$beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.
- In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
$text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
In TensorFlow, this is themomentum
argument ofRMSPropOptimizer
.
$g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
$text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is thedecay
argument ofRMSPropOptimizer
.
High-Level Comparison
Now we are finally ready to talk about the differences between the two.
The denominator is quite similar (except for the bias-correction, which I explain about later).
However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.
E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:
Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]
This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.
Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.
Lastly, I would describe the difference with regard to bias-correction.
Accumulating Momentum
Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.
I.e. we can imagine our algorithm as a stationary ball on a linear slope.
What would happen when we use each of the methods?
In Adam
The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.
Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$
I.e. the "ball" immediately starts moving downhill in a constant speed.
In rmsprop with momentum
Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.
Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)
Changing Direction
Now, consider a scenario following the previous one:
After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.
What would happen when we use each of the methods?
This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).
Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:
- Adam is slower to change its direction, and then much slower to get back to the minimum.
- However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).
Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.
Bias-Correction
What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)
Generally speaking, a moving average is a weighted average of:
- The moving average of all of the previous terms
- The current term
Then what is the moving average in the first step?
- A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.
- When you only have one term, by definition the average should be equal to that term.
Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).
So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.
For more about the bias-correction in Adam, see section 3 in the paper and also this answer.
Simulation Python Code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
###########################################
# script parameters
def f(x):
if x > 0:
return x
else:
return -0.1 * x
def f_grad(x):
if x > 0:
return 1
else:
return -0.1
METHOD_TO_LEARNING_RATE = {
'Adam': 0.01,
'GD': 0.00008,
'rmsprop_with_Nesterov_momentum': 0.008,
'rmsprop_with_momentum': 0.001,
'rmsprop': 0.02,
'momentum': 0.00008,
'Nesterov': 0.008,
'Adadelta': None,
}
X0 = 2
METHOD = 'rmsprop'
METHOD = 'momentum'
METHOD = 'GD'
METHOD = 'rmsprop_with_Nesterov_momentum'
METHOD = 'Nesterov'
METHOD = 'Adadelta'
METHOD = 'rmsprop_with_momentum'
METHOD = 'Adam'
LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]
MOMENTUM_DECAY_FACTOR = 0.9
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
ADADELTA_DECAY_FACTOR = 0.9
RMSPROP_EPSILON = 1e-10
ADADELTA_EPSILON = 1e-6
ADAM_EPSILON = 1e-10
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
ADAM_GRADS_AVG_DECAY_FACTOR = 0.9
INTERVAL = 9e2
INTERVAL = 1
INTERVAL = 3e2
INTERVAL = 3e1
###########################################
def plot_func(axe, f):
xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
vf = np.vectorize(f)
ys = vf(xs)
return axe.plot(xs, ys, color='grey')
def next_color(color, f):
color[1] -= 0.01
if color[1] < 0:
color[1] = 1
return color[:]
def update(frame):
global k, x, prev_step, squared_grads_decaying_avg,
squared_prev_steps_decaying_avg, grads_decaying_avg
if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
gradient = f_grad(x + step_momentum_portion)
else:
gradient = f_grad(x)
if METHOD == 'GD':
step = -LEARNING_RATE * gradient
elif METHOD in ('momentum', 'Nesterov'):
step = step_momentum_portion - LEARNING_RATE * gradient
elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
squared_grads_decaying_avg = (
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
if METHOD == 'rmsprop':
step = -LEARNING_RATE / grads_rms * gradient
else:
assert(METHOD in ('rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'))
print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
elif METHOD == 'Adadelta':
gradient = f_grad(x)
squared_grads_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
squared_prev_steps_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
step = - prev_steps_rms / grads_rms * gradient
elif METHOD == 'Adam':
squared_grads_decaying_avg = (
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
unbiased_squared_grads_decaying_avg = (
squared_grads_decaying_avg /
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
grads_decaying_avg = (
ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
(1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
unbiased_grads_decaying_avg = (
grads_decaying_avg /
(1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
step = - (LEARNING_RATE /
(np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
unbiased_grads_decaying_avg)
x += step
prev_step = step
k += 1
color = next_color(cur_color, f)
print(f'k: {k}n'
f'x: {x}n'
f'step: {step}n'
f'gradient: {gradient}n')
k_x_marker, = k_and_x.plot(k, x, '.', color=color)
x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)
return k_x_marker, x_y_marker
k = 0
x = X0
cur_color = [0, 1, 1]
prev_step = 0
squared_grads_decaying_avg = 0
squared_prev_steps_decaying_avg = 0
grads_decaying_avg = 0
fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
k_and_x.set_xlabel('k')
k_and_x.set_ylabel('x', rotation=0)
x_and_y.set_xlabel('x')
x_and_y.set_ylabel('y', rotation=0)
plot_func(x_and_y, f)
x_and_y.plot(x, f(x), '.', color=cur_color[:])
k_and_x.plot(k, x, '.', color=cur_color[:])
plt.tight_layout()
ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
plt.show()
$endgroup$
add a comment |
$begingroup$
(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)
Some Background
Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.
Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:
- In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
$text{learning_rate}$ is a hyperparameter.
$text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
- All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
$g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
Both the Adam paper and TensorFlow use the following notation:
$beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
$beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.
- In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
$text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
In TensorFlow, this is themomentum
argument ofRMSPropOptimizer
.
$g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
$text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is thedecay
argument ofRMSPropOptimizer
.
High-Level Comparison
Now we are finally ready to talk about the differences between the two.
The denominator is quite similar (except for the bias-correction, which I explain about later).
However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.
E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:
Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]
This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.
Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.
Lastly, I would describe the difference with regard to bias-correction.
Accumulating Momentum
Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.
I.e. we can imagine our algorithm as a stationary ball on a linear slope.
What would happen when we use each of the methods?
In Adam
The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.
Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$
I.e. the "ball" immediately starts moving downhill in a constant speed.
In rmsprop with momentum
Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.
Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)
Changing Direction
Now, consider a scenario following the previous one:
After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.
What would happen when we use each of the methods?
This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).
Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:
- Adam is slower to change its direction, and then much slower to get back to the minimum.
- However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).
Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.
Bias-Correction
What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)
Generally speaking, a moving average is a weighted average of:
- The moving average of all of the previous terms
- The current term
Then what is the moving average in the first step?
- A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.
- When you only have one term, by definition the average should be equal to that term.
Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).
So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.
For more about the bias-correction in Adam, see section 3 in the paper and also this answer.
Simulation Python Code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
###########################################
# script parameters
def f(x):
if x > 0:
return x
else:
return -0.1 * x
def f_grad(x):
if x > 0:
return 1
else:
return -0.1
METHOD_TO_LEARNING_RATE = {
'Adam': 0.01,
'GD': 0.00008,
'rmsprop_with_Nesterov_momentum': 0.008,
'rmsprop_with_momentum': 0.001,
'rmsprop': 0.02,
'momentum': 0.00008,
'Nesterov': 0.008,
'Adadelta': None,
}
X0 = 2
METHOD = 'rmsprop'
METHOD = 'momentum'
METHOD = 'GD'
METHOD = 'rmsprop_with_Nesterov_momentum'
METHOD = 'Nesterov'
METHOD = 'Adadelta'
METHOD = 'rmsprop_with_momentum'
METHOD = 'Adam'
LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]
MOMENTUM_DECAY_FACTOR = 0.9
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
ADADELTA_DECAY_FACTOR = 0.9
RMSPROP_EPSILON = 1e-10
ADADELTA_EPSILON = 1e-6
ADAM_EPSILON = 1e-10
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
ADAM_GRADS_AVG_DECAY_FACTOR = 0.9
INTERVAL = 9e2
INTERVAL = 1
INTERVAL = 3e2
INTERVAL = 3e1
###########################################
def plot_func(axe, f):
xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
vf = np.vectorize(f)
ys = vf(xs)
return axe.plot(xs, ys, color='grey')
def next_color(color, f):
color[1] -= 0.01
if color[1] < 0:
color[1] = 1
return color[:]
def update(frame):
global k, x, prev_step, squared_grads_decaying_avg,
squared_prev_steps_decaying_avg, grads_decaying_avg
if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
gradient = f_grad(x + step_momentum_portion)
else:
gradient = f_grad(x)
if METHOD == 'GD':
step = -LEARNING_RATE * gradient
elif METHOD in ('momentum', 'Nesterov'):
step = step_momentum_portion - LEARNING_RATE * gradient
elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
squared_grads_decaying_avg = (
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
if METHOD == 'rmsprop':
step = -LEARNING_RATE / grads_rms * gradient
else:
assert(METHOD in ('rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'))
print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
elif METHOD == 'Adadelta':
gradient = f_grad(x)
squared_grads_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
squared_prev_steps_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
step = - prev_steps_rms / grads_rms * gradient
elif METHOD == 'Adam':
squared_grads_decaying_avg = (
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
unbiased_squared_grads_decaying_avg = (
squared_grads_decaying_avg /
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
grads_decaying_avg = (
ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
(1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
unbiased_grads_decaying_avg = (
grads_decaying_avg /
(1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
step = - (LEARNING_RATE /
(np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
unbiased_grads_decaying_avg)
x += step
prev_step = step
k += 1
color = next_color(cur_color, f)
print(f'k: {k}n'
f'x: {x}n'
f'step: {step}n'
f'gradient: {gradient}n')
k_x_marker, = k_and_x.plot(k, x, '.', color=color)
x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)
return k_x_marker, x_y_marker
k = 0
x = X0
cur_color = [0, 1, 1]
prev_step = 0
squared_grads_decaying_avg = 0
squared_prev_steps_decaying_avg = 0
grads_decaying_avg = 0
fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
k_and_x.set_xlabel('k')
k_and_x.set_ylabel('x', rotation=0)
x_and_y.set_xlabel('x')
x_and_y.set_ylabel('y', rotation=0)
plot_func(x_and_y, f)
x_and_y.plot(x, f(x), '.', color=cur_color[:])
k_and_x.plot(k, x, '.', color=cur_color[:])
plt.tight_layout()
ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
plt.show()
$endgroup$
(My answer is based mostly on Adam: A Method for Stochastic Optimization (the original Adam paper) and on the implementation of rmsprop with momentum in Tensorflow (which is operator() of struct ApplyRMSProp), as rmsprop is unpublished - it was described in a lecture by Geoffrey Hinton .)
Some Background
Adam and rmsprop with momentum are both methods (used by a gradient descent algorithm) to determine the step.
Let $Delta x^{(t)}_j$ be the $j^{text{th}}$ component of the $t^{text{th}}$ step. Then:
- In Adam: $$Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{text{BCMA}left(g_{j}^{2}right)}}cdottext{BCMA}left(g_{j}right)$$
while:
$text{learning_rate}$ is a hyperparameter.
$text{BCMA}$ is short for "bias-corrected (exponential) moving average" (I made up the acronym for brevity).
- All of the moving averages I am going to talk about are exponential moving averages, so I would just refer to them as "moving averages".
$g_j$ is the $j^{text{th}}$ component of the gradient, and so $text{BCMA}left(g_{j}right)$ is a bias-corrected moving average of the $j^{text{th}}$ components of the gradients that were calculated. Similarly, $text{BCMA}left(g_{j}^{2}right)$ is a bias-corrected moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.- For each moving average, the decay factor (aka smoothing factor) is a hyperparameter.
Both the Adam paper and TensorFlow use the following notation:
$beta_1$ is the decay factor for $text{BCMA}left(g_{j}right)$
$beta_2$ is the decay factor for $text{BCMA}left(g^2_{j}right)$
- The denominator is actually $sqrt{text{BCMA}left(g_{j}^{2}right)}+epsilon$, while $epsilon$ is a small hyperparameter, but I would ignore it for simplicity.
- In rmsprop with momentum: $$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-frac{text{learning_rate}}{sqrt{text{MA}left(g_{j}^{2}right)}}cdot g_{j}^{left(tright)}$$
while:
$text{momentum_decay_factor}$ is a hyperparameter, and I would assume it is in $(0,1)$ (as it usually is).
In TensorFlow, this is themomentum
argument ofRMSPropOptimizer
.
$g^{(t)}_j$ is the $j^{text{th}}$ component of the gradient in the $t^{text{th}}$ step.
$text{MA}left(g_{j}^{2}right)$ is a moving average of the squares of the $j^{text{th}}$ components of the gradients that were calculated.
The decay factor of this moving average is a hyperparameter, and in TensorFlow, this is thedecay
argument ofRMSPropOptimizer
.
High-Level Comparison
Now we are finally ready to talk about the differences between the two.
The denominator is quite similar (except for the bias-correction, which I explain about later).
However, the momentum-like behavior that both share (Adam due to $text{BCMA}left(g_{j}right)$, and rmsprop with momentum due to explicitly taking a fraction of the previous step) is somewhat different.
E.g. this is how Sebastian Ruder describes this difference in his blog post An overview of gradient descent optimization algorithms:
Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface [...]
This description is based on the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, so check it out if you want to dive deeper.
Next, I would describe 2 simple scenarios to demonstrate the difference in the momentum-like behaviors of the methods.
Lastly, I would describe the difference with regard to bias-correction.
Accumulating Momentum
Consider the following scenario: The gradient was constant in every step in the recent past, and $Delta x_{j}^{(t-1)}=0$. Also, to keep it simple, $g_{j}>0$.
I.e. we can imagine our algorithm as a stationary ball on a linear slope.
What would happen when we use each of the methods?
In Adam
The gradient was constant in the recent past, so $text{BCMA}left(g_{j}^{2}right)approx g_{j}^{2}$ and $text{BCMA}left(g_{j}right)approx g_j$.
Thus we get:
$$begin{gathered}\
Delta x_{j}^{(t)}=-frac{text{learning_rate}}{sqrt{g_{j}^{2}}}cdot g_{j}=-frac{text{learning_rate}}{|g_{j}|}cdot g_{j}\
downarrow\
Delta x_{j}^{left(tright)}=-text{learning_rate}
end{gathered}
$$
I.e. the "ball" immediately starts moving downhill in a constant speed.
In rmsprop with momentum
Similarly, we get:
$$Delta x_{j}^{left(tright)}=text{momentum_decay_factor}cdotDelta x_{j}^{left(t-1right)}-text{learning_rate}$$
This case is a little more complicated, but we can see that:
$$begin{gathered}\
Delta x_{j}^{left(tright)}=-text{learning_rate}\
Delta x_{j}^{left(t+1right)}=-text{learning_rate}cdot(1+text{momentum_decay_factor})
end{gathered}
$$
So the "ball" starts accelerating downhill.
Given that the gradient stays constant, you can prove that if:
$$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(kright)}$$
then: $$-frac{text{learning_rate}}{1-text{momentum_decay_factor}}<Delta x_{j}^{left(k+1right)}<Delta x_{j}^{left(kright)}$$
Therefore, we conclude that the step converges, i.e. $Delta x_{j}^{left(kright)}approx Delta x_{j}^{left(k-1right)}$ for some $k>t$, and then:
$$begin{gathered}\
Delta x_{j}^{left(kright)}approx text{momentum_decay_factor}cdotDelta x_{j}^{left(kright)}-text{learning_rate}\
downarrow\
Delta x_{j}^{left(kright)}approx -frac{text{learning_rate}}{1-text{momentum_decay_factor}}
end{gathered}
$$
Thus, the "ball" accelerates downhill and approaches a speed $frac{1}{1-text{momentum_decay_factor}}$ times as large as the constant speed of Adam's "ball". (E.g. for a typical $text{momentum_decay_factor}=0.9$, it can approach $10times$ speed!)
Changing Direction
Now, consider a scenario following the previous one:
After going down the slope (in the previous scenario) for quite some time (i.e. enough time for rmsprop with momentum to reach a nearly constant step size), suddenly a slope with an opposite and smaller constant gradient is reached.
What would happen when we use each of the methods?
This time I would just describe the results of my simulation of the scenario (my Python code is at the end of the answer).
Note that I have chosen for Adam's $text{BCMA}left(g_{j}right)$ a decay factor equal to $text{momentum_decay_factor}$. Choosing differently would have changed the following results:
- Adam is slower to change its direction, and then much slower to get back to the minimum.
- However, rmsprop with momentum reaches much further before it changes direction (when both use the same $text{learning_rate}$).
Note that this further reach is because rmsprop with momentum first reaches the opposite slope with much higher speed than Adam. If both reached the opposite slope with the same speed (which would happen if Adam's $text{learning_rate}$ were $frac{1}{1-text{momentum_decay_factor}}$ times as large as that of rmsprop with momentum), then Adam would reach further before changing direction.
Bias-Correction
What do we mean by a biased/bias-corrected moving average? (Or at least, what does the Adam paper mean by that?)
Generally speaking, a moving average is a weighted average of:
- The moving average of all of the previous terms
- The current term
Then what is the moving average in the first step?
- A natural choice for a programmer would be to initialize the "moving average of all of the previous terms" to $0$. We say that in this case the moving average is biased towards $0$.
- When you only have one term, by definition the average should be equal to that term.
Thus, we say that the moving average is bias-corrected in case the moving average in the first step is the first term (and the moving average works as usual for the rest of the steps).
So here is another difference: The moving averages in Adam are bias-corrected, while the moving average in rmsprop with momentum is biased towards $0$.
For more about the bias-correction in Adam, see section 3 in the paper and also this answer.
Simulation Python Code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
###########################################
# script parameters
def f(x):
if x > 0:
return x
else:
return -0.1 * x
def f_grad(x):
if x > 0:
return 1
else:
return -0.1
METHOD_TO_LEARNING_RATE = {
'Adam': 0.01,
'GD': 0.00008,
'rmsprop_with_Nesterov_momentum': 0.008,
'rmsprop_with_momentum': 0.001,
'rmsprop': 0.02,
'momentum': 0.00008,
'Nesterov': 0.008,
'Adadelta': None,
}
X0 = 2
METHOD = 'rmsprop'
METHOD = 'momentum'
METHOD = 'GD'
METHOD = 'rmsprop_with_Nesterov_momentum'
METHOD = 'Nesterov'
METHOD = 'Adadelta'
METHOD = 'rmsprop_with_momentum'
METHOD = 'Adam'
LEARNING_RATE = METHOD_TO_LEARNING_RATE[METHOD]
MOMENTUM_DECAY_FACTOR = 0.9
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.9
ADADELTA_DECAY_FACTOR = 0.9
RMSPROP_EPSILON = 1e-10
ADADELTA_EPSILON = 1e-6
ADAM_EPSILON = 1e-10
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR = 0.999
ADAM_GRADS_AVG_DECAY_FACTOR = 0.9
INTERVAL = 9e2
INTERVAL = 1
INTERVAL = 3e2
INTERVAL = 3e1
###########################################
def plot_func(axe, f):
xs = np.arange(-X0 * 0.5, X0 * 1.05, abs(X0) / 100)
vf = np.vectorize(f)
ys = vf(xs)
return axe.plot(xs, ys, color='grey')
def next_color(color, f):
color[1] -= 0.01
if color[1] < 0:
color[1] = 1
return color[:]
def update(frame):
global k, x, prev_step, squared_grads_decaying_avg,
squared_prev_steps_decaying_avg, grads_decaying_avg
if METHOD in ('momentum', 'Nesterov', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
step_momentum_portion = MOMENTUM_DECAY_FACTOR * prev_step
if METHOD in ('Nesterov', 'rmsprop_with_Nesterov_momentum'):
gradient = f_grad(x + step_momentum_portion)
else:
gradient = f_grad(x)
if METHOD == 'GD':
step = -LEARNING_RATE * gradient
elif METHOD in ('momentum', 'Nesterov'):
step = step_momentum_portion - LEARNING_RATE * gradient
elif METHOD in ('rmsprop', 'rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'):
squared_grads_decaying_avg = (
RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - RMSPROP_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + RMSPROP_EPSILON)
if METHOD == 'rmsprop':
step = -LEARNING_RATE / grads_rms * gradient
else:
assert(METHOD in ('rmsprop_with_momentum',
'rmsprop_with_Nesterov_momentum'))
print(f'LEARNING_RATE / grads_rms * gradient: {LEARNING_RATE / grads_rms * gradient}')
step = step_momentum_portion - LEARNING_RATE / grads_rms * gradient
elif METHOD == 'Adadelta':
gradient = f_grad(x)
squared_grads_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * gradient ** 2)
grads_rms = np.sqrt(squared_grads_decaying_avg + ADADELTA_EPSILON)
squared_prev_steps_decaying_avg = (
ADADELTA_DECAY_FACTOR * squared_prev_steps_decaying_avg +
(1 - ADADELTA_DECAY_FACTOR) * prev_step ** 2)
prev_steps_rms = np.sqrt(squared_prev_steps_decaying_avg + ADADELTA_EPSILON)
step = - prev_steps_rms / grads_rms * gradient
elif METHOD == 'Adam':
squared_grads_decaying_avg = (
ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR * squared_grads_decaying_avg +
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR) * gradient ** 2)
unbiased_squared_grads_decaying_avg = (
squared_grads_decaying_avg /
(1 - ADAM_SQUARED_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
grads_decaying_avg = (
ADAM_GRADS_AVG_DECAY_FACTOR * grads_decaying_avg +
(1 - ADAM_GRADS_AVG_DECAY_FACTOR) * gradient)
unbiased_grads_decaying_avg = (
grads_decaying_avg /
(1 - ADAM_GRADS_AVG_DECAY_FACTOR ** (k + 1)))
step = - (LEARNING_RATE /
(np.sqrt(unbiased_squared_grads_decaying_avg) + ADAM_EPSILON) *
unbiased_grads_decaying_avg)
x += step
prev_step = step
k += 1
color = next_color(cur_color, f)
print(f'k: {k}n'
f'x: {x}n'
f'step: {step}n'
f'gradient: {gradient}n')
k_x_marker, = k_and_x.plot(k, x, '.', color=color)
x_y_marker, = x_and_y.plot(x, f(x), '.', color=color)
return k_x_marker, x_y_marker
k = 0
x = X0
cur_color = [0, 1, 1]
prev_step = 0
squared_grads_decaying_avg = 0
squared_prev_steps_decaying_avg = 0
grads_decaying_avg = 0
fig, (k_and_x, x_and_y) = plt.subplots(1, 2, figsize=(9,5))
k_and_x.set_xlabel('k')
k_and_x.set_ylabel('x', rotation=0)
x_and_y.set_xlabel('x')
x_and_y.set_ylabel('y', rotation=0)
plot_func(x_and_y, f)
x_and_y.plot(x, f(x), '.', color=cur_color[:])
k_and_x.plot(k, x, '.', color=cur_color[:])
plt.tight_layout()
ani = FuncAnimation(fig, update, blit=False, repeat=False, interval=INTERVAL)
plt.show()
edited Oct 5 '18 at 12:40
answered Oct 5 '18 at 9:38
Oren MilmanOren Milman
2549
2549
add a comment |
add a comment |
$begingroup$
the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?
New contributor
$endgroup$
add a comment |
$begingroup$
the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?
New contributor
$endgroup$
add a comment |
$begingroup$
the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?
New contributor
$endgroup$
the code has some problems, like there is an extra f in print(f'LEARNING_RATE..'). Could you please check it?
New contributor
New contributor
answered 16 mins ago
Xiaogang WangXiaogang Wang
1
1
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26792%2fdifference-between-rmsprop-with-momentum-and-adam-optimizers%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Did you mean
beta1=momentum
andbeta2=decay
?$endgroup$
– Oren Milman
Sep 20 '18 at 18:33
$begingroup$
Also, I agree that blog post An overview of gradient descent optimization algorithms by Sebastian Ruder is great, but note that (as far as I can see) Sebastian doesn't say explicitly that Adam and rmsprop with momentum are very similar.
$endgroup$
– Oren Milman
Oct 4 '18 at 15:43