Gradient Checking LSTM - how to get change in Cost across timesteps?
$begingroup$
I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:
01 01 01 01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11
So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.
Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?
Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?
Also, should I perform it as follows for LSTM:
- perturb a single weight upwards
- forward prop 4 timesteps
- perturb the weight downwards
- forward prop 4 timesteps
- get 4 deltas
- sum the 4 deltas to get a total change in Cost
or
- Set N=0
- perturb the weight upwards
- foward prop at a particular timestep N
- perturb the weight downwards
- forward prop at a particular timestep N
- get single delta, store it away
- increment N
- until N not equal 4 return to step 2)
- sum the 4 deltas to get a total change in Cost
The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?
gradient-descent
$endgroup$
|
show 3 more comments
$begingroup$
I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:
01 01 01 01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11
So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.
Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?
Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?
Also, should I perform it as follows for LSTM:
- perturb a single weight upwards
- forward prop 4 timesteps
- perturb the weight downwards
- forward prop 4 timesteps
- get 4 deltas
- sum the 4 deltas to get a total change in Cost
or
- Set N=0
- perturb the weight upwards
- foward prop at a particular timestep N
- perturb the weight downwards
- forward prop at a particular timestep N
- get single delta, store it away
- increment N
- until N not equal 4 return to step 2)
- sum the 4 deltas to get a total change in Cost
The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?
gradient-descent
$endgroup$
$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47
$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48
$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31
$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33
$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33
|
show 3 more comments
$begingroup$
I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:
01 01 01 01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11
So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.
Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?
Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?
Also, should I perform it as follows for LSTM:
- perturb a single weight upwards
- forward prop 4 timesteps
- perturb the weight downwards
- forward prop 4 timesteps
- get 4 deltas
- sum the 4 deltas to get a total change in Cost
or
- Set N=0
- perturb the weight upwards
- foward prop at a particular timestep N
- perturb the weight downwards
- forward prop at a particular timestep N
- get single delta, store it away
- increment N
- until N not equal 4 return to step 2)
- sum the 4 deltas to get a total change in Cost
The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?
gradient-descent
$endgroup$
I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:
01 01 01 01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11
So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.
Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?
Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?
Also, should I perform it as follows for LSTM:
- perturb a single weight upwards
- forward prop 4 timesteps
- perturb the weight downwards
- forward prop 4 timesteps
- get 4 deltas
- sum the 4 deltas to get a total change in Cost
or
- Set N=0
- perturb the weight upwards
- foward prop at a particular timestep N
- perturb the weight downwards
- forward prop at a particular timestep N
- get single delta, store it away
- increment N
- until N not equal 4 return to step 2)
- sum the 4 deltas to get a total change in Cost
The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?
gradient-descent
gradient-descent
edited Apr 28 '18 at 20:01
Kari
asked Apr 27 '18 at 4:42
KariKari
614321
614321
$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47
$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48
$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31
$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33
$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33
|
show 3 more comments
$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47
$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48
$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31
$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33
$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33
$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47
$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47
$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48
$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48
$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31
$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31
$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33
$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33
$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33
$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33
|
show 3 more comments
2 Answers
2
active
oldest
votes
$begingroup$
Interesting question.
Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.
It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.
$endgroup$
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
add a comment |
$begingroup$
Answering my own question several months later (after reading the answer by @SanjayKrishna).
My 'approach 1' seems more correct, but causes more hustle than is actually needed.
Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".
$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$
Thus, we should do the following:
- peturb a single weight upwards
- perform a full forward prop (for example, 15 timesteps), get the
cost_a
from your MSE. It should be just a single scalar value. - peturb the weight downwards
- redo the full fwd prop to obtain the
cost_b
from your MSE which is another scalar value. - compute the delta (just a single scalar value) by subtracting
cost_a
fromcost_b
- compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)
Edit
I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30919%2fgradient-checking-lstm-how-to-get-change-in-cost-across-timesteps%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Interesting question.
Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.
It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.
$endgroup$
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
add a comment |
$begingroup$
Interesting question.
Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.
It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.
$endgroup$
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
add a comment |
$begingroup$
Interesting question.
Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.
It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.
$endgroup$
Interesting question.
Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.
It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.
answered Apr 29 '18 at 21:13
Sanjay KrishnaSanjay Krishna
912
912
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
add a comment |
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15
add a comment |
$begingroup$
Answering my own question several months later (after reading the answer by @SanjayKrishna).
My 'approach 1' seems more correct, but causes more hustle than is actually needed.
Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".
$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$
Thus, we should do the following:
- peturb a single weight upwards
- perform a full forward prop (for example, 15 timesteps), get the
cost_a
from your MSE. It should be just a single scalar value. - peturb the weight downwards
- redo the full fwd prop to obtain the
cost_b
from your MSE which is another scalar value. - compute the delta (just a single scalar value) by subtracting
cost_a
fromcost_b
- compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)
Edit
I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.
$endgroup$
add a comment |
$begingroup$
Answering my own question several months later (after reading the answer by @SanjayKrishna).
My 'approach 1' seems more correct, but causes more hustle than is actually needed.
Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".
$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$
Thus, we should do the following:
- peturb a single weight upwards
- perform a full forward prop (for example, 15 timesteps), get the
cost_a
from your MSE. It should be just a single scalar value. - peturb the weight downwards
- redo the full fwd prop to obtain the
cost_b
from your MSE which is another scalar value. - compute the delta (just a single scalar value) by subtracting
cost_a
fromcost_b
- compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)
Edit
I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.
$endgroup$
add a comment |
$begingroup$
Answering my own question several months later (after reading the answer by @SanjayKrishna).
My 'approach 1' seems more correct, but causes more hustle than is actually needed.
Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".
$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$
Thus, we should do the following:
- peturb a single weight upwards
- perform a full forward prop (for example, 15 timesteps), get the
cost_a
from your MSE. It should be just a single scalar value. - peturb the weight downwards
- redo the full fwd prop to obtain the
cost_b
from your MSE which is another scalar value. - compute the delta (just a single scalar value) by subtracting
cost_a
fromcost_b
- compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)
Edit
I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.
$endgroup$
Answering my own question several months later (after reading the answer by @SanjayKrishna).
My 'approach 1' seems more correct, but causes more hustle than is actually needed.
Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".
$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$
Thus, we should do the following:
- peturb a single weight upwards
- perform a full forward prop (for example, 15 timesteps), get the
cost_a
from your MSE. It should be just a single scalar value. - peturb the weight downwards
- redo the full fwd prop to obtain the
cost_b
from your MSE which is another scalar value. - compute the delta (just a single scalar value) by subtracting
cost_a
fromcost_b
- compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)
Edit
I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.
edited 2 mins ago
answered Nov 22 '18 at 15:38
KariKari
614321
614321
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30919%2fgradient-checking-lstm-how-to-get-change-in-cost-across-timesteps%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47
$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48
$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31
$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33
$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33