Gradient Checking LSTM - how to get change in Cost across timesteps?

I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:

   01       01       01       01

   ^        ^        ^        ^

  LSTM --> LSTM --> LSTM --> LSTM

   ^        ^        ^        ^

   11       11       11       11

So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.

Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?

Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?

Also, should I perform it as follows for LSTM:

perturb a single weight upwards

forward prop 4 timesteps

perturb the weight downwards

forward prop 4 timesteps

get 4 deltas

sum the 4 deltas to get a total change in Cost

Set N=0

perturb the weight upwards

foward prop at a particular timestep N

perturb the weight downwards

forward prop at a particular timestep N

get single delta, store it away

increment N

until N not equal 4 return to step 2)

sum the 4 deltas to get a total change in Cost

The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?

edited Apr 28 '18 at 20:01

asked Apr 27 '18 at 4:42

Kari

614321

$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47

$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48

$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31

$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33

$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33

|
show 3 more comments

I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:

   01       01       01       01

   ^        ^        ^        ^

  LSTM --> LSTM --> LSTM --> LSTM

   ^        ^        ^        ^

   11       11       11       11

So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.

Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?

Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?

Also, should I perform it as follows for LSTM:

perturb a single weight upwards

forward prop 4 timesteps

perturb the weight downwards

forward prop 4 timesteps

get 4 deltas

sum the 4 deltas to get a total change in Cost

Set N=0

perturb the weight upwards

foward prop at a particular timestep N

perturb the weight downwards

forward prop at a particular timestep N

get single delta, store it away

increment N

until N not equal 4 return to step 2)

sum the 4 deltas to get a total change in Cost

The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?

edited Apr 28 '18 at 20:01

asked Apr 27 '18 at 4:42

Kari

614321

$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47

$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48

$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31

$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33

$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33

|
show 3 more comments

I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:

   01       01       01       01

   ^        ^        ^        ^

  LSTM --> LSTM --> LSTM --> LSTM

   ^        ^        ^        ^

   11       11       11       11

So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.

Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?

Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?

Also, should I perform it as follows for LSTM:

perturb a single weight upwards

forward prop 4 timesteps

perturb the weight downwards

forward prop 4 timesteps

get 4 deltas

sum the 4 deltas to get a total change in Cost

Set N=0

perturb the weight upwards

foward prop at a particular timestep N

perturb the weight downwards

forward prop at a particular timestep N

get single delta, store it away

increment N

until N not equal 4 return to step 2)

sum the 4 deltas to get a total change in Cost

The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?

edited Apr 28 '18 at 20:01

asked Apr 27 '18 at 4:42

Kari

614321

I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:

   01       01       01       01

   ^        ^        ^        ^

  LSTM --> LSTM --> LSTM --> LSTM

   ^        ^        ^        ^

   11       11       11       11

So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.

Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?

Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?

Also, should I perform it as follows for LSTM:

perturb a single weight upwards

forward prop 4 timesteps

perturb the weight downwards

forward prop 4 timesteps

get 4 deltas

sum the 4 deltas to get a total change in Cost

Set N=0

perturb the weight upwards

foward prop at a particular timestep N

perturb the weight downwards

forward prop at a particular timestep N

get single delta, store it away

increment N

until N not equal 4 return to step 2)

sum the 4 deltas to get a total change in Cost

The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?

gradient-descent

edited Apr 28 '18 at 20:01

asked Apr 27 '18 at 4:42

Kari

614321

edited Apr 28 '18 at 20:01

asked Apr 27 '18 at 4:42

Kari

614321

edited Apr 28 '18 at 20:01

asked Apr 27 '18 at 4:42

Kari

614321

asked Apr 27 '18 at 4:42

Kari

614321

asked Apr 27 '18 at 4:42

Kari

614321

$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47

$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48

$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31

$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33

$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33

|
show 3 more comments

$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47

$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48

$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31

$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33

$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33

How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?

– David Masip
Apr 30 '18 at 7:47

What does perturb the weight downwards mean?

– David Masip
Apr 30 '18 at 7:48

$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little

– Kari
Apr 30 '18 at 9:31

Ok, what about the first question?

– David Masip
Apr 30 '18 at 9:33

I can use any function, cross entropy, mean squared, etc.

– Kari
Apr 30 '18 at 9:33

|
show 3 more comments

2 Answers
2

active

oldest

votes

+50

Interesting question.

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15

add a comment |

Answering my own question several months later (after reading the answer by @SanjayKrishna).

My 'approach 1' seems more correct, but causes more hustle than is actually needed.

Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".

$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$

Thus, we should do the following:

peturb a single weight upwards

perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

peturb the weight downwards

redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

compute the delta (just a single scalar value) by subtracting cost_a from cost_b

compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)

Edit

I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.

edited 2 mins ago

answered Nov 22 '18 at 15:38

Kari

614321

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30919%2fgradient-checking-lstm-how-to-get-change-in-cost-across-timesteps%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

+50

Interesting question.

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15

add a comment |

+50

Interesting question.

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15

add a comment |

+50

Interesting question.

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

Interesting question.

Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

answered Apr 29 '18 at 21:13

Sanjay Krishna

912

$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15

add a comment |

$begingroup$
Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
$endgroup$
– Kari
Oct 3 '18 at 15:15

Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory

– Kari
Oct 3 '18 at 15:15

add a comment |

Answering my own question several months later (after reading the answer by @SanjayKrishna).

My 'approach 1' seems more correct, but causes more hustle than is actually needed.

Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".

$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$

Thus, we should do the following:

peturb a single weight upwards

perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

peturb the weight downwards

redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

compute the delta (just a single scalar value) by subtracting cost_a from cost_b

compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)

Edit

I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.

edited 2 mins ago

answered Nov 22 '18 at 15:38

Kari

614321

add a comment |

Answering my own question several months later (after reading the answer by @SanjayKrishna).

My 'approach 1' seems more correct, but causes more hustle than is actually needed.

Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".

$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$

Thus, we should do the following:

peturb a single weight upwards

perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

peturb the weight downwards

redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

compute the delta (just a single scalar value) by subtracting cost_a from cost_b

compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)

Edit

I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.

edited 2 mins ago

answered Nov 22 '18 at 15:38

Kari

614321

add a comment |

Answering my own question several months later (after reading the answer by @SanjayKrishna).

My 'approach 1' seems more correct, but causes more hustle than is actually needed.

Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".

$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$

Thus, we should do the following:

peturb a single weight upwards

perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

peturb the weight downwards

redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

compute the delta (just a single scalar value) by subtracting cost_a from cost_b

compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)

Edit

I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.

edited 2 mins ago

answered Nov 22 '18 at 15:38

Kari

614321

Answering my own question several months later (after reading the answer by @SanjayKrishna).

My 'approach 1' seems more correct, but causes more hustle than is actually needed.

Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".

$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$

Thus, we should do the following:

peturb a single weight upwards

perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

peturb the weight downwards

redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

compute the delta (just a single scalar value) by subtracting cost_a from cost_b

compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)

Edit

I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.

edited 2 mins ago

answered Nov 22 '18 at 15:38

Kari

614321

edited 2 mins ago

answered Nov 22 '18 at 15:38

Kari

614321

answered Nov 22 '18 at 15:38

Kari

614321

answered Nov 22 '18 at 15:38

Kari

614321

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Gfyuki