Gradient Checking LSTM - how to get change in Cost across timesteps?












5












$begingroup$


I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:



   01       01       01       01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11


So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.



Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?



Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?





Also, should I perform it as follows for LSTM:




  1. perturb a single weight upwards

  2. forward prop 4 timesteps

  3. perturb the weight downwards

  4. forward prop 4 timesteps

  5. get 4 deltas

  6. sum the 4 deltas to get a total change in Cost


or




  1. Set N=0

  2. perturb the weight upwards

  3. foward prop at a particular timestep N

  4. perturb the weight downwards

  5. forward prop at a particular timestep N

  6. get single delta, store it away

  7. increment N

  8. until N not equal 4 return to step 2)

  9. sum the 4 deltas to get a total change in Cost


The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?










share|improve this question











$endgroup$












  • $begingroup$
    How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:47












  • $begingroup$
    What does perturb the weight downwards mean?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:48










  • $begingroup$
    $theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
    $endgroup$
    – Kari
    Apr 30 '18 at 9:31












  • $begingroup$
    Ok, what about the first question?
    $endgroup$
    – David Masip
    Apr 30 '18 at 9:33










  • $begingroup$
    I can use any function, cross entropy, mean squared, etc.
    $endgroup$
    – Kari
    Apr 30 '18 at 9:33
















5












$begingroup$


I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:



   01       01       01       01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11


So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.



Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?



Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?





Also, should I perform it as follows for LSTM:




  1. perturb a single weight upwards

  2. forward prop 4 timesteps

  3. perturb the weight downwards

  4. forward prop 4 timesteps

  5. get 4 deltas

  6. sum the 4 deltas to get a total change in Cost


or




  1. Set N=0

  2. perturb the weight upwards

  3. foward prop at a particular timestep N

  4. perturb the weight downwards

  5. forward prop at a particular timestep N

  6. get single delta, store it away

  7. increment N

  8. until N not equal 4 return to step 2)

  9. sum the 4 deltas to get a total change in Cost


The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?










share|improve this question











$endgroup$












  • $begingroup$
    How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:47












  • $begingroup$
    What does perturb the weight downwards mean?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:48










  • $begingroup$
    $theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
    $endgroup$
    – Kari
    Apr 30 '18 at 9:31












  • $begingroup$
    Ok, what about the first question?
    $endgroup$
    – David Masip
    Apr 30 '18 at 9:33










  • $begingroup$
    I can use any function, cross entropy, mean squared, etc.
    $endgroup$
    – Kari
    Apr 30 '18 at 9:33














5












5








5





$begingroup$


I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:



   01       01       01       01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11


So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.



Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?



Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?





Also, should I perform it as follows for LSTM:




  1. perturb a single weight upwards

  2. forward prop 4 timesteps

  3. perturb the weight downwards

  4. forward prop 4 timesteps

  5. get 4 deltas

  6. sum the 4 deltas to get a total change in Cost


or




  1. Set N=0

  2. perturb the weight upwards

  3. foward prop at a particular timestep N

  4. perturb the weight downwards

  5. forward prop at a particular timestep N

  6. get single delta, store it away

  7. increment N

  8. until N not equal 4 return to step 2)

  9. sum the 4 deltas to get a total change in Cost


The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?










share|improve this question











$endgroup$




I am performing gradient check for my LSTM which has 4 timesteps. The LSTM looks as follows:



   01       01       01       01
^ ^ ^ ^
LSTM --> LSTM --> LSTM --> LSTM
^ ^ ^ ^
11 11 11 11


So, at every timestep we are feeding in vector {1,1} and expect {0,1} at the output.



Assume I perturb the weight inside LSTM, then perform 4 forward props - one for each timestep - how do I now get delta of the cost function that this single perturbation has caused?



Am I allowed to simply add-up the change in Cost from all 4 timesteps to treat it as derivative estimate?





Also, should I perform it as follows for LSTM:




  1. perturb a single weight upwards

  2. forward prop 4 timesteps

  3. perturb the weight downwards

  4. forward prop 4 timesteps

  5. get 4 deltas

  6. sum the 4 deltas to get a total change in Cost


or




  1. Set N=0

  2. perturb the weight upwards

  3. foward prop at a particular timestep N

  4. perturb the weight downwards

  5. forward prop at a particular timestep N

  6. get single delta, store it away

  7. increment N

  8. until N not equal 4 return to step 2)

  9. sum the 4 deltas to get a total change in Cost


The second approach somehow seems more correct, because LSTM will have a hidden state ..Is this correct intuition or it won't matter?







gradient-descent






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 28 '18 at 20:01







Kari

















asked Apr 27 '18 at 4:42









KariKari

614321




614321












  • $begingroup$
    How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:47












  • $begingroup$
    What does perturb the weight downwards mean?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:48










  • $begingroup$
    $theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
    $endgroup$
    – Kari
    Apr 30 '18 at 9:31












  • $begingroup$
    Ok, what about the first question?
    $endgroup$
    – David Masip
    Apr 30 '18 at 9:33










  • $begingroup$
    I can use any function, cross entropy, mean squared, etc.
    $endgroup$
    – Kari
    Apr 30 '18 at 9:33


















  • $begingroup$
    How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:47












  • $begingroup$
    What does perturb the weight downwards mean?
    $endgroup$
    – David Masip
    Apr 30 '18 at 7:48










  • $begingroup$
    $theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
    $endgroup$
    – Kari
    Apr 30 '18 at 9:31












  • $begingroup$
    Ok, what about the first question?
    $endgroup$
    – David Masip
    Apr 30 '18 at 9:33










  • $begingroup$
    I can use any function, cross entropy, mean squared, etc.
    $endgroup$
    – Kari
    Apr 30 '18 at 9:33
















$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47






$begingroup$
How is your cost function defined? As I understand it is $sum_{outputs} || (0, 1) - network_output || ^2$. Is it like that?
$endgroup$
– David Masip
Apr 30 '18 at 7:47














$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48




$begingroup$
What does perturb the weight downwards mean?
$endgroup$
– David Masip
Apr 30 '18 at 7:48












$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31






$begingroup$
$theta_i := theta_i - epsilon$ in other words to pull the one of the weights down a little
$endgroup$
– Kari
Apr 30 '18 at 9:31














$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33




$begingroup$
Ok, what about the first question?
$endgroup$
– David Masip
Apr 30 '18 at 9:33












$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33




$begingroup$
I can use any function, cross entropy, mean squared, etc.
$endgroup$
– Kari
Apr 30 '18 at 9:33










2 Answers
2






active

oldest

votes


















3





+50







$begingroup$

Interesting question.




Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.




It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.






share|improve this answer









$endgroup$













  • $begingroup$
    Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
    $endgroup$
    – Kari
    Oct 3 '18 at 15:15



















0












$begingroup$

Answering my own question several months later (after reading the answer by @SanjayKrishna).



My 'approach 1' seems more correct, but causes more hustle than is actually needed.



Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".



$$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$



Thus, we should do the following:




  1. peturb a single weight upwards

  2. perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

  3. peturb the weight downwards

  4. redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

  5. compute the delta (just a single scalar value) by subtracting cost_a from cost_b

  6. compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)


Edit



I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.






share|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30919%2fgradient-checking-lstm-how-to-get-change-in-cost-across-timesteps%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3





    +50







    $begingroup$

    Interesting question.




    Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.




    It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.






    share|improve this answer









    $endgroup$













    • $begingroup$
      Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
      $endgroup$
      – Kari
      Oct 3 '18 at 15:15
















    3





    +50







    $begingroup$

    Interesting question.




    Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.




    It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.






    share|improve this answer









    $endgroup$













    • $begingroup$
      Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
      $endgroup$
      – Kari
      Oct 3 '18 at 15:15














    3





    +50







    3





    +50



    3




    +50



    $begingroup$

    Interesting question.




    Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.




    It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.






    share|improve this answer









    $endgroup$



    Interesting question.




    Like standard backpropagation, [backpropagation through time] consists of a repeated application of the chain rule. The subtlety is that, for recurrent networks, the loss function depends on the activation of the hidden layer not only through its influence on the output layer, but also through its influence on the hidden layer at the next timestep.




    It looks like both approaches would have similar results but at a different granularity (adding noise at different levels) this is because the backprop is not really disturbed by this addition as it is still chained from last time step to the first. Hence it boils down to why you actually want to add noise in the first place, as mentioned in this paper , sections IIIC and IV.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Apr 29 '18 at 21:13









    Sanjay KrishnaSanjay Krishna

    912




    912












    • $begingroup$
      Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
      $endgroup$
      – Kari
      Oct 3 '18 at 15:15


















    • $begingroup$
      Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
      $endgroup$
      – Kari
      Oct 3 '18 at 15:15
















    $begingroup$
    Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
    $endgroup$
    – Kari
    Oct 3 '18 at 15:15




    $begingroup$
    Which of the two approaches do you reckon will give a more reliable result? (less noise). I am inclined towards #2 because both "up" and "down" perturbations can occur during the same timestep, which probably means shorter time required to hold things in Random access memory
    $endgroup$
    – Kari
    Oct 3 '18 at 15:15











    0












    $begingroup$

    Answering my own question several months later (after reading the answer by @SanjayKrishna).



    My 'approach 1' seems more correct, but causes more hustle than is actually needed.



    Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".



    $$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$



    Thus, we should do the following:




    1. peturb a single weight upwards

    2. perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

    3. peturb the weight downwards

    4. redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

    5. compute the delta (just a single scalar value) by subtracting cost_a from cost_b

    6. compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)


    Edit



    I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.






    share|improve this answer











    $endgroup$


















      0












      $begingroup$

      Answering my own question several months later (after reading the answer by @SanjayKrishna).



      My 'approach 1' seems more correct, but causes more hustle than is actually needed.



      Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".



      $$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$



      Thus, we should do the following:




      1. peturb a single weight upwards

      2. perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

      3. peturb the weight downwards

      4. redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

      5. compute the delta (just a single scalar value) by subtracting cost_a from cost_b

      6. compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)


      Edit



      I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.






      share|improve this answer











      $endgroup$
















        0












        0








        0





        $begingroup$

        Answering my own question several months later (after reading the answer by @SanjayKrishna).



        My 'approach 1' seems more correct, but causes more hustle than is actually needed.



        Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".



        $$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$



        Thus, we should do the following:




        1. peturb a single weight upwards

        2. perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

        3. peturb the weight downwards

        4. redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

        5. compute the delta (just a single scalar value) by subtracting cost_a from cost_b

        6. compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)


        Edit



        I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.






        share|improve this answer











        $endgroup$



        Answering my own question several months later (after reading the answer by @SanjayKrishna).



        My 'approach 1' seems more correct, but causes more hustle than is actually needed.



        Don't forget that the Cost is the Mean Squared Error. In my specific case it's the average of the errors from each timestep. It's this MSE which allows us to see the "delta".



        $$C = MSE= frac{1}{T}sum_{t=0}^{t=T}(actual_t-wanted_t)^2$$



        Thus, we should do the following:




        1. peturb a single weight upwards

        2. perform a full forward prop (for example, 15 timesteps), get the cost_a from your MSE. It should be just a single scalar value.

        3. peturb the weight downwards

        4. redo the full fwd prop to obtain the cost_b from your MSE which is another scalar value.

        5. compute the delta (just a single scalar value) by subtracting cost_a from cost_b

        6. compare the delta to your gradient for that was computed during BackPropThroughTime (your gradient for that particular weight, estimated from all the timesteps)


        Edit



        I am actually using something like "Mean SoftMaxedCrossEntropy", not MSE. But the idea is the same: sum up the errors at each timestep, divide by $T$ and that's your Cost.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited 2 mins ago

























        answered Nov 22 '18 at 15:38









        KariKari

        614321




        614321






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30919%2fgradient-checking-lstm-how-to-get-change-in-cost-across-timesteps%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ponta tanko

            Tantalo (mitologio)

            Erzsébet Schaár