Backprop Through Max-Pooling Layers?












49












$begingroup$


This is a small conceptual question that's been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?



I came across max-pooling layers while going through this tutorial for Torch 7's nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don't understand how the gradient calculation is done for a max-pooling layer.



I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${delta_i}^l$ (defined as ${delta_i}^l = frac{partial E}{partial {z_i}^l}$) is given by:
$$
{delta_i}^l = theta^{'}({z_i}^l) sum_{j} {delta_j}^{l+1} w_{i,j}^{l,l+1}
$$



So, a max-pooling layer would receive the ${delta_j}^{l+1}$'s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${delta_i}^{l}$ isn't a single number anymore, but a vector ($theta^{'}({z_j}^l)$ would have to be replaced by $nabla theta(left{{z_j}^lright})$). Furthermore, $theta$, being the max function, isn't differentiable with respect to it's inputs.



So....how should it work out exactly?










share|improve this question











$endgroup$

















    49












    $begingroup$


    This is a small conceptual question that's been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?



    I came across max-pooling layers while going through this tutorial for Torch 7's nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don't understand how the gradient calculation is done for a max-pooling layer.



    I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${delta_i}^l$ (defined as ${delta_i}^l = frac{partial E}{partial {z_i}^l}$) is given by:
    $$
    {delta_i}^l = theta^{'}({z_i}^l) sum_{j} {delta_j}^{l+1} w_{i,j}^{l,l+1}
    $$



    So, a max-pooling layer would receive the ${delta_j}^{l+1}$'s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${delta_i}^{l}$ isn't a single number anymore, but a vector ($theta^{'}({z_j}^l)$ would have to be replaced by $nabla theta(left{{z_j}^lright})$). Furthermore, $theta$, being the max function, isn't differentiable with respect to it's inputs.



    So....how should it work out exactly?










    share|improve this question











    $endgroup$















      49












      49








      49


      19



      $begingroup$


      This is a small conceptual question that's been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?



      I came across max-pooling layers while going through this tutorial for Torch 7's nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don't understand how the gradient calculation is done for a max-pooling layer.



      I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${delta_i}^l$ (defined as ${delta_i}^l = frac{partial E}{partial {z_i}^l}$) is given by:
      $$
      {delta_i}^l = theta^{'}({z_i}^l) sum_{j} {delta_j}^{l+1} w_{i,j}^{l,l+1}
      $$



      So, a max-pooling layer would receive the ${delta_j}^{l+1}$'s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${delta_i}^{l}$ isn't a single number anymore, but a vector ($theta^{'}({z_j}^l)$ would have to be replaced by $nabla theta(left{{z_j}^lright})$). Furthermore, $theta$, being the max function, isn't differentiable with respect to it's inputs.



      So....how should it work out exactly?










      share|improve this question











      $endgroup$




      This is a small conceptual question that's been nagging me for a while: How can we back-propagate through a max-pooling layer in a neural network?



      I came across max-pooling layers while going through this tutorial for Torch 7's nn library. The library abstracts the gradient calculation and forward passes for each layer of a deep network. I don't understand how the gradient calculation is done for a max-pooling layer.



      I know that if you have an input ${z_i}^l$ going into neuron $i$ of layer $l$, then ${delta_i}^l$ (defined as ${delta_i}^l = frac{partial E}{partial {z_i}^l}$) is given by:
      $$
      {delta_i}^l = theta^{'}({z_i}^l) sum_{j} {delta_j}^{l+1} w_{i,j}^{l,l+1}
      $$



      So, a max-pooling layer would receive the ${delta_j}^{l+1}$'s of the next layer as usual; but since the activation function for the max-pooling neurons takes in a vector of values (over which it maxes) as input, ${delta_i}^{l}$ isn't a single number anymore, but a vector ($theta^{'}({z_j}^l)$ would have to be replaced by $nabla theta(left{{z_j}^lright})$). Furthermore, $theta$, being the max function, isn't differentiable with respect to it's inputs.



      So....how should it work out exactly?







      neural-network backpropagation






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 13 '16 at 5:09







      shinvu

















      asked May 12 '16 at 8:38









      shinvushinvu

      350135




      350135






















          3 Answers
          3






          active

          oldest

          votes


















          46












          $begingroup$

          There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.



          So in your example, $delta_i^l$ would be a vector of all zeros, except that the $i^*$th location will get a values $left{delta_j^{l+1}right}$ where $i^* = argmax_{i} (z_i^l)$






          share|improve this answer











          $endgroup$









          • 6




            $begingroup$
            Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
            $endgroup$
            – shinvu
            May 13 '16 at 5:35










          • $begingroup$
            But don't you need to multiply with the derivative of the activation function?
            $endgroup$
            – Jason
            Mar 19 '18 at 4:02






          • 1




            $begingroup$
            @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
            $endgroup$
            – Chrigi
            Feb 5 at 12:48










          • $begingroup$
            What is the stride is less than the kernel width for max pooling ?
            $endgroup$
            – Vatsal
            Mar 4 at 5:52



















          3












          $begingroup$

          Max Pooling



          So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:



          $ P_i = f(sum_j W_{ij} PR_j)$,



          where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:



          $grad(PR_j) = sum_i grad(P_i) f^prime W_{ij}$.



          But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^prime = 1$ for the max neuron in the previous layer and $f^prime = 0$ for all other neurons. So:



          $grad(PR_{max neuron}) = sum_i grad(P_i) W_{i {max neuron}}$,



          $grad(PR_{others}) = 0.$






          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            @Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation well within a computational graph which is quick to grasp.!





            share









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11699%2fbackprop-through-max-pooling-layers%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              46












              $begingroup$

              There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.



              So in your example, $delta_i^l$ would be a vector of all zeros, except that the $i^*$th location will get a values $left{delta_j^{l+1}right}$ where $i^* = argmax_{i} (z_i^l)$






              share|improve this answer











              $endgroup$









              • 6




                $begingroup$
                Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
                $endgroup$
                – shinvu
                May 13 '16 at 5:35










              • $begingroup$
                But don't you need to multiply with the derivative of the activation function?
                $endgroup$
                – Jason
                Mar 19 '18 at 4:02






              • 1




                $begingroup$
                @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
                $endgroup$
                – Chrigi
                Feb 5 at 12:48










              • $begingroup$
                What is the stride is less than the kernel width for max pooling ?
                $endgroup$
                – Vatsal
                Mar 4 at 5:52
















              46












              $begingroup$

              There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.



              So in your example, $delta_i^l$ would be a vector of all zeros, except that the $i^*$th location will get a values $left{delta_j^{l+1}right}$ where $i^* = argmax_{i} (z_i^l)$






              share|improve this answer











              $endgroup$









              • 6




                $begingroup$
                Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
                $endgroup$
                – shinvu
                May 13 '16 at 5:35










              • $begingroup$
                But don't you need to multiply with the derivative of the activation function?
                $endgroup$
                – Jason
                Mar 19 '18 at 4:02






              • 1




                $begingroup$
                @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
                $endgroup$
                – Chrigi
                Feb 5 at 12:48










              • $begingroup$
                What is the stride is less than the kernel width for max pooling ?
                $endgroup$
                – Vatsal
                Mar 4 at 5:52














              46












              46








              46





              $begingroup$

              There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.



              So in your example, $delta_i^l$ would be a vector of all zeros, except that the $i^*$th location will get a values $left{delta_j^{l+1}right}$ where $i^* = argmax_{i} (z_i^l)$






              share|improve this answer











              $endgroup$



              There is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.



              So in your example, $delta_i^l$ would be a vector of all zeros, except that the $i^*$th location will get a values $left{delta_j^{l+1}right}$ where $i^* = argmax_{i} (z_i^l)$







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited May 14 '16 at 9:48









              shinvu

              350135




              350135










              answered May 12 '16 at 11:52









              aboraabora

              57853




              57853








              • 6




                $begingroup$
                Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
                $endgroup$
                – shinvu
                May 13 '16 at 5:35










              • $begingroup$
                But don't you need to multiply with the derivative of the activation function?
                $endgroup$
                – Jason
                Mar 19 '18 at 4:02






              • 1




                $begingroup$
                @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
                $endgroup$
                – Chrigi
                Feb 5 at 12:48










              • $begingroup$
                What is the stride is less than the kernel width for max pooling ?
                $endgroup$
                – Vatsal
                Mar 4 at 5:52














              • 6




                $begingroup$
                Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
                $endgroup$
                – shinvu
                May 13 '16 at 5:35










              • $begingroup$
                But don't you need to multiply with the derivative of the activation function?
                $endgroup$
                – Jason
                Mar 19 '18 at 4:02






              • 1




                $begingroup$
                @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
                $endgroup$
                – Chrigi
                Feb 5 at 12:48










              • $begingroup$
                What is the stride is less than the kernel width for max pooling ?
                $endgroup$
                – Vatsal
                Mar 4 at 5:52








              6




              6




              $begingroup$
              Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
              $endgroup$
              – shinvu
              May 13 '16 at 5:35




              $begingroup$
              Oh right, there is no point back-propagating through the non-maximum neurons - that was a crucial insight. So if I now understand this correctly, back-propagating through the max-pooling layer simply selects the max. neuron from the previous layer (on which the max-pooling was done) and continues back-propagation only through that.
              $endgroup$
              – shinvu
              May 13 '16 at 5:35












              $begingroup$
              But don't you need to multiply with the derivative of the activation function?
              $endgroup$
              – Jason
              Mar 19 '18 at 4:02




              $begingroup$
              But don't you need to multiply with the derivative of the activation function?
              $endgroup$
              – Jason
              Mar 19 '18 at 4:02




              1




              1




              $begingroup$
              @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
              $endgroup$
              – Chrigi
              Feb 5 at 12:48




              $begingroup$
              @Jason: The max function is locally linear for the activation that got the max, so the derivative of it is constant 1. For the activations that didn't make it through, it's 0. That's conceptually very similar to differentiating the ReLU(x) = max(0,x) activation function.
              $endgroup$
              – Chrigi
              Feb 5 at 12:48












              $begingroup$
              What is the stride is less than the kernel width for max pooling ?
              $endgroup$
              – Vatsal
              Mar 4 at 5:52




              $begingroup$
              What is the stride is less than the kernel width for max pooling ?
              $endgroup$
              – Vatsal
              Mar 4 at 5:52











              3












              $begingroup$

              Max Pooling



              So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:



              $ P_i = f(sum_j W_{ij} PR_j)$,



              where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:



              $grad(PR_j) = sum_i grad(P_i) f^prime W_{ij}$.



              But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^prime = 1$ for the max neuron in the previous layer and $f^prime = 0$ for all other neurons. So:



              $grad(PR_{max neuron}) = sum_i grad(P_i) W_{i {max neuron}}$,



              $grad(PR_{others}) = 0.$






              share|improve this answer









              $endgroup$


















                3












                $begingroup$

                Max Pooling



                So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:



                $ P_i = f(sum_j W_{ij} PR_j)$,



                where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:



                $grad(PR_j) = sum_i grad(P_i) f^prime W_{ij}$.



                But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^prime = 1$ for the max neuron in the previous layer and $f^prime = 0$ for all other neurons. So:



                $grad(PR_{max neuron}) = sum_i grad(P_i) W_{i {max neuron}}$,



                $grad(PR_{others}) = 0.$






                share|improve this answer









                $endgroup$
















                  3












                  3








                  3





                  $begingroup$

                  Max Pooling



                  So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:



                  $ P_i = f(sum_j W_{ij} PR_j)$,



                  where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:



                  $grad(PR_j) = sum_i grad(P_i) f^prime W_{ij}$.



                  But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^prime = 1$ for the max neuron in the previous layer and $f^prime = 0$ for all other neurons. So:



                  $grad(PR_{max neuron}) = sum_i grad(P_i) W_{i {max neuron}}$,



                  $grad(PR_{others}) = 0.$






                  share|improve this answer









                  $endgroup$



                  Max Pooling



                  So suppose you have a layer P which comes on top of a layer PR. Then the forward pass will be something like this:



                  $ P_i = f(sum_j W_{ij} PR_j)$,



                  where $P_i$ is the activation of the ith neuron of the layer P, f is the activation function and W are the weights. So if you derive that, by the chain rule you get that the gradients flow as follows:



                  $grad(PR_j) = sum_i grad(P_i) f^prime W_{ij}$.



                  But now, if you have max pooling, $f = id$ for the max neuron and $f = 0$ for all other neurons, so $f^prime = 1$ for the max neuron in the previous layer and $f^prime = 0$ for all other neurons. So:



                  $grad(PR_{max neuron}) = sum_i grad(P_i) W_{i {max neuron}}$,



                  $grad(PR_{others}) = 0.$







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Sep 27 '16 at 9:08









                  patapouf_aipatapouf_ai

                  27116




                  27116























                      0












                      $begingroup$

                      @Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation well within a computational graph which is quick to grasp.!





                      share









                      $endgroup$


















                        0












                        $begingroup$

                        @Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation well within a computational graph which is quick to grasp.!





                        share









                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          @Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation well within a computational graph which is quick to grasp.!





                          share









                          $endgroup$



                          @Shinvu's answer is well written, I would like to point to a video that explains the gradient of Max() operation well within a computational graph which is quick to grasp.!






                          share











                          share


                          share










                          answered 4 mins ago









                          anuanu

                          1586




                          1586






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11699%2fbackprop-through-max-pooling-layers%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Ponta tanko

                              Tantalo (mitologio)

                              Erzsébet Schaár