what is difference between the DDQN and DQN?












1












$begingroup$


I think I did not understand what is the difference between DQN and DDQN in implementation.
I understand that we change the traget network during the running of DDQN but I do not understand how it is done in this code.



We put the self.target_model.set_weights(self.model.get_weights())
In implementation of DDQN and this is added when action is finished for DQN https://github.com/keon/deep-q-learning self.target_model.set_weights(self.model.get_weights()) is added to DQN in order to change the DQN to DDQN! But this happens when we are going out from running by break! There for there is no difference between them!



What is wrong in my mind? (Maybe the difference will be at test? Is this code for train and the test is done with setting the exploration rate=0 and then run this for just one episode with new weight we found? Is it right?



Therefore, what is difference between the presented DQN (https://github.com/keon/deep-q-learning/blob/master/dqn.py) and DDQN(https://github.com/keon/deep-q-learning/blob/master/ddqn.py) of this link:










share|improve this question











$endgroup$




bumped to the homepage by Community 17 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    1












    $begingroup$


    I think I did not understand what is the difference between DQN and DDQN in implementation.
    I understand that we change the traget network during the running of DDQN but I do not understand how it is done in this code.



    We put the self.target_model.set_weights(self.model.get_weights())
    In implementation of DDQN and this is added when action is finished for DQN https://github.com/keon/deep-q-learning self.target_model.set_weights(self.model.get_weights()) is added to DQN in order to change the DQN to DDQN! But this happens when we are going out from running by break! There for there is no difference between them!



    What is wrong in my mind? (Maybe the difference will be at test? Is this code for train and the test is done with setting the exploration rate=0 and then run this for just one episode with new weight we found? Is it right?



    Therefore, what is difference between the presented DQN (https://github.com/keon/deep-q-learning/blob/master/dqn.py) and DDQN(https://github.com/keon/deep-q-learning/blob/master/ddqn.py) of this link:










    share|improve this question











    $endgroup$




    bumped to the homepage by Community 17 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      1












      1








      1





      $begingroup$


      I think I did not understand what is the difference between DQN and DDQN in implementation.
      I understand that we change the traget network during the running of DDQN but I do not understand how it is done in this code.



      We put the self.target_model.set_weights(self.model.get_weights())
      In implementation of DDQN and this is added when action is finished for DQN https://github.com/keon/deep-q-learning self.target_model.set_weights(self.model.get_weights()) is added to DQN in order to change the DQN to DDQN! But this happens when we are going out from running by break! There for there is no difference between them!



      What is wrong in my mind? (Maybe the difference will be at test? Is this code for train and the test is done with setting the exploration rate=0 and then run this for just one episode with new weight we found? Is it right?



      Therefore, what is difference between the presented DQN (https://github.com/keon/deep-q-learning/blob/master/dqn.py) and DDQN(https://github.com/keon/deep-q-learning/blob/master/ddqn.py) of this link:










      share|improve this question











      $endgroup$




      I think I did not understand what is the difference between DQN and DDQN in implementation.
      I understand that we change the traget network during the running of DDQN but I do not understand how it is done in this code.



      We put the self.target_model.set_weights(self.model.get_weights())
      In implementation of DDQN and this is added when action is finished for DQN https://github.com/keon/deep-q-learning self.target_model.set_weights(self.model.get_weights()) is added to DQN in order to change the DQN to DDQN! But this happens when we are going out from running by break! There for there is no difference between them!



      What is wrong in my mind? (Maybe the difference will be at test? Is this code for train and the test is done with setting the exploration rate=0 and then run this for just one episode with new weight we found? Is it right?



      Therefore, what is difference between the presented DQN (https://github.com/keon/deep-q-learning/blob/master/dqn.py) and DDQN(https://github.com/keon/deep-q-learning/blob/master/ddqn.py) of this link:







      deep-learning reinforcement-learning dqn deep-network weight-initialization






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Sep 24 '18 at 16:16







      user10296606

















      asked Sep 22 '18 at 5:19









      user10296606user10296606

      2989




      2989





      bumped to the homepage by Community 17 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 17 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          2 Answers
          2






          active

          oldest

          votes


















          0












          $begingroup$

          From what I understand, the difference between DQN and DDQN is in the calculation of the target Q-values of the next states. In DQN, we simply take the maximum of all the Q-values over all possible actions. This is likely to select over-estimated values, hence DDPG proposed to estimate the value of the chosen action instead. The chosen action is the one selected by our policy model.



          I looked through the codes and got confused too because this bit wasn't implemented. Then I realized they are commented out. The commented lines here would have selected the action for the next state using the current model and used the target model to get the Q-values for the selected action. They changed it in a commit some time ago, no idea why.



          As for the code self.target_model.set_weights(self.model.get_weights()), it is the updating of the target model. The target model is supposed to have the same function as the policy model, but the DQN algorithm purposely separates them and updates it once in a while to stabilize training. It can be done once every certain number of steps, or in this case they seem to do it every episode.






          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            In particular, DQN is just Q-learning, which uses neural networks as a policy and use "hacks" like experience replay, target networks and reward clipping.



            In original paper authors use convolutional network, which takes your image pixels and then fit it into a set of convolutional layers. However there are a couple of statistical problems:




            1. DQN approximate a set of values that are very interrelated (DDQN
              solves it)

            2. DQN tend to be overoptimistic. It will over-appreciate being in this state although this only happened due to the statistical error (Double DQN solves it)


            $$Q(s,a) = V(s) + A(s,a)$$



            By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).



            We’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision



            As @emilyfy said self.target_model.set_weights(self.model.get_weights()) - the updating of the target model.






            share|improve this answer









            $endgroup$













              Your Answer





              StackExchange.ifUsing("editor", function () {
              return StackExchange.using("mathjaxEditing", function () {
              StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
              StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
              });
              });
              }, "mathjax-editing");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38632%2fwhat-is-difference-between-the-ddqn-and-dqn%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              0












              $begingroup$

              From what I understand, the difference between DQN and DDQN is in the calculation of the target Q-values of the next states. In DQN, we simply take the maximum of all the Q-values over all possible actions. This is likely to select over-estimated values, hence DDPG proposed to estimate the value of the chosen action instead. The chosen action is the one selected by our policy model.



              I looked through the codes and got confused too because this bit wasn't implemented. Then I realized they are commented out. The commented lines here would have selected the action for the next state using the current model and used the target model to get the Q-values for the selected action. They changed it in a commit some time ago, no idea why.



              As for the code self.target_model.set_weights(self.model.get_weights()), it is the updating of the target model. The target model is supposed to have the same function as the policy model, but the DQN algorithm purposely separates them and updates it once in a while to stabilize training. It can be done once every certain number of steps, or in this case they seem to do it every episode.






              share|improve this answer









              $endgroup$


















                0












                $begingroup$

                From what I understand, the difference between DQN and DDQN is in the calculation of the target Q-values of the next states. In DQN, we simply take the maximum of all the Q-values over all possible actions. This is likely to select over-estimated values, hence DDPG proposed to estimate the value of the chosen action instead. The chosen action is the one selected by our policy model.



                I looked through the codes and got confused too because this bit wasn't implemented. Then I realized they are commented out. The commented lines here would have selected the action for the next state using the current model and used the target model to get the Q-values for the selected action. They changed it in a commit some time ago, no idea why.



                As for the code self.target_model.set_weights(self.model.get_weights()), it is the updating of the target model. The target model is supposed to have the same function as the policy model, but the DQN algorithm purposely separates them and updates it once in a while to stabilize training. It can be done once every certain number of steps, or in this case they seem to do it every episode.






                share|improve this answer









                $endgroup$
















                  0












                  0








                  0





                  $begingroup$

                  From what I understand, the difference between DQN and DDQN is in the calculation of the target Q-values of the next states. In DQN, we simply take the maximum of all the Q-values over all possible actions. This is likely to select over-estimated values, hence DDPG proposed to estimate the value of the chosen action instead. The chosen action is the one selected by our policy model.



                  I looked through the codes and got confused too because this bit wasn't implemented. Then I realized they are commented out. The commented lines here would have selected the action for the next state using the current model and used the target model to get the Q-values for the selected action. They changed it in a commit some time ago, no idea why.



                  As for the code self.target_model.set_weights(self.model.get_weights()), it is the updating of the target model. The target model is supposed to have the same function as the policy model, but the DQN algorithm purposely separates them and updates it once in a while to stabilize training. It can be done once every certain number of steps, or in this case they seem to do it every episode.






                  share|improve this answer









                  $endgroup$



                  From what I understand, the difference between DQN and DDQN is in the calculation of the target Q-values of the next states. In DQN, we simply take the maximum of all the Q-values over all possible actions. This is likely to select over-estimated values, hence DDPG proposed to estimate the value of the chosen action instead. The chosen action is the one selected by our policy model.



                  I looked through the codes and got confused too because this bit wasn't implemented. Then I realized they are commented out. The commented lines here would have selected the action for the next state using the current model and used the target model to get the Q-values for the selected action. They changed it in a commit some time ago, no idea why.



                  As for the code self.target_model.set_weights(self.model.get_weights()), it is the updating of the target model. The target model is supposed to have the same function as the policy model, but the DQN algorithm purposely separates them and updates it once in a while to stabilize training. It can be done once every certain number of steps, or in this case they seem to do it every episode.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Oct 15 '18 at 9:31









                  emilyfyemilyfy

                  11




                  11























                      0












                      $begingroup$

                      In particular, DQN is just Q-learning, which uses neural networks as a policy and use "hacks" like experience replay, target networks and reward clipping.



                      In original paper authors use convolutional network, which takes your image pixels and then fit it into a set of convolutional layers. However there are a couple of statistical problems:




                      1. DQN approximate a set of values that are very interrelated (DDQN
                        solves it)

                      2. DQN tend to be overoptimistic. It will over-appreciate being in this state although this only happened due to the statistical error (Double DQN solves it)


                      $$Q(s,a) = V(s) + A(s,a)$$



                      By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).



                      We’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision



                      As @emilyfy said self.target_model.set_weights(self.model.get_weights()) - the updating of the target model.






                      share|improve this answer









                      $endgroup$


















                        0












                        $begingroup$

                        In particular, DQN is just Q-learning, which uses neural networks as a policy and use "hacks" like experience replay, target networks and reward clipping.



                        In original paper authors use convolutional network, which takes your image pixels and then fit it into a set of convolutional layers. However there are a couple of statistical problems:




                        1. DQN approximate a set of values that are very interrelated (DDQN
                          solves it)

                        2. DQN tend to be overoptimistic. It will over-appreciate being in this state although this only happened due to the statistical error (Double DQN solves it)


                        $$Q(s,a) = V(s) + A(s,a)$$



                        By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).



                        We’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision



                        As @emilyfy said self.target_model.set_weights(self.model.get_weights()) - the updating of the target model.






                        share|improve this answer









                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          In particular, DQN is just Q-learning, which uses neural networks as a policy and use "hacks" like experience replay, target networks and reward clipping.



                          In original paper authors use convolutional network, which takes your image pixels and then fit it into a set of convolutional layers. However there are a couple of statistical problems:




                          1. DQN approximate a set of values that are very interrelated (DDQN
                            solves it)

                          2. DQN tend to be overoptimistic. It will over-appreciate being in this state although this only happened due to the statistical error (Double DQN solves it)


                          $$Q(s,a) = V(s) + A(s,a)$$



                          By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).



                          We’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision



                          As @emilyfy said self.target_model.set_weights(self.model.get_weights()) - the updating of the target model.






                          share|improve this answer









                          $endgroup$



                          In particular, DQN is just Q-learning, which uses neural networks as a policy and use "hacks" like experience replay, target networks and reward clipping.



                          In original paper authors use convolutional network, which takes your image pixels and then fit it into a set of convolutional layers. However there are a couple of statistical problems:




                          1. DQN approximate a set of values that are very interrelated (DDQN
                            solves it)

                          2. DQN tend to be overoptimistic. It will over-appreciate being in this state although this only happened due to the statistical error (Double DQN solves it)


                          $$Q(s,a) = V(s) + A(s,a)$$



                          By decoupling the estimation, intuitively our DDQN can learn which states are (or are not) valuable without having to learn the effect of each action at each state (since it’s also calculating V(s)).



                          We’re able to calculate V(s). This is particularly useful for states where their actions do not affect the environment in a relevant way. In this case, it’s unnecessary to calculate the value of each action. For instance, moving right or left only matters if there is a risk of collision



                          As @emilyfy said self.target_model.set_weights(self.model.get_weights()) - the updating of the target model.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Jan 17 at 3:47









                          Daniel ChepenkoDaniel Chepenko

                          1313




                          1313






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f38632%2fwhat-is-difference-between-the-ddqn-and-dqn%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Ponta tanko

                              Tantalo (mitologio)

                              Erzsébet Schaár