Q learning neural network experience replay problem












2












$begingroup$


I am currently trying to create a tic tac toe q learning neural network to introduce me to reinforcement learning, however it didn't work so I decided to try a simpler project requiring a network to train against static data rather than another neural network.
This lead to me following the guidelines from this website - http://outlace.com/rlpart3.html



however after programming this, the simple version works half the time, this is the version without experience replay. on some runs of the program the game will be played correctly, others it just moves back and forth wehn doing test runs.
When trying to implement experience replay to complete the harder version. the program will just constantly get itself into a loop of going back and forth when testing



i have a limit of 100 batches in which a batch is what the neural network is trained on. I am wondering whether this is an appropriate amount, or if there could be any common problems with implementing experience replay that i may have made.



My current perspective of what experience replay is:
1. run the program
2. after each turn, the data you used to train the network on, gets saved into a batch
3. when you have reaches x(100) amount of batches, pick one out and train on it.
4. overwrite the oldest batch with the new batches that come in.



If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.



EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself,m or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.










share|improve this question











$endgroup$

















    2












    $begingroup$


    I am currently trying to create a tic tac toe q learning neural network to introduce me to reinforcement learning, however it didn't work so I decided to try a simpler project requiring a network to train against static data rather than another neural network.
    This lead to me following the guidelines from this website - http://outlace.com/rlpart3.html



    however after programming this, the simple version works half the time, this is the version without experience replay. on some runs of the program the game will be played correctly, others it just moves back and forth wehn doing test runs.
    When trying to implement experience replay to complete the harder version. the program will just constantly get itself into a loop of going back and forth when testing



    i have a limit of 100 batches in which a batch is what the neural network is trained on. I am wondering whether this is an appropriate amount, or if there could be any common problems with implementing experience replay that i may have made.



    My current perspective of what experience replay is:
    1. run the program
    2. after each turn, the data you used to train the network on, gets saved into a batch
    3. when you have reaches x(100) amount of batches, pick one out and train on it.
    4. overwrite the oldest batch with the new batches that come in.



    If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.



    EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself,m or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.










    share|improve this question











    $endgroup$















      2












      2








      2





      $begingroup$


      I am currently trying to create a tic tac toe q learning neural network to introduce me to reinforcement learning, however it didn't work so I decided to try a simpler project requiring a network to train against static data rather than another neural network.
      This lead to me following the guidelines from this website - http://outlace.com/rlpart3.html



      however after programming this, the simple version works half the time, this is the version without experience replay. on some runs of the program the game will be played correctly, others it just moves back and forth wehn doing test runs.
      When trying to implement experience replay to complete the harder version. the program will just constantly get itself into a loop of going back and forth when testing



      i have a limit of 100 batches in which a batch is what the neural network is trained on. I am wondering whether this is an appropriate amount, or if there could be any common problems with implementing experience replay that i may have made.



      My current perspective of what experience replay is:
      1. run the program
      2. after each turn, the data you used to train the network on, gets saved into a batch
      3. when you have reaches x(100) amount of batches, pick one out and train on it.
      4. overwrite the oldest batch with the new batches that come in.



      If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.



      EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself,m or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.










      share|improve this question











      $endgroup$




      I am currently trying to create a tic tac toe q learning neural network to introduce me to reinforcement learning, however it didn't work so I decided to try a simpler project requiring a network to train against static data rather than another neural network.
      This lead to me following the guidelines from this website - http://outlace.com/rlpart3.html



      however after programming this, the simple version works half the time, this is the version without experience replay. on some runs of the program the game will be played correctly, others it just moves back and forth wehn doing test runs.
      When trying to implement experience replay to complete the harder version. the program will just constantly get itself into a loop of going back and forth when testing



      i have a limit of 100 batches in which a batch is what the neural network is trained on. I am wondering whether this is an appropriate amount, or if there could be any common problems with implementing experience replay that i may have made.



      My current perspective of what experience replay is:
      1. run the program
      2. after each turn, the data you used to train the network on, gets saved into a batch
      3. when you have reaches x(100) amount of batches, pick one out and train on it.
      4. overwrite the oldest batch with the new batches that come in.



      If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.



      EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself,m or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.







      machine-learning neural-network q-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 1 '18 at 17:06







      Peter Jamieson

















      asked Feb 1 '18 at 15:58









      Peter JamiesonPeter Jamieson

      475




      475






















          1 Answer
          1






          active

          oldest

          votes


















          3












          $begingroup$

          I am pretty sure you have to:



          wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.



          You then take an average of resulted gradients and finally, correct your network with that averaged gradient.



          After,





          • put the elements back in the memory (where you've taken them from)




            Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."




          • or throw it out.



          Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements



          I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.




          EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.




          Consider using just one network. Let's say our memory bank contains several elements:



          ...
          {...}
          {stateFrom, takenAction, takenActionQval, immediateReward, stateNext } <-- a single element
          {...}
          {...}
          {...}
          ...


          When using each element in your memory during the Correction session (one element after the other), you need to:




          1. pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

          2. Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

          3. obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

          4. get deltas of several other elements that you've chosen from the bank

          5. sum up all deltas, punish your network with them.


          I've intentionally circumvented 1 thing - you don't actually store takenActionQval, because they might become obsolete by the time you fetch its element from MemoryBank. You have to re-compute these scores during backprop.





          Therefore, you are training against the version of your network from the similar but previous correction session.



          Notice, you don't store ["next state", action Y] because by the time you select it to train (maybe you didn't select it for several minibatches), the network might have a different q-value for that action Y.



          You could also copy your network to the second network (target network), but only, say, every 200 timesteps. You would still proceed to punish your network for any differences with the target network in the meantime, after every 30 timesteps.



          Notice, the intuition of why this works is because: the Q-values are sort-of "flowing" from finish back to the beginning, by a little with every new journey. And you are always updating the current state towards the successor state. And train the successor state against its own (even further) successor.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:40










          • $begingroup$
            the main nn or whether the opposing neural network learns itself
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:41










          • $begingroup$
            I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
            $endgroup$
            – Kari
            Feb 4 '18 at 15:58













          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27347%2fq-learning-neural-network-experience-replay-problem%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3












          $begingroup$

          I am pretty sure you have to:



          wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.



          You then take an average of resulted gradients and finally, correct your network with that averaged gradient.



          After,





          • put the elements back in the memory (where you've taken them from)




            Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."




          • or throw it out.



          Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements



          I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.




          EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.




          Consider using just one network. Let's say our memory bank contains several elements:



          ...
          {...}
          {stateFrom, takenAction, takenActionQval, immediateReward, stateNext } <-- a single element
          {...}
          {...}
          {...}
          ...


          When using each element in your memory during the Correction session (one element after the other), you need to:




          1. pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

          2. Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

          3. obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

          4. get deltas of several other elements that you've chosen from the bank

          5. sum up all deltas, punish your network with them.


          I've intentionally circumvented 1 thing - you don't actually store takenActionQval, because they might become obsolete by the time you fetch its element from MemoryBank. You have to re-compute these scores during backprop.





          Therefore, you are training against the version of your network from the similar but previous correction session.



          Notice, you don't store ["next state", action Y] because by the time you select it to train (maybe you didn't select it for several minibatches), the network might have a different q-value for that action Y.



          You could also copy your network to the second network (target network), but only, say, every 200 timesteps. You would still proceed to punish your network for any differences with the target network in the meantime, after every 30 timesteps.



          Notice, the intuition of why this works is because: the Q-values are sort-of "flowing" from finish back to the beginning, by a little with every new journey. And you are always updating the current state towards the successor state. And train the successor state against its own (even further) successor.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:40










          • $begingroup$
            the main nn or whether the opposing neural network learns itself
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:41










          • $begingroup$
            I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
            $endgroup$
            – Kari
            Feb 4 '18 at 15:58


















          3












          $begingroup$

          I am pretty sure you have to:



          wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.



          You then take an average of resulted gradients and finally, correct your network with that averaged gradient.



          After,





          • put the elements back in the memory (where you've taken them from)




            Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."




          • or throw it out.



          Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements



          I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.




          EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.




          Consider using just one network. Let's say our memory bank contains several elements:



          ...
          {...}
          {stateFrom, takenAction, takenActionQval, immediateReward, stateNext } <-- a single element
          {...}
          {...}
          {...}
          ...


          When using each element in your memory during the Correction session (one element after the other), you need to:




          1. pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

          2. Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

          3. obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

          4. get deltas of several other elements that you've chosen from the bank

          5. sum up all deltas, punish your network with them.


          I've intentionally circumvented 1 thing - you don't actually store takenActionQval, because they might become obsolete by the time you fetch its element from MemoryBank. You have to re-compute these scores during backprop.





          Therefore, you are training against the version of your network from the similar but previous correction session.



          Notice, you don't store ["next state", action Y] because by the time you select it to train (maybe you didn't select it for several minibatches), the network might have a different q-value for that action Y.



          You could also copy your network to the second network (target network), but only, say, every 200 timesteps. You would still proceed to punish your network for any differences with the target network in the meantime, after every 30 timesteps.



          Notice, the intuition of why this works is because: the Q-values are sort-of "flowing" from finish back to the beginning, by a little with every new journey. And you are always updating the current state towards the successor state. And train the successor state against its own (even further) successor.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:40










          • $begingroup$
            the main nn or whether the opposing neural network learns itself
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:41










          • $begingroup$
            I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
            $endgroup$
            – Kari
            Feb 4 '18 at 15:58
















          3












          3








          3





          $begingroup$

          I am pretty sure you have to:



          wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.



          You then take an average of resulted gradients and finally, correct your network with that averaged gradient.



          After,





          • put the elements back in the memory (where you've taken them from)




            Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."




          • or throw it out.



          Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements



          I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.




          EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.




          Consider using just one network. Let's say our memory bank contains several elements:



          ...
          {...}
          {stateFrom, takenAction, takenActionQval, immediateReward, stateNext } <-- a single element
          {...}
          {...}
          {...}
          ...


          When using each element in your memory during the Correction session (one element after the other), you need to:




          1. pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

          2. Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

          3. obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

          4. get deltas of several other elements that you've chosen from the bank

          5. sum up all deltas, punish your network with them.


          I've intentionally circumvented 1 thing - you don't actually store takenActionQval, because they might become obsolete by the time you fetch its element from MemoryBank. You have to re-compute these scores during backprop.





          Therefore, you are training against the version of your network from the similar but previous correction session.



          Notice, you don't store ["next state", action Y] because by the time you select it to train (maybe you didn't select it for several minibatches), the network might have a different q-value for that action Y.



          You could also copy your network to the second network (target network), but only, say, every 200 timesteps. You would still proceed to punish your network for any differences with the target network in the meantime, after every 30 timesteps.



          Notice, the intuition of why this works is because: the Q-values are sort-of "flowing" from finish back to the beginning, by a little with every new journey. And you are always updating the current state towards the successor state. And train the successor state against its own (even further) successor.






          share|improve this answer











          $endgroup$



          I am pretty sure you have to:



          wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.



          You then take an average of resulted gradients and finally, correct your network with that averaged gradient.



          After,





          • put the elements back in the memory (where you've taken them from)




            Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."




          • or throw it out.



          Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements



          I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.




          EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.




          Consider using just one network. Let's say our memory bank contains several elements:



          ...
          {...}
          {stateFrom, takenAction, takenActionQval, immediateReward, stateNext } <-- a single element
          {...}
          {...}
          {...}
          ...


          When using each element in your memory during the Correction session (one element after the other), you need to:




          1. pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

          2. Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

          3. obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

          4. get deltas of several other elements that you've chosen from the bank

          5. sum up all deltas, punish your network with them.


          I've intentionally circumvented 1 thing - you don't actually store takenActionQval, because they might become obsolete by the time you fetch its element from MemoryBank. You have to re-compute these scores during backprop.





          Therefore, you are training against the version of your network from the similar but previous correction session.



          Notice, you don't store ["next state", action Y] because by the time you select it to train (maybe you didn't select it for several minibatches), the network might have a different q-value for that action Y.



          You could also copy your network to the second network (target network), but only, say, every 200 timesteps. You would still proceed to punish your network for any differences with the target network in the meantime, after every 30 timesteps.



          Notice, the intuition of why this works is because: the Q-values are sort-of "flowing" from finish back to the beginning, by a little with every new journey. And you are always updating the current state towards the successor state. And train the successor state against its own (even further) successor.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 13 mins ago

























          answered Feb 3 '18 at 2:54









          KariKari

          549321




          549321












          • $begingroup$
            Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:40










          • $begingroup$
            the main nn or whether the opposing neural network learns itself
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:41










          • $begingroup$
            I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
            $endgroup$
            – Kari
            Feb 4 '18 at 15:58




















          • $begingroup$
            Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:40










          • $begingroup$
            the main nn or whether the opposing neural network learns itself
            $endgroup$
            – Peter Jamieson
            Feb 4 '18 at 10:41










          • $begingroup$
            I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
            $endgroup$
            – Kari
            Feb 4 '18 at 15:58


















          $begingroup$
          Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
          $endgroup$
          – Peter Jamieson
          Feb 4 '18 at 10:40




          $begingroup$
          Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
          $endgroup$
          – Peter Jamieson
          Feb 4 '18 at 10:40












          $begingroup$
          the main nn or whether the opposing neural network learns itself
          $endgroup$
          – Peter Jamieson
          Feb 4 '18 at 10:41




          $begingroup$
          the main nn or whether the opposing neural network learns itself
          $endgroup$
          – Peter Jamieson
          Feb 4 '18 at 10:41












          $begingroup$
          I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
          $endgroup$
          – Kari
          Feb 4 '18 at 15:58






          $begingroup$
          I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
          $endgroup$
          – Kari
          Feb 4 '18 at 15:58




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27347%2fq-learning-neural-network-experience-replay-problem%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Ponta tanko

          Tantalo (mitologio)

          Erzsébet Schaár