Q learning neural network experience replay problem

I am currently trying to create a tic tac toe q learning neural network to introduce me to reinforcement learning, however it didn't work so I decided to try a simpler project requiring a network to train against static data rather than another neural network.
This lead to me following the guidelines from this website - http://outlace.com/rlpart3.html

however after programming this, the simple version works half the time, this is the version without experience replay. on some runs of the program the game will be played correctly, others it just moves back and forth wehn doing test runs.
When trying to implement experience replay to complete the harder version. the program will just constantly get itself into a loop of going back and forth when testing

i have a limit of 100 batches in which a batch is what the neural network is trained on. I am wondering whether this is an appropriate amount, or if there could be any common problems with implementing experience replay that i may have made.

My current perspective of what experience replay is:
1. run the program
2. after each turn, the data you used to train the network on, gets saved into a batch
3. when you have reaches x(100) amount of batches, pick one out and train on it.
4. overwrite the oldest batch with the new batches that come in.

If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.

EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself,m or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.

edited Feb 1 '18 at 17:06

asked Feb 1 '18 at 15:58

Peter Jamieson

475

add a comment |

If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.

edited Feb 1 '18 at 17:06

asked Feb 1 '18 at 15:58

Peter Jamieson

475

add a comment |

If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.

edited Feb 1 '18 at 17:06

asked Feb 1 '18 at 15:58

Peter Jamieson

475

If anyone could let me know where I have gone wrong, or if there is any feedback about the experience replay or the quality of the question please let me know and i would be very grateful.

machine-learning neural-network q-learning

edited Feb 1 '18 at 17:06

asked Feb 1 '18 at 15:58

Peter Jamieson

475

edited Feb 1 '18 at 17:06

asked Feb 1 '18 at 15:58

Peter Jamieson

475

edited Feb 1 '18 at 17:06

asked Feb 1 '18 at 15:58

Peter Jamieson

475

asked Feb 1 '18 at 15:58

Peter Jamieson

475

asked Feb 1 '18 at 15:58

Peter Jamieson

475

add a comment |

1 Answer
1

active

oldest

votes

I am pretty sure you have to:

wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.

You then take an average of resulted gradients and finally, correct your network with that averaged gradient.

After,

put the elements back in the memory (where you've taken them from)

Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."

or throw it out.

Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements

I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.

EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.

Consider using just one network. Let's say our memory bank contains several elements:

...

{...}

{stateFrom, takenAction, takenActionQval, immediateReward, stateNext }  <-- a single element

{...}

{...}

{...}

...

When using each element in your memory during the Correction session (one element after the other), you need to:

pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

get deltas of several other elements that you've chosen from the bank

sum up all deltas, punish your network with them.

I've intentionally circumvented 1 thing - you don't actually store takenActionQval, because they might become obsolete by the time you fetch its element from MemoryBank. You have to re-compute these scores during backprop.

Therefore, you are training against the version of your network from the similar but previous correction session.

Notice, you don't store ["next state", action Y] because by the time you select it to train (maybe you didn't select it for several minibatches), the network might have a different q-value for that action Y.

You could also copy your network to the second network (target network), but only, say, every 200 timesteps. You would still proceed to punish your network for any differences with the target network in the meantime, after every 30 timesteps.

Notice, the intuition of why this works is because: the Q-values are sort-of "flowing" from finish back to the beginning, by a little with every new journey. And you are always updating the current state towards the successor state. And train the successor state against its own (even further) successor.

edited 13 mins ago

answered Feb 3 '18 at 2:54

Kari

549321

$begingroup$
Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:40

$begingroup$
the main nn or whether the opposing neural network learns itself
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:41

$begingroup$
I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
$endgroup$
– Kari
Feb 4 '18 at 15:58

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f27347%2fq-learning-neural-network-experience-replay-problem%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I am pretty sure you have to:

wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.

You then take an average of resulted gradients and finally, correct your network with that averaged gradient.

After,

put the elements back in the memory (where you've taken them from)

Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."

or throw it out.

Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements

I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.

EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.

Consider using just one network. Let's say our memory bank contains several elements:

...

{...}

{stateFrom, takenAction, takenActionQval, immediateReward, stateNext }  <-- a single element

{...}

{...}

{...}

...

When using each element in your memory during the Correction session (one element after the other), you need to:

pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

get deltas of several other elements that you've chosen from the bank

sum up all deltas, punish your network with them.

Therefore, you are training against the version of your network from the similar but previous correction session.

edited 13 mins ago

answered Feb 3 '18 at 2:54

Kari

549321

$begingroup$
Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:40

$begingroup$
the main nn or whether the opposing neural network learns itself
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:41

$begingroup$
I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
$endgroup$
– Kari
Feb 4 '18 at 15:58

add a comment |

I am pretty sure you have to:

wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.

You then take an average of resulted gradients and finally, correct your network with that averaged gradient.

After,

put the elements back in the memory (where you've taken them from)

Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."

or throw it out.

Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements

I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.

EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.

Consider using just one network. Let's say our memory bank contains several elements:

...

{...}

{stateFrom, takenAction, takenActionQval, immediateReward, stateNext }  <-- a single element

{...}

{...}

{...}

...

When using each element in your memory during the Correction session (one element after the other), you need to:

pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

get deltas of several other elements that you've chosen from the bank

sum up all deltas, punish your network with them.

Therefore, you are training against the version of your network from the similar but previous correction session.

edited 13 mins ago

answered Feb 3 '18 at 2:54

Kari

549321

$begingroup$
Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:40

$begingroup$
the main nn or whether the opposing neural network learns itself
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:41

$begingroup$
I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
$endgroup$
– Kari
Feb 4 '18 at 15:58

add a comment |

I am pretty sure you have to:

wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.

You then take an average of resulted gradients and finally, correct your network with that averaged gradient.

After,

put the elements back in the memory (where you've taken them from)

Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."

or throw it out.

Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements

I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.

EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.

Consider using just one network. Let's say our memory bank contains several elements:

...

{...}

{stateFrom, takenAction, takenActionQval, immediateReward, stateNext }  <-- a single element

{...}

{...}

{...}

...

When using each element in your memory during the Correction session (one element after the other), you need to:

pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

get deltas of several other elements that you've chosen from the bank

sum up all deltas, punish your network with them.

Therefore, you are training against the version of your network from the similar but previous correction session.

edited 13 mins ago

answered Feb 3 '18 at 2:54

Kari

549321

I am pretty sure you have to:

wait for enough data in the memory (100 entries, etc), then take out a minibatch (for example, 10 random elements), run backprop 10 times - once for each selected element.

You then take an average of resulted gradients and finally, correct your network with that averaged gradient.

After,

put the elements back in the memory (where you've taken them from)

Samples that have high priority are likely to be used in training many times. Reducing the weights on these oft-seen samples basically tells the network, "train on these samples, but without much emphasis; they'll be seen again soon."

or throw it out.

Then keep playing the game, adding say, another 30 examples before doing another weights-correction session using a batch of 10 elements

I refer to the "Session" meaning a sequence of backprops, where the result is the average gradient used to finally correct the network.

EDIT: Another question I have in terms of training a neural network against a neural network, is that do you train it against a completely separate network that trains itself, or do you train it against a previous version of itself. And when training it against the other neural network, do you turn the epsilon greedy down to make the opposing neural network not use any random moves.

Consider using just one network. Let's say our memory bank contains several elements:

...

{...}

{stateFrom, takenAction, takenActionQval, immediateReward, stateNext }  <-- a single element

{...}

{...}

{...}

...

When using each element in your memory during the Correction session (one element after the other), you need to:

pick an element. As seen, it contains stateFrom, taken action (action X), qvalue of action X, reward you received, state that it led to.

Run forward prop as if you were in "next state" (mentioned in that element). Get its best [action Y, qvalue]. In fact, your action Y doesn't have to be the action with highest q-value. Instead, it could be the "epsilon-greedy decided action" from the "next state" - you would then have SARSA instead of Q-learning

obtain delta by which the "qvalue of action X" differs from the immediate reward + "qvalue of action Y".

get deltas of several other elements that you've chosen from the bank

sum up all deltas, punish your network with them.

Therefore, you are training against the version of your network from the similar but previous correction session.

edited 13 mins ago

answered Feb 3 '18 at 2:54

Kari

549321

edited 13 mins ago

answered Feb 3 '18 at 2:54

Kari

549321

answered Feb 3 '18 at 2:54

Kari

549321

answered Feb 3 '18 at 2:54

Kari

549321

$begingroup$
Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:40

$begingroup$
the main nn or whether the opposing neural network learns itself
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:41

$begingroup$
I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
$endgroup$
– Kari
Feb 4 '18 at 15:58

add a comment |

$begingroup$
Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:40

$begingroup$
the main nn or whether the opposing neural network learns itself
$endgroup$
– Peter Jamieson
Feb 4 '18 at 10:41

$begingroup$
I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself
$endgroup$
– Kari
Feb 4 '18 at 15:58

Thanks so much for the nswer this is really helpful, I would just like to make sure I understand. you said that I would be using SARSA instead of Q-learning, Are you suggesting that i change the algorithm to SARSA, or is that specific to when training against multiple neural networks. I also don't think i said the EDIT part correctly. What i meant is when running tic tac toe, the opposing player would be another neural network - like Deepmind's alpha go, I was wondering whether the opposing neural network plays without random actions, and whether or not it uses previous weights of --

– Peter Jamieson
Feb 4 '18 at 10:40

the main nn or whether the opposing neural network learns itself

– Peter Jamieson
Feb 4 '18 at 10:41

I never yet coded 2 competing or self-improving networks, so you'll need to post a separate question on that specifically. I would assume SARSA would indeed be better in this case, since agent would learn not to trust the environment too much. That's because the other agent would be part of the environment and thus the environment would constantly change. I would also advise to check on page 25 of users.isr.ist.utl.pt/~mtjspaan/readingGroup/learningNeto05.pdf although I also need to read it myself

– Kari
Feb 4 '18 at 15:58

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

dm92fagOpc1pJfswyKPFJgvpS92ghDfRyLTz4hTLcmKtuJD,UHpw,rEaB,fdiux

搜尋此網誌

Gfyuki