AlphaGo (and other game programs using reinforcement-learning) without human database
$begingroup$
I am not a specialist of the subject, and my question is probably very naive. It stems from an essay to understand the powers and limitation of reinforcement learning as used in the AlphaGo program.
The program AlphaGo has been built using, among other things (Monte-Carlo exploration of trees, etc.), neural networks which are trained from a huge database of human-played go games, and which are then reinforced by letting play versions of the program against itself many times.
Now I wonder what would happen is we tried to build such a program without human database, i.e. starting with a basic program of Go just knowing rules and some method to explore trees, and letting play against itself to improve its neural network. Will we, after many games against itself, arrive at a program able to compete with or beat the best human players? And if so, how many games (in order of magnitude) would be needed for that? Or on the contrary, will such a program converge toward a much weaker player?
I assume that the experiment has not been made, since AlphaGo is so recent. But the answer may nevertheless be obvious to a specialist. Otherwise any educated guess will interest me.
One can also ask the same question for "simpler" games. If we use roughly the same reinforcement-learning technics used for AlphaGo, but with
no use of human database, for a Chess program, would we eventually get
a program able to beat the best human? And if so, how fast? Has this been tried? Or if not for Chess, what about Checkers, or even simpler games?
Thanks a lot.
reinforcement-learning
$endgroup$
add a comment |
$begingroup$
I am not a specialist of the subject, and my question is probably very naive. It stems from an essay to understand the powers and limitation of reinforcement learning as used in the AlphaGo program.
The program AlphaGo has been built using, among other things (Monte-Carlo exploration of trees, etc.), neural networks which are trained from a huge database of human-played go games, and which are then reinforced by letting play versions of the program against itself many times.
Now I wonder what would happen is we tried to build such a program without human database, i.e. starting with a basic program of Go just knowing rules and some method to explore trees, and letting play against itself to improve its neural network. Will we, after many games against itself, arrive at a program able to compete with or beat the best human players? And if so, how many games (in order of magnitude) would be needed for that? Or on the contrary, will such a program converge toward a much weaker player?
I assume that the experiment has not been made, since AlphaGo is so recent. But the answer may nevertheless be obvious to a specialist. Otherwise any educated guess will interest me.
One can also ask the same question for "simpler" games. If we use roughly the same reinforcement-learning technics used for AlphaGo, but with
no use of human database, for a Chess program, would we eventually get
a program able to beat the best human? And if so, how fast? Has this been tried? Or if not for Chess, what about Checkers, or even simpler games?
Thanks a lot.
reinforcement-learning
$endgroup$
add a comment |
$begingroup$
I am not a specialist of the subject, and my question is probably very naive. It stems from an essay to understand the powers and limitation of reinforcement learning as used in the AlphaGo program.
The program AlphaGo has been built using, among other things (Monte-Carlo exploration of trees, etc.), neural networks which are trained from a huge database of human-played go games, and which are then reinforced by letting play versions of the program against itself many times.
Now I wonder what would happen is we tried to build such a program without human database, i.e. starting with a basic program of Go just knowing rules and some method to explore trees, and letting play against itself to improve its neural network. Will we, after many games against itself, arrive at a program able to compete with or beat the best human players? And if so, how many games (in order of magnitude) would be needed for that? Or on the contrary, will such a program converge toward a much weaker player?
I assume that the experiment has not been made, since AlphaGo is so recent. But the answer may nevertheless be obvious to a specialist. Otherwise any educated guess will interest me.
One can also ask the same question for "simpler" games. If we use roughly the same reinforcement-learning technics used for AlphaGo, but with
no use of human database, for a Chess program, would we eventually get
a program able to beat the best human? And if so, how fast? Has this been tried? Or if not for Chess, what about Checkers, or even simpler games?
Thanks a lot.
reinforcement-learning
$endgroup$
I am not a specialist of the subject, and my question is probably very naive. It stems from an essay to understand the powers and limitation of reinforcement learning as used in the AlphaGo program.
The program AlphaGo has been built using, among other things (Monte-Carlo exploration of trees, etc.), neural networks which are trained from a huge database of human-played go games, and which are then reinforced by letting play versions of the program against itself many times.
Now I wonder what would happen is we tried to build such a program without human database, i.e. starting with a basic program of Go just knowing rules and some method to explore trees, and letting play against itself to improve its neural network. Will we, after many games against itself, arrive at a program able to compete with or beat the best human players? And if so, how many games (in order of magnitude) would be needed for that? Or on the contrary, will such a program converge toward a much weaker player?
I assume that the experiment has not been made, since AlphaGo is so recent. But the answer may nevertheless be obvious to a specialist. Otherwise any educated guess will interest me.
One can also ask the same question for "simpler" games. If we use roughly the same reinforcement-learning technics used for AlphaGo, but with
no use of human database, for a Chess program, would we eventually get
a program able to beat the best human? And if so, how fast? Has this been tried? Or if not for Chess, what about Checkers, or even simpler games?
Thanks a lot.
reinforcement-learning
reinforcement-learning
asked Apr 10 '16 at 5:06
JoëlJoël
1765
1765
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
$begingroup$
I'm no expert but it looks like AlphaGo Zero answers your question.
https://deepmind.com/blog/alphago-zero-learning-scratch/
Previous versions of AlphaGo initially trained on thousands of human
amateur and professional games to learn how to play Go. AlphaGo Zero
skips this step and learns to play simply by playing games against
itself, starting from completely random play. In doing so, it quickly
surpassed human level of play and defeated the previously published
champion-defeating version of AlphaGo by 100 games to 0.
$endgroup$
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
1
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
4
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
1
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
|
show 2 more comments
$begingroup$
The same question has been asked to the author of the AlphaGo paper and his answer was that we don't know what would happen if AlphaGo would learn from scratch (they haven't tested it).
However, given the complexity of the game, it would be a difficult task to train an algorithm from scratch without prior knowledge. Thus, it is reasonable at the beginning to start building such a system by upgrading it to a Master level using knowledge acquired by humans.
It is worth noting that, although the human moves bias the action selection at the tree nodes (states), this prior has a decay factor. This means that increased visitations to a specific state, reduce the strength of the prior to encourage the algorithm to explore.
The current level of Mastery of AlphaGo is unknown how close or far it is to a human's way of playing (in the tournament it did one move that a human had almost zero probability to perform!- but equally did some really bad moves as well). Possibly it remains for all these questions to be answered by actually implementing the corresponding testing algorithms.
I owe to edit my answer as the recent paper of DeepMind answers your question. There were lots of advancements that came out from the whole previous experience with the first version of AlphaGo and it is really worth reading it.
$endgroup$
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
add a comment |
$begingroup$
As far as I understood the algorithm of AlphaGo, it is based on a simple reinforcement learning (RL) framework, using Monte-Carlo tree search to select the best actions. On the top of it, the states and actions covered by the RL algorithm are not simply the entire possible configuration of the game (Go has a huge complexity) but are based on a policy network and a value network, learned from real games and then improved by playing games AlphaGo vs AlphaGo.
Then we might wonder if the training from real games is just a shortcut to save time or a necessary option to get such efficiency. I guess no one really know the answer, but we could state some assumptions. First, the human ability to promote good moves is due to much more complex intelligence than a simple neural net. For board games, it is a mix between memory, experience, logic and feelings. In this direction, I'm not sure the AlphaGo algorithm could build such a model without explicitly exploring a huge percentage of the entire configuration of the Go game (which is practically impossible). Current researches focus on building more complex representation of such a game, like relational RL or inductive logic learning. Then for simpler games (might be the case for chess but nothing sure), I would say that AlphaGo could retrieve similar techniques as humans by playing against itself, especially for openings (there are first only 10 moves available).
Still it is only an opinion. But I'm quite sure that the key to answer your question resides in the RL approach that is nowadays still quite simple in term of knowledge. We are not really able to identify what makes us able to handle these games, and the best way we found until yet to defeat human is to roughly learn from him, and improve (a bit) the learned model with massive calculations.
$endgroup$
add a comment |
$begingroup$
Competitive self-play without human database is even possible for complicated, partially observed environments. OpenAI is focusing on this direction.
According to this article:
Self-play ensures that the environment is always the right difficulty for an AI to improve.
That's an important reason for the success of self-play.
OpenAI achieved superhuman results for Dota 2 1v1, in August 11th 2017, beat Dendi 2-0 under standard tournament rules.
The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans.
Not just games, this direction is also promising for robotics tasks.
We’ve found that self-play allows simulated AIs to discover physical skills like tackling, ducking, faking, kicking, catching, and diving for the ball, without explicitly designing an environment with these skills in mind.
In the next step, they extend the method to learn how to cooperate, compete and communicate, not just limit to self-play.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11118%2falphago-and-other-game-programs-using-reinforcement-learning-without-human-dat%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I'm no expert but it looks like AlphaGo Zero answers your question.
https://deepmind.com/blog/alphago-zero-learning-scratch/
Previous versions of AlphaGo initially trained on thousands of human
amateur and professional games to learn how to play Go. AlphaGo Zero
skips this step and learns to play simply by playing games against
itself, starting from completely random play. In doing so, it quickly
surpassed human level of play and defeated the previously published
champion-defeating version of AlphaGo by 100 games to 0.
$endgroup$
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
1
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
4
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
1
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
|
show 2 more comments
$begingroup$
I'm no expert but it looks like AlphaGo Zero answers your question.
https://deepmind.com/blog/alphago-zero-learning-scratch/
Previous versions of AlphaGo initially trained on thousands of human
amateur and professional games to learn how to play Go. AlphaGo Zero
skips this step and learns to play simply by playing games against
itself, starting from completely random play. In doing so, it quickly
surpassed human level of play and defeated the previously published
champion-defeating version of AlphaGo by 100 games to 0.
$endgroup$
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
1
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
4
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
1
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
|
show 2 more comments
$begingroup$
I'm no expert but it looks like AlphaGo Zero answers your question.
https://deepmind.com/blog/alphago-zero-learning-scratch/
Previous versions of AlphaGo initially trained on thousands of human
amateur and professional games to learn how to play Go. AlphaGo Zero
skips this step and learns to play simply by playing games against
itself, starting from completely random play. In doing so, it quickly
surpassed human level of play and defeated the previously published
champion-defeating version of AlphaGo by 100 games to 0.
$endgroup$
I'm no expert but it looks like AlphaGo Zero answers your question.
https://deepmind.com/blog/alphago-zero-learning-scratch/
Previous versions of AlphaGo initially trained on thousands of human
amateur and professional games to learn how to play Go. AlphaGo Zero
skips this step and learns to play simply by playing games against
itself, starting from completely random play. In doing so, it quickly
surpassed human level of play and defeated the previously published
champion-defeating version of AlphaGo by 100 games to 0.
answered Oct 19 '17 at 2:10
GabeGabe
21624
21624
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
1
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
4
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
1
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
|
show 2 more comments
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
1
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
4
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
1
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
$begingroup$
Is this more recent?
$endgroup$
– kosmos
Oct 19 '17 at 2:24
1
1
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
This was published on October 18th, 2017.
$endgroup$
– ncasas
Oct 19 '17 at 11:06
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
$begingroup$
It would be interesting to know results against humans. Because one reason for the pre-trained human database is to refine MCTS algorithm against human opponents. Original AlphaGo was optimised to play vs humans, not other ML. As such it is harder to say whether AlphaGo Zero is strictly "better" than the original AlphaGo, or just dominates it in a game theory sense - e.g. AlphaGo Zero beats AlphaGo beats Lee Sedol beats AlphaGo Zero . . .
$endgroup$
– Neil Slater
Oct 19 '17 at 11:58
4
4
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
$begingroup$
Neil, Yes this would be interesting. But I would not bet a cent of the human chances against Alpha Go zero.
$endgroup$
– Joël
Oct 19 '17 at 16:28
1
1
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
$begingroup$
The networks that were pre-trained with human data were used to make the search more efficient but not more human-like. After the supervised learning phased finished, the network was playing against itself. In addition to that (you can refer to my answer below) the $Q$ function in the tree was affected by the prior probability from the policy network and then decayed over number of visitations to encourage exploration.
$endgroup$
– Constantinos
Nov 2 '17 at 4:55
|
show 2 more comments
$begingroup$
The same question has been asked to the author of the AlphaGo paper and his answer was that we don't know what would happen if AlphaGo would learn from scratch (they haven't tested it).
However, given the complexity of the game, it would be a difficult task to train an algorithm from scratch without prior knowledge. Thus, it is reasonable at the beginning to start building such a system by upgrading it to a Master level using knowledge acquired by humans.
It is worth noting that, although the human moves bias the action selection at the tree nodes (states), this prior has a decay factor. This means that increased visitations to a specific state, reduce the strength of the prior to encourage the algorithm to explore.
The current level of Mastery of AlphaGo is unknown how close or far it is to a human's way of playing (in the tournament it did one move that a human had almost zero probability to perform!- but equally did some really bad moves as well). Possibly it remains for all these questions to be answered by actually implementing the corresponding testing algorithms.
I owe to edit my answer as the recent paper of DeepMind answers your question. There were lots of advancements that came out from the whole previous experience with the first version of AlphaGo and it is really worth reading it.
$endgroup$
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
add a comment |
$begingroup$
The same question has been asked to the author of the AlphaGo paper and his answer was that we don't know what would happen if AlphaGo would learn from scratch (they haven't tested it).
However, given the complexity of the game, it would be a difficult task to train an algorithm from scratch without prior knowledge. Thus, it is reasonable at the beginning to start building such a system by upgrading it to a Master level using knowledge acquired by humans.
It is worth noting that, although the human moves bias the action selection at the tree nodes (states), this prior has a decay factor. This means that increased visitations to a specific state, reduce the strength of the prior to encourage the algorithm to explore.
The current level of Mastery of AlphaGo is unknown how close or far it is to a human's way of playing (in the tournament it did one move that a human had almost zero probability to perform!- but equally did some really bad moves as well). Possibly it remains for all these questions to be answered by actually implementing the corresponding testing algorithms.
I owe to edit my answer as the recent paper of DeepMind answers your question. There were lots of advancements that came out from the whole previous experience with the first version of AlphaGo and it is really worth reading it.
$endgroup$
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
add a comment |
$begingroup$
The same question has been asked to the author of the AlphaGo paper and his answer was that we don't know what would happen if AlphaGo would learn from scratch (they haven't tested it).
However, given the complexity of the game, it would be a difficult task to train an algorithm from scratch without prior knowledge. Thus, it is reasonable at the beginning to start building such a system by upgrading it to a Master level using knowledge acquired by humans.
It is worth noting that, although the human moves bias the action selection at the tree nodes (states), this prior has a decay factor. This means that increased visitations to a specific state, reduce the strength of the prior to encourage the algorithm to explore.
The current level of Mastery of AlphaGo is unknown how close or far it is to a human's way of playing (in the tournament it did one move that a human had almost zero probability to perform!- but equally did some really bad moves as well). Possibly it remains for all these questions to be answered by actually implementing the corresponding testing algorithms.
I owe to edit my answer as the recent paper of DeepMind answers your question. There were lots of advancements that came out from the whole previous experience with the first version of AlphaGo and it is really worth reading it.
$endgroup$
The same question has been asked to the author of the AlphaGo paper and his answer was that we don't know what would happen if AlphaGo would learn from scratch (they haven't tested it).
However, given the complexity of the game, it would be a difficult task to train an algorithm from scratch without prior knowledge. Thus, it is reasonable at the beginning to start building such a system by upgrading it to a Master level using knowledge acquired by humans.
It is worth noting that, although the human moves bias the action selection at the tree nodes (states), this prior has a decay factor. This means that increased visitations to a specific state, reduce the strength of the prior to encourage the algorithm to explore.
The current level of Mastery of AlphaGo is unknown how close or far it is to a human's way of playing (in the tournament it did one move that a human had almost zero probability to perform!- but equally did some really bad moves as well). Possibly it remains for all these questions to be answered by actually implementing the corresponding testing algorithms.
I owe to edit my answer as the recent paper of DeepMind answers your question. There were lots of advancements that came out from the whole previous experience with the first version of AlphaGo and it is really worth reading it.
edited 7 mins ago
Glorfindel
119116
119116
answered May 12 '16 at 15:01
ConstantinosConstantinos
1,16629
1,16629
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
add a comment |
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
$begingroup$
You are welcome :)
$endgroup$
– Constantinos
May 12 '16 at 19:04
add a comment |
$begingroup$
As far as I understood the algorithm of AlphaGo, it is based on a simple reinforcement learning (RL) framework, using Monte-Carlo tree search to select the best actions. On the top of it, the states and actions covered by the RL algorithm are not simply the entire possible configuration of the game (Go has a huge complexity) but are based on a policy network and a value network, learned from real games and then improved by playing games AlphaGo vs AlphaGo.
Then we might wonder if the training from real games is just a shortcut to save time or a necessary option to get such efficiency. I guess no one really know the answer, but we could state some assumptions. First, the human ability to promote good moves is due to much more complex intelligence than a simple neural net. For board games, it is a mix between memory, experience, logic and feelings. In this direction, I'm not sure the AlphaGo algorithm could build such a model without explicitly exploring a huge percentage of the entire configuration of the Go game (which is practically impossible). Current researches focus on building more complex representation of such a game, like relational RL or inductive logic learning. Then for simpler games (might be the case for chess but nothing sure), I would say that AlphaGo could retrieve similar techniques as humans by playing against itself, especially for openings (there are first only 10 moves available).
Still it is only an opinion. But I'm quite sure that the key to answer your question resides in the RL approach that is nowadays still quite simple in term of knowledge. We are not really able to identify what makes us able to handle these games, and the best way we found until yet to defeat human is to roughly learn from him, and improve (a bit) the learned model with massive calculations.
$endgroup$
add a comment |
$begingroup$
As far as I understood the algorithm of AlphaGo, it is based on a simple reinforcement learning (RL) framework, using Monte-Carlo tree search to select the best actions. On the top of it, the states and actions covered by the RL algorithm are not simply the entire possible configuration of the game (Go has a huge complexity) but are based on a policy network and a value network, learned from real games and then improved by playing games AlphaGo vs AlphaGo.
Then we might wonder if the training from real games is just a shortcut to save time or a necessary option to get such efficiency. I guess no one really know the answer, but we could state some assumptions. First, the human ability to promote good moves is due to much more complex intelligence than a simple neural net. For board games, it is a mix between memory, experience, logic and feelings. In this direction, I'm not sure the AlphaGo algorithm could build such a model without explicitly exploring a huge percentage of the entire configuration of the Go game (which is practically impossible). Current researches focus on building more complex representation of such a game, like relational RL or inductive logic learning. Then for simpler games (might be the case for chess but nothing sure), I would say that AlphaGo could retrieve similar techniques as humans by playing against itself, especially for openings (there are first only 10 moves available).
Still it is only an opinion. But I'm quite sure that the key to answer your question resides in the RL approach that is nowadays still quite simple in term of knowledge. We are not really able to identify what makes us able to handle these games, and the best way we found until yet to defeat human is to roughly learn from him, and improve (a bit) the learned model with massive calculations.
$endgroup$
add a comment |
$begingroup$
As far as I understood the algorithm of AlphaGo, it is based on a simple reinforcement learning (RL) framework, using Monte-Carlo tree search to select the best actions. On the top of it, the states and actions covered by the RL algorithm are not simply the entire possible configuration of the game (Go has a huge complexity) but are based on a policy network and a value network, learned from real games and then improved by playing games AlphaGo vs AlphaGo.
Then we might wonder if the training from real games is just a shortcut to save time or a necessary option to get such efficiency. I guess no one really know the answer, but we could state some assumptions. First, the human ability to promote good moves is due to much more complex intelligence than a simple neural net. For board games, it is a mix between memory, experience, logic and feelings. In this direction, I'm not sure the AlphaGo algorithm could build such a model without explicitly exploring a huge percentage of the entire configuration of the Go game (which is practically impossible). Current researches focus on building more complex representation of such a game, like relational RL or inductive logic learning. Then for simpler games (might be the case for chess but nothing sure), I would say that AlphaGo could retrieve similar techniques as humans by playing against itself, especially for openings (there are first only 10 moves available).
Still it is only an opinion. But I'm quite sure that the key to answer your question resides in the RL approach that is nowadays still quite simple in term of knowledge. We are not really able to identify what makes us able to handle these games, and the best way we found until yet to defeat human is to roughly learn from him, and improve (a bit) the learned model with massive calculations.
$endgroup$
As far as I understood the algorithm of AlphaGo, it is based on a simple reinforcement learning (RL) framework, using Monte-Carlo tree search to select the best actions. On the top of it, the states and actions covered by the RL algorithm are not simply the entire possible configuration of the game (Go has a huge complexity) but are based on a policy network and a value network, learned from real games and then improved by playing games AlphaGo vs AlphaGo.
Then we might wonder if the training from real games is just a shortcut to save time or a necessary option to get such efficiency. I guess no one really know the answer, but we could state some assumptions. First, the human ability to promote good moves is due to much more complex intelligence than a simple neural net. For board games, it is a mix between memory, experience, logic and feelings. In this direction, I'm not sure the AlphaGo algorithm could build such a model without explicitly exploring a huge percentage of the entire configuration of the Go game (which is practically impossible). Current researches focus on building more complex representation of such a game, like relational RL or inductive logic learning. Then for simpler games (might be the case for chess but nothing sure), I would say that AlphaGo could retrieve similar techniques as humans by playing against itself, especially for openings (there are first only 10 moves available).
Still it is only an opinion. But I'm quite sure that the key to answer your question resides in the RL approach that is nowadays still quite simple in term of knowledge. We are not really able to identify what makes us able to handle these games, and the best way we found until yet to defeat human is to roughly learn from him, and improve (a bit) the learned model with massive calculations.
answered Apr 10 '16 at 13:19
debzsuddebzsud
987212
987212
add a comment |
add a comment |
$begingroup$
Competitive self-play without human database is even possible for complicated, partially observed environments. OpenAI is focusing on this direction.
According to this article:
Self-play ensures that the environment is always the right difficulty for an AI to improve.
That's an important reason for the success of self-play.
OpenAI achieved superhuman results for Dota 2 1v1, in August 11th 2017, beat Dendi 2-0 under standard tournament rules.
The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans.
Not just games, this direction is also promising for robotics tasks.
We’ve found that self-play allows simulated AIs to discover physical skills like tackling, ducking, faking, kicking, catching, and diving for the ball, without explicitly designing an environment with these skills in mind.
In the next step, they extend the method to learn how to cooperate, compete and communicate, not just limit to self-play.
$endgroup$
add a comment |
$begingroup$
Competitive self-play without human database is even possible for complicated, partially observed environments. OpenAI is focusing on this direction.
According to this article:
Self-play ensures that the environment is always the right difficulty for an AI to improve.
That's an important reason for the success of self-play.
OpenAI achieved superhuman results for Dota 2 1v1, in August 11th 2017, beat Dendi 2-0 under standard tournament rules.
The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans.
Not just games, this direction is also promising for robotics tasks.
We’ve found that self-play allows simulated AIs to discover physical skills like tackling, ducking, faking, kicking, catching, and diving for the ball, without explicitly designing an environment with these skills in mind.
In the next step, they extend the method to learn how to cooperate, compete and communicate, not just limit to self-play.
$endgroup$
add a comment |
$begingroup$
Competitive self-play without human database is even possible for complicated, partially observed environments. OpenAI is focusing on this direction.
According to this article:
Self-play ensures that the environment is always the right difficulty for an AI to improve.
That's an important reason for the success of self-play.
OpenAI achieved superhuman results for Dota 2 1v1, in August 11th 2017, beat Dendi 2-0 under standard tournament rules.
The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans.
Not just games, this direction is also promising for robotics tasks.
We’ve found that self-play allows simulated AIs to discover physical skills like tackling, ducking, faking, kicking, catching, and diving for the ball, without explicitly designing an environment with these skills in mind.
In the next step, they extend the method to learn how to cooperate, compete and communicate, not just limit to self-play.
$endgroup$
Competitive self-play without human database is even possible for complicated, partially observed environments. OpenAI is focusing on this direction.
According to this article:
Self-play ensures that the environment is always the right difficulty for an AI to improve.
That's an important reason for the success of self-play.
OpenAI achieved superhuman results for Dota 2 1v1, in August 11th 2017, beat Dendi 2-0 under standard tournament rules.
The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans.
Not just games, this direction is also promising for robotics tasks.
We’ve found that self-play allows simulated AIs to discover physical skills like tackling, ducking, faking, kicking, catching, and diving for the ball, without explicitly designing an environment with these skills in mind.
In the next step, they extend the method to learn how to cooperate, compete and communicate, not just limit to self-play.
answered Feb 15 '18 at 12:07
TQATQA
1549
1549
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f11118%2falphago-and-other-game-programs-using-reinforcement-learning-without-human-dat%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown