Why a Random Reward in One-step Dynamics MDP?
$begingroup$
I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$
where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.
This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.
Clearly, I am missing something. Any enlightenment would be really helpful. Thx!
machine-learning reinforcement-learning
$endgroup$
add a comment |
$begingroup$
I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$
where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.
This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.
Clearly, I am missing something. Any enlightenment would be really helpful. Thx!
machine-learning reinforcement-learning
$endgroup$
$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39
$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46
add a comment |
$begingroup$
I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$
where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.
This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.
Clearly, I am missing something. Any enlightenment would be really helpful. Thx!
machine-learning reinforcement-learning
$endgroup$
I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_{t+1},R_{t+1}|S_t=s, A_t=a)
$$
where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.
This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.
Clearly, I am missing something. Any enlightenment would be really helpful. Thx!
machine-learning reinforcement-learning
machine-learning reinforcement-learning
edited 6 mins ago
Esmailian
1,686115
1,686115
asked Mar 16 at 21:59
RLSelfStudyRLSelfStudy
283
283
$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39
$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46
add a comment |
$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39
$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46
$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39
$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39
$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46
$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.
Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.
As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.
$endgroup$
add a comment |
$begingroup$
State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.
So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.
So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing
New contributor
$endgroup$
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fwhy-a-random-reward-in-one-step-dynamics-mdp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.
Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.
As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.
$endgroup$
add a comment |
$begingroup$
In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.
Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.
As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.
$endgroup$
add a comment |
$begingroup$
In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.
Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.
As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.
$endgroup$
In general, $R_{t+1}$ is is a random variable with conditional probability distribution $Pr(R_{t+1}=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.
Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.
As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.
answered Mar 17 at 0:39
Philip RaeisghasemPhilip Raeisghasem
2135
2135
add a comment |
add a comment |
$begingroup$
State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.
So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.
So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing
New contributor
$endgroup$
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
add a comment |
$begingroup$
State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.
So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.
So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing
New contributor
$endgroup$
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
add a comment |
$begingroup$
State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.
So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.
So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing
New contributor
$endgroup$
State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.
So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.
So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing
New contributor
New contributor
answered 2 hours ago
苏东远苏东远
111
111
New contributor
New contributor
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
add a comment |
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
$begingroup$
Very good explanation!
$endgroup$
– Esmailian
11 mins ago
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fwhy-a-random-reward-in-one-step-dynamics-mdp%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39
$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46