Q-learning vs temporal-difference vs model-based reinforcement learning
I'm in a course called "Intelligent Machines" at the university. We were introduced with 3 methods of reinforced learning, and with those we were given the intuition of when to use them, and I quote:
- Q-Learning - Best when MDP can't be solved.
- Temporal Difference Learning - best when MDP is known or can be learned but can't be solved.
- Model-based - best when MDP can't be learned.
Are there any good examples explaining when to choose one method over the other?
machine-learning reinforcement-learning q-learning temporal-difference
add a comment |
I'm in a course called "Intelligent Machines" at the university. We were introduced with 3 methods of reinforced learning, and with those we were given the intuition of when to use them, and I quote:
- Q-Learning - Best when MDP can't be solved.
- Temporal Difference Learning - best when MDP is known or can be learned but can't be solved.
- Model-based - best when MDP can't be learned.
Are there any good examples explaining when to choose one method over the other?
machine-learning reinforcement-learning q-learning temporal-difference
5
Q-learning is a temporal difference algorithm.
– Don Reba
Dec 9 '15 at 17:50
Isn't Q-Learning used to calculate the Q-value, While Temporal Difference Learning used to calculate Value function? [They are related, But not exactly the same i guess] Or am i mistaken?
– StationaryTraveller
Dec 9 '15 at 17:53
4
V is the state value function, Q is the action value function, and Q-learning is a specific off-policy temporal-difference learning algorithm. You can learn either Q or V using different TD or non-TD methods, both of which could be model-based or not.
– Don Reba
Dec 10 '15 at 2:42
1
Thanks for the semantics, But it still doesn't help me of finding an example of when to use which one. When is it good to choose Q value over V function?
– StationaryTraveller
Dec 11 '15 at 15:49
2
You need the action-value function in order to form a policy. You can learn it directly or you can retrieve it from the state-value function if you know the state transition probability function.
– Don Reba
Dec 11 '15 at 16:06
add a comment |
I'm in a course called "Intelligent Machines" at the university. We were introduced with 3 methods of reinforced learning, and with those we were given the intuition of when to use them, and I quote:
- Q-Learning - Best when MDP can't be solved.
- Temporal Difference Learning - best when MDP is known or can be learned but can't be solved.
- Model-based - best when MDP can't be learned.
Are there any good examples explaining when to choose one method over the other?
machine-learning reinforcement-learning q-learning temporal-difference
I'm in a course called "Intelligent Machines" at the university. We were introduced with 3 methods of reinforced learning, and with those we were given the intuition of when to use them, and I quote:
- Q-Learning - Best when MDP can't be solved.
- Temporal Difference Learning - best when MDP is known or can be learned but can't be solved.
- Model-based - best when MDP can't be learned.
Are there any good examples explaining when to choose one method over the other?
machine-learning reinforcement-learning q-learning temporal-difference
machine-learning reinforcement-learning q-learning temporal-difference
edited Nov 22 '18 at 15:39
nbro
5,68185096
5,68185096
asked Dec 9 '15 at 14:17
StationaryTravellerStationaryTraveller
5511823
5511823
5
Q-learning is a temporal difference algorithm.
– Don Reba
Dec 9 '15 at 17:50
Isn't Q-Learning used to calculate the Q-value, While Temporal Difference Learning used to calculate Value function? [They are related, But not exactly the same i guess] Or am i mistaken?
– StationaryTraveller
Dec 9 '15 at 17:53
4
V is the state value function, Q is the action value function, and Q-learning is a specific off-policy temporal-difference learning algorithm. You can learn either Q or V using different TD or non-TD methods, both of which could be model-based or not.
– Don Reba
Dec 10 '15 at 2:42
1
Thanks for the semantics, But it still doesn't help me of finding an example of when to use which one. When is it good to choose Q value over V function?
– StationaryTraveller
Dec 11 '15 at 15:49
2
You need the action-value function in order to form a policy. You can learn it directly or you can retrieve it from the state-value function if you know the state transition probability function.
– Don Reba
Dec 11 '15 at 16:06
add a comment |
5
Q-learning is a temporal difference algorithm.
– Don Reba
Dec 9 '15 at 17:50
Isn't Q-Learning used to calculate the Q-value, While Temporal Difference Learning used to calculate Value function? [They are related, But not exactly the same i guess] Or am i mistaken?
– StationaryTraveller
Dec 9 '15 at 17:53
4
V is the state value function, Q is the action value function, and Q-learning is a specific off-policy temporal-difference learning algorithm. You can learn either Q or V using different TD or non-TD methods, both of which could be model-based or not.
– Don Reba
Dec 10 '15 at 2:42
1
Thanks for the semantics, But it still doesn't help me of finding an example of when to use which one. When is it good to choose Q value over V function?
– StationaryTraveller
Dec 11 '15 at 15:49
2
You need the action-value function in order to form a policy. You can learn it directly or you can retrieve it from the state-value function if you know the state transition probability function.
– Don Reba
Dec 11 '15 at 16:06
5
5
Q-learning is a temporal difference algorithm.
– Don Reba
Dec 9 '15 at 17:50
Q-learning is a temporal difference algorithm.
– Don Reba
Dec 9 '15 at 17:50
Isn't Q-Learning used to calculate the Q-value, While Temporal Difference Learning used to calculate Value function? [They are related, But not exactly the same i guess] Or am i mistaken?
– StationaryTraveller
Dec 9 '15 at 17:53
Isn't Q-Learning used to calculate the Q-value, While Temporal Difference Learning used to calculate Value function? [They are related, But not exactly the same i guess] Or am i mistaken?
– StationaryTraveller
Dec 9 '15 at 17:53
4
4
V is the state value function, Q is the action value function, and Q-learning is a specific off-policy temporal-difference learning algorithm. You can learn either Q or V using different TD or non-TD methods, both of which could be model-based or not.
– Don Reba
Dec 10 '15 at 2:42
V is the state value function, Q is the action value function, and Q-learning is a specific off-policy temporal-difference learning algorithm. You can learn either Q or V using different TD or non-TD methods, both of which could be model-based or not.
– Don Reba
Dec 10 '15 at 2:42
1
1
Thanks for the semantics, But it still doesn't help me of finding an example of when to use which one. When is it good to choose Q value over V function?
– StationaryTraveller
Dec 11 '15 at 15:49
Thanks for the semantics, But it still doesn't help me of finding an example of when to use which one. When is it good to choose Q value over V function?
– StationaryTraveller
Dec 11 '15 at 15:49
2
2
You need the action-value function in order to form a policy. You can learn it directly or you can retrieve it from the state-value function if you know the state transition probability function.
– Don Reba
Dec 11 '15 at 16:06
You need the action-value function in order to form a policy. You can learn it directly or you can retrieve it from the state-value function if you know the state transition probability function.
– Don Reba
Dec 11 '15 at 16:06
add a comment |
1 Answer
1
active
oldest
votes
Temporal Difference is an approach to learning how to predict a quantity that depends on future values of a given signal. It can be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm used to learn the Q-function. As stated by Don Reba, you need the Q-function to perform an action (e.g., following an epsilon-greedy policy). If you have only the V-function you can still derive the Q-function by iterating over all the possible next states and choosing the action which leads you to the state with the highest V-value. For examples and more insights, I recommend the classic book from Sutton and Barto.
In model-free RL you don't learn the state-transition function (the model) and you can rely only on samples. However, you might be interested also in learning it, for example because you cannot collect many samples and want to generate some virtual ones. In this case we talk about model-based RL.
Model-based RL is quite common in robotics, where you cannot perform many real simulations or the robot will break. This is a good survey with many examples (but it only talks about policy search algorithms). For another example have a look at this paper. Here the authors learn - along with a policy - a Gaussian process to approximate the forward model of the robot, in order to simulate trajectories and to reduce the number of real robot interaction.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f34181056%2fq-learning-vs-temporal-difference-vs-model-based-reinforcement-learning%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Temporal Difference is an approach to learning how to predict a quantity that depends on future values of a given signal. It can be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm used to learn the Q-function. As stated by Don Reba, you need the Q-function to perform an action (e.g., following an epsilon-greedy policy). If you have only the V-function you can still derive the Q-function by iterating over all the possible next states and choosing the action which leads you to the state with the highest V-value. For examples and more insights, I recommend the classic book from Sutton and Barto.
In model-free RL you don't learn the state-transition function (the model) and you can rely only on samples. However, you might be interested also in learning it, for example because you cannot collect many samples and want to generate some virtual ones. In this case we talk about model-based RL.
Model-based RL is quite common in robotics, where you cannot perform many real simulations or the robot will break. This is a good survey with many examples (but it only talks about policy search algorithms). For another example have a look at this paper. Here the authors learn - along with a policy - a Gaussian process to approximate the forward model of the robot, in order to simulate trajectories and to reduce the number of real robot interaction.
add a comment |
Temporal Difference is an approach to learning how to predict a quantity that depends on future values of a given signal. It can be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm used to learn the Q-function. As stated by Don Reba, you need the Q-function to perform an action (e.g., following an epsilon-greedy policy). If you have only the V-function you can still derive the Q-function by iterating over all the possible next states and choosing the action which leads you to the state with the highest V-value. For examples and more insights, I recommend the classic book from Sutton and Barto.
In model-free RL you don't learn the state-transition function (the model) and you can rely only on samples. However, you might be interested also in learning it, for example because you cannot collect many samples and want to generate some virtual ones. In this case we talk about model-based RL.
Model-based RL is quite common in robotics, where you cannot perform many real simulations or the robot will break. This is a good survey with many examples (but it only talks about policy search algorithms). For another example have a look at this paper. Here the authors learn - along with a policy - a Gaussian process to approximate the forward model of the robot, in order to simulate trajectories and to reduce the number of real robot interaction.
add a comment |
Temporal Difference is an approach to learning how to predict a quantity that depends on future values of a given signal. It can be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm used to learn the Q-function. As stated by Don Reba, you need the Q-function to perform an action (e.g., following an epsilon-greedy policy). If you have only the V-function you can still derive the Q-function by iterating over all the possible next states and choosing the action which leads you to the state with the highest V-value. For examples and more insights, I recommend the classic book from Sutton and Barto.
In model-free RL you don't learn the state-transition function (the model) and you can rely only on samples. However, you might be interested also in learning it, for example because you cannot collect many samples and want to generate some virtual ones. In this case we talk about model-based RL.
Model-based RL is quite common in robotics, where you cannot perform many real simulations or the robot will break. This is a good survey with many examples (but it only talks about policy search algorithms). For another example have a look at this paper. Here the authors learn - along with a policy - a Gaussian process to approximate the forward model of the robot, in order to simulate trajectories and to reduce the number of real robot interaction.
Temporal Difference is an approach to learning how to predict a quantity that depends on future values of a given signal. It can be used to learn both the V-function and the Q-function, whereas Q-learning is a specific TD algorithm used to learn the Q-function. As stated by Don Reba, you need the Q-function to perform an action (e.g., following an epsilon-greedy policy). If you have only the V-function you can still derive the Q-function by iterating over all the possible next states and choosing the action which leads you to the state with the highest V-value. For examples and more insights, I recommend the classic book from Sutton and Barto.
In model-free RL you don't learn the state-transition function (the model) and you can rely only on samples. However, you might be interested also in learning it, for example because you cannot collect many samples and want to generate some virtual ones. In this case we talk about model-based RL.
Model-based RL is quite common in robotics, where you cannot perform many real simulations or the robot will break. This is a good survey with many examples (but it only talks about policy search algorithms). For another example have a look at this paper. Here the authors learn - along with a policy - a Gaussian process to approximate the forward model of the robot, in order to simulate trajectories and to reduce the number of real robot interaction.
edited Nov 22 '18 at 15:47
nbro
5,68185096
5,68185096
answered Dec 14 '15 at 9:20
SimonSimon
1,99142446
1,99142446
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f34181056%2fq-learning-vs-temporal-difference-vs-model-based-reinforcement-learning%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
5
Q-learning is a temporal difference algorithm.
– Don Reba
Dec 9 '15 at 17:50
Isn't Q-Learning used to calculate the Q-value, While Temporal Difference Learning used to calculate Value function? [They are related, But not exactly the same i guess] Or am i mistaken?
– StationaryTraveller
Dec 9 '15 at 17:53
4
V is the state value function, Q is the action value function, and Q-learning is a specific off-policy temporal-difference learning algorithm. You can learn either Q or V using different TD or non-TD methods, both of which could be model-based or not.
– Don Reba
Dec 10 '15 at 2:42
1
Thanks for the semantics, But it still doesn't help me of finding an example of when to use which one. When is it good to choose Q value over V function?
– StationaryTraveller
Dec 11 '15 at 15:49
2
You need the action-value function in order to form a policy. You can learn it directly or you can retrieve it from the state-value function if you know the state transition probability function.
– Don Reba
Dec 11 '15 at 16:06