Consider the GridWorld example from the notes.

Using the inverse Manhattan distance as a potential reward function, calculate Q(s, West) for state s = (1,2) and state s' = (0,2), receiving no immediate reward.

Assume α=0.5 and γ=0.9 and Q(s,a)=0 for all states and actions.

Question

Consider the GridWorld example from the notes.

Using the inverse Manhattan distance as a potential reward function, calculate Q(s, West) for state s = (1,2) and state s' = (0,2), receiving no immediate reward.

Assume α=0.5 and γ=0.9 and Q(s,a)=0 for all states and actions.

BlackTom AI · Accepted Answer

We begin by restating the setup in plain terms to ensure the calculation isn't assumed: we are evaluating Q(s, West) for s = (1,2) with s' = (0,2). The immediate reward r is given as 0. The learning parameters are α = 0.5 and γ = 0.9. It is also given that Q(s,a) = 0 for all states and actions initially, i.e., Q(s,a) = 0 for every state s and every action a at the start.

Key step: apply the Q-learning update rule. The standard update is:
Q(s,a) ← Q(s,a) + α [ r + γ * max_a' Q(s', a') − Q(s,a) ]
Here, for the action West at s, we have:
- r = 0 (no immediate reward)
- max_a' Q(s', a') is the maximum over actions in state s' of Q(s', a')
- Q(s,a) is the current value to be updated, which is 0 by initialization.

Because Q(s', a') is 0 for all actions a' in state s' = (0,2) at initialization, we have max_a' Q(s', a') = 0.
Plugging in the numbers:
Q(s, West) ← 0 + 0.5 [ 0 + 0.9 * 0 − 0 ] = 0.5 * 0 = 0.

Conclusion from the standard Q-learning update with the given initial conditions: Q(s, West) remains 0 after this update.

Now, regarding the mention of an inverse Manhattan distance as a potential reward function: if one substitutes a shaping reward F = γ Φ(s') − Φ(s) (where Φ is the potential, here inverse distance), the update would include this additional term, r + F in place of r. Depending on the exact definition of Φ, this could lead to a nonzero update. However, with the information as stated—initial Q-values all zero and no explicit shaping term in the plain update—the calculation above shows the resulting Q(s, West) = 0.

In short, under the standard Q-learning update with the given α, γ, r = 0, and initial Q(s,a) = 0 for all s,a, the computed Q-value for West at s = (1,2) after transitioning to s' = (0,2) is 0, not -0.1.

Consider the GridWorld example from the notes. Using the inverse Manhattan distance as a potential reward function, calculate Q(s, West) for state s = (1,2) and state s' = (0,2), receiving no immediate reward. Assume α=0.5 and γ=0.9 and Q(s,a)=0 for all states and actions.数值题

类似问题

A robot is leaning how to move through a maze. The robot does not receive the correct path in advance. Instead, it ties different moves. It receives a reward when it gets closer to the exit and a penalty when it hits a wall. Which type of machine leaning should be used?

Which of the following best describes reinforcement learning?

Based on how the book defines states and measures the value of states, which of the following state's value would be the best if your team were on defense?

Which learning type is used when the system interacts with an environment and learns through rewards and penalties?

In reinforcement learning, the agent's policy is predetermined and remains unchanged throughout the training process, regardless of the rewards received from the environment for its actions, the state of the environment, or the outcomes of its actions.

Which of the following best describes the Reinforcement Learning from Human Feedback (RLHF) process?

更多留学生实用工具

智能学习助手

风格化写作助手

论文查重助手

文献引用助手

课堂转译助手

课堂笔记助手

Quiz搜索助手

学校历年真题

智能刷题助手

智能匹配练习

希望你的学习变得更简单