表: Gridworld MDP 2 5 1 S -5 A B C 图：转换函数 0.8 ^ 0.1 0.1 查看表 (Gridworld MDP) 和图（转移函数）。Gridworld MDP 以讲座中讨论的方式运行。状态是网格方块，由其行（A、B 或 C）和列（1 或 2）值标识，如表中所示。智能体的初始状态始终是 (A,1)，用字母 S 标记。有两个终止目标状态：奖励为 -5 的 (B,1) 和奖励为 +5 的 (B,2)。非终止状态下奖励为 0。（在智能体执行下一个动作之前，收到状态奖励。）转移函数（参见图）使得智能体以 0.8 的概率发生预期的移动（上、下、左或右）。智能体最终处于与预期方向垂直的状态之一的概率各为 0.1。如果与墙壁发生碰撞，智能体将保持状态不变，并且漂移概率将添加到保持相同状态的概率中。假设 V 1 1(A,1) = 0, V 1 1C,1) = 0, V 1 1(C,2) = 4, V 1 1(A,2) = 4, V 1 1(B,1) = -5，V 1 1(B,2) = +5 根据这些信息，状态 (A,1) 使用折扣因子 0.5 进行的第二轮值迭代 (V2) 更新是什么？ Review Table: Gridworld MDP and Figure: Transition Function. The gridworld MDP operates like the one discussed in lecture. The states are grid squares, identified by their column (A, B, or C) and row (1 or 2) values, as presented in the table. The agent always starts in state (A,1), marked with the letter S. There are two terminal goal states: (B,1) with reward -5, and (B,2) with reward +5. Rewards are 0 in non-terminal states. (The reward for a state is received before the agent applies the next action.) The transition function in Figure: Transition Function is such that the intended agent movement (Up, Down, Left, or Right) happens with probability 0.8. The probability that the agent ends up in one of the states perpendicular to the intended direction is 0.1 each. If a collision with a wall happens, the agent stays in the same state, and the drift probability is added to the probability of remaining in the same state. Assume that V 1 (A,1) = 0, V 1 C,1) = 0, V 1 (C,2) = 4, V 1 (A,2) = 4, V 1 (B,1) = -5, and V 1 (B,2) = +5. Given this information, what is the second round of value iteration (V 2 ) update for state (A,1) with a discount of 0.5?

Question

表: Gridworld MDP

2		5	
1	S	 -5	
	A	B	C

图：转换函数

0.8

^

0.1 <-    |   -> 0.1

查看表 (Gridworld MDP) 和图（转移函数）。Gridworld MDP 以讲座中讨论的方式运行。状态是网格方块，由其行（A、B 或 C）和列（1 或 2）值标识，如表中所示。智能体的初始状态始终是 (A,1)，用字母 S 标记。有两个终止目标状态：奖励为 -5 的 (B,1) 和奖励为 +5 的 (B,2)。非终止状态下奖励为 0。（在智能体执行下一个动作之前，收到状态奖励。）转移函数（参见图）使得智能体以 0.8 的概率发生预期的移动（上、下、左或右）。智能体最终处于与预期方向垂直的状态之一的概率各为 0.1。如果与墙壁发生碰撞，智能体将保持状态不变，并且漂移概率将添加到保持相同状态的概率中。

假设 V
1
1​(A,1) = 0, V
1
1​C,1) = 0, V
1
1​(C,2) = 4, V
1
1​(A,2) = 4, V
1
1​(B,1) = -5，V
1
1​(B,2) = +5

根据这些信息，状态 (A,1) 使用折扣因子 0.5 进行的第二轮值迭代 (V2​) 更新是什么？

Review Table: Gridworld MDP and Figure: Transition Function. The gridworld MDP operates like the one discussed in lecture. The states are grid squares, identified by their column (A, B, or C) and row (1 or 2) values, as presented in the table.

The agent always starts in state (A,1), marked with the letter S. There are two terminal goal states: (B,1) with reward -5, and (B,2) with reward +5. Rewards are 0 in non-terminal states. (The reward for a state is received before the agent applies the next action.)

The transition function in Figure: Transition Function is such that the intended agent movement (Up, Down, Left, or Right) happens with probability 0.8. The probability that the agent ends up in one of the states perpendicular to the intended direction is 0.1 each. If a collision with a wall happens, the agent stays in the same state, and the drift probability is added to the probability of remaining in the same state.

Assume that V
1
​(A,1) = 0, V
1
​C,1) = 0, V
1
​(C,2) = 4, V
1
​(A,2) = 4, V
1
​(B,1) = -5, and V
1
​(B,2) = +5. Given this information, what is the second round of value iteration (V
2
​) update for state (A,1) with a discount of 0.5?

BlackTom AI · Accepted Answer

We need to perform a Bellman update for value iteration with discount gamma = 0.5. The update for state s is V2(s) = max_a [ R(s) + gamma * sum_{s'} P(s'|s,a) * V1(s') ]. Since non-terminal states have reward R(s) = 0, we only rely on the discounted expected value of successor states.

From (A,1), consider the four possible actions (Up, Down, Left, Right). The transition function gives 0.8 probability for the intended direction, and 0.1 probability for each of the perpendicular directions. If a move would hit a wall, the agent stays in the same state and the drift probability adds to the stay probability. We also use the provided V1 values: V1(A,1)=0, V1(C,1)=0, V1(C,2)=4, V1(A,2)=4, V1(B,1)=-5, V1(B,2)=+5.

- Action Right (from (A,1) toward (B,1)):
  - Intended move to (B,1) with V1 = -5, weight 0.8.
  - Perpendicular directions would be Up and Down (relative to Right): Up likely hits a wall and keeps (A,1) with V1 = 0; Down leads to (A,2) with V1 = 4; each has probability 0.1.
  - Expected value before discount: 0.8 * (-5) + 0.1 * 0 + 0.1 * 4 = -4 + 0 + 0.4 = -3.6.
  - Add discount: gamma * expected = 0.5 * (-3.6) = -1.8.

- Action Down (from (A,1) toward (A,2)):
  - Intended move to (A,2) with V1 = 4, weight 0.8.
  - Perpendicular directions would be Left and Right: Left hits wall, stays at (A,1) with V1 = 0; Right moves to (B,1) with V1 = -5; each with probability 0.1.
  - Expected value before discount: 0.8 * 4 + 0.1 * 0 + 0.1 * (-5) = 3.2 + 0 - 0.5 = 2.7.
  - Add discount: gamma * expected = 0.5 * 2.7 = 1.35.

- Action Left (from (A,1) toward the wall):
  - Intended move would hit a wall, so we stay in (A,1) with V1 = 0; perpendicular directions would be Up and Down, but both likely include staying in or moving to already considered states; the exact calculation would yield a smaller expected value than the Down action in this setup.
  - Resulting discounted value would be ≤ 0, not exceeding 1.35.

- Action Up (from (A,1) toward the wall or an equivalent blocked direction):
  - Similar to Left, this action will largely result in staying in place or moving to states with limited positive value, leading to a discounted value not exceeding the Down action result.

Summarizing the comparisons: the Down action gives a discounted expected value of 1.35, while the Right action gives -1.8, and the Left/Up actions are not better than 1.35 in this configuration. Therefore, the maximal V2(A,1) is 1.35.

Now, evaluating the provided options:
- 1.35: Fits the computed maximum using the Down action; this option is consistent with the Bellman update using the given V1 values and transition dynamics.
- 0: This would correspond to selecting an action with zero discounted return, which is not the case here since Down yields a positive 1.35.
- 2.7: This would be the raw expected value before applying the discount (2.7), but the update requires gamma * (expected value) added to the reward (which is zero); thus the discounted result is 1.35, not 2.7.
- -0.4: This would imply a negative discounted return, which is not the maximal outcome among the actions for this state under the given V1 values.

In short, the Down action provides the highest discounted return of 1.35, while the other actions yield lower values or negative results under the provided value estimates.

Similar Questions

Which of the following algorithms is used to find a policy?

For stock A, we have 𝛽 𝑖 = 0.70. Suppose the expected market risk premium next year is 9% and the risk-free rate is 3%. What is the expected return of this stock based on the CAPM? (Please answer in % and round to 2 decimal places. If the answer is 8.057%, then in the box, write 8.06)

The following graph shows the Security Market Line in the current economy. If a stock has a β of 1.2, what is the expected return of this stock based on the CAPM? (Please answer in % and round to 2 decimal places. If the answer is 8.057%, then in the box, write 8.06)

Suppose Apple publicly announces earnings today and the announcement is above market expectations. If markets are semistrong-form efficient, then:

More Practical Tools for Students Powered by AI Study Helper

Homework AI Solver

Stylized AI Paper Writer

Plagiarism Checker Assistant

Citation AI Academic Writing Tool

In-Class Translation Assistant

AI Note Generator

AI Quiz Answers

Past Exam Questions from University Test Bank

Smart Practice Assistant

Adaptive Practice

Making Your Study Simpler