Welcome to discuss about this book here! #1
Replies: 34 comments 86 replies
-
In slides 9, page 22. I guess this is `exercise`. |
Beta Was this translation helpful? Give feedback.
-
Great book and course! Helped me reorganize some of the points of RL. In the process of reading this book (ver 2022.8), I met some little confusion, probably clerical errors. Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
第6章,page 20,这个好像不收敛的。 w = 0
g = lambda w: w**3 - 5
import random
for i in range(100):
print(i, w)
w = w - 1/(i+10) * (g(w) + random.gauss(0, 1)) |
Beta Was this translation helpful? Give feedback.
-
Dvoretzky 定理的证明只包括了 |
Beta Was this translation helpful? Give feedback.
-
6.2 Page 107 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
In page. 85, the second paragraph of the subsection A comprehensive example: Episode length and sparse reward, "See, for example, Figure 5.3(h)" should be "5.3(a)", because you mentioned that the episode length is 1. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
![]() 163页 感谢您的书。 |
Beta Was this translation helpful? Give feedback.
-
Thank you so much for writing such a helpful book! |
Beta Was this translation helpful? Give feedback.
-
Page 172. Chapter 8. Algorithm 8.1. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I think there's a symbol \gamma missing here. There's the same typo in slides. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Page 207. Chapter 9. Here I think the π(a|s,θ) should take values in the interval (0,1) due to the softmax function, not [0,1]. |
Beta Was this translation helpful? Give feedback.
-
Prof. Zhao, when will this book be published? |
Beta Was this translation helpful? Give feedback.
-
In Section 3.6, Page 64, book ver. March 2024, it says: |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Prof. Zhao, can u add more algorithm such as TRPO, PPO, SAC... into this book, though u have mentioned part of that basic mathematical knowledge in some chapters. VERY THANKS. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Great book and video! Truly grateful that we have such excellent Professor and learning materials in China. I'd like to comment on the setup of grid world used in this book. As shown in the screenshot below, let's focus on the bottom-left corner. If we use this state-value, and try to do one step of policy improvement, we would find that the optimal action is to turn right, directly step into forbidden grid and get to target as soon as possible. ![]() related video: https://www.bilibili.com/video/BV1Le411K7qY?t=652.0 Calculation:
The reason is that the game is
So the punishment for stepping into forbidden grid would be highly compensated if agent could get to the target quickly. Though everything is still correct if the reader understands the setup, I feel that this might mislead some readers' intuitions. Because readers subconsciously might always expect the optimal policy is the shortest path avoiding forbidden grid. I guess that Professor may have done this on purpose in order to illustrate core idea more simply, as a compromise. But it's still great to hear and confirm that from Professor. PS: In the episodic setup, still focusing on the bottom-left corner, the agent no longer chooses to turn right, since:
codeimport numpy as np
# Constants
gamma = 0.9 # Discount factor
reward_target = 1 # Reward for reaching the target
reward_boundary = -1 # Reward for hitting the boundary
reward_forbidden = -1 # Reward for forbidden states
n_rows = 5 # Number of rows in the grid
n_cols = 5 # Number of columns in the grid
actions = ['^', '>', 'v', '<', 'o'] # List of possible actions
#Grid world
grid_world = np.array([
['S', 'S', 'S', 'S', 'S'],
['S', 'F', 'F', 'S', 'S'],
['S', 'S', 'F', 'S', 'S'],
['S', 'F', 'T', 'F', 'S'],
['S', 'F', 'S', 'S', 'S'],
])
policy = np.array([
['>', '>', '>', 'v', 'v'],
['^', '^', '>', 'v', 'v'],
['^', '<', 'v', '>', 'v'],
['^', '>', 'o', '<', 'v'],
['^', '>', '^', '<', '<'],
])
#return next_state, reward, done
def P(state, action, is_episodic=False):
row, col = state
target_row, target_col = row, col
if action == '^':
if row == 0:
return (row, col), reward_boundary, False
else:
target_row = row - 1
if action == '>':
if col == n_cols - 1:
return (row, col), reward_boundary, False
else:
target_col = col + 1
if action == 'v':
if row == n_rows - 1:
return (row, col), reward_boundary, False
else:
target_row = row + 1
if action == '<':
if col == 0:
return (row, col), reward_boundary, False
else:
target_col = col - 1
if grid_world[target_row, target_col] == 'F':
return (target_row, target_col), reward_forbidden, False
if grid_world[target_row, target_col] == 'T':
if is_episodic:
return (target_row, target_col), reward_target, True
else:
return (target_row, target_col), reward_target, False
if grid_world[target_row, target_col] == 'S':
return (target_row, target_col), 0, False
# Calculate the state value of a policy
def calc_state_value_of_policy(policy, is_episodic=False):
state_value = np.zeros((n_rows, n_cols))
while True:
new_state_value = np.zeros((n_rows, n_cols))
for row in range(n_rows):
for col in range(n_cols):
action = policy[row, col]
(next_row, next_col), reward, done = P((row, col), action, is_episodic)
new_state_value[row, col] = reward + gamma * state_value[next_row, next_col]*(1-done)
if np.sum(np.abs(new_state_value - state_value)) < 1e-4:
break
state_value = new_state_value
return state_value
if __name__ == '__main__':
print("State Value of continuing setup:")
state_value = calc_state_value_of_policy(policy, is_episodic=False)
print(state_value)
print("State Value of episodic setup:")
state_value = calc_state_value_of_policy(policy, is_episodic=True)
print(state_value) |
Beta Was this translation helpful? Give feedback.
-
Hi there,
If you have any feedback about the book, you can leave a comment here. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions