外发加工网站源码下载,wordpress后台所有栏目都404,如何用asp做网站,wordpress xiu 5.6Frozen-Lake-v1环境描述
“冰冻湖”#xff08;Frozen Lake#xff09;环境要求玩家从起点穿越到终点#xff0c;过程中不能掉入任何洞中#xff0c;只能通过在冰冻的湖面上行走来完成。由于冰冻湖面的湿滑特性#xff0c;玩家可能并不总是能按照预想的方向移动。游戏开始…Frozen-Lake-v1环境描述
“冰冻湖”Frozen Lake环境要求玩家从起点穿越到终点过程中不能掉入任何洞中只能通过在冰冻的湖面上行走来完成。由于冰冻湖面的湿滑特性玩家可能并不总是能按照预想的方向移动。游戏开始时玩家位于冰冻湖网格世界的起始位置 [0,0]而目标位置则位于世界的另一端例如在 4x4 的环境中为 [3,3]。当使用预先设定的地图时冰面上的洞会分布在固定的位置而当生成随机地图时洞则会随机分布在各个位置。玩家会不断移动直到到达目标位置或掉入洞中为止。由于湖面湿滑除非禁用此特性玩家有时可能会朝着与预想方向垂直的方向移动请参见 is_slippery 参数。随机生成的世界中总会存在一条通往目标的路径。
动作空间Action Space
动作的形状为 (1,)其取值范围在 {0, 3} 之间表示玩家移动的方向。
0向左移动1向下移动2向右移动3向上移动
观测空间Observation Space
观测值是一个表示玩家当前位置的数值计算方式为 current_row * ncols current_col其中行和列的编号均从 0 开始。例如在 4x4 的地图中目标位置的观测值可以这样计算3 * 4 3 15。可能的观测值数量取决于地图的大小。观测值以整数int()形式返回。
起始状态Starting State
每个回合episode开始时玩家处于状态 [0]即位置 [0, 0]。
奖励Rewards
奖励机制如下
到达目标1掉入洞中0停留在冰冻湖面上未到达目标也未掉入洞中0
回合结束Episode End
当以下情况发生时回合结束
终止条件Termination玩家掉入洞中。玩家到达目标位置该位置为 max(nrow) * max(ncol) - 1即位置 [max(nrow)-1, max(ncol)-1]。截断条件Truncation当使用 time_limit 包装器时 对于 4x4 的环境回合长度为 100。对于 FrozenLake8x8-v1 环境回合长度为 200。
信息Information
step() 和 reset() 函数返回一个包含以下键的字典
p - 状态的转移概率。有关转移概率的详细信息请参见 is_slippery 参数的说明。
参数 (Arguments)
import gymnasium as gym
gym.make(FrozenLake-v1, descNone, map_name4x4, is_slipperyTrue)“S” for Start tile“G” for Goal tile“F” for frozen tile“H” for a tile with a hole
随机地图生成
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
gym.make(FrozenLake-v1, descgenerate_random_map(size8))is_slipperyTrue如果设置为 True玩家将以 1/3 的概率按照预想的方向移动否则将以相等的概率即每个垂直方向各 1/3 的概率朝着与预想方向垂直的其中一个方向移动。
Frozen Lake ⛄ (non slippery version)
训练Q-Learning智能体使其仅在冰冻的格子F上行走避开洞穴H从而从起始状态S导航到目标状态G。
有两种尺寸的环境
map_name“4x4”4x4 的网格版本map_name“8x8”8x8 的网格版本
该环境有两种模式
is_slipperyFalse由于冰冻湖面不滑智能体总是会按照预想的方向移动确定性环境。is_slipperyTrue由于冰冻湖面湿滑智能体可能并不总是会按照预想的方向移动随机性环境。
先从简单的 4x4 地图和不滑模式开始。我们添加了一个名为 render_mode rgb_array 的参数用于指定环境的可视化方式。“rgb_array”返回一个表示环境当前状态的单帧图像。这个帧是一个形状为 (x, y, 3) 的 np.ndarray代表一个 x 行 y 列像素图像的 RGB 值。
import os
import tqdm
import random # To generate random numbers
import imageio # To generate a replay video
import numpy as np
import gymnasium as gym
import pickle5 as pickle # Save/Load model
from tqdm.notebook import tqdm# 创建环境
env gym.make(FrozenLake-v1, map_name8x8, is_slipperyFalse, render_modergb_array)
print(_____OBSERVATION SPACE_____ \n)
print(Observation Space, env.observation_space)
print(Sample observation, env.observation_space.sample()) # Get a random observation_____OBSERVATION SPACE_____ Observation Space Discrete(64)
Sample observation 35print(\n _____ACTION SPACE_____ \n)
print(Action Space Shape, env.action_space.n)
print(Action Space Sample, env.action_space.sample()) # Take a random action_____ACTION SPACE_____ Action Space Shape 4
Action Space Sample 1创建并初始化Q-Table
state_space env.observation_space.n
action_space env.action_space.ndef initialize_q_table(state_space, action_space):Qtable np.zeros((state_space, action_space))return QtableQtable_frozenlake initialize_q_table(state_space, action_space)print(Q-Table :\n, Qtable_frozenlake)Q-Table :[[0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.]]定义贪心策略(updating policy)
当 Q 学习智能体完成训练后我们最终所采用的策略也将是贪婪策略。贪婪策略用于利用 Q 表来选择动作。
def greedy_policy(Qtable, state):# Exploitation: take the action with the highest state, action valueaction np.argmax(Qtable[state][:])return action定义 ϵ \epsilon ϵ-贪心策略(acting policy)
ε-贪婪策略Epsilon-greedy是一种在训练过程中用于平衡探索exploration与利用exploitation权衡的训练策略。 ε-贪婪策略的理念如下
以概率 1 - ε我们进行利用即智能体选择具有最高状态-动作对值的动作。以概率 ε我们进行探索尝试一个随机动作。
随着训练的持续进行我们会逐渐减小 ε 的值因为随着训练的深入我们需要的探索会越来越少而利用则会越来越多。
def epsilon_greedy_policy(Qtable, state, epsilon):# Randomly generate a number between 0 and 1random_num random.uniform(0, 1)# if random_num greater than epsilon -- exploitationif random_num epsilon:# Take the action with the highest value given a stateaction greedy_policy(Qtable, state)# else -- explorationelse:action env.action_space.sample()return action定义超参数
探索exploration相关的超参数是其中最为重要的我们需要确保智能体能够充分探索状态空间以学习到一个良好的价值近似。为了实现这一点我们需要让 ε 值探索率逐渐衰减。如果你将 ε 值衰减得过快即衰减率设置得过高你的智能体可能会陷入困境因为它没有充分探索状态空间从而无法解决问题。
# Training parameters
n_training_episodes 10000 # Total training episodes
learning_rate 0.7 # Learning rate# Evaluation parameters
n_eval_episodes 100 # Total number of test episodes# Environment parameters
env_id FrozenLake-v1 # Name of the environment
max_steps 200 # Max steps per episode
gamma 0.95 # Discounting rate
eval_seed [] # The evaluation seed of the environment# Exploration parameters
max_epsilon 1.0 # Exploration probability at start
min_epsilon 0.05 # Minimum exploration probability
decay_rate 0.0001 # Exponential decay rate for exploration prob训练Agent
余弦退火Cosine Annealing是一种在深度学习和强化学习中常用的学习率或探索率调整策略其特点在于能够产生一种“先慢后快”的衰减效果非常适合用于探索率的衰减。余弦退火衰减函数基于余弦函数的性质其公式可以表示为 ϵ t ϵ m i n 1 2 ( ϵ m a x − ϵ m i n ) ( 1 cos ( t π T ) ) \epsilon_t \epsilon_{min} \frac{1}{2}(\epsilon_{max} - \epsilon_{min})(1 \cos(\frac{t\pi}{T})) ϵtϵmin21(ϵmax−ϵmin)(1cos(Ttπ))
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):for episode in tqdm(range(n_training_episodes)):# Reduce epsilon (because we need less and less exploration)#epsilon min_epsilon (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)epsilon min_epsilon (max_epsilon - min_epsilon)*0.5 * (1 np.cos(episode*np.pi/n_training_episodes))# Reset the environmentstate, info env.reset()step 0terminated Falsetruncated False# repeatfor step in range(max_steps):# Choose the action At using epsilon greedy policyaction epsilon_greedy_policy(Qtable, state, epsilon)# Take action At and observe Rt1 and St1# Take the action (a) and observe the outcome state(s) and reward (r)new_state, reward, terminated, truncated, info env.step(action)# Update Q(s,a): Q(s,a) lr [R(s,a) gamma * max Q(s,a) - Q(s,a)]Qtable[state][action] Qtable[state][action] learning_rate * (reward gamma * np.max(Qtable[new_state]) - Qtable[state][action])# If terminated or truncated finish the episodeif terminated or truncated:break# Our next state is the new statestate new_statereturn QtableQtable_frozenlake train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
print(Q table:\n,Qtable_frozenlake)0%| | 0/10000 [00:00?, ?it/s]Q table:[[0.48767498 0.51334208 0.51334208 0.48767498][0.48767498 0.54036009 0.54036009 0.51334208][0.51334208 0.56880009 0.56880009 0.54036009][0.54036009 0.59873694 0.59873694 0.56880009][0.56880009 0.63024941 0.63024941 0.59873694][0.59873694 0.66342043 0.66342043 0.63024941][0.63024941 0.6983373 0.6983373 0.66342043][0.66342043 0.73509189 0.6983373 0.6983373 ][0.51334208 0.54036009 0.54036009 0.48767498][0.51334208 0.56880009 0.56880009 0.51334208][0.54036009 0.59873694 0.59873694 0.54036009][0.56880009 0. 0.63024941 0.56880009][0.59873694 0.66342043 0.66342043 0.59873694][0.63024941 0.6983373 0.6983373 0.63024941][0.66342043 0.73509189 0.73509189 0.66342043][0.6983373 0.77378094 0.73509189 0.6983373 ][0.54036009 0.56880009 0.56880009 0.51334208][0.54036009 0.59873694 0.59873694 0.54036009][0.56880009 0.63024941 0. 0.56880009][0. 0. 0. 0. ][0. 0.6983373 0.6983373 0.63024941][0.66342043 0. 0.73509189 0.66342043][0.6983373 0.77378094 0.77378094 0.6983373 ][0.73509189 0.81450625 0.77378094 0.73509189][0.56880009 0.54036009 0.59873694 0.54036009][0.56880009 0.56880009 0.63024941 0.56880009][0.59873694 0.59873694 0.66342043 0.59873694][0.63024941 0. 0.6983373 0. ][0.66342043 0.73509189 0. 0.66342043][0. 0. 0. 0. ][0. 0.81450625 0.81450625 0.73509189][0.77378094 0.857375 0.81450625 0.77378094][0.54036009 0.51334208 0.56880009 0.56880009][0.54036009 0. 0.59873694 0.59873694][0.56880009 0. 0. 0.63024941][0. 0. 0. 0. ][0. 0.6983373 0.77378094 0.6983373 ][0.73509189 0.73509189 0.81450625 0. ][0.77378094 0. 0.857375 0.77378094][0.81450625 0.9025 0.857375 0.81450625][0.51334208 0.48767498 0. 0.54036009][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0.6983373 0. ][0.66342043 0. 0.73509189 0.73509189][0.6983373 0.6983373 0. 0.77378094][0. 0. 0. 0. ][0. 0.95 0.9025 0.857375 ][0.48762436 0.46329118 0. 0.51334208][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0.66337945 0. 0.73509189][0. 0. 0. 0. ][0. 1. 0.95 0.9025 ][0.46329049 0.46328851 0.41401517 0.48767498][0.4598555 0.28526749 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0. 0. 0. ][0. 0.52391649 0. 0.69833488][0. 0. 0. 0. ][0. 0. 0. 0. ]]评估Agent
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):Evaluate the agent for n_eval_episodes episodes and returns average reward and std of reward.:param env: The evaluation environment:param n_eval_episodes: Number of episode to evaluate the agent:param Q: The Q-table:param seed: The evaluation seed array (for taxi-v3)episode_rewards []for episode in tqdm(range(n_eval_episodes)):if seed:state, info env.reset(seedseed[episode])else:state, info env.reset()step 0truncated Falseterminated Falsetotal_rewards_ep 0for step in range(max_steps):# Take the action (index) that have the maximum expected future reward given that stateaction greedy_policy(Q, state)new_state, reward, terminated, truncated, info env.step(action)total_rewards_ep rewardif terminated or truncated:breakstate new_stateepisode_rewards.append(total_rewards_ep)mean_reward np.mean(episode_rewards)std_reward np.std(episode_rewards)return mean_reward, std_reward# Evaluate our Agent
mean_reward, std_reward evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
print(fMean_reward{mean_reward:.2f} /- {std_reward:.2f})0%| | 0/100 [00:00?, ?it/s]Mean_reward1.00 /- 0.00可视化Q-L Agent
env gym.wrappers.RecordVideo(env, video_folder./FrozenLake-v1-QL,disable_loggerTrue,fps30)
state, info env.reset()
for step in range(max_steps):action greedy_policy(Qtable_frozenlake, state)state, reward, terminated, truncated, info env.step(action)if terminated True:break
env.close()Frozen Lake ⛄ (slippery version)
# 创建环境
slippery_env gym.make(FrozenLake-v1, map_name8x8, is_slipperyTrue, render_modergb_array)# 初始化Q-Table
SQtable_frozenlake initialize_q_table(slippery_env.observation_space.n, slippery_env.action_space.n)print(Q-Table :\n, SQtable_frozenlake)Q-Table :[[0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.][0. 0. 0. 0.]]# Training parameters
n_training_episodes 30000 # Total training episodesSQtable_frozenlake train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, slippery_env, max_steps, SQtable_frozenlake)
print(Slippery Q table:\n,SQtable_frozenlake)0%| | 0/30000 [00:00?, ?it/s]Slippery Q table:[[2.95092538e-02 2.98058559e-02 3.55425091e-02 2.92342470e-02][3.42323951e-02 4.00210946e-02 2.16868036e-02 2.13141622e-02][3.14652001e-02 4.41281793e-02 3.17972584e-02 6.43655693e-02][4.89825234e-02 3.73930063e-02 7.97258599e-02 4.23927897e-02][5.15487856e-02 5.45440972e-02 8.14347651e-02 6.17998290e-02][8.66710970e-02 5.41912201e-02 5.47576619e-02 5.39549992e-02][6.81525386e-02 8.96955605e-02 1.19140669e-01 6.43553105e-02][1.01704389e-01 8.04192328e-02 8.43238258e-02 8.44453576e-02][1.87306116e-02 2.36937038e-02 1.80814138e-02 3.88206186e-02][1.89657996e-02 2.16608203e-02 1.83821238e-02 4.74695342e-02][2.66255882e-02 2.70086379e-02 2.40275840e-02 7.01938623e-02][2.97640728e-02 6.13349121e-03 1.41040909e-02 7.60960874e-02][4.56302719e-02 4.00950988e-02 5.37607732e-02 4.24232515e-02][5.69634607e-02 9.01739578e-02 7.01677881e-02 5.63298290e-02][9.78335315e-02 9.06127357e-02 1.06052611e-01 7.17523463e-02][1.14012792e-01 1.40811865e-01 1.04272680e-01 7.99623572e-02][1.66456648e-02 1.62485470e-02 1.74163490e-02 3.27535019e-02][1.51419958e-02 1.35046417e-02 1.45398887e-02 1.76688930e-02][1.90295521e-02 4.25410252e-03 1.83213222e-03 5.23178872e-03][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][1.58533465e-02 3.67224622e-02 4.99585459e-02 1.75388507e-02][6.61441587e-02 1.10450323e-02 3.11163963e-02 8.05712877e-02][1.07919603e-01 7.22318163e-02 1.57449939e-01 8.42769872e-02][2.61324179e-01 1.16283409e-01 1.31168208e-01 1.16575381e-01][1.29772186e-02 9.80512299e-03 1.14779611e-02 1.64368035e-02][1.60933016e-02 1.31074728e-02 4.39518357e-03 1.67676482e-02][8.69393524e-03 1.04075372e-03 9.52817187e-03 1.23338489e-02][1.22384041e-05 3.33445738e-03 2.72660768e-04 4.60460192e-04][1.67221437e-02 2.29019203e-03 8.82719764e-02 3.48791688e-03][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][4.98473132e-02 1.14280468e-01 4.18399272e-01 1.67182467e-02][1.81395051e-01 1.72917909e-01 4.36020228e-01 1.24908431e-01][1.46880265e-02 4.16296475e-03 5.19508161e-03 9.44630730e-03][1.37255941e-03 2.32290503e-03 2.75471190e-04 1.57960307e-02][1.20043561e-02 8.62992516e-06 4.91368229e-06 9.49594317e-05][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][3.80016972e-04 2.01246118e-03 4.14847844e-02 1.84458017e-02][8.13313604e-02 1.43051670e-01 1.83712719e-02 4.43525885e-04][5.54783837e-03 1.17747589e-02 2.17635398e-02 4.91418874e-01][2.28295381e-01 1.92538698e-01 6.01683424e-01 2.23992518e-01][9.27879652e-03 1.10868374e-04 1.08306191e-03 2.25001087e-03][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][1.00862789e-05 5.52190401e-04 1.80929931e-05 7.63147199e-06][3.06042994e-04 1.37267286e-03 5.67208929e-02 2.00546095e-03][3.70648973e-02 1.78282543e-03 2.28593220e-02 2.80359293e-02][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][5.87143371e-02 2.15910759e-01 7.11720489e-01 1.25217126e-02][2.32287713e-03 1.36611363e-03 1.14438549e-03 1.64655400e-03][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][1.79431114e-05 6.40070118e-05 3.93668028e-04 1.58774176e-05][1.67603193e-04 3.36343482e-05 0.00000000e00 2.98572949e-05][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][1.88518050e-04 1.03447142e-04 6.00621301e-03 2.24384032e-05][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][2.76773042e-01 1.49805588e-01 9.34268361e-01 5.30816981e-01][1.72881462e-03 7.73892342e-04 9.74143329e-04 8.28502800e-04][4.18900346e-04 1.36431192e-03 4.80345616e-04 4.90415880e-04][4.37120136e-04 6.63167566e-05 8.11219030e-05 1.93626697e-04][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00][0.00000000e00 0.00000000e00 0.00000000e00 2.02683032e-02][4.13766214e-01 1.66513079e-01 6.39326884e-01 4.46513193e-01][5.20579883e-01 9.53990423e-01 8.54776446e-02 1.89000000e-02][0.00000000e00 0.00000000e00 0.00000000e00 0.00000000e00]]# Evaluate our Agent
mean_reward, std_reward evaluate_agent(slippery_env, max_steps, n_eval_episodes, SQtable_frozenlake, eval_seed)
print(fMean_reward{mean_reward:.2f} /- {std_reward:.2f})0%| | 0/100 [00:00?, ?it/s]Mean_reward0.37 /- 0.48可视化Slippery Agent:
slippery_env gym.wrappers.RecordVideo(slippery_env, video_folder./Slippery-FrozenLake-v1-QL,disable_loggerTrue,fps30)
state, info slippery_env.reset()
for step in range(max_steps):action greedy_policy(SQtable_frozenlake, state)state, reward, terminated, truncated, info slippery_env.step(action)if terminated True:break
slippery_env.close()Agent vs Slippery Agent