Skip to content

Latest commit

 

History

History
94 lines (67 loc) · 4.26 KB

RL.md

File metadata and controls

94 lines (67 loc) · 4.26 KB

RL

David Silver Course

08_Model-based RL

  • "model" is estimating the env: modeling $\eta$: $P(s_{t+1}|s_t,a_t)$, $R(s_t, a_t)$
  • Model-based RL: Learn model from experience, and plan value function and/or policy from simulated experience
  • Dyna: learn and plan ...
    • Dyna-Q altrithom: 在平行世界里(in previsous state s)用model代替env,再多更新几次Q(s,a)
  • MC tree search
    • from current root
    • only search sub-paths of the tree: search and evaluate dynamically, only go deeper to current best (greedy search)

awesome

  • DQN. Playing Atari with Deep Reinforcement Learning
  • DQN + target Q net. Human-level control through deep reinforcement learning
  • Policy Gradient. Policy Gradient Methods for RL with Function Approximation 主要是推导得到了 Policy gradient Themorm
  • A3C. Asynchronous Methods for Deep Reinforcement Learning
  • Duel Q network. Dueling Network Architectures for Deep Reinforcement Learning
  • high dimentional continguous control using GAE

not so owesome

Action Branching Architectures for Deep Reinforcement Learning 解决action是多维的问题,这个方法正常人都能想出来。 based on dualing q network.

interesting applications

麻将AI Suphx: Mastering Mahjong with Deep Reinforcement Learning

by MSRA. 看视频就ok

  • Encode tiles as 4 * 34 matrix
  • model: 5 models combine with descition flow。 Models are if_吃/碰/杠 and 丢牌模型.
    • mdoel input: D * 34 * 1, D contains private tiles, open hands, history, manaully features etc.
    • model output: 34 * 1 for discard model; scaler for chi/pong/kong model.
    • 3 * 1 Conv with 256 channels repeat 50x skip connected.
  • training: supervised learning using human players action as label, then self-play with the trained models as policy.
  • trick 1: Oracle guiding. use full infomation as teacher 常见套路
  • trick 2: global reward predictor as critic. 考虑风险偏好下的reward?
  • trick 3: online "finetune" to priavate tiles at hand

斗地主AI DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning

  • Encode card as 4 * 15 matrix
  • state: history moves encode to (T, 4, 15) tensor
  • action: leagal movement encode to (4, 15) matrix
  • model: for each leagal movement ie. action, Concate(a, LSTM(s)) | MLP(6,512) | scaler, output is state action value Q(s,a)
  • use MC with DNN
    • why MC? long horizon and sparse reward
    • why not DQN? large and variable action space, max_a Q(s,a) is computational expensive
    • why not policy gradient? inifinte action space. While action as feature can generalize eg. 3KKK to 3JJJ

滴滴打车派单算法 Large-Scale Order Dispatch in On-Demand Ride-Hailing Platforms: A Learning and Planning Approach

  • define one day as one episode,
  • state defined as s = (grid, time), offline learn V(s), thus we know Q(s,a) = r + V(s')
  • oneline dispatch: solve a = argmax_a Q(s, a) with KM algriom. 一种确定性算法,输入二分图及其权重,输出一个使权值和最大的匹配方案,这里权重就是offline估计出来的Q(s,a)

AlphaGo

Mastering the game of Go with deep neural networks and tree search

overview:

  • Supervised Learning (SL), learn policy net $p_\sigma$ from human expert
  • RL policy net: init by $p_\sigma$, self-play training, got $p_\rho$
  • train Value net $V_\theta$ with RL self-play data and MSE loss
  • infer: combine fast policy net $p_\pi$ (for rollout), value net $V$ and MCTS

model arch:

  • input: 19 x 19 x C. board current state, history and domain knowledge features (broadcast to 19x19 if is scaler)
  • model: 13 layer Conv

About MCTS:

  • 围棋游戏复杂度 $b^d$ 其中 $b$ 为宽度,即可选择的action个数,=19*19;$d$为深度,即游戏长度,=100~200
  • 简化搜索树无非是从两个方面 1. sample actions ($b$) using policy net; 2. truncate $d$ using value net
  • 具体方式:ref to paper or David Silver course ppt

Alpha Go Zero

  • trained solely by self-play RL
  • only raw board features
  • sinlge model, output value and policy: $(p, v) = f_\theta(s)$
  • simpler tree search (as target $\pi$)
  • train: make policy $p$ closer to tree search policy $\pi$, value $v$ closer to simulated game output $z$

$$ loss = (z - v)^2 - \pi^T \log p + c||\theta||^2 $$