Please complete each homework for each team, and
mention who contributed which parts in your report.
In this assignment, we will solve the classic control problem - CartPole.
CartPole is an environment which contains a pendulum attached by an un-actuated joint to a cart, and the goal is to prevent it from falling over. You can apply a force of +1 or -1 to the cart. A reward of +1 is provided for every timestep that the pendulum remains upright.
- OpenAI gym
- TensorFlow
- Numpy
- Scipy
- IPython Notebook
If you already have some of above libraries installed, try to manage the dependencies by yourself.
If you are using a new environment (may be virtual), the preferred approach for installing above dependencies is to use Anaconda, which is a Python distribution that includes many of the most popular Python packages for science, math, engineering and data analysis.
- Install Anaconda: Follow the instructions on the Anaconda download site.
- Install TensorFlow: See anaconda section of TensorFlow installation page.
- Install OpenAI gym: Follow the official installation documents here.
If you are unfamiliar with Numpy or IPython, you should read materials from CS231n:
Also, knowing the basics of TensorFlow is required to complete this assignment.
For introductory material on TensorFlow, see
- MNIST For ML Beginners from official site
- Tutorial Video from Stanford CS224D
Feel free to skip these materials if you are already familiar with these libraries.
- Start IPython: After you clone this repository and install all the dependencies, you should start the IPython notebook server from the home directory
- Open the assignment: Open
HW2_Policy_Graident.ipynb
, and it will walk you through completing the assignment.
-
[+20] Construct a 2-layer neural network to represent policy
-
[+30] Compute the surrogate loss
-
[+20] Compute the accumulated discounted rewards at each timestep
-
[+10] Use baseline to reduce the variance
-
[+10] Modify the code and write a report to compare the variance and performance before and after adding baseline (with figures is better)
-
[+10] In function
process_paths
of classPolicyOptimizer
, why we need to normalize the advantages? i.e., what's the usage of this line:p["advantages"] = (a - a.mean()) / (a.std() + 1e-8)
Include the answer in your report
- Office hour 2-3 pm in 資電館 with YenChen Lin.
- Due on Oct. 17 before class.