Smart_Snake

This is a Reinforcement Learning project. In this project, Agent(snake) learns how to play the snake game¹. The game board is 12×12. The snake moves in the 10×10 area and eats the food. Eating the food increases the length of the snake. The snake must learn how to eat the food without running into the screen border or itself.

The learning algorithm is DQN.

Average Test scores : 20

Best achieved score : 49

Preview Algorithm Network State Hyperparameters Results References Useful Resources

Preview


score = 49	score = 48


score = 46	score = 46	score = 43

Algorithm


DQN Pseudo Code (https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)

Q = Q_θ = Action-Value Function = Policy network

Q^{^} = Q_θ^- = Target Function = Target network

Note:

The implementation has some differences with the above algorithm:

The training (calculating loss and updating the weights) doesn't apply to the first 2000 steps².
- Because there are not enough samples in the replay memory².
Target Network is updated every C episode (not every C step)³.
I supposed that s_t = x_t.

Network

Input Data :

(Batch_size, 8)

Layers :

FC(1024) → ReLU → FC(1024) → ReLU → FC(512) → ReLU → FC(4)

State

s_t :

The frame of the game after t transitions. It is converted to a 12×12 np array.

Example:


Frame


(12 × 12) Np array

Φ(s_t) :

Based on the Assignment4 of the Artificial Intelligence (CS 440/ECE 448) course from University of Illinois at Urbana–Champaign, 8 features were extracted from the frame as below:

[Adjoining_wall_x, Adjoining_wall_y, food_dir_x, food_dir_y, Adjoining_body_top, Adjoining_body_bottom, Adjoining_body_left, Adjoining_body_right]⁴

Φ(s_t) for the previous example:

Hyperparameters

Some initializations have been adopted from this paper and this site.

C:
- 10
γ:
- 0.99
Batch size :
- 128
actions :
- (Left, Right, Up, Down) ~ (0, 1, 2, 3)
Rewards :
- (Reward_Food , Reward_Loose , Reward_Move) ~ (100, -100, -0.1)
N (Replay Memory Size) :
- 50000
M (Number of Episodes) :
- 30000
Learning rate :
- 0.001
Optimizer :
- RMSprop
Loss :
- MSELoss
Epsilon Greedy :
- ε decreases linearly from 1(ε_max) to 0.0001(ε_min) with step 0.00001(∆ε). In other words, after 100000 steps the ε will be 0.0001 for the rest of the training².

Results

Plots:


_Train	_Test

Notices :

Training finished in ~ 209 Minutes on Tesla V100-SXM2-16GB (using Google Colab Pro).

Test Result: Mean(scores) : 20.257 | Std(scores) : 6.50

References

[1] Wikipedia - Snake (video game genre)

[2] https://www.diva-portal.org/smash/get/diva2:1342302/FULLTEXT01.pdf

[3] PyTorch - REINFORCEMENT LEARNING (DQN) TUTORIAL

[4] CS440/ECE448 Spring 2019 Assignment 4: Reinforcement Learning and Deep Learning (UNIVERSITY OF ILLINOIS URBANA-CHAMPAIGN)

Useful Resources

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.

AI learns to play SNAKE using Reinforcement Learning (Square Robots)

How to automate Snake using Reinforcement Learning (DeKay Arts)

https://github.com/YuriyGuts/snake-ai-reinforcement/

https://github.com/benjamin-dupuis/DQN-snake

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
Code		Code
README_Files		README_Files
README.md		README.md

amirdy/smart_snake

Folders and files

Latest commit

History

Repository files navigation

Smart_Snake

Average Test scores : 20

Best achieved score : 49

Preview

Algorithm

Q = Qθ = Action-Value Function = Policy network

Q^ = Qθ- = Target Function = Target network

Note:

The implementation has some differences with the above algorithm:

The training (calculating loss and updating the weights) doesn't apply to the first 2000 steps2.

Because there are not enough samples in the replay memory2.

Target Network is updated every C episode (not every C step)3.

I supposed that st = xt.

Network

Input Data :

(Batch_size, 8)

Layers :

FC(1024) → ReLU → FC(1024) → ReLU → FC(512) → ReLU → FC(4)

State

st :

Example:

Φ(st) :

[Adjoining_wall_x, Adjoining_wall_y, food_dir_x, food_dir_y, Adjoining_body_top, Adjoining_body_bottom, Adjoining_body_left, Adjoining_body_right]4

Φ(st) for the previous example:

Hyperparameters

C:

10

γ:

0.99

Batch size :

128

actions :

(Left, Right, Up, Down) ~ (0, 1, 2, 3)

Rewards :

(RewardFood , RewardLoose , RewardMove) ~ (100, -100, -0.1)

N (Replay Memory Size) :

50000

M (Number of Episodes) :

30000

Learning rate :

0.001

Optimizer :

RMSprop

Loss :

MSELoss

Epsilon Greedy :

ε decreases linearly from 1(εmax) to 0.0001(εmin) with step 0.00001(∆ε). In other words, after 100000 steps the ε will be 0.0001 for the rest of the training2.

Results

Plots:

Notices :

Training finished in ~ 209 Minutes on Tesla V100-SXM2-16GB (using Google Colab Pro).

Test Result: Mean(scores) : 20.257 | Std(scores) : 6.50

References

Useful Resources

About

Resources

Stars

Watchers

Forks

Languages

Q = Q_θ = Action-Value Function = Policy network

Q^{^} = Q_θ^- = Target Function = Target network

The training (calculating loss and updating the weights) doesn't apply to the first 2000 steps².

Because there are not enough samples in the replay memory².

Target Network is updated every C episode (not every C step)³.

I supposed that s_t = x_t.

s_t :

Φ(s_t) :

[Adjoining_wall_x, Adjoining_wall_y, food_dir_x, food_dir_y, Adjoining_body_top, Adjoining_body_bottom, Adjoining_body_left, Adjoining_body_right]⁴

Φ(s_t) for the previous example:

(Reward_Food , Reward_Loose , Reward_Move) ~ (100, -100, -0.1)

ε decreases linearly from 1(ε_max) to 0.0001(ε_min) with step 0.00001(∆ε). In other words, after 100000 steps the ε will be 0.0001 for the rest of the training².