This project explores the application of deep Reinforcement Learning (RL) algorithms in image reconstruction, utilizing the popular PPO algorithm.
The task involves selecting a simple, binarized image (akin to pixel art or images of MNIST digits), which the agent must reconstruct within an environment represented by a grid, each cell signifying a pixel. The goal is to maximize the similarity between the recreated and the target image.
The environment, denoted as E
, is an N x N
grid, with each cell (C
) representing a pixel. The cells within this environment can be in one of two states, C=1
or C=0
.
The agent (A
) operating within this environment has an action space A = {0, 1}
, corresponding to changing a cell's state to 0 (black pixel) or 1 (white pixel). The agent initiates its trajectory at the top-left cell (e_{1,1}
) and terminates at the bottom-right cell (e_{N,N}
).
Here P
is the path of the agent, representing the sequence of cells it visits, altering their states with actions from A
.
The agent's observation at any time step t
, denoted as O
, is a tuple consisting of the current environment state E_{s_t}
, the agent's current position P_{s_t}
, and the target image T
.
In this definition, fps
denotes the frame-per-second parameter. If fps = f
, the agent's observation at time t
includes the states of the environment and the agent's positions for the f-1
previous time steps.
The reward signal was designed to encourage an equilibrium of potential rewards for both black and white cells, facilitating the accurate replication of the target image. The total possible reward R
is split equally between white cells C_w
and black cells C_b
.
For an agent at position e_{i, j}
, corresponding to target image location t_{i, j}
, the reward is calculated as follows:
- For
t_{i, j}=1
,r=R_{w}/C_{w}
ife_{i, j}=t_{i, j}
, otherwiser=-(R_{w}/C_{w})
; - For
t_{i, j}=0
,r=R_{b}/C_{b}
ife_{i, j}=t_{i, j}
, otherwiser=-(R_{b}/C_{b})
.
If the final step is reached with S=T
, an extra reward of 0.1R
is given.