Compared Non-stationary Multi-armed Bandits in Single-Agent to Multi-Agents Scenarios- Distributed Optimization and Learning(DOL) Course Project
We implemented Bandit learning algorithms for single-agent and multi-agent scenarios in this project. To this end, we used a non-stationary environment where some reward functions change disruptively.
We considered different single-agent multi-armed bandits with 2 and 10 arms in the first part. In both cases, we designed environments with different difficulty levels, i.e., discriminability of rewards. We used Epsilon-greedy, Upper-Confidence Bound(UCB), Policy-gradient, Thompson Sampling, and Actor-Critic algorithms in this part.
In the second part, we considered multi-agent multi-armed bandit scenarios. Similar to the former, we considered different numbers of arms with different reward probability distributions. In this part, we used Joint Action Learners(JAL), Free Maximum Q-value(FMQ), Distributed Q-learning, and Multi-agent Actor-Critic Algorithm.