This repository contains implementation of "A Deep Reinforcement Learning Algorithm Using Dynamic Attention Model for Vehicle Routing Problems" article (TensorFlow2).
This work was done as a final project for DeepPavlov course: Advanced Topics in Deep Reinforcement learning . A non-dynamic version (Attention, Learn to Solve Routing Problems! ) of this approach, which was implemented as a part of the project, can be found at https://github.com/alexeypustynnikov/AM-VRP.
One of the important applications of combinatorial optimization is vehicle routing problem, in which the goal is to find the best routes for a fleet of vehicles visiting a set of locations. Usually, "best" means routes with the least total distance or cost.
We would consider only particular case of general VRP problem: Capacitated Vehicle Routing Problem (CVRP), where the vehicle has a limited carrying capacity of the goods that must be delivered.
VRP is an NP-hard problem (Lenstra and Rinnooy Kan, 1981).
Exact algorithms are only efficient for small problem instances. The number of near-optimal algorithms are introduced in academic literature. There are multiple professional tools for solving various VRP problems (ex. Google OR-Tools ).
The structural features of the input graph instance are extracted by the encoder. Then the solution is constructed incrementally by the decoder.
Specifically, at each construction step, the decoder predicts a distribution over nodes, then one node is selected and appended to the end of the partial solution.
- Use RL to create agent that can learn heuristics and provide suboptimal solutions.
- Make use of Graph Attention Networks (GAT) to create appropriate graph embeddings for the agent.
- Policy of RL agent is governed by decoder.
-
After vehicle returns to depot, the remaining nodes could be considered as a new (smaller) instance (graph) to be solved.
-
Idea: update embedding of the remaining nodes using encoder after agent arrives back to depot.
-
Implementation:
-
Force RL agent to wait for others once it arrives to
$x_0$ . -
When every agent is in depot, apply encoder with mask to the whole batch.
Current enviroment implementation is located in enviroment.py file - AgentVRP class .
The class contains information about current state and actions that were done by agent.
Main methods:
- step(action): transit to a new state according to the action.
- get_costs(dataset, pi): returns costs for each graph in batch according to the paths in action-state space.
- get_mask(): returns a mask with available actions (allowed nodes).
- all_finished(): checks if all games in batch are finished (all graphes are solved).
- partial_finished(): checks if partial solutions for all graphs has been built, i.e. all agents came back to depot.
Connection with RL language:
-
State:
$X$ - graph instance (coordinates, demands, etc.) together with information in which node agent is located. -
Action:
$\pi_t$ - decision in which node agent should go. - Reward: The (negative) tour length.
AM-D is trained by policy gradient using REINFORCE algorithm with baseline.
Baseline
- Baseline is a copy of model with fixed weights from one of the preceding epochs.
- Use warm-up for early epochs: mix exponential moving average of model cost over past epochs with baseline model.
- Update baseline at the end of epoch if the difference in costs for candidate model and baseline is statistically-significant (t-test).
- Baseline uses separate dataset for this validation. This dataset is updated after each baseline renewal.
Example
Implementation in TensorFlow 2
- AM-D for VRP Report.ipynb - demo report notebook
- enviroment.py - enviroment for VRP RL Agent
- layers.py - MHA layers for encoder
- attention_graph_encoder.py - Graph Attention Encoder
- reinforce_baseline.py - class for REINFORCE baseline
- attention_dynamic_model.py - main model and decoder
- train.py - defines training loop which we use in train_model.ipynb
- train_model.ipynb - from this file one can start training or continue training from chechpoint
- utils.py and utils_demo.py - various auxiliary functions for data creation, saving and visualisation
- lkh3_baseline folder - everything for running LKH algorithm + logs.
- results folder: folder name is ADM_VRP_{graph_size}_{batch_size}. There are training logs, learning curves and saved models in each folder
- Open train_model.ipynb and choose training parameters.
- All outputs would be saved in current directory.