This repository contains our work for the challenge of the ALTEGRAD course at ENS Paris-Saclay where @s89ne and I finished 6th out of about 50 teams with a score of 0.9422.
The goal of the challenge was to retrieve molecules from a database given natural language queries. The handout of the challenge can be found here and the challenge was hosted privately on kaggle.
This is a watered down version of our original code for the competition. We removed things that didn't work out for clarity.
The main idea is to use contrastive learning to encode the text and the graph in the same vector space. Then, we can use a similarity function to rank the molecules based on their embeddings. Our approach can be summarized in the following steps:
- Design a good graph neural network; in our case we used the GAT architecture.
- Use this GNN in the DiffPool architecture to aggregate the nodes in a clever way.
- Use an ensemble of such models to get a good score.
You can find our detailed report here.
In the report, we present the following table (we only removed the diffpool-old model as it did not really bring anything to the table) :
Model name | DiffPool layers | MPL dimension | MPL | Linear layers | Attention heads | Final linear layer | Graph parameters | Validation score |
---|---|---|---|---|---|---|---|---|
GAT | - | - | ||||||
Diffpool-deep | ||||||||
DiffPool-big | 10, 5, 3, 1 | |||||||
Diffpool-shallow | ||||||||
DiffPool-base | ||||||||
Diffpool-medium | ||||||||
Diffpool-linear | ||||||||
Diffpool-large |
To reproduce our results, use the train.py script to train the models. For each model in the above table, the correct hyper-parameters are set in the ensembling.ipynb notebook. In our case, we used some of the models multiple times, but training each model once should be enough to get a good idea of the performance of the model. Once the models are trained, use the ensembling.ipynb notebook to ensemble the models and get the final solution.