COMET(Chemically Omnipotent Molecular Encoder from Transformer)
- Data Loader : data loader sampling masking atom based on their inverse occurence probability
- Logging : record macro f1 score, confusion matrix, weight histogram
- Data : dataset consist more balanced molecule sample with more abundant rare symbol.
- Data Set : Normalize molecular property with mean & std value. It is very fast with pandas operation.
- Data Set : It preprocess each molecule and hold their Adjacency Matrix and Feature Matrix. Also each molecule is parsed into fixed size vector.
- Inside Iteration : masking indices were selected based on the symbol distribution and return A, X, masked_A, masked_X, masked_idx, P
- Ground Truth : previous ground truth matrix is indexed inside the training iteration.
- Put Masking Task in Batch_process fn
- Normalize A matrix in dataloader or iteration (Note. masked A should be generate from the Normalized A)
- Crop A matrix with max-atom length in order to increase speed
- Auxilary Regression is done in one time
- Masking Rate is adjusted with given Masking Radius. Avg num_masking get close to Avg connected num_atom
- Data Loader : Firstly sampling center atom with occurence distribution. Secondly, it find out adjacent atom by multiplying A matrix with r(radius) times. Construct index set and truncate with num_masking
- Data Loader : Masked A would provide
- Model : A should be calculate from previous A.
Total Number of Molecules in Raw Zinc Dataset : 531,354,040
Name | Train Size | Train Coverage | Valid Size | Valid Coverage | Sampling Rate |
COMET_L | 19.9M(19,919,005) | 4.9M(4,980,881) | 4.7% | ||
COMET_M | 5.9M (5,975,109) | 1.4M(1,494,480) | 1.4% | ||
COMET_S | 1.9M (1,979,256) | 0.5M (495,380) | 0.47% | ||
COMET_XXS | 197K (197,189) | 49K (49,514) | 0.047% |
Register conda environment to jupyter notebook :
Install RDkit :
Handling Large Dataset :
Neat Tutorial to use HDF5 with python :
Convert String into HDF5 encoding :
Loading List of HDF5 files with pytorch Dataset :
Installing TensorboardX :
git clone && cd tensorboardX && python install
Compress and Extract datasetfile :
compress : tar -zcvf dataset.tar.gz dataset
extrace : tar -zxvf dataset.tar.gz