Machine learning models for 16S rRNA sequence classification
This repository contains the code and comparative analyses of 5 machine learning models on different classification tasks and using various preproccessing methods. A list of models used for bacterial taxonomy classification with the curated 16S rRNA gene is as follows:
-
Ribosomal Database Project (RDP) Classifier with k-mer frequency classification
This model was developed by Wang, Q. et al (2007). Access the github repository and the paper
-
Convolutional Neural Networks (CNN) with k-mer frequency classification
This model is based on an architecture developed by Fiannaca, A. et al (2018). Access the github repository and the paper
-
Bilateral Long-Short Term Memory NN (BiLSTM) with one-hot-encoded sequence classification
This model is based on an architecture developed by Philipp Münch. Access the github repository
-
Combined Convolutional BiLSTM (ConvBiLSTM) with one-hot-encoded sequence classification
This model is based on an architecture developed by Desai, P. et al (2020). Access the paper
-
Attention-based ConvBiLSTM (Read2Pheno) with one-hot-encoded sequence classification
This model is based on an architecture developed by Zhao, Z. et al (2021). Access the github repository and the paper
These models have been combined in the jupyter notebook file (models_notebook.ipynb). This notebook also contains the scripts required for preprocessing the data and labels, compiling and running the models, and saving and visualising the results.
The seperate data-preprocessing and model-training scripts can be used instead of the full jupyter file when the memory requirements are too high for the user's system.