A machine learning exercise for predicting species taxa(taxonomy groups) based on input sequences with python using the sklearn package.
The M.L part is broken down to segments
Multiple gb files of the 5.8 rRNA locus from different species of the Asparagales order were selected from the genebank site, that contains a annotated sequences database.
Due to the fact that the sequences are of varying length a normalization was performed:
- The sequences were broken into tokens.
- They were converted in N length vectors using CountVectorizer function from the sklearn package
Label data (taxonomy group) of the species was taken from the gb files and converted to ordinal values with sklearn label encoder
The predictor that was chosen is SVN
(https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn-svm-svc)
A pipeline was created that include the following, in that order:
- Sequence normalisation (as was mentioned)
- Dimensional reduction (Due to the great length of the sequences) using PCA
- SVN
The pipeline parameters (only the PCA number of output dimensions) were optimised using GridSearchCV
The model and the ordinal data encoder were saved to binary file using pickle
- train.py - Python file that have the code for generating the actual model
- predict.py : Python file for using the model and doing predictions
- input : gb file (the files that contains the sequences)
- output : file with predictions
- functions.py - general functions
- model_transformation.joblib - pickled model
- encoder.joblib - pickled encoder
run in cmd: predict.py "filename"
filename must be a .gb file
outputs a file names "pred.txt" with predictions