ML_DNA_seq

Description

A machine learning exercise for predicting species taxa(taxonomy groups) based on input sequences with python using the sklearn package.

Machine Learning:

The M.L part is broken down to segments

Getting the data

Multiple gb files of the 5.8 rRNA locus from different species of the Asparagales order were selected from the genebank site, that contains a annotated sequences database.

Normalisation

Due to the fact that the sequences are of varying length a normalization was performed:

The sequences were broken into tokens.
They were converted in N length vectors using CountVectorizer function from the sklearn package

Label data (taxonomy group) of the species was taken from the gb files and converted to ordinal values with sklearn label encoder

Building the model

The predictor that was chosen is SVN
(https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn-svm-svc)

A pipeline was created that include the following, in that order:

Sequence normalisation (as was mentioned)
Dimensional reduction (Due to the great length of the sequences) using PCA
SVN

hyper-parameter optimisation

The pipeline parameters (only the PCA number of output dimensions) were optimised using GridSearchCV

Model persistence

The model and the ordinal data encoder were saved to binary file using pickle

Important files in project

train.py - Python file that have the code for generating the actual model
predict.py : Python file for using the model and doing predictions
1. input : gb file (the files that contains the sequences)
2. output : file with predictions
functions.py - general functions
model_transformation.joblib - pickled model
encoder.joblib - pickled encoder

How to use

run in cmd: predict.py "filename"

filename must be a .gb file

outputs a file names "pred.txt" with predictions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML_DNA_seq

Description

Machine Learning:

Getting the data

Normalisation

Building the model

hyper-parameter optimisation

Model persistence

Important files in project

How to use

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
seq		seq
encoder.joblib		encoder.joblib
file_list		file_list
functions.py		functions.py
model_transformation.joblib		model_transformation.joblib
pred.txt		pred.txt
predict.py		predict.py
readme.md		readme.md
train.py		train.py

royassis/dnaSeqML

Folders and files

Latest commit

History

Repository files navigation

ML_DNA_seq

Description

Machine Learning:

Getting the data

Normalisation

Building the model

hyper-parameter optimisation

Model persistence

Important files in project

How to use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages