Skip to content

Supervised classification of various species DNA sequences using FFT and Machine Learning.

License

Notifications You must be signed in to change notification settings

DrStef/Machine-Learning-and-Digital-Signal-Processing-for-Genome-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning and Digital Signal Processing for Genome Classification

Project: Applying Digital Signal Processing: FFT, Spectrograms, Wavelets and Machine Learning/Deep Learning to genome classification.

In this project we develop effective methods for classifying genomes (DNA sequences) based on Digital Signal Processing, Machine Learning, Deep Learning. This is on-going research and results will be published on a regular basis.
As a starting point we analyzed the following paper:

"ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels" by Gurjit S. Randhawa , Kathleen A. Hill and Lila Kari. https://doi.org/10.1186/s12864-019-5571-y

Department of Computer Science, University of Western Ontario, London, ON, Canada

Their DNA sequence classification method: ML-DSP is very effective and matches or outperforms the acuuracy of best existing methods with DNA sequence alignment.
They collected a large number of DNA sequences, and built many datasets for validation purposes: Vertebrates, Fungi, Insects...
They thourughly report many classification results. That we will use as reference for our research.

Our objective is to develop methods that outperform the accuracy of ML-DSP with the most challenging datasets: Fungi, Protists, Insects.

Datasets

All datasets are available from Dr. Gurjit S. Randhawa Github repository:

https://github.com/grandhawa/MLDSP/tree/master/DataBase/

Data was extracted from the National Library of Medicine, National Center for Biotechnology Information (NCBI) website. The database can be searched and mitochondrial genome of various live species: plants, fungi... can be downloaded.
Illustration with : https://www.ncbi.nlm.nih.gov/labs/gquery/all/?term=NC_001224.1

ML-DSP approach

The authors propose (we cite) a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. They test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of > 97%.

Their original ML-DSP approach consists of: "The main idea behind ML-DSP is to combine supervised machine learning techniques with digital signal processing, for the purpose of DNA sequence classification. More precisely, for a given set $S={S_1,S_2,…,S_n}$ of n DNA sequences, ML-DSP uses:

      DNA numerical representations.
      Discrete Fourier Transform (DFT) of DNA numerical representations and extraction of spectrum magintudes $M_{i}$
      Pearson Correlation Coefficient (PCC) to compute the distance matrix of all pairwise distances for each pair of magnitude spectra $(M_i,M_j)$, where 1≤i,j≤n
      Supervised Machine Learning classifiers which take the pairwise distance matrix for a set of sequences.

Their results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy than method and is overall faster.

Our initial approach: ML-FFT

Methodology:

In the initial ML-FFT implementation we achieved 100% accuracy with the vertebrate dataset "birds-fish-mammals" by:

  • selecting the first NFFT=1024 points of each DNA sequence,
  • applying window and a very low frequency high-pass filter
  • feeding one-sided spectrum (frequency response) to Machine Learning classification algorithms: Logistic Regression, SVM.

This simple method did not work with more challenging datasets like the Fungi dataset.

Part I: ML-FFT - Birds - Fishes - Mammals DNA sequences classification

dataset

The Birds - Fishes - Mammals DNA sequences dataset is available here:

https://github.com/grandhawa/MLDSP/tree/master/DataBase/Birds-Fish-Mammals

Class Genomes
(count)
Birds 553
Fish 874
Mammals 2313

With this dataset, the authors achieve 100% accuracy with the ML-DSP method ! Results with our classification method are presented below.

Results

We selected the first NFFT=256, 512, 1024, 2048 in each DNA sequence and then computed the FFT spectrum.
We tested the ML-FFT approach with and without the spectrum phase. Some results are reported in the table below. It looks like the phase add some value with very short sequences NFFT=256, 512.
Optimal results were achieved with NFFT=1024 and 2048. The phase was not instrumental.
Without any particular pre-processing we achieve an accuracy close to 100% with Logistic Regression and SVM.
Like the authors, to measure the performance of such a classifier, we optimized hyperparameters and used the 10-fold cross-validation technique.

Approach Accuracy ML Technique
First NFFT= 512 points 97-98% SVM rbf kernel
First NFFT= 1024 Magnitude+Phase (1024 features) ~100% SVM linear kernel
First NFFT= 1024 Magnitude only (512 features) ~100% SVM rbf kernel

We display the best result below.

Drawing

Classification report

Confusion Matrix

  • The DNA sequence classification of vertebrates, with three different classes is not really a challenge.
  • Classification within a same class i.e datasets: Fungi, Insects, is much more challenging.
  • We tested other ML models: K Nearest Neighbor(KNN), Random Forest Classifier, Decision Tree with accuracy in the 96-98% range. Not as efficient as SVM. They are left as an exercise.

Jupyter Notebook:


Part II: ML-FFT + "Soft Align" - Fungi DNA sequences classification

Methodology

The methodology is described in the Jupyter Notebook Part IIa.

Fungi dataset

It is available here:

https://github.com/grandhawa/MLDSP/tree/master/DataBase/Fungi

Phylum Genomes
(count)
DNA sequence
(min-max length)
Basidiomycota 30 9745 / 235849
Pezizomycotina 104 1364 / 203051
Saccharomycotina 90 18844 / 107123

We collected additional DNA sequences on the NCBI website and we created extra datasets for running the soft-align procedure and identifying reference DNA frames.
The additional datasets are saved in folders:

  • A_Basidiomycota
  • A_Pezizomycotina
  • A_Saccharomycotina

A zip file of the amended fungi dataset folder: Fungi.zip can be dowloaded in the current repository.
This folder is required for running the following Jupyter notebook: "Part II a: Matching Fungi DNA sub-sequences with cross-correlation".

Results

For the challenging Fungi dataset, the simple ML-FFT method does not work. We introduce a soft alignment method ("Soft Align") where:

  • all frames length NFFT= 1024 points
  • we select a NFFT reference frame in each Fungi phylum (sub-phylum). Three reference frames are identified.
  • we select an optimal NFFT frame for each DNA sequence in each Fungi phylum. By comparison (cross-correlation) with the three reference frames.
  • then we applied the simple ML-FFT method on optimal NFFT frames .

The ML-FFT + Soft Align outperforms ML-DSP with accuracy between 96 to 98% in a reasonable time.

ML-FFT + Soft Align Approach Accuracy ML Technique
NFFT= 1024 points Magnitude only (512 features) 96% Logistic Regression newton-cg solver
SVM "rbf" solver
NFFT= 1024 points Magnitude + Phase (1024 features) 98% Logistic Regression newton-cg solver
Drawing

Classification report

Confusion Matrix


Jupyter Notebooks:

Soft Align method with cross-correlation. Determination of reference frames for each class.

Full classification after importing reference frames for each class.


Part III: Insects DNA sequences classification

Excellent results were achieved using the method described in the previous section. This part is left as an exercise.
We used a subset of the Insect DNA sequence dataset for the soft-align process. Which may introduce a bias. Collecting additional sequences on the website of the National Center for Biotechnology Information (NCBI) was tedious.
We may post the Jupyter notebook in the future.

Results

ML-FFT + Soft Alignement applied to insect dataset.

Drawing

Classification report

Confusion Matrix


Part IV: Protists DNA sequences classification

We did not test the ML-FFT + Soft Alignement method with the small protists dataset. More data is needed for the initial stage: DNA sequence alignement and definition of the reference DNA frames.