WISP : binning of bacterial long-read signatures

This Python program is meant to determine to which taxa a bacteria is belonginig to, from long reads (>10.000bp), solely based upon alignment-free methods. As of now, the main focus is upon kmers proportions. It aims to do binning over a collection of samples, giving possible class to each read. It is a rebranching of a master 1 internship project, done on free time. Concept has not evolved since, but code was redesign for better comprehension.

Currently, five levels of taxa are implemented : domain, phylum, group, order and family. Once a model finished at a given taxa level, it aims to do another iteration from previous results, excluding non-matching reference genomes.

The core functionnalities relies on a class probabiliy attribution to discriminate reads that might not be good indicators for our specie to be determined. As many other options, you can choose the ratio and the selection function to suit best your biological context.

WISP is research software. If you want to use it, please source the code.

Installing software

git clone -b v0.1.0 --single-branch git@github.com:Tharos-ux/wisp.git
cd wisp/
python -m pip install . --quiet

Usage

usage: wisp [-h] [-l] {build,predict} ...

Bacteria family identification tool.

Subcommands:
  {build,predict}  Available subcommands
    build          Creates the database from the specified set of files.
    predict        Creates the samples and evaluates them.

Global Arguments:
  -h, --help       show this help message and exit
  -l, --locals     Display locals on error.

The command build allows to create models from a set of reference genomes.

usage: wisp build [-h] [-p PARAMETERS] database_name input_folder

positional arguments:
  database_name         Name for database
  input_folder          Input folder containig reference genomes

options:
  -h, --help            show this help message and exit
  -p PARAMETERS, --parameters PARAMETERS
                        Specifies a parameter file

The command predict offers to predict taxonomy of sample from computed models.

usage: wisp predict [-h] [-p PARAMETERS] database_name input_folder output_folder

positional arguments:
  database_name         Name for database
  input_folder          Input folder containig unknown genomes
  output_folder         Input folder containig reference genomes

options:
  -h, --help            show this help message and exit
  -p PARAMETERS, --parameters PARAMETERS
                        Specifies a parameter file

Project architecture

All code about binning is in the workspace folder.

main.py is the main loop and argument parser
create_database.py contains function to index the reference genomes
create_model.py contains the functions to create XGboost models from the index
create_sample.py contains the functions to create the dataset for the reads we want to predict
create_prediction.py contains the functions to make prediction on the sample dataset with models

Two scripts come along, in the scripts folder.

download_refseq.py downloads, from a refseq assembly file, the representative genomes, and annotates them by thier classification (NCBI taxonomy)
visualize_output.py renders a html file with graphs from a .json, output of the wisp predict command

URL to refseq assembly file for bacteria

Core idea

Given a set of reference genomes, annotated with their taxonomy, this program samples a set of 10kb lectures in each reference genome. Each sample is fractured in k-mers, which are counted : those counts are the features for the model. We train the model on many samples, in order to try to extract a k-mer signature for each specie in our reference genomes. Then, when we want to predict, we apply the same treatment to our lectures, and we try to compute a list of classes that can be associated to the unknown sample. Once the upper classification level is determined, we move on to a lower one, until we reach family. If the confidence score is high enough, we may explore multiple branches of the taxonomy.

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
scripts		scripts
workspace		workspace
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WISP : binning of bacterial long-read signatures

Installing software

Usage

Project architecture

Core idea

Tasks

About

Releases 2

Packages

Languages

License

Tharos-ux/wisp

Folders and files

Latest commit

History

Repository files navigation

WISP : binning of bacterial long-read signatures

Installing software

Usage

Project architecture

Core idea

Tasks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages