Skip to content

a signal-level demultiplexer for Oxford Nanopore reads

License

Notifications You must be signed in to change notification settings

rrwick/Deepbinner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Deepbinner

Deepbinner is a tool for demultiplexing barcoded Oxford Nanopore sequencing reads. It does this with a deep convolutional neural network classifier, using many of the architectural advances that have proven successful in image classification. Unlike other demultiplexers (e.g. Albacore and Porechop), Deepbinner identifies barcodes from the raw signal (a.k.a. squiggle) which gives it greater sensitivity and fewer unclassified reads.

  • Reasons to use Deepbinner:
    • To minimise the number of unclassified reads (use Deepbinner by itself).
    • To minimise the number of misclassified reads (use Deepbinner in conjunction with Albacore demultiplexing).
    • You plan on running signal-level downstream analyses, like Nanopolish. Deepbinner can demultiplex the fast5 files which makes this easier.
  • Reasons to not use Deepbinner:
    • You only have basecalled reads not the raw fast5 files (which Deepbinner requires).
    • You have a small/slow computer. Deepbinner is more computationally intensive than Porechop.
    • You used a sequencing/barcoding kit other than the ones Deepbinner was trained on.

You can read more about Deepbinner in this preprint:
Wick RR, Judd LM, Holt KE. Deepbinner: Demultiplexing barcoded Oxford Nanopore reads with deep convolutional neural networks. bioRxiv. 2018; doi:10.1101/366526.

2021 update

I developed Deepbinner almost three years ago, which is a very long time in the fast-moving space of Nanopore sequencing! Since then, a lot has changed, and for most users, Deepbinner is probably no longer the best choice for demultiplexing your Nanopore reads.

When Deepbinner was published, it had a nice advantage over sequence-based demultiplexing. I.e. demultiplexing from the raw signal gave better accuracy than demultiplexing from a basecalled sequence. But the last few years have seen very nice increases in Oxford Nanopore basecalling accuracy, and that has made sequence-based demultiplexing more accurate as well, so Deepbinner's advantage has considerably narrowed. Guppy (Oxford Nanopore's production basecalling tool) has integrated sequence-based demultiplexing, and this makes it very convenient to use. Also, Deepbinner's models are out-of-date: they cover only 12 barcodes, but up to 96 native barcodes are now available.

The short version is this: I think most users should demultiplex with Guppy, not Deepbinner. Guppy is easier to run and will do nearly as well as Deepbinner (probably, I haven't tested this quantitatively).

On a final note, I don't think that the concept of raw-signal-based demultiplexing with a neural network is obsolete. Raw signals always contain more information than basecalled sequences, and neural networks can make very good classifiers. In a perfect world, I'd like to see raw-signal neural-network demultiplexing integrated into Guppy – a feature request in case any Guppy developers are reading this πŸ˜„ So I will leave Deepbinner's repo in place for any intrepid users that might want to modify it, train custom models, etc. But consider it deprecated.

Table of contents

Requirements

Deepbinner runs on MacOS and Linux and requires Python 3.5+.

TensorFlow logo

Its most complex requirement is TensorFlow, which powers the neural network. TensorFlow can run on CPUs (easy to install, supported on many machines) or on NVIDIA GPUs (better performance). If you're only going to use Deepbinner to classify reads, you may not need GPU-level performance (read more here). But if you want to train your own Deepbinner neural network, then using a GPU is a necessity.

The simplest way to install TensorFlow for your CPU is with pip3 install tensorflow. Building TensorFlow from source may give slighly better performance (because it will use all instructions sets supported by your CPU) but the installation is more complex. If you are using Ubuntu and have an NVIDIA GPU, check out these instructions for installing TensorFlow with GPU support.

Deepbinner uses some other Python packages (Keras, NumPy and h5py) but these should be taken care of by pip when installing Deepbinner. It also assumes that you have gzip available on your command line. If you are going to train your own Deepbinner network, then you'll need a few more Python packages as well (see the training instructions).

If you are using multi-read fast5 files (new in 2019), then you'll also need to have the multi_to_single_fast5 tool installed on your path. You can get it here: github.com/nanoporetech/ont_fast5_api.

Installation

Install from source

You can install Deepbinner using pip, either from a local copy:

git clone https://github.com/rrwick/Deepbinner.git
pip3 install ./Deepbinner
deepbinner --help

Or directly from GitHub:

pip3 install git+https://github.com/rrwick/Deepbinner.git
deepbinner --help

Run without installation

Deepbinner can be run directly from its repository by using the deepbinner-runner.py script, no installation required:

git clone https://github.com/rrwick/Deepbinner.git
Deepbinner/deepbinner-runner.py -h

If you run Deepbinner this way, it's up to you to make sure that all necessary Python packages are installed.

Quick usage

Demultiplex native barcoding reads that are already basecalled:

deepbinner classify --native fast5_dir > classifications
deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir demultiplexed_reads

Demultiplex rapid barcoding reads that are already basecalled:

deepbinner classify --rapid fast5_dir > classifications
deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir demultiplexed_reads

Demultiplex native barcoding raw fast5 reads (potentially in real-time during a sequencing run):

deepbinner realtime --in_dir fast5_dir --out_dir demultiplexed_fast5s --native

Demultiplex rapid barcoding raw fast5 reads (potentially in real-time during a sequencing run):

deepbinner realtime --in_dir fast5_dir --out_dir demultiplexed_fast5s --rapid

The sample_reads.tar.gz file in this repository contains a small test set: six fast5 files and a FASTQ of their basecalled sequences. When classified with Deepbinner, you should get two reads each from barcodes 1, 2 and 3.

Available trained models

Deepbinner currently only provides pre-trained models for the EXP-NBD103 native barcoding expansion and the SQK-RBK004 rapid barcoding kit. See more details here.

If you have different data, then pre-trained models aren't available. If you have lots of existing data, you can train your own network. Alternatively, if you can share your data with me, I could train a model and make it available as part of Deepbinner. Let me know!

Using Deepbinner after basecalling

If your reads are already basecalled, then running Deepbinner is a two-step process:

  1. Classify reads using the fast5 files
  2. Organise the basecalled FASTQ reads into bins using the classifications

Step 1: classifying fast5 reads

This is accomplished using the deepbinner classify command, e.g.:

deepbinner classify --native fast5_dir > classifications

Since the native barcoding kit puts barcodes on both the start and end of reads, Deepbinner will look for both. Most reads should have a barcode at the start, but barcodes at the end are less common. If a read has conflicting barcodes at the start and end, it will be put in the unclassified bin. The --require_both option makes Deepbinner only bin reads with a matching start and end barcode, but this is very stringent and will result in far more unclassified reads. See more on the wiki: Combining start and end barcodes. None of this applies if you are using rapid barcoding reads (--rapid), as they only have a barcode at the start.

Here is the full usage for deepbinner classify.

Step 2: binning basecalled reads

This is accomplished using the deepbinner bin command, e.g.:

deepbinner bin --classes classifications --reads basecalled_reads.fastq.gz --out_dir 

This will leave your original basecalled reads in place, copying the sequences out to new files in your specified output directory. Both FASTA and FASTQ reads inputs are okay, gzipped or not. Deepbinner will gzip the binned reads at the end of the process.

Here is the full usage for deepbinner bin.

Using Deepbinner before basecalling

If you haven't yet basecalled your reads, you can use deepbinner realtime to bin the fast5 files, e.g.:

deepbinner realtime --in_dir fast5s --out_dir demultiplexed_fast5s --native

This command will move (not copy) fast5 files from the --in_dir directory to the --out_dir directory. As the command name suggests, this can be run in real-time – Deepbinner will watch the input directory and wait for new reads. Just set --in_dir to where MinKNOW deposits its reads. Or if you sequence on a laptop and copy the reads to a server, you can run Deepbinner on the server, watching the directory where the reads are deposited. Use Ctrl-C to stop it.

This command doesn't have to be run in real-time – it works just as well on a directory of fast5 files from a finished sequencing run.

Here is the full usage for deepbinner realtime (many of the same options as the classify command).

Using Deepbinner with Albacore demultiplexing

If you use both Deepbinner and Albacore to demultiplex reads, only keeping reads for which both tools agree on the barcode, you can achieve very low rates of misclassified reads (high precision, positive predictive value) but a larger proportion of reads will not be classified (put into the 'none' bin). This is what I usually do with my sequencing runs!

The easiest way to achieve this is to follow the Using Deepbinner before basecalling instructions above. Then run Albacore separately on each of Deepbinner's output directories, with its --barcoding option on. You should find that for each bin, Albacore puts most of the reads in the same bin (the reads we want to keep), some in the unclassified bin (slightly suspect reads, likely with lower quality basecalls) and a small number in a different bin (very suspect reads).

Here are some instructions and Bash code to carry this out automatically.

Using Deepbinner with multi-read fast5s

Multi-read fast5s complicate the matter for Deepbinner: if one fast5 file contains reads from more than one barcode, then it cannot simply be moved into a bin. The simplest solution is to first run the multi_to_single_fast5 tool available in the ont_fast5_api before running Deepbinner. This is necessary if you are running the deepbinner classify command.

If you are running the deepbinner realtime command, then Deepbinner can handle multi-read fast5 files. It will run the multi_to_single_fast5 tool putting the single-read fast5s into a temporary directory, and then move the single-read fast5s into bins in the output directory. However, unlike running deepbinner realtime on single-read fast5s, where the fast5s are moved into the destination directory, running it on multi-read fast5s will leave the original input files in place (because it's the unpacked single-read fast5s which are moved). So you might want to delete the multi-read fast5s after Deepbinner finishes to save disk space.

Performance

Deepbinner lives up to its name by using a deep neural network. It's therefore not particularly fast, but should be fast enough to keep up with a typical MinION run. If you want to squeeze out a bit more performance, try adjusting the 'Performance' options. Read more here for a detailed description of these options. In my tests, it can classify about 15 reads/sec using 12 threads (the default). Giving it more threads helps a little, but not much.

Building TensorFlow from source may give better performance (because it can then use all available instruction sets on your CPU). Running TensorFlow on a GPU will definitely give better Deepbinner performance: my tests on a Tesla K80 could classify over 100 reads/sec.

Training

You can train your own neural network with Deepbinner, but you'll need two things:

  • Lots of training data using the same barcoding and sequencing kits. More is better, so ideally from more than one sequencing run.
  • A fast computer to train on, ideally with TensorFlow running on a big GPU.

If you can meet those requirements, then read on in the Deepbinner training instructions!

Contributing

As always, the wider community is welcome to contribute to Deepbinner by submitting issues or pull requests.

I also have a particular need for one kind of contribution: training reads! The lab where I work has mainly used R9.4/R9.5 flowcells with the SQK-LSK108 kit. If you have other types of reads that you can share, I'd be interested (see here for more info).

Acknowledgments

I would like to thank James Ferguson from the Garvan Institute. We met at the Nanopore Day Melbourne event in February 2018 where I saw him present on raw signal detection of barcodes. It was then that the seeds of Deepbinner were sown!

I'm also in debt to Matthew Croxen for sharing his SQK-RBK004 rapid barcoding reads with me – they were used to build Deepbinner's pre-trained model for that kit.

License

GNU General Public License, version 3

About

a signal-level demultiplexer for Oxford Nanopore reads

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages