Important note: this is a toy dataset that should not be used to draw any biological conclusions! It was produced artificially so that it returns a small number of bins in a short amount of time but that is all.
Important note II (for people familiar with git): This repository contains both input files and partially computed output files. This allows you to run any step in any order. However, if you are using the git checkout, this will overwrite some of the existing files and it will seem that changes were made. Many of the operations of SemiBin are probabilistic and running it twice will return different results.
conda create -n semibin_tutorial
conda install -n semibin_tutorial -c conda-forge -c bioconda semibin
conda activate semibin_tutorial
Installing any recent version of the semibin
package will install both a SemiBin
(corresponding to the old SemiBin1) and a new SemiBin2
commands. You can test that the installation works by using
SemiBin2 check_install
Step 1. Generate features (15 sec)
SemiBin2 generate_sequence_features_single \
--input-fasta single_sample_binning/single.fasta \
--input-bam single_sample_binning/single.sorted.bam \
--output single_output
Step 2. (optional) Train a model (2-10 mins)
SemiBin2 train_self \
--data single_output/data.csv \
--data-split single_output/data_split.csv \
--output single_output
The time taken by this step can vary by quite a lot depending on your hardware.
You can add --epochs 1
for testing (but the model will not be very good!).
Step 3 (option 1: use pretrained model). binning (30 sec)
With a pretrained model, you do not need to train a model.
SemiBin2 bin_short \
--environment human_gut \
--data single_output/data.csv \
--input-fasta single_sample_binning/single.fasta \
--output single_output
Step 3 (option 1: use model trained in step 2). binning (30 secs)
SemiBin2 bin_short \
--model single_output/model.h5 \
--data single_output/data.csv \
--input-fasta single_sample_binning/single.fasta \
--output single_output
Do everything with a single command. If you provide an environment (e.g.,
using --environment human_gut
), then that environment will be used:
SemiBin2 single_easy_bin \
--environment human_gut \
--input-fasta single_sample_binning/single.fasta \
--input-bam single_sample_binning/single.sorted.bam \
--output easy_out
Otherwise, a new model will get trained:
SemiBin2 single_easy_bin \
--input-fasta single_sample_binning/single.fasta \
--input-bam single_sample_binning/single.sorted.bam \
--output easy_out
There is also a multi_easy_bin
command for multi-sample binning.
These are wrapper commands over the more complex pipelines above. They are provided for convenience but the internal process is exactly the same as running SemiBin2 step-by-step.
If you have long-reads, you switch out the bin_short
subcommand with the bin_long
subcommand.
Or, if you are using the easy binning subcommands, add --sequencing-type long_reads
:
SemiBin2 single_easy_bin \
--environment human_gut \
--sequencing-type long_reads \
--input-fasta single_sample_binning/single.fasta \
--input-bam single_sample_binning/single.sorted.bam \
--output easy_out_long
Within the basic framework above, we now consider a few different variations, which can lead to better results
You can train from many samples, using the --train-from-many
flag. This will (1) take longer, (2) lead to better models.
In this case, we are using the same sample twice, as a demonstration only:
SemiBin2 train_self \
--train-from-many \
--data single_output/data.csv single_output/data.csv \
--data-split single_output/data_split.csv single_output/data_split.csv \
--output single_output
This can be meaningful if you have either co-assembled the samples or performed cross-mappings (i.e., mapped reads from multiple samples to your sample of interest).
You still use the generate_sequence_features_single
subcommand to generate the features.
SemiBin2 generate_sequence_features_single \
--input-fasta coassembly_binning/coassembly.fasta \
--input-bam coassembly_binning/*.bam \
--output coassembly_output
This is a more complex approach, but can obtain very good results.
Step 1. Generate a combined FASTA file
SemiBin2 concatenate_fasta \
--input-fasta multi_sample_binning/S*.fna \
--output multi_output
Step 2. Map all your samples to that combined FASTA file (use an external tool).
For completeness, we provide a script to do it with NGLess, but you can use other tools: [How to map to the concatenated FASTA file using NGLess]
Step 3. Generate features for multi-sample binning
Once you have produced a sorted BAM file per sample, you can use generate_sequence_features_multi
to generate features:
SemiBin2 generate_sequence_features_multi \
--input-fasta multi_sample_binning/concatenated.fa.gz \
--input-bam multi_sample_binning/*.bam \
--output multi_output
SemiBin2 will generate data.csv and data_split.csv for every sample.
Step 4. Training
You need to train a model for every sample. For example for sample S1
:
SemiBin2 train_self \
--data multi_output/samples/S1/data.csv \
--data-split multi_output/samples/S1/data_split.csv \
--output S1_output
This will take a long time. You can add --epochs 1
for testing (but the model will not be very good!).
You can also run this in a loop with bash
for sample in S1 S2 S3 S4 S5 ; do
SemiBin2 train_self \
--data multi_output/samples/${sample}/data.csv \
--data-split multi_output/samples/${sample}/data_split.csv \
--output ${sample}_output
done
Step 5. Binning
Again, you run it separately for every sample:
SemiBin2 bin_short \
--input-fasta multi_sample_binning/S1.fna \
--model S1_output/model.h5 \
--data multi_output/samples/S1/data.csv \
--output output
or in a loop
for sample in S1 S2 S3 S4 S5 ; do
SemiBin2 bin_short \
--input-fasta multi_sample_binning/${sample}.fna \
--model ${sample}_output/model.h5 \
--data multi_output/samples/${sample}/data.csv \
--output output
done