Skip to content

BigDataBiology/SemiBin_tutorial

Repository files navigation

SemiBin2 tutorial

Important note: this is a toy dataset that should not be used to draw any biological conclusions! It was produced artificially so that it returns a small number of bins in a short amount of time but that is all.

Important note II (for people familiar with git): This repository contains both input files and partially computed output files. This allows you to run any step in any order. However, if you are using the git checkout, this will overwrite some of the existing files and it will seem that changes were made. Many of the operations of SemiBin are probabilistic and running it twice will return different results.

Creating an environment with SemiBin2 installed

conda create -n semibin_tutorial
conda install -n semibin_tutorial -c conda-forge -c bioconda semibin
conda activate semibin_tutorial

Installing any recent version of the semibin package will install both a SemiBin (corresponding to the old SemiBin1) and a new SemiBin2 commands. You can test that the installation works by using

SemiBin2 check_install

How to run SemiBin2

Single sample mode

Step 1. Generate features (15 sec)

SemiBin2 generate_sequence_features_single \
    --input-fasta single_sample_binning/single.fasta \
    --input-bam single_sample_binning/single.sorted.bam \
    --output single_output

Step 2. (optional) Train a model (2-10 mins)

SemiBin2 train_self \
    --data single_output/data.csv \
    --data-split single_output/data_split.csv \
    --output single_output

The time taken by this step can vary by quite a lot depending on your hardware.

You can add --epochs 1 for testing (but the model will not be very good!).

Step 3 (option 1: use pretrained model). binning (30 sec)

With a pretrained model, you do not need to train a model.

SemiBin2 bin_short \
    --environment human_gut \
    --data single_output/data.csv \
    --input-fasta single_sample_binning/single.fasta \
    --output single_output

Step 3 (option 1: use model trained in step 2). binning (30 secs)

SemiBin2 bin_short \
    --model single_output/model.h5 \
    --data single_output/data.csv \
    --input-fasta single_sample_binning/single.fasta \
    --output single_output

Easy mode(s)

Do everything with a single command. If you provide an environment (e.g., using --environment human_gut), then that environment will be used:

SemiBin2 single_easy_bin \
    --environment human_gut \
    --input-fasta single_sample_binning/single.fasta \
    --input-bam single_sample_binning/single.sorted.bam \
    --output easy_out

Otherwise, a new model will get trained:

SemiBin2 single_easy_bin \
    --input-fasta single_sample_binning/single.fasta \
    --input-bam single_sample_binning/single.sorted.bam \
    --output easy_out

There is also a multi_easy_bin command for multi-sample binning.

These are wrapper commands over the more complex pipelines above. They are provided for convenience but the internal process is exactly the same as running SemiBin2 step-by-step.

Long-reads algorithms

If you have long-reads, you switch out the bin_short subcommand with the bin_long subcommand.

Or, if you are using the easy binning subcommands, add --sequencing-type long_reads:

SemiBin2 single_easy_bin \
    --environment human_gut \
    --sequencing-type long_reads \
    --input-fasta single_sample_binning/single.fasta \
    --input-bam single_sample_binning/single.sorted.bam \
    --output easy_out_long

Variations

Within the basic framework above, we now consider a few different variations, which can lead to better results

Variation 1: Training a model fom multiple samples

You can train from many samples, using the --train-from-many flag. This will (1) take longer, (2) lead to better models.

In this case, we are using the same sample twice, as a demonstration only:

SemiBin2 train_self \
    --train-from-many \
    --data single_output/data.csv single_output/data.csv \
    --data-split single_output/data_split.csv single_output/data_split.csv \
    --output single_output

Variation 2: Bin a single set of contigs using multiple samples for abundance estimation.

This can be meaningful if you have either co-assembled the samples or performed cross-mappings (i.e., mapped reads from multiple samples to your sample of interest).

You still use the generate_sequence_features_single subcommand to generate the features.

SemiBin2 generate_sequence_features_single \
    --input-fasta coassembly_binning/coassembly.fasta \
    --input-bam coassembly_binning/*.bam \
    --output coassembly_output

Variation 3: Multi-sample binning (1 min)

This is a more complex approach, but can obtain very good results.

Step 1. Generate a combined FASTA file

SemiBin2 concatenate_fasta \
    --input-fasta multi_sample_binning/S*.fna \
    --output multi_output

Step 2. Map all your samples to that combined FASTA file (use an external tool).

For completeness, we provide a script to do it with NGLess, but you can use other tools: [How to map to the concatenated FASTA file using NGLess]

Step 3. Generate features for multi-sample binning

Once you have produced a sorted BAM file per sample, you can use generate_sequence_features_multi to generate features:

SemiBin2 generate_sequence_features_multi \
    --input-fasta multi_sample_binning/concatenated.fa.gz \
    --input-bam multi_sample_binning/*.bam \
    --output multi_output

SemiBin2 will generate data.csv and data_split.csv for every sample.

Step 4. Training

You need to train a model for every sample. For example for sample S1:

SemiBin2 train_self \
    --data multi_output/samples/S1/data.csv \
    --data-split multi_output/samples/S1/data_split.csv \
    --output S1_output

This will take a long time. You can add --epochs 1 for testing (but the model will not be very good!).

You can also run this in a loop with bash

for sample in S1 S2 S3 S4 S5 ; do
    SemiBin2 train_self \
        --data multi_output/samples/${sample}/data.csv \
        --data-split multi_output/samples/${sample}/data_split.csv \
        --output ${sample}_output
done

Step 5. Binning

Again, you run it separately for every sample:

SemiBin2 bin_short \
    --input-fasta multi_sample_binning/S1.fna \
    --model S1_output/model.h5 \
    --data multi_output/samples/S1/data.csv \
    --output output

or in a loop

for sample in S1 S2 S3 S4 S5 ; do
    SemiBin2 bin_short \
        --input-fasta multi_sample_binning/${sample}.fna \
        --model ${sample}_output/model.h5 \
        --data multi_output/samples/${sample}/data.csv \
        --output output
done

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published