Bin detangling workflow

This is a work-in-progress collection of scripts for the refinement of metagenome-assembled genomes (MAGs) using a combination of emergent self-organising maps (ESOM) and machine-learning algorithms. This is not intended to work as a single all-in-one package, as there are too many stages to refinement that require user intervention and judgement calls. Instead, this repository is a collection of scripts that are called together to walk through a series of pre-computed bins and coverage files to achieve the outcome.

The documentation will improve as scripts are finalised and the workflow is validated.

Introduction

The idea behind this analysis is that you have obtained a metagenomic assembly, and are trying to identify clusters of contigs which belong to the same organism, or group of closely related organisms. There are a number of excellent automated pipelines for the recovery of these organism bins (or MAGs; Metagenomic-Assembled Genomes), and the outputs of these pipelines are a good place to begin this workflow.

Three binning tools I frequently use are:

This workflow employs some biology-agnostic clustering techniques to evaluate the evidence for bin formation and membership and is not intended as a replacement for any of these tools. These scripts are for refining problematic bins and ensuring the quality of bins obtained from these software suites.

Note - technically you could perform binning with this workflow, but I wouldn't recommend it.

Quick-fire use

Using the initial bins produced in the Genomics Aotearoa Metagenomics Summer School.

Start by mapping the data to produce the coverage table required for the binning refinement. There are two binned data sets here, the raw bins (bin_*.fna), and those that have been through DAS_Tool refinement.

Raw bins

bowtie2-build data/spades_assembly.m1000.fna data/spades_assembly.m1000

for i in {1..4};
do
    bowtie2 --sensitive-local --threads 4 -x data/spades_assembly.m1000 -1 data/sample${i}_R1.fastq.gz -2 data/sample${i}_R2.fastq.gz > sample${i}.sam
    samtools view -bS sample${i}.sam | samtools sort -o sample${i}.bam
    samtools depth -a sample${i}.bam > sample${i}.depth.txt
done

# Create per-contig summary of the depths
python bin/compute_depth_profile.py -o results/depth.parquet sample{1..4}.depth.txt

Fortunately, the refined bins have a different extension to the raw versions, so they are easy to sort by wildcard.

# Raw bins
python bin/compute_kmer_profile.py -k 4 -o results/raw_bins.parquet -f results/raw_bins.fna -t 4 data/bin_*.fna

python bin/project_ordination.py -n yeojohnson -w 0.5 --store_features results/raw_bins.matrix.tsv -k results/raw_bins.parquet -c results/depth.parquet -o results/raw_bins.tsne.parquet

python bin/identify_bin_cores.py --threshold 0.8 --plot_traces -i results/raw_bins.tsne.parquet -o results/raw_bins.tsne_core.parquet

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
bin		bin
envs		envs
results		results
scripts		scripts
tests		tests
README.md		README.md
busco_report.py		busco_report.py
test.tsv		test.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bin detangling workflow

Introduction

Quick-fire use

Raw bins

About

Releases

Packages

Languages

dwwaite/bin_detangling

Folders and files

Latest commit

History

Repository files navigation

Bin detangling workflow

Introduction

Quick-fire use

Raw bins

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages