Skip to content

Upcoming to 1.6

Javier Tamames edited this page Aug 22, 2022 · 16 revisions

How to add a new assembler

From version 1.6, SqueezeMeta allow the connection of other assemblers than the ones shipped with the distro (Megahit, Spades, Canu and Flye). Here I will teach you a practical example of how to do it, showing the plugging of the IDBA-UD assembler (https://github.com/loneknightpy/idba) (Peng et al, Bioinformatics 2012, 28:111420–1428; https://doi.org/10.1093/bioinformatics/bts174)

I will assume that you already installed the IDBA-UD software and put it somewhere in your system, for instance the /software/idba directory (my choice, but you can put it wherever you want)

What you need to do is to create a script to run the assembler. Your script will be called by the SqueezeMeta pipeline, as you will see in a moment. The script can be written in any language you want. As I am an old-fashioned, Perl-native man, I will show you my Perl script. A Python one will be very similar. In a demonstration of sheer creativity, I called my script “assembly_idba.pl”, and it looks like this:

#!/usr/bin/perl

use strict;

print "Running IDBA assembly\n";

`#-- By default, SqueezeMeta will pass the following arguments to your script:`

my $projectdir=$ARGV[0]; # First argument: Directory of the project

my $sample=$ARGV[1]; # Second argument: Name of the sample (in sequential mode) or project (in the rest)

my $par1name=$ARGV[2]; # Third argument: Name of the pair1 file

my $par2name=$ARGV[3]; # Fourth argument: Name of the pair2 file

my $numthreads=12; #-- In addition, we can define other parameters, for instance number of threads

`#-- IDBA wants data as an interlaced fasta file`

`#-- Fortunately, they provide a fq2fa script converting our fastq files to that format`

`#-- But, our fastq files are gzipped. Therefore first thing is to gunzip them`

`#-- We define $g1 and $g2 as variables containing the name of the gunzipped files`
   `#-- Simply remove the ".gz" extension to get the gunzipped name`

my $g1=$par1name; $g1=~s/\.gz$//; my $g2=$par2name; $g2=~s/\.gz$//;

my $fastafile="temp.fasta"; #-- And we define $fastafile as the resulting interlaced fasta file

`#-- Now, we gunzip the files and run the fq2fa script`

my $merge_command="gunzip $par1name; gunzip $par2name; /software/idba/bin/fq2fa --merge $g1 $g2 $fastafile"; system($merge_command);

`#-- And then we can run the IDBA assembler, just providing the input filename ($fastafile)`

`#-- We could the desired assembler options to this command line.`

`#-- For instance, we added the number of threads`

`#-- The results will be stored in a directory we named "tempidba"`

my $assembly_command="/software/idba/bin/idba -r $fastafile --num_threads $numthreads -o tempidba"; system($assembly_command);

`#-- Finally, we have to move the resulting fasta file to the "results" directory of the SqueezeMeta project`

`#-- IDBA names the file "scaffold.fa"`

`#-- Keep in mind that the file must be named "01.project.fasta"`

my $mv_command="mv tempidba/scaffold.fa $projectdir/results/01.$sample.fasta"; system($mv_command);

#-- To finish, we clean up things we don't need anymore (the temporal directory and fasta files)

my $rm_command="rm -r tempidba; rm $g1; rm $g2"; system($rm_command);

print "All done here! Have fun!\n";

Take into account that when SqueezeMeta will call your script, it will pass four arguments: The project directory, the project name, and the read files (two paired-end, gunzipped fastq or fasta files). This is probably all you need to know to call the assembler:

As you see in the script, I just run a formatting script fq2fa provided by IDBA-UD, to put the runs in the format it wants them. Then I run the assembler, and finally I move the resulting contig file to the results directory in the SqueezeMeta project. This is very important because the rest of the pipeline will look for the contig file there. Also, take into account that the name of the contig file must be 01.project.fasta (where project is your project name).

Ok, that is the machinery to run the IDBA assembler. Now, how to plug it into SqueezeMeta? The first thing to do is to move your script to the place where all other assembly scripts are, which is the installpath/lib/SqueezeMeta directory (where installpath is the installation directory of SqueezeMeta. You will see there other scripts for running assemblers, like assembly_megahit.pl, assembly_spades.pl, etc). Then, edit the SqueezeMeta_conf.pl file in the scripts directory of the SqueezeMeta installation. You will see a line like this:

%assemblers = ("megahit","assembly_megahit.pl","spades", "assembly_spades.pl","canu","assembly_canu.pl","flye", "assembly_flye.pl");

This line is a hash (a dict in python), telling SqueezeMeta the names of the available assemblers and the associated scripts for running them. Just add yours. Remember that the name you specify will be the one to run the assembler:

%assemblers = ("megahit","assembly_megahit.pl","spades", "assembly_spades.pl","canu","assembly_canu.pl","flye", "assembly_flye.pl",”idba”,”assembly_idba.pl”);

Save it, and you are done. Now you can run a SqueezeMeta project using your new “idba” assembler:

SqueezeMeta.pl -m coassembly -f mydir -s mysamples.samples -p idba_test -a idba

And you will see this:

SqueezeMeta v1.6.0, March 2022 - (c) J. Tamames, F. Puente-Sánchez CNB-CSIC, Madrid, SPAIN

Please cite: Tamames & Puente-Sanchez, Frontiers in Microbiology 9, 3349 (2019). doi: https://doi.org/10.3389/fmicb.2018.03349

Run started Thu May 26 17:45:13 2022 in coassembly mode 2 metagenomes found: SRR1927149_s SRR1929485_s

Now creating directories Reading configuration from SqueezeMeta_conf.pl [0 seconds]: STEP1 -> RUNNING ASSEMBLY: 01.run_all_assemblies.pl (idba) contigs: 45831 n50: 645 max: 85917 mean: 540 total length: 24792136 n80: 322 All done here! Have fun!