Home

Welcome to the cath-tools-genomescan wiki. This area will be used to provide extra help to use the tools in this project.

Please Note: any code quoted in this section is provided with the same License as the rest of this project (ie you are welcome to use it for pretty much anything, but you do so at your own risk).

Running cath-genomescan in parallel

Thanks very much to @alexdthomas (see #12) for this approach to running cath-genomescan in parallel.

Important note: please do not just use the code below without understanding what it is doing. The following will launch 48 (potentially large) processes concurrently. This may well cause problems if you run it on a machine without enough resources (eg a standard desktop). Please make sure this script makes sense for your own environment.

I was running cath-genomescan.pl on all of the genes predicted in my metagenome samples. It was taking quite a while, mostly due to hmmsearch, which despite the default behavior to use all available cpus usually only uses a few. I wanted to figure out how to parallelize it and came up with this strategy.

added an argument to hmmsearch in cath-genomescan.pl to only use 1 cpu ('--cpu 1')
split my amino acid fasta files into the number of cpus I wanted to use (in this case 48)
wrote the cath-genomescan.pl commands for each of the sub files to a text file
submitted that text file to be run in parallel (cpus=48)

This seemed to work pretty well and got the results I was looking for very quickly. I of course had to clean up some intermediate files and parse/aggregate the results into single files. I figured I'd share the strategy. Not sure how easily it can be implemented for cath-genomescan.pl, but at least providing an argument to set the number of cpus used by hmmsearch could be helpful.

Here is the nasty bash code I used inside a loop over my directories. Tried to clean up a little to make it easier to read.

# get the headers from the faa file 
      grep "^>" min1000_faa > min1000_faa_headers.txt
# split into 48 chunks
      split --number=l/48 min1000_faa_headers.txt --additional-suffix=_headers_split
# remove the leading ">" from the headers 
      sed -i 's/>//g' *_headers_split
# for each of the header chunks 
      for i in *_headers_split;
      do
# use pullseq to get the aa sequences for each chunk (https://github.com/bcthomas/pullseq)
        pullseq -i $min1000_faa -n $i > $i.faa
# append the cath-genomescan.pl command for that chunk to a text file
        echo "/path/to/cath-tools-genomescan/apps/cath-genomescan.pl -i $PWD/$i.faa -l path/to/cath-tools-genomescan/data/funfam-hmm3.lib -o $PWD/$i.faa.out" >> cath_genomescan_cluster_commands.txt
      done
# now send the cath-genomescan.pl commands to parallel with 48 cpus and pipe to qsub to submit to the cluster 
echo "parallel -j48 < $PWD/cath_genomescan_cluster_commands.txt" | qsub -V -N cath_${sample}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Running cath-genomescan in parallel

Clone this wiki locally