-
Notifications
You must be signed in to change notification settings - Fork 4
Home
Welcome to the cath-tools-genomescan
wiki. This area will be used to provide extra help to use the tools in this project.
Please Note: any code quoted in this section is provided with the same License as the rest of this project (ie you are welcome to use it for pretty much anything, but you do so at your own risk).
Thanks very much to @alexdthomas (see #12) for this approach to running cath-genomescan
in parallel.
Important note: please do not just use the code below without understanding what it is doing. The following will launch 48 (potentially large) processes concurrently. This may well cause problems if you run it on a machine without enough resources (eg a standard desktop). Please make sure this script makes sense for your own environment.
I was running cath-genomescan.pl
on all of the genes predicted in my metagenome samples. It was taking quite a while, mostly due to hmmsearch
, which despite the default behavior to use all available cpus usually only uses a few. I wanted to figure out how to parallelize it and came up with this strategy.
- added an argument to
hmmsearch
incath-genomescan.pl
to only use 1 cpu ('--cpu 1'
) - split my amino acid fasta files into the number of cpus I wanted to use (in this case 48)
- wrote the
cath-genomescan.pl
commands for each of the sub files to a text file - submitted that text file to be run in parallel (cpus=48)
This seemed to work pretty well and got the results I was looking for very quickly. I of course had to clean up some intermediate files and parse/aggregate the results into single files. I figured I'd share the strategy. Not sure how easily it can be implemented for cath-genomescan.pl
, but at least providing an argument to set the number of cpus used by hmmsearch
could be helpful.
Here is the nasty bash code I used inside a loop over my directories. Tried to clean up a little to make it easier to read.
# get the headers from the faa file
grep "^>" min1000_faa > min1000_faa_headers.txt
# split into 48 chunks
split --number=l/48 min1000_faa_headers.txt --additional-suffix=_headers_split
# remove the leading ">" from the headers
sed -i 's/>//g' *_headers_split
# for each of the header chunks
for i in *_headers_split;
do
# use pullseq to get the aa sequences for each chunk (https://github.com/bcthomas/pullseq)
pullseq -i $min1000_faa -n $i > $i.faa
# append the cath-genomescan.pl command for that chunk to a text file
echo "/path/to/cath-tools-genomescan/apps/cath-genomescan.pl -i $PWD/$i.faa -l path/to/cath-tools-genomescan/data/funfam-hmm3.lib -o $PWD/$i.faa.out" >> cath_genomescan_cluster_commands.txt
done
# now send the cath-genomescan.pl commands to parallel with 48 cpus and pipe to qsub to submit to the cluster
echo "parallel -j48 < $PWD/cath_genomescan_cluster_commands.txt" | qsub -V -N cath_${sample}