Parallelisation of taxonomy_ranks #1

HuoJnx · 2022-11-02T12:14:03Z

Hello. I think taxonomy_ranks is a very convenient tool for lineage annotation. But it's a little bit slow, so I wrote a script to parallelize it and found it works. It can fasten the speed 4 times for 21579 queries on a 48-thread server. Hope that the script can help anyone who needs it. ^_^

Code

taxaranks_parallel(){

    ## stop after error
    set -e
    
    ## get current directory
    current_dir=$(pwd)
    
    ##parse the input path
    input=$1
    dir=$(dirname $input|xargs realpath)
    base=$(basename $input)
    real_input="${dir}/${base}"
    echo "Input is $input."
    
    ## go to sub_dir
    sub_dir="${dir}/split_${base}"
    rm -rf $sub_dir; mkdir -p $sub_dir
    cd $sub_dir
    echo "Create temporary directory $sub_dir."
    
    ## get parameters for spliting, then split
    total_line=$(cat $real_input|wc -l )
    threads=$(nproc)
    need_length=3
    split -a $need_length -d -n "l/${threads}" $real_input
    echo "Have $threads threads, split the file to $threads parts."
    
    ## run taxaranks in parallel
    echo "Annotating..."
    ls .|parallel "taxaranks -i {} -o {}.lineage -t"
    
    ## merge
    merge_file="../${base}.lineage"
    merge_file_with_head="../${base}.lineage.with_head"
    
    #### drop the first line for each file, then merge
    rm -rf $merge_file;ls *.lineage|parallel "awk 'NR>1 {print}' {} &>> $merge_file"
    
    #### add the first line for the merge file
    head_line=$(ls *.lineage|head -n1|xargs head -n1)
    awk -v a="$head_line" 'BEGIN{print a} {print $0}' $merge_file &>$merge_file_with_head
    rm -rf $merge_file;mv $merge_file_with_head $merge_file
    
    ## remove the sub_dir
    rm -rf $sub_dir
    echo "Clear temporary directory."
    
    ## back to the previous working directory
    cd $current_dir
    
    ## prompt
    echo "All finished."
}

Example

Without parallelization

With parallelization

The text was updated successfully, but these errors were encountered:

linzhi2013 · 2022-11-02T19:38:30Z

Hi HuoJnx,

Thanks a lot for your suggestion!

I will post it on the main page of the project.

Cheers
Guanliang

Copy from #1

HuoJnx · 2022-11-03T01:01:27Z

Wow! I'm happy to be of help! ☺️

linzhi2013 added a commit that referenced this issue Nov 2, 2022

Create parallelize_taxon.sh

ab5aa38

Copy from #1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelisation of taxonomy_ranks #1

Parallelisation of taxonomy_ranks #1

HuoJnx commented Nov 2, 2022 •

edited

Loading

linzhi2013 commented Nov 2, 2022

HuoJnx commented Nov 3, 2022

Parallelisation of taxonomy_ranks #1

Parallelisation of taxonomy_ranks #1

Comments

HuoJnx commented Nov 2, 2022 • edited Loading

Code

Example

Without parallelization

With parallelization

linzhi2013 commented Nov 2, 2022

HuoJnx commented Nov 3, 2022

HuoJnx commented Nov 2, 2022 •

edited

Loading