Skip to content

3.2. Setup on virtual cloud machine

Natasha Pavlovikj edited this page Jan 7, 2021 · 2 revisions

In order to use ProkEvo, the computational platform needs to have HTCondor, Pegasus WMS and Miniconda. While these can be found on the majority computational platforms, here we provide detailed instructions on how to install ProkEvo and its dependencies on fresh Virtual Cloud Machine. Anvil is the Holland Computing Center’s cloud computing resource, similar to Amazon AWS. We used CentOS 7.8 Anvil compute instance with 32 CPUs, 60GB of RAM and 160GB of disk space. It is to note that Amazon AWS provides image with HTCondor installed on it in case that is the researcher's preferred cloud platform.

The first step after logging in the machine, is to clone the ProkEvo repo:

[centos@npavlovikj-prokevo ~]$ git clone https://github.com/npavlovikj/ProkEvo.git
[centos@npavlovikj-prokevo ~]$ cd ProkEvo/cloud

The cloud working directory contains install_dependencies_vm.sh script that installs all the required dependencies and needs to be run first as a root user:

[centos@npavlovikj-prokevo cloud]$ sudo ./install_dependencies_vm.sh

And that's it! This command installs HTCondor, Pegasus WMS and Miniconda for the user, and the researcher can use ProkEvo as usual.

1. Downloading raw Illumina reads from NCBI

To download raw Illumina paired-end reads from NCBI, as an input, ProkEvo requires only a list of SRA ids stored in the file sra_ids.txt. In this repo, as an example we provide file sra_ids.txt with few Salmonella enterica subsp. enterica serovar Enteritidis genomes:

[centos@npavlovikj-prokevo cloud]$ cat sra_ids.txt 
SRR5160663
SRR8385633
SRR9984383

Once a list of SRA ids is created, the next step is to submit ProkEvo.

2. Using already downloaded raw reads

ProkEvo supports using raw Illumina reads available on the local system. In order to use this feature, a tabular file rc.txt with the name of the sample and its local location should be created. There are multiple ways how a researcher can do this. The command we use is:

while read line
do
echo ''${line}'_1.fastq file:///absolute_path_to_fastq_files/'${line}'_1.fastq site="local"' >> rc.txt
echo ''${line}'_2.fastq file:///absolute_path_to_fastq_files/'${line}'_2.fastq site="local"' >> rc.txt
done < sra_ids.txt 

where sra_ids.txt is the file with the SRA ids and absolute_path_to_fastq_files is the absolute path to the reads.

After this, the rc.txt file should look like:

SRR5160663_1.fastq file:///home/centos/ProkEvo/SRR5160663_1.fastq site="local"
SRR8385633_1.fastq file:///home/centos/ProkEvo/SRR8385633_1.fastq site="local"
SRR9984383_1.fastq file:///home/centos/ProkEvo/SRR9984383_1.fastq site="local"
SRR5160663_2.fastq file:///home/centos/ProkEvo/SRR5160663_2.fastq site="local"
SRR9984383_2.fastq file:///home/centos/ProkEvo/SRR9984383_2.fastq site="local"
SRR8385633_2.fastq file:///home/centos/ProkEvo/SRR8385633_2.fastq site="local"

Please note that the absolute path to the raw reads on our system "/home/centos/ProkEvo/" and this location will be different for you.

Run ProkEvo!

Once the input files are specified, the next step is to submit ProkEvo using the provided submit.sh script:

[centos@npavlovikj-prokevo cloud]$ ./submit.sh 

And that's it! The submit script sets the current directory as a working directory where all temporary and final outputs are stored. Running ./submit.sh prints lots of useful information on the command line, including how to check the status of the workflow and remove it if necessary.

Monitoring ProkEvo

Once the workflow is submitted, its status can be checked with the pegasus-status command:

[centos@npavlovikj-prokevo cloud]$ pegasus-status -l /home/centos/ProkEvo/cloud/centos/pegasus/pipeline/20210107T192451+0000 
STAT  IN_STATE  JOB                                                                                   
Run   01:12:04  pipeline-0 ( /home/centos/ProkEvo/cloud/centos/pegasus/pipeline/20210107T192451+0000 )
Run   01:03:34   ┗━ex_spades_run_ID0000007                                                            
Summary: 2 Condor jobs total (R:2)

UNRDY READY   PRE  IN_Q  POST  DONE  FAIL %DONE STATE   DAGNAME                                                                                
   15     0     0     1     0    29     0  64.4 Running */home/centos/ProkEvo/cloud/centos/pegasus/pipeline/20210107T192451+0000/pipeline-0.dag
Summary: 1 DAG total (Running:1)

Briefly, this command shows the currently running stats, as well as now much of the pipeline has been completed. As it can be seen on the provided output, at the time the command was run, one Spades job was running, and the pipeline was 64.4% done with no failed jobs. Depending on the time the pegasus-status command was run, the shown output will differ.

Once the pipeline has finished, researchers can run commands such as pegasus-analyzer and pegasus-statistics to obtain statistics about the workflow, such as the number of jobs that failed/succeeded, run time of tasks, etc.

Output

All the output files are stored in the directory outputs which is in the directory where ProkEvo is submitted from:

[centos@npavlovikj-prokevo cloud]$ ls outputs/
fastbaps_baps.csv              sabricate_ncbi_output.csv               SRR5160663_prokka_output.tar.gz         SRR9984383_plasmidfinder_output.tar.gz
fastqc_summary_all.txt         sabricate_plasmidfinder_output.csv      SRR5160663_quast_output                 SRR9984383_prokka_output
fastqc_summary_final.txt       sabricate_resfinder_output.csv          SRR5160663_spades_output                SRR9984383_prokka_output.tar.gz
mlst_output.csv                sabricate_vfdb_output.csv               SRR8385633_plasmidfinder_output.tar.gz  SRR9984383_quast_output
roary_output                   sistr_all.csv                           SRR8385633_prokka_output                SRR9984383_spades_output
roary_output.tar.gz            sistr_all_merge.csv                     SRR8385633_prokka_output.tar.gz         sub-pipeline.dax
sabricate_argannot_output.csv  SRR5160663_plasmidfinder_output.tar.gz  SRR8385633_quast_output
sabricate_card_output.csv      SRR5160663_prokka_output                SRR8385633_spades_output
[centos@npavlovikj-prokevo cloud]$