-
Notifications
You must be signed in to change notification settings - Fork 5
3.2. Setup on virtual cloud machine
In order to use ProkEvo, the computational platform needs to have HTCondor, Pegasus WMS and Miniconda. While these can be found on the majority computational platforms, here we provide detailed instructions on how to install ProkEvo and its dependencies on fresh Virtual Cloud Machine. Anvil is the Holland Computing Center’s cloud computing resource, similar to Amazon AWS. We used CentOS 7.8 Anvil compute instance with 32 CPUs, 60GB of RAM and 160GB of disk space. It is to note that Amazon AWS provides image with HTCondor installed on it in case that is the researcher's preferred cloud platform.
The first step after logging in the machine, is to clone the ProkEvo repo:
[centos@npavlovikj-prokevo ~]$ git clone https://github.com/npavlovikj/ProkEvo.git
[centos@npavlovikj-prokevo ~]$ cd ProkEvo/cloud
The cloud
working directory contains install_dependencies_vm.sh
script that installs all the required dependencies and needs to be run first as a root user:
[centos@npavlovikj-prokevo cloud]$ sudo ./install_dependencies_vm.sh
And that's it! This command installs HTCondor, Pegasus WMS and Miniconda for the user, and the researcher can use ProkEvo as usual.
To download raw Illumina paired-end reads from NCBI, as an input, ProkEvo requires only a list of SRA ids stored in the file sra_ids.txt
. In this repo, as an example we provide file sra_ids.txt
with few Salmonella enterica subsp. enterica serovar Enteritidis genomes:
[centos@npavlovikj-prokevo cloud]$ cat sra_ids.txt
SRR5160663
SRR8385633
SRR9984383
Once a list of SRA ids is created, the next step is to submit ProkEvo.
ProkEvo supports using raw Illumina reads available on the local system. In order to use this feature, a tabular file rc.txt
with the name of the sample and its local location should be created. There are multiple ways how a researcher can do this. The command we use is:
while read line
do
echo ''${line}'_1.fastq file:///absolute_path_to_fastq_files/'${line}'_1.fastq site="local"' >> rc.txt
echo ''${line}'_2.fastq file:///absolute_path_to_fastq_files/'${line}'_2.fastq site="local"' >> rc.txt
done < sra_ids.txt
where sra_ids.txt
is the file with the SRA ids and absolute_path_to_fastq_files
is the absolute path to the reads.
After this, the rc.txt
file should look like:
SRR5160663_1.fastq file:///home/centos/ProkEvo/SRR5160663_1.fastq site="local"
SRR8385633_1.fastq file:///home/centos/ProkEvo/SRR8385633_1.fastq site="local"
SRR9984383_1.fastq file:///home/centos/ProkEvo/SRR9984383_1.fastq site="local"
SRR5160663_2.fastq file:///home/centos/ProkEvo/SRR5160663_2.fastq site="local"
SRR9984383_2.fastq file:///home/centos/ProkEvo/SRR9984383_2.fastq site="local"
SRR8385633_2.fastq file:///home/centos/ProkEvo/SRR8385633_2.fastq site="local"
Please note that the absolute path to the raw reads on our system "/home/centos/ProkEvo/" and this location will be different for you.
Once the input files are specified, the next step is to submit ProkEvo using the provided submit.sh
script:
[centos@npavlovikj-prokevo cloud]$ ./submit.sh
And that's it! The submit script sets the current directory as a working directory where all temporary and final outputs are stored. Running ./submit.sh
prints lots of useful information on the command line, including how to check the status of the workflow and remove it if necessary.
Once the workflow is submitted, its status can be checked with the pegasus-status command:
[centos@npavlovikj-prokevo cloud]$ pegasus-status -l /home/centos/ProkEvo/cloud/centos/pegasus/pipeline/20210107T192451+0000
STAT IN_STATE JOB
Run 01:12:04 pipeline-0 ( /home/centos/ProkEvo/cloud/centos/pegasus/pipeline/20210107T192451+0000 )
Run 01:03:34 ┗━ex_spades_run_ID0000007
Summary: 2 Condor jobs total (R:2)
UNRDY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME
15 0 0 1 0 29 0 64.4 Running */home/centos/ProkEvo/cloud/centos/pegasus/pipeline/20210107T192451+0000/pipeline-0.dag
Summary: 1 DAG total (Running:1)
Briefly, this command shows the currently running stats, as well as now much of the pipeline has been completed. As it can be seen on the provided output, at the time the command was run, one Spades job was running, and the pipeline was 64.4% done with no failed jobs. Depending on the time the pegasus-status
command was run, the shown output will differ.
Once the pipeline has finished, researchers can run commands such as pegasus-analyzer and pegasus-statistics to obtain statistics about the workflow, such as the number of jobs that failed/succeeded, run time of tasks, etc.
All the output files are stored in the directory outputs
which is in the directory where ProkEvo is submitted from:
[centos@npavlovikj-prokevo cloud]$ ls outputs/
fastbaps_baps.csv sabricate_ncbi_output.csv SRR5160663_prokka_output.tar.gz SRR9984383_plasmidfinder_output.tar.gz
fastqc_summary_all.txt sabricate_plasmidfinder_output.csv SRR5160663_quast_output SRR9984383_prokka_output
fastqc_summary_final.txt sabricate_resfinder_output.csv SRR5160663_spades_output SRR9984383_prokka_output.tar.gz
mlst_output.csv sabricate_vfdb_output.csv SRR8385633_plasmidfinder_output.tar.gz SRR9984383_quast_output
roary_output sistr_all.csv SRR8385633_prokka_output SRR9984383_spades_output
roary_output.tar.gz sistr_all_merge.csv SRR8385633_prokka_output.tar.gz sub-pipeline.dax
sabricate_argannot_output.csv SRR5160663_plasmidfinder_output.tar.gz SRR8385633_quast_output
sabricate_card_output.csv SRR5160663_prokka_output SRR8385633_spades_output
[centos@npavlovikj-prokevo cloud]$