diff --git a/GCP RAPT.md b/GCP RAPT.md deleted file mode 100644 index 9c26a5b..0000000 --- a/GCP RAPT.md +++ /dev/null @@ -1,142 +0,0 @@ -# Google Cloud Platform (GCP) RAPT – Documentation - -This page contains instruction and examples for running *GCP RAPT*. `run_rapt_gcp.sh` is a shell script provides a command line interface to run *GCP RAPT*. Some basic knowledge of Unix/Linux commands, [SKESA](https://github.com/ncbi/SKESA), and [PGAP](https://github.com/ncbi/pgap) is useful in completing this tutorial. -Please see our [wiki page](https://github.com/ncbi/rapt/wiki) for References, Licenses, FAQs, and In-depth Documentation and Examples. - - -## System requirements -*GCP RAPT* is designed to run on the Google Cloud Platform (GCP), it has no special hardware requirements for the local machine (the one where `run_rapt_gcp.sh` runs). It can be conveniently invoked from the Google Cloud Shell or any computer with the following prerequisites: -- gcloud SDK installed (automatically enabled in Cloud Shell) -- gsutil tool installed (automatically enabled in Cloud Shell) -- Cloud Life Sciences API enabled for your project - for help see [Quick start using a Cloud Shell](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md) -- Access to a Google storage bucket for your data - for help see [Quick start using a Cloud Shell](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md) - -*GCP RAPT* will bring up and shut down Google instances as needed.
- -## Quick start -Here are instructions to execute RAPT once your system is set up. Additional instructions are available on our [wiki page](wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md). -1. In a browser, sign into [GCP](https://console.cloud.google.com/) -2. Invoke a Cloud Shell -3. Download the latest release by executing the following commands: - - ``` - ~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.5.1/rapt-v0.5.1.tar.gz - ~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz - ``` -4. Run `run_rapt_gcp.sh help` to see the *GCP RAPT* usage information. - -### Try an example -To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file in a Google storage bucket, or they can be in a run in SRA (an accession).
-Important: Only reads sequenced on **Illumina machines** can be used by RAPT. - -#### Starting from an SRA run
-To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.
-This example takes about 1 hour. - -Run the following command, where [gs://your_results_bucket](https://cloud.google.com/storage/docs/creating-buckets) is the Google storage bucket where the outputs and logs will be copied when the job finishes. - -```bash -~$ ./run_rapt_gcp.sh submitacc SRR3496277 --bucket gs://your_results_bucket
-``` - - -If the job is successfully created, the script will print out execution information similar to the following: -``` -RAPT job has been created successfully. ----------------------------------------------------- -Job-id: 5541b09bb9 -Output storage: gs://your_results_bucket/5541b09bb9 -GCP account: 1111111111111-compute@developer.gserviceaccount.com -GCP project: example ----------------------------------------------------- - -[**Attention**] RAPT jobs may take hours to finish. Progress of this job can be viewed in GCP stackdriver log viewer at: - - https://console.cloud.google.com/logs/viewer?project=strides-documentation-testing&filters=text:5541b09bb9 - -For current status of this job, run: - - run_rapt_gcp.sh joblist | fgrep 5541b09bb9 - -For technical details of this job, run: - - run_rapt_gcp.sh jobdetails 5541b09bb9 -~$ -``` - - -Check the status of the jobs executed under this project, run: -```bash -~$ ./run_rapt_gcp.sh joblist - -GCP Account: [1111111111111-compute@example.gserviceaccount.com] -Project: [example] -JOB_ID USER LABEL SRR STATUS START_TIME END_TIME OUTPUT_URI -5541b09bb9 tester SRR3496277 Running 2020-07-10T18:52:30 gs://your_results_bucket/2565f37562 -~$ -``` - - -The results for the job will be available in the bucket you specified after the job is marked 'Done'. Please note that some runs may take up to 24 hours. - -#### Starting from fastq or fasta file
-You can use a fastq or a fasta file produced by Illumina sequencers as input to RAPT. This file can contain paired-end reads, with the two reads of a pair adjacent to each other in the file or single-end reads. Note that the quality scores are not necessary. The file needs to be copied to the Google storage bucket before you run `run_rapt_gcp.sh`. - -The genus species of the sequenced organism needs to be provided on the command line. The strain is optional. -Here is an example command using a file available in the bucket named your_input_bucket: - -```bash -~$ ./run_rapt_gcp.sh submitfastq gs://your_input_bucket/M_pirum_25960.fastq -b gs://your_results_bucket --label M_pirum_25960 --organism "Mycoplasma pirum" --strain "ATCC 25960" -``` - - -If the job is successfully created, the script will print out execution information similar to the following: - -``` -RAPT job has been created successfully. ----------------------------------------------------- -Job-id: b2ac02d7c7 -Output storage: gs://your_results_bucket/b2ac02d7c7 -GCP account: 1111111111111-compute@developer.gserviceaccount.com -GCP project: example ----------------------------------------------------- - -[**Attention**] RAPT jobs may take hours to finish. Progress of this job can be viewed in GCP stackdriver log viewer at: - - https://console.cloud.google.com/logs/viewer?project=strides-documentation-testing&filters=text:b2ac02d7c7 - -For current status of this job, run: - - run_rapt_gcp.sh joblist | fgrep b2ac02d7c7 - -For technical details of this job, run: - - run_rapt_gcp.sh jobdetails b2ac02d7c7 - -~$ -``` - - -To get more execution details and examples in our [wiki page](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples.md). -- Setting up GCP with step by step guide -- Using fastq files as input - -If you have other questions, please visit our [FAQs page](https://github.com/ncbi/rapt/wiki/FAQ.md). - -### Review the output -*GCP RAPT* generates a tarball named `output.tar.gz` in your designated bucket, under a "directory" named after the 10-character job-id assigned at the start of the execution (i.e. "2894b72f9f"). The tarball contains the following files: -1. concise.log is file with the log of major stages and status of your RAPT run
-2. verbose.log is a detailed log file of all the actions and console outputs that RAPT performed for your run
-3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA
-4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format
-5. PGAP annotation results in multiple formats:
- * annot.gbk: annotated genome in GenBank flat file format
- * annot.gff: annotated genome in GFF3 format
- * annot.sqn: annotated genome in ASN format
- * annot.faa: multifasta file of the proteins annotated on the genome
- * annot.fna: multifasta file of the trancripts annotated on the genome
- * calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found. - -Along with the tarball there is also a `run.log` file generated automatically by the Google Life Sciences Pipeline where RAPT is invoked. This file catches all output to stdout and stderr by anything, and may be helpful to identify the problem should any happens. - - diff --git a/RAPT_context4.png b/RAPT_context4.png deleted file mode 100644 index 9d0f1d0..0000000 Binary files a/RAPT_context4.png and /dev/null differ diff --git a/RAPT_context_Apr2022.png b/RAPT_context_Apr2022.png new file mode 100644 index 0000000..e24b50a Binary files /dev/null and b/RAPT_context_Apr2022.png differ diff --git a/README.md b/README.md index 9b5e60d..3de940e 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,24 @@ # Read Assembly and Annotation Pipeline Tool (RAPT) -RAPT is a NCBI pipeline designed for assembling and annotating short genomic sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major components, [SKESA](https://github.com/ncbi/SKESA) and [PGAP](https://github.com/ncbi/pgap). SKESA is a *de novo* assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines *ab initio* gene prediction algorithms with homology-based methods. RAPT takes an SRA run or a fasta or fastq file of Illumina reads as input and produces an assembled and annotated genome. +RAPT is an NCBI pipeline designed for **assembling and annotating short genomic sequencing reads** obtained from **bacterial or archaeal isolates** *de novo*. It takes an SRA run or a fasta or fastq file of Illumina reads as input and produces an assembled and annotated genome of **quality comparable to RefSeq** in a couple of hours. +RAPT consists of three major components, the genome assembler [SKESA](https://github.com/ncbi/SKESA), the taxonomic assignment tool [ANI](https://pubmed.ncbi.nlm.nih.gov/29792589/) and the Prokaryotic Genome Annotation Pipeline ([PGAP](https://github.com/ncbi/pgap)). + +With RAPT you will:
+* **assemble your reads** into contigs
+* **assign a scientific name** to the assembly
+* **predict coding and non-coding genes** *de novo*, including anti-microbial resistance (AMR) genes and virulence factors, based on expert-curated data such as hidden Markov models and conserved domain architectures
If you are new to RAPT, please visit our [wiki page](https://github.com/ncbi/rapt/wiki) for detailed information, and watch a [short webinar](https://www.youtube.com/watch?v=7trM1pKAVXQ). -![RAPT](RAPT_context4.png) +![RAPT](RAPT_context_Apr2022.png) To use the latest version, download the RAPT command-line interface with the following commands: ``` -~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.5.1/rapt-v0.5.1.tar.gz +~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.5.2/rapt-v0.5.2.tar.gz ~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz ``` -There should be two scripts in your directory now, `run_rapt_gcp.sh` and `run_rapt.py`, corresponding to two variations of RAPT: Google Cloud Platform (GCP) RAPT and Standalone RAPT. [GCP RAPT](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples) is designed to run on GCP and is for users with GCP accounts (please note this is different from a gmail account), while [Stand-alone RAPT](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations) can run on any computing environments meeting a few pre-requisites. +There should be two scripts in your directory now, `run_rapt_gcp.sh` and `run_rapt.py`, corresponding to two variations of RAPT: Google Cloud Platform (GCP) RAPT and Standalone RAPT. [GCP RAPT](https://github.com/ncbi/rapt/wiki/GCP_RAPT_doc) is designed to run on GCP and is for users with GCP accounts (please note this is different from a gmail account), while [Stand-alone RAPT](https://github.com/ncbi/rapt/wiki/Standalone_RAPT_doc) can run on any computing environments meeting a few pre-requisites. -For instructions on running RAPT, please go to their respective documentation pages: [GCP RAPT](https://github.com/ncbi/rapt/wiki/GCP%20RAPT%20In-depth%20Documentation%20and%20Examples) or [Stand-alone RAPT](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations). +For instructions on running RAPT, please go to their respective documentation pages: [GCP RAPT](https://github.com/ncbi/rapt/wiki/GCP_RAPT_doc) or [Stand-alone RAPT](https://github.com/ncbi/rapt/wiki/Standalone_RAPT_doc). diff --git a/Standalone RAPT.md b/Standalone RAPT.md deleted file mode 100644 index 85f0bac..0000000 --- a/Standalone RAPT.md +++ /dev/null @@ -1,101 +0,0 @@ -# Standalone RAPT – Documentation - -This page contains instruction and examples for running RAPT.`run_rapt.py` is a python script that provides an interface to run the *Standalone RAPT* on a local machine. "Local" means the same machine as where `run_rapt.py` is executed. It could be a physical machine on premise, or more conveniently, a cloud VM instance. -Some basic knowledge of Unix/Linux commands, [SKESA](https://github.com/ncbi/SKESA), and [PGAP](https://github.com/ncbi/pgap) is useful in completing this tutorial. -Please see our [wiki page](https://github.com/ncbi/rapt/wiki) for References, Licenses, FAQs, and In-depth Documentation and Examples. - - -## System requirements - -The machine must satisfy the following prerequisites: - -* At least 4GB memory per CPU core
-* At least 8 CPU cores and 32 GB memory
-* Linux OS preferred, Windows 10 (pro or enterprise version) will also work but extra configuration is required
-* Internet connection
-* Container runner installed (currently supports Docker/Podman/Singularity), Docker is recommended
-* Python installed
-* 100GB free storage space on disk
- - -### Additional tips if using Windows 10 (pro/enterprise version) -1. Right now it seems to only work on a real physical machine (L0, metal) with CPUs support virtualization (Like INTEL VT-x technology); Make sure this feature is enabled in BIOS
-2. Windows 10 only, must be at least Professional or Enterprise version (hypervisor capability)
-3. Install python and Docker Desktop
-4. Start Docker service with hyper-V enabled
-5. Make sure Docker has switched to 'Linux containers'. It should do so by default if hyper-V is up and running.
- -## Quick start -Here are instructions to execute RAPT once your system is set up. Additional instructions are available on our [wiki page](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations.md). -1. Go to your machine or instance command line
-2. Download the latest release by executing the following commands:
- - ``` - ~$ curl -sSLo rapt.tar.gz https://github.com/ncbi/rapt/releases/download/v0.5.1/rapt-v0.5.1.tar.gz - ~$ tar -xzf rapt.tar.gz && rm -f rapt.tar.gz - ``` -3. Run `./run_rapt.py -h` to see the *Stand-alone RAPT* usage information
- -### Try an example -To run RAPT, you need Illumina-sequenced reads for the genome you wish to assemble and annotate. These can be in a fasta file on the machine where you wish to run RAPT, or they can be in a run in the NCBI Sequence Read Archive (SRA).
-Important: Only reads sequenced on **Illumina machines** can be used by RAPT. - -#### Starting from an SRA run
-To demonstrate how to run RAPT, we are going to use SRR3496277, a set of reads available in SRA for *Mycoplasma pirum*.
-This example takes about 1 hour to complete (time may vary depends on the configuration of the computer). - -Run the following command, the outputs and logs will be located in the current directory when the job finishes. - -``` -~$ ./run_rapt.py -a srr3496277 -RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /path/to/current_dir/raptout_e26d552147/verbose.log. -~$ -``` - - -All output files and logs will be located in a subdirectory named `raptout_xxxxxxxxxx` under current directory. `xxxxxxxxxx` is the RUNID generated by `run_rapt.py`, unique to each time it is launched. Please note that some runs may take up to 24 hours.
- -#### Starting from fastq or fasta file
-You can use a fastq or a fasta file produced by Illumina sequencers as input to RAPT. This file can contain paired-end reads, with the two reads of a pair adjacent to each other in the file or single-end reads. Note that the quality scores are not necessary. The file needs to be on the local file system. -The genus species of the sequenced organism needs to be provided on the command line. The strain is optional. -Here is an example command using a file already on your computer: - -```bash -~$ ./run_rapt.py -q path/to/srr3496277.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960" -RAPT is now running, it may take a long time to finish. To see the progress, track the verbose log file /home/username/raptout_d3e7956148/verbose.log. -~$ -``` - - -To get more execution details and examples, see our [wiki page](https://github.com/ncbi/rapt/wiki/Standalone%20RAPT%20In-depth%20Documentation%20and%20Recommendations.md). -- Help Documentation
-- Reference data location
-- Advanced Options - -If you have other questions, please visit our [FAQs page](https://github.com/ncbi/rapt/wiki/FAQ.md). - -### Review the output
- -RAPT generates 11 output files if completes normally without error. The default location of result output is in the current directory. Each run of RAPT will create a subdirectory bearing the name raptout_ where is a random 10-character string. The --tag JOBID switch can be used to specify a human-readable job id which will be appended after the random RUNID for easy recognition. - -To store the output in location other than the current directory, use the -o or --output-dir switch to specify the desired location: -```bash -~$ ./run_rapt.py -q path/to/srr3496277.fastq --organism "Mycoplasma pirum" --strain "ATCC 25960" --output-dir path/to/output-dir -``` - - -All messages from RAPT execution are logged, with time stamps, in a file named `verbose.log` in the output directory. A simpler version log file, `concise.log`, is also created with only entries mark the main stages and status. Below is the list of expected output files: - -1. concise.log
-2. verbose.log -3. skesa.out.fa: multifasta files of the assembled contigs produced by SKESA
-4. ani-tax-report.txt and ani-tax-report.xml: Taxonomy verification results in text or XML format
-5. PGAP annotation results in multiple formats:
- * annot.gbk: annotated genome in GenBank flat file format
- * annot.gff: annotated genome in GFF3 format
- * annot.sqn: annotated genome in ASN format
- * annot.faa: multifasta file of the proteins annotated on the genome
- * annot.fna: multifasta file of the trancripts annotated on the genome
- * calls.tab: tab-delimited file of the coordinates of detected foreign sequence. Empty if no foreign contaminant was found. - -See a [detailed description of the annotation output files](https://github.com/ncbi/pgap/wiki/Output-Files) for more information. \ No newline at end of file diff --git a/dist/CHANGELOG.md b/dist/CHANGELOG.md index 5b59048..13c693e 100644 --- a/dist/CHANGELOG.md +++ b/dist/CHANGELOG.md @@ -1,3 +1,7 @@ +### Release v0.5.2 +* PGAP at 2022-04-14.build6021 +* Add `--auto-correct-tax` switch + ### Release v0.5.1 * PGAP at 2022-02-10.build5872 diff --git a/dist/README.txt b/dist/README.txt index feda682..a26429a 100644 --- a/dist/README.txt +++ b/dist/README.txt @@ -1,4 +1,4 @@ -Read Assembly and Annotation Pipeline Tool (RAPT) v0.5.1 +Read Assembly and Annotation Pipeline Tool (RAPT) v0.5.2 RAPT is a NCBI pipeline designed for assembling and annotating Illumina genome sequencing reads obtained from bacterial or archaeal isolates. RAPT consists of two major NCBI components, SKESA and PGAP. SKESA is a de-novo assembler for microbial genomes based on DeBruijn graphs. PGAP is a prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. RAPT takes an Illumina SRA run or a fasta file as input and produces an assembled and annotated genome. diff --git a/dist/release-notes.txt b/dist/release-notes.txt index aa2138c..91c08df 100644 --- a/dist/release-notes.txt +++ b/dist/release-notes.txt @@ -1,9 +1,9 @@ -RELEASE: v0.5.1 -DATE: 03-18-2022 -BUILD: rapt-37347638 +RELEASE: v0.5.2 +DATE: 05-04-2022 +BUILD: rapt-38092134 SKESA: 2.5.0 -PGAPX: 2022-02-10.build5872 +PGAPX: 2022-04-14.build6021 DESCRIPTION: -This release updates PGAP to 2022-02-10.build5872 +This release updates PGAP to 2022-04-14.build6021, and implemented --auto-correct-tax switch: when specified, in the case that the genome sequence is misassigned or contaminated and ANI predicts an organism with HIGH confidence, the system will use the predicted organism for PGAP instead of the one provided by the user. diff --git a/dist/run_rapt.py b/dist/run_rapt.py index 29b017a..1452ed5 100755 --- a/dist/run_rapt.py +++ b/dist/run_rapt.py @@ -12,9 +12,9 @@ ##to be compatible with python2 from abc import ABCMeta, abstractmethod -IMAGE_URI="ncbi/rapt:v0.5.1" +IMAGE_URI="ncbi/rapt:v0.5.2" -RAPT_VERSION="rapt-37347638" +RAPT_VERSION="rapt-38092134" DEFAULT_REF_DIR = '.rapt_refdata' @@ -23,7 +23,7 @@ FLG_SKESA_ONLY = 'skesa_only' FLG_NO_REPORT = 'no_report' FLG_STOP_ON_ERRORS = 'stop_on_errors' - +FLG_AUTO_CORRECT_TAX = 'auto_correct_tax' CONCISE_LOG='concise.log' VERBOSE_LOG='verbose.log' @@ -590,6 +590,8 @@ def main(args, parser): parser.add_argument('--stop-on-errors', dest=ARGDEST_FLAGS, action='append_const', const=FLG_STOP_ON_ERRORS, help='Do not run PGAP annotation pipeline when the genome sequence is misassigned or contaminated') + parser.add_argument('--auto-correct-tax', dest=ARGDEST_FLAGS, action='append_const', const=FLG_AUTO_CORRECT_TAX, help='If the genome sequence is misassigned or contaminated and ANI predicts an organism with HIGH confidence, use it for PGAP instead of the one provided by the user') + parser.add_argument('-o', '--output-dir', dest=ARGDEST_OUTDIR, help='Directory to store results and logs. If omitted, use current directory') ##general switches diff --git a/dist/run_rapt_gcp.sh b/dist/run_rapt_gcp.sh index ac6ecf6..3474aa6 100755 --- a/dist/run_rapt_gcp.sh +++ b/dist/run_rapt_gcp.sh @@ -1,8 +1,8 @@ #!/usr/bin/env bash ###############################* Global Constants *################################## -IMAGE_URI="ncbi/rapt:v0.5.1" -RAPT_VERSION="rapt-37347638" +IMAGE_URI="ncbi/rapt:v0.5.2" +RAPT_VERSION="rapt-38092134" GCP_LOGS_VIEWER="https://console.cloud.google.com/logs/viewer" @@ -44,6 +44,7 @@ OPT_JOBTIMEOUT="--timeout" FLG_SKESA_ONLY="--skesa-only" FLG_NOREPORT="--no-usage-reporting" FLG_STOPONERRORS="--stop-on-errors" +FLG_AUTO_CORRECT_TAX="--auto-correct-tax" FLG_USE_CSV="--csv" ####################################### Utilities #################################### @@ -72,8 +73,8 @@ Usage: ${script_name} [options] Job creation commands: ${CMD_ACXN} <${OPT_BUCKET}|${OPT_BUCKET_L} URL> [${OPT_LABEL} LABEL] - [${FLG_SKESA_ONLY}] [${FLG_NOREPORT}] [${FLG_STOPONERRORS}] [${OPT_VMTYPE} TYPE] - [${OPT_BDSIZE} NUM] [${OPT_JOBTIMEOUT} SECONDS] + [${FLG_SKESA_ONLY}] [${FLG_NOREPORT}] [${FLG_STOPONERRORS}] [${FLG_AUTO_CORRECT_TAX}] + [${OPT_VMTYPE} TYPE] [${OPT_BDSIZE} NUM] [${OPT_JOBTIMEOUT} SECONDS] Submit a job to run RAPT on an SRA run accession (sra_acxn). @@ -131,6 +132,11 @@ Job creation commands: Optional. Do not run PGAP annotation pipeline when the genome sequence is misassigned or contaminated. + ${FLG_AUTO_CORRECT_TAX} + Optional. If the genome sequence is misassigned or contaminated and ANI predicts + an organism with HIGH confidence, use it for PGAP instead of the one provided by + the user. + ${OPT_REGIONS} Optional, comma-separated. Specify in which GCP region(s) RAPT should run. @@ -359,6 +365,9 @@ parse_opts() flags+=("stop_on_errors") ;; + ${FLG_AUTO_CORRECT_TAX}) + flags+=("auto_correct_tax") + ;; ${FLG_USE_CSV}) format="csv" ;;