-
Notifications
You must be signed in to change notification settings - Fork 17
FAQ
As of December 2024, NCBI's pilot tool, Read assembly and Annotation Pipeline (RAPT) tool will no longer be available. We encourage you to check out NCBI’s suite of assembly and annotation tools including the genome assembler SKESA, the taxonomic assignment tool ANI, and the prokaryotic genome annotation pipeline (PGAP). Learn more...
At this time RAPT only supports reads produced on the Illumina sequencing platform. Reads can be provided to RAPT as fasta or fastq files or as SRA run accessions (starting with the SRR, DRR or SRR prefix).
No. RAPT is only designed to work on data sequenced from bacterial or archaeal isolates.
At this time, RAPT only supports SKESA. If you wish to annotate an already assembled genome, please use PGAP.
At the moment, there are two variations of RAPT; Google Cloud Platform (GCP) RAPT and Standalone RAPT.
Please see the respective documentation pages for pre-requisites and instructions: GCP RAPT, Standalone RAPT.
For each run of the pipeline, multiple reports will be generated. One at the beginning, and one at the end of each phase of RAPT. These reports help us measure our impact on the community, which in turns helps us get funds, so please report your usage. For more information see the NCBI privacy policy. What we collect will look like this:
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T12:24:44 rapt_start
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T12:25:55 skesa_success
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T12:54:22 ani_start
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T13:24:88 ani_success
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T13:25:44 pgap_start
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T19:46:11 pgap_success
1 34.86.175.158 8bd35abb-8a04-4984-9d88-59c349824819 2020-07-10T19:46:11 rapt_exit
Although we recommend always reporting information back to NCBI because this helps us build a better product by understanding usage and errors, you can disable this by adding the following --no-usage-reporting
to the run_rapt_gcp.sh
or the run_rapt.py
job submission command.
The taxonomy check step indicates that the organism for my input data is misassigned. What does it mean, and what should I do?
The taxonomy check done within RAPT with the Average Nucleotide Identity tool compares the set of contigs assembled by RAPT to type strain assemblies available in GenBank. A misassignment indicates that the short read sequences passed to RAPT on input come from a different organism than the one provided. If you agree with the ANI assessment, and wish to use the ANI-chosen scientific name in the downstream steps, re-run RAPT with the flag --auto-correct-tax
. This will guarantee the best annotation quality possible.
I am not confident in the taxonomic classification of the organism I sequenced, so the scientific name I can provide is only a guess. Is it acceptable?
Yes! The taxonomy check done within RAPT with ANI can assign a scientific name to your assembly based on its best matching assembly in GenBank that is of well-defined origin. If you run RAPT with the flag --auto-correct-tax
, the scientific name determined by ANI will override the scientific name you provide on input, resulting in a more accurate annotation. The scientific name in the final results will be the ANI-chosen name.
Can I make sure RAPT stops if the taxonomy check indicates my sample may be misassigned or contaminated?
Yes. Add the flag --stop-on-errors
to the run_rapt_gcp.sh
or the run_rapt.py
job submission command and RAPT will stop if the taxonomy check indicates the species assigned to the reads is incorrect or if the read set is contaminated.
The cost of running RAPT increases roughly linearly with the size of the genome assembled from the read set provided on input. Below are examples of inputs and their runtimes on GCP n1-highmem-8
(8-CPU) machines. For reference, dollar cost of renting such machines can be derived from the current Google virtual machine cost structure.
SRA run | Species | Size of genome produced (Mb) | Runtime (min) |
---|---|---|---|
SRR11101319 | Campylobacter jejuni | 1.9 | 58 |
SRR11147196 | Listeria monocytogenes | 3 | 113 |
SRR4457405 | Clostridium perfringens | 3.7 | 97 |
ERR4436589 | Acinetobacter baumanii | 4 | 104 |
SRR12431019 | Salmonella enterica | 4.8 | 109 |
ERR2116816 | Enterobacter cloacae | 4.9 | 124 |
ERR4338267 | Salmonella enterica | 5.1 | 127 |
SRR6048050 | Pseudomonas aeruginosa | 6.5 | 171 |
SRR11046561 | Klebsiella oxytoca | 6.6 | 173 |
ERR1974692 | Pseudomonas aeruginosa | 7.3 | 193 |
See the PGAP FAQs
Please open an issue issue, after checking that your question was not addressed in previously opened issues.
Why does my run occasionally not finish, producing no logs or message in terminal, and yet the pipeline still seem to be running?
You are most likely running the pipeline on a remote machine over ssh, and the connection has been interrupted. Use the nohup utility, or a terminal multiplexer, such as tmux or screen when working on a remote machine, to allow run_rapt.py
to continue in case the ssh connection is interrupted.
One possible reason is failure to connect to SRA. Such failures are reported in the log file run.log
, with the line SRA connection check failed with code 1, abort..
, and are typically transient. Please retry.
What is the default options for --machine-type TYPE, --boot-disk-size NUM, and --timeout SECONDS, why would I change them?
--machine-type TYPE
Default is "n1-highmem-8" (refer to google cloud documentation), which is suitable for most jobs. The larger the machine, the faster the job will be.
There is a point of diminishing returns which will vary per user and their cost/time preferences.
--boot-disk-size NUM
Optional. Set the size (in Gb) of boot disk for the virtual machine. Default size is 128. The larger the boot disk, the faster the job will be.
There is a point of diminishing returns which will vary per user and their cost/time preferences.
--timeout SECONDS
Optional. Set the timeout (seconds) for the job. Default is 86400s (24 hours). If you have a job that does not complete in this time,
you can increase the timeout and/or increase your machine type.
My run is marked 'Failed', and the message in the log is "Execution failed: selecting resources: selecting region and zone: no available zones: us-central1: CPUS quota too low"
This message is caused by insufficient compute quota available to your GCP project in the us-central1 zone for RAPT to execute with the default machine: n1-highmem-16. This commonly occur when using a "free GCP" account. The first step is to view your quotas. On the line “Compute Engine API – CPUs”, select “All quotas” and find the region(s) where non-zero quota is available. Use the --regions
parameter to specify a region where you have an alloted quota. You must also select a machine size that is equal or lower than your quota limit, using the --machine-type
parameter to specify a lower machine size.
Please note: if you use an instance rather than command shell, the instance is counted as a machine.
Yes, by using larger machines. By default, RAPT runs on n1-highmem-8
machines. You can run RAPT on a larger machine by adjusting the --machine-type
parameter. In our hands, the runtime decreaes by 30% on average when switching from n1-highmem-8
machines to n1-highmem-16
but the cost increases by about 40%, based on the current Google virtual machine cost structure.
By default, RAPT runs on n1-highmem-8
machines. We do not recommend running RAPT on smaller machines.
Follow the set-up instruction for running in a Cloud Shell.
-
On the GCP screen from the last step, click "Compute Engine" or navigate to the "Compute Engine" section by clicking on the navigation menu with the "hamburger icon" (three horizontal lines) on the top left corner.
]
-
Click on the blue "CREATE INSTANCE" button on the top bar.
-
Create an image with the default parameters. Give your instance a name for tracking and enable access to all Cloud APIs. Plus look at the expense for record keeping.
-
Click the blue "Create" button. This will create and start the VM.
-
SSH into your instance.