diff --git a/README.md b/README.md index 4b7fab7d53..83c092b021 100644 --- a/README.md +++ b/README.md @@ -39,12 +39,14 @@ ADAM does much more than just k-mer counting. Running the ADAM CLI without argum ``` $ adam-submit - e 888~-_ e e e - d8b 888 \ d8b d8b d8b - /Y88b 888 | /Y88b d888bdY88b - / Y88b 888 | / Y88b / Y88Y Y888b - /____Y88b 888 / /____Y88b / YY Y888b -/ Y88b 888_-~ / Y88b / Y888b + e 888~-_ e e e + d8b 888 \ d8b d8b d8b + /Y88b 888 | /Y88b d888bdY88b + / Y88b 888 | / Y88b / Y88Y Y888b + /____Y88b 888 / /____Y88b / YY Y888b + / Y88b 888_-~ / Y88b / Y888b + +Usage: adam-submit [ --] Choose one of the following commands: @@ -62,8 +64,11 @@ CONVERSION OPERATIONS anno2adam : Convert a annotation file (in VCF format) to the corresponding ADAM format adam2vcf : Convert an ADAM variant to the VCF ADAM format fasta2adam : Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences. + adam2fasta : Convert ADAM nucleotide contig fragments to FASTA files features2adam : Convert a file with sequence features into corresponding ADAM format wigfix2bed : Locally convert a wigFix file to BED format + fragments2reads : Convert alignment records into fragment records. + reads2fragments : Convert alignment records into fragment records. PRINT print : Print an ADAM formatted file @@ -72,9 +77,7 @@ PRINT print_tags : Prints the values and counts of all tags in a set of records listdict : Print the contents of an ADAM sequence dictionary allelecount : Calculate Allele frequencies - buildinfo : Display build information (use this for bug reports) view : View certain reads from an alignment-record file. - ``` You can learn more about a command, by calling it without arguments or with `--help`, e.g. @@ -84,16 +87,22 @@ $ adam-submit transform Argument "INPUT" is required INPUT : The ADAM, BAM or SAM file to apply the transforms to OUTPUT : Location to write the transformed data in ADAM/Parquet format + -add_md_tags VAL : Add MD Tags to reads based on the FASTA (or equivalent) file passed to this option. + -aligned_read_predicate : Only load aligned reads. Only works for Parquet files. + -cache : Cache data to avoid recomputing between stages. -coalesce N : Set the number of partitions written to the ADAM output directory + -concat VAL : Concatenate this file with and write the result to -dump_observations VAL : Local path to dump BQSR observations to. Outputs CSV format. -force_load_bam : Forces Transform to load from BAM/SAM. -force_load_fastq : Forces Transform to load from unpaired FASTQ. -force_load_ifastq : Forces Transform to load from interleaved FASTQ. -force_load_parquet : Forces Transform to load from Parquet. + -force_shuffle_coalesce : Even if the repartitioned RDD has fewer partitions, force a shuffle. -h (-help, --help, -?) : Print help -known_indels VAL : VCF file including locations of known INDELs. If none is provided, default consensus model will be used. -known_snps VAL : Sites-only VCF giving location of known SNPs + -limit_projection : Only project necessary fields. Only works for Parquet files. -log_odds_threshold N : The log-odds threshold for accepting a realignment. Default value is 5.0. -mark_duplicate_reads : Mark duplicate reads -max_consensus_number N : The maximum number of consensus to try realigning a target region to. Default @@ -101,6 +110,10 @@ Argument "INPUT" is required -max_indel_size N : The maximum length of an INDEL to realign to. Default value is 500. -max_target_size N : The maximum length of a target region to attempt realigning. Default length is 3000. + -md_tag_fragment_size N : When adding MD tags to reads, load the reference in fragments of this size. + -md_tag_overwrite : When adding MD tags to reads, overwrite existing incorrect tags. + -paired_fastq VAL : When converting two (paired) FASTQ files to ADAM, pass the path to the second file + here. -parquet_block_size N : Parquet block size (default = 128mb) -parquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec -parquet_disable_dictionary : Disable dictionary encoding @@ -109,11 +122,17 @@ Argument "INPUT" is required -print_metrics : Print metrics to the log on completion -realign_indels : Locally realign indels present in reads. -recalibrate_base_qualities : Recalibrate the base quality scores (ILLUMINA only) + -record_group VAL : Set converted FASTQs' record-group names to this value; if empty-string is passed, + use the basename of the input file, minus the extension. -repartition N : Set the number of partitions to map data to + -single : Saves OUTPUT as single file -sort_fastq_output : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default. Ignored if not saving as FASTQ. -sort_reads : Sort the reads by referenceId and read position - ``` + -storage_level VAL : Set the storage level to use for caching. + -stringency VAL : Stringency level for various checks; can be SILENT, LENIENT, or STRICT. Defaults + to LENIENT +``` The ADAM `transform` command allows you to mark duplicates, run base quality score recalibration (BQSR) and other pre-processing steps on your data. diff --git a/adam-cli/src/main/scala/org/bdgenomics/adam/cli/ADAMMain.scala b/adam-cli/src/main/scala/org/bdgenomics/adam/cli/ADAMMain.scala index d8002340c8..9389eab00e 100644 --- a/adam-cli/src/main/scala/org/bdgenomics/adam/cli/ADAMMain.scala +++ b/adam-cli/src/main/scala/org/bdgenomics/adam/cli/ADAMMain.scala @@ -67,7 +67,6 @@ object ADAMMain { PrintTags, ListDict, AlleleCount, - BuildInformation, View ) ) diff --git a/adam-cli/src/main/scala/org/bdgenomics/adam/cli/BuildInformation.scala b/adam-cli/src/main/scala/org/bdgenomics/adam/cli/BuildInformation.scala deleted file mode 100644 index 8ad727eae2..0000000000 --- a/adam-cli/src/main/scala/org/bdgenomics/adam/cli/BuildInformation.scala +++ /dev/null @@ -1,39 +0,0 @@ -/** - * Licensed to Big Data Genomics (BDG) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The BDG licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.bdgenomics.adam.cli - -import org.bdgenomics.utils.cli._ - -object BuildInformation extends BDGCommandCompanion { - val commandName: String = "buildinfo" - val commandDescription: String = "Display build information (use this for bug reports)" - - def apply(cmdLine: Array[String]): BDGCommand = { - new BuildInformation() - } -} - -class BuildInformation() extends BDGCommand { - val companion = BuildInformation - - def run() = { - val properties = org.bdgenomics.adam.core.util.BuildInformation.asString(); - println("Build information:\n" + properties); - } - -} diff --git a/adam-cli/src/test/scala/org/bdgenomics/adam/cli/ADAMMainSuite.scala b/adam-cli/src/test/scala/org/bdgenomics/adam/cli/ADAMMainSuite.scala index d0f4a65112..a5183e0762 100644 --- a/adam-cli/src/test/scala/org/bdgenomics/adam/cli/ADAMMainSuite.scala +++ b/adam-cli/src/test/scala/org/bdgenomics/adam/cli/ADAMMainSuite.scala @@ -48,16 +48,6 @@ class ADAMMainSuite extends FunSuite { assert(out.contains("flatten")) } - test("buildinfo argument via main") { - val stream = new ByteArrayOutputStream() - Console.withOut(stream) { - ADAMMain.main(Array("buildinfo")) - } - val out = stream.toString() - assert(!out.contains("Usage")) - assert(out.contains("Build information")) - } - test("command groups is empty when called via apply") { val stream = new ByteArrayOutputStream() Console.withOut(stream) { @@ -82,7 +72,7 @@ class ADAMMainSuite extends FunSuite { test("add new command group to default command groups") { val stream = new ByteArrayOutputStream() Console.withOut(stream) { - val commandGroups = defaultCommandGroups.union(List(CommandGroup("NEW COMMAND GROUP", List(BuildInformation)))) + val commandGroups = defaultCommandGroups.union(List(CommandGroup("NEW COMMAND GROUP", List(Flatten)))) new ADAMMain(commandGroups)(Array()) } val out = stream.toString() @@ -107,7 +97,7 @@ class ADAMMainSuite extends FunSuite { Console.withOut(stream) { val module = new AbstractModule with ScalaModule { def configure() = { - bind[List[CommandGroup]].toInstance(List(CommandGroup("SINGLE COMMAND GROUP", List(BuildInformation)))) + bind[List[CommandGroup]].toInstance(List(CommandGroup("SINGLE COMMAND GROUP", List(Flatten)))) } } val injector = Guice.createInjector(module) @@ -117,7 +107,7 @@ class ADAMMainSuite extends FunSuite { val out = stream.toString() assert(out.contains("Usage")) assert(out.contains("SINGLE")) - assert(out.contains("buildinfo")) + assert(out.contains("flatten")) } test("custom module with new command group added to default command groups") { @@ -125,7 +115,7 @@ class ADAMMainSuite extends FunSuite { Console.withOut(stream) { val module = new AbstractModule with ScalaModule { def configure() = { - bind[List[CommandGroup]].toInstance(defaultCommandGroups.union(List(CommandGroup("NEW COMMAND GROUP", List(BuildInformation))))) + bind[List[CommandGroup]].toInstance(defaultCommandGroups.union(List(CommandGroup("NEW COMMAND GROUP", List(Flatten))))) } } val injector = Guice.createInjector(module) @@ -136,21 +126,4 @@ class ADAMMainSuite extends FunSuite { assert(out.contains("Usage")) assert(out.contains("NEW")) } - - test("buildinfo from custom module argument via apply") { - val stream = new ByteArrayOutputStream() - Console.withOut(stream) { - val module = new AbstractModule with ScalaModule { - def configure() = { - bind[List[CommandGroup]].toInstance(List(CommandGroup("SINGLE COMMAND GROUP", List(BuildInformation)))) - } - } - val injector = Guice.createInjector(module) - val commandGroups = injector.instance[List[CommandGroup]] - new ADAMMain(commandGroups).apply(Array("buildinfo")) - } - val out = stream.toString() - assert(!out.contains("Usage")) - assert(out.contains("Build information")) - } } diff --git a/docs/source/01_intro.md b/docs/source/01_intro.md index 9ee5faca6b..97ff17bea0 100644 --- a/docs/source/01_intro.md +++ b/docs/source/01_intro.md @@ -93,64 +93,73 @@ just as easily use this Avro IDL description as the basis for a Python project. ADAM does much more than just k-mer counting. Running the ADAM CLI without arguments or with `--help` will display available commands, e.g. -$ adam +$ adam-submit ``` - e 888~-_ e e e - d8b 888 \ d8b d8b d8b - /Y88b 888 | /Y88b d888bdY88b - / Y88b 888 | / Y88b / Y88Y Y888b - /____Y88b 888 / /____Y88b / YY Y888b -/ Y88b 888_-~ / Y88b / Y888b + e 888~-_ e e e + d8b 888 \ d8b d8b d8b + /Y88b 888 | /Y88b d888bdY88b + / Y88b 888 | / Y88b / Y88Y Y888b + /____Y88b 888 / /____Y88b / YY Y888b + / Y88b 888_-~ / Y88b / Y888b + +Usage: adam-submit [ --] Choose one of the following commands: ADAM ACTIONS - compare : Compare two ADAM files based on read name - findreads : Find reads that match particular individual or comparative criteria depth : Calculate the depth from a given ADAM file, at each variant in a VCF count_kmers : Counts the k-mers/q-mers from a read dataset. + count_contig_kmers : Counts the k-mers/q-mers from a read dataset. transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations adam2fastq : Convert BAM to FASTQ files plugin : Executes an ADAMPlugin + flatten : Convert a ADAM format file to a version with a flattened schema, suitable for querying with tools like Impala CONVERSION OPERATIONS - bam2adam : Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input) vcf2adam : Convert a VCF file to the corresponding ADAM format anno2adam : Convert a annotation file (in VCF format) to the corresponding ADAM format adam2vcf : Convert an ADAM variant to the VCF ADAM format fasta2adam : Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences. - reads2ref : Convert an ADAM read-oriented file to an ADAM reference-oriented file - mpileup : Output the samtool mpileup text from ADAM reference-oriented data + adam2fasta : Convert ADAM nucleotide contig fragments to FASTA files features2adam : Convert a file with sequence features into corresponding ADAM format wigfix2bed : Locally convert a wigFix file to BED format + fragments2reads : Convert alignment records into fragment records. + reads2fragments : Convert alignment records into fragment records. PRINT print : Print an ADAM formatted file print_genes : Load a GTF file containing gene annotations and print the corresponding gene models flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat) - viz : Generates images from sections of the genome print_tags : Prints the values and counts of all tags in a set of records listdict : Print the contents of an ADAM sequence dictionary - summarize_genotypes : Print statistics of genotypes and variants in an ADAM file allelecount : Calculate Allele frequencies - buildinfo : Display build information (use this for bug reports) view : View certain reads from an alignment-record file. ``` You can learn more about a command, by calling it without arguments or with `--help`, e.g. ``` -$ adam transform +$ adam-submit transform Argument "INPUT" is required INPUT : The ADAM, BAM or SAM file to apply the transforms to OUTPUT : Location to write the transformed data in ADAM/Parquet format + -add_md_tags VAL : Add MD Tags to reads based on the FASTA (or equivalent) file passed to this option. + -aligned_read_predicate : Only load aligned reads. Only works for Parquet files. + -cache : Cache data to avoid recomputing between stages. -coalesce N : Set the number of partitions written to the ADAM output directory + -concat VAL : Concatenate this file with and write the result to -dump_observations VAL : Local path to dump BQSR observations to. Outputs CSV format. + -force_load_bam : Forces Transform to load from BAM/SAM. + -force_load_fastq : Forces Transform to load from unpaired FASTQ. + -force_load_ifastq : Forces Transform to load from interleaved FASTQ. + -force_load_parquet : Forces Transform to load from Parquet. + -force_shuffle_coalesce : Even if the repartitioned RDD has fewer partitions, force a shuffle. -h (-help, --help, -?) : Print help -known_indels VAL : VCF file including locations of known INDELs. If none is provided, default consensus model will be used. -known_snps VAL : Sites-only VCF giving location of known SNPs + -limit_projection : Only project necessary fields. Only works for Parquet files. -log_odds_threshold N : The log-odds threshold for accepting a realignment. Default value is 5.0. -mark_duplicate_reads : Mark duplicate reads -max_consensus_number N : The maximum number of consensus to try realigning a target region to. Default @@ -158,27 +167,28 @@ Argument "INPUT" is required -max_indel_size N : The maximum length of an INDEL to realign to. Default value is 500. -max_target_size N : The maximum length of a target region to attempt realigning. Default length is 3000. + -md_tag_fragment_size N : When adding MD tags to reads, load the reference in fragments of this size. + -md_tag_overwrite : When adding MD tags to reads, overwrite existing incorrect tags. + -paired_fastq VAL : When converting two (paired) FASTQ files to ADAM, pass the path to the second file + here. -parquet_block_size N : Parquet block size (default = 128mb) -parquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec -parquet_disable_dictionary : Disable dictionary encoding -parquet_logging_level VAL : Parquet logging level (default = severe) -parquet_page_size N : Parquet page size (default = 1mb) -print_metrics : Print metrics to the log on completion - -qualityBasedTrim : Trims reads based on quality scores of prefix/suffixes across read group. - -qualityThreshold N : Phred scaled quality threshold used for trimming. If omitted, Phred 20 is used. -realign_indels : Locally realign indels present in reads. -recalibrate_base_qualities : Recalibrate the base quality scores (ILLUMINA only) + -record_group VAL : Set converted FASTQs' record-group names to this value; if empty-string is passed, + use the basename of the input file, minus the extension. -repartition N : Set the number of partitions to map data to + -single : Saves OUTPUT as single file -sort_fastq_output : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default. Ignored if not saving as FASTQ. -sort_reads : Sort the reads by referenceId and read position - -trimBeforeBQSR : Performs quality based trim before running BQSR. Default is to run quality based - trim after BQSR. - -trimFromEnd N : Trim to be applied to end of read. - -trimFromStart N : Trim to be applied to start of read. - -trimReadGroup VAL : Read group to be trimmed. If omitted, all reads are trimmed. - -trimReads : Apply a fixed trim to the prefix and suffix of all reads/reads in a specific read - group. + -storage_level VAL : Set the storage level to use for caching. + -stringency VAL : Stringency level for various checks; can be SILENT, LENIENT, or STRICT. Defaults + to LENIENT ``` The ADAM transform command allows you to mark duplicates, run base quality score recalibration (BQSR) and other pre-processing steps on your data.