- Author
-
Hiroyuki Mishima (missy at be.to / hmishima at nagasaki-u.ac.jp)
- Copyright
-
Hiroyuki Mishima, 2010-2011
- License
-
the MIT license. See the LICENSE file.
see Dindel’s web page www.sanger.ac.uk/resources/software/dindel/ .
A rakefile for the Genome Analysis Toolkit (GATK) workflow.
see GATK web page www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit .
This workflow describes “Better - sample-level realignment with known indels and recalibration”. See www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2#Better:_sample-level_realignment_with_known_indels_and_recalibration .
A combined rakefile calling both GATK and Dindel workflows workflows.
The newest UnifiedGenotyper implements the Dindel algorithm and can reports both SNVs and indels. This workflows is based on the newest recommended workflow on the GATK web page. Now Rakefile.invoke is devided into Rakefile.invoke.Gatk and Rakefile.invoke.Picard. Each file containes Gatk or Picard specific methods. This made shareing Rakefile.invoke easier between different workflows.
-
Rake dry-run: rake -n
-
Rake run: rake
-
Pwrake dry-run: pwrake NODEFILE=nodefile -n
-
Pwrake run: pwrake NODEFILE=nodefile
-
details of Pwrake: see github.com/masa16/Pwrake/ and bioruby.open-bio.org/wiki/Workflows
Workflow directories consist of the folloing files:
Main rakefile. In the start, target files of each workflow step are defined in constants. These constants are used in definition of the :default task. This makes finding overview of the workflow easy and setting break points in workflow execution. Each workflow steps can be defined using the “rule” method if the dependency of the step is defined by naming rules such as file extensions (suffix). The “file” method also can be used. This methods defines dependency using fixed filenames instead of rules. You can flexibly define dependencies using regular ruby syntax such as Enumerable#each.
Refered from Rakefile. Command-line options of tools to be invoked are described in this file. Each invoke method should receive a Task object (sometimes “t” is used for a parameter name). An optional Hash object can be used if the method needs extra information.
Defining helper methods simplifying Rakefile descriptions. These methods are defined in the top level.
“objfile” is an array of String objects (filenames). “dependency” is a Hash but expected to have only one key. To replace the “.bam” file extension (or suffix) to “.dedup.bam”, “dependency” should be {“.bam” => “.dedup.bam”}. Note that you do not have to use “.” to indicate a dot.
The “rule” method of Rake requires an Array of Proc objects (object of code block or procedure) to define dependent files. This method returnes a Proc object to replace suffix. “dependency” is as same as that in the suffix method.
A file with contains a line “localhost 16” to allow to run maximum 16 processes simultaneously in the localhost. Multiple lines are allowed in this file.
As a summary of the agile workflow development, the general procedure for describing new workflows in Pwrake is given below.
Describe file dependencies in Rakefile.
task "output.dat" => "input.dat" do |t| RakefileInvoke::generate_terget t end
Define the RakefileInvoke::generate_terget method in Rake.invoke.
module RakefileInvoke def generate_target(t) sh "command-line #{t.prerequisite} > #{t.name}" end end
Parameter adjustments require modifications to Rakefile.invoke only. Similarly, changes in file dependencies require modification to Rakefile only.
-
In Rakefile and Rakefile.invoke, all the fixed values in the command-line should be given using constants instead of hard coding.
-
In Rakefile, the rule method is useful if the order of tasks can be defined by the file naming rule such as file name extensions.
-
For syntax check, the -n option of the Pwrake/Rake command for dry-run is useful.
-
To check the correctness of the generated command-line, the command-line can be shown by replacing the sh method by the puts method in Rakefile.invoke.
-
Redirecting the standard output and the standard error to files is a good practice for trouble shooting.
copyright © Hiroyuki Mishima, 2010-2011. See the LICENSE file.