Merge pull request #88 from trstickland/doc_rewrite

Doc rewrite
sanger-pathogens · Oct 3, 2018 · e26f4c6 · e26f4c6
2 parents 32fd793 + c614628
commit e26f4c6
Show file tree

Hide file tree

Showing 2 changed files with 69 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -5,15 +5,20 @@ A portable, scalable eukaryotic genome annotation pipeline implemented in Nextfl
 [![License: ISC](https://img.shields.io/badge/License-ISC-brightgreen.svg)](https://github.com/sanger-pathogens/companion/blob/master/LICENSE)  
 [![status](https://img.shields.io/badge/NAR-10.1093%2Fnar.gkw292-brightgreen.svg)](https://doi.org/10.1093/nar/gkw292)
 
+
 ## Content
-  * [Introduction](#introduction)
-  * [Installation](#installation)
-    * [Required dependencies](#required-dependencies)
-  * [Usage](#usage)
-    * [Preparing reference annotations](#preparing-reference-annotations)
-  * [License](#license)
-  * [Feedback/Issues](#feedbackissues)
-  * [Citation](#citation)
+ * [Introduction](#introduction)
+ * [Dependencies](#dependencies)
+   * [Docker](#docker)
+ * [Installation](#installation)
+ * [Usage](#usage)
+   * [Local copy of Companion](#local-copy-of-companion)
+   * [Running Companion direct from a repository](#running-companion-direct-from-a-repository)
+   * [Preparing reference annotations](#preparing-reference-annotations)
+ * [License](#license)
+ * [Feedback/Issues](#feedbackissues)
+ * [Citation](#citation)
+
 
 ## Introduction
 This software is a comprehensive computational pipeline for the annotation of eukaryotic genomes (like protozoan parasites). It performs the following tasks:
@@ -93,7 +98,7 @@ docker pull sangerpathogens/companion
 To create a local copy of companion, you can download this repo from github (if you are familiar with github, you may
 of course prefer to _clone_ or _fork_ it).
 ```
-curl https://github.com/trstickland/companion/archive/master.zip  # or click the green button on the guthub web page
+curl -L -o companion-master.zip https://github.com/sanger-pathogens/companion/archive/master.zip  # or click the green button on the guthub web page
 unzip companion-master.zip
 mv companion-master my-companion-project # renaming it to something meaningful to you is a good idea
 ```
@@ -106,9 +111,19 @@ nextflow run my-companion-project -profile docker
 The argument `-profile docker` instructs nextflow to run the sangerpathogens/companion docker image for the dependencies.
 
 Have a look at the `nextflow.config` file to see the definition of the docker profile, and how the docker image is specified.
-You will also find file names, paths, parameters, etc. that you can edit to perform your own runs.
+You will also find file names, paths, parameters, etc. that you can edit to perform your own runs.  The following warrant
+a special mention:
+
+*inseq*  The input FASTA file  (`${baseDir}/example-data/L_donovani.1.fasta` in the example parameter file included wirth the distribution)
+
+*ref_dir* The directory containing reference genomes (`${baseDir}/example-data/references` in the example file)
+
+*dist_dir* The directory that will contain the newly created output files (`${baseDir}/example-data-output` in the example file)
 
-### running Companion direct from a repository
+*run_snap* We recommend SNAP is disabled, as it has not provided useful results in this pipeline (`false` in the example file)
+
+
+### Running Companion direct from a repository
 
 If you run nextflow with the name of a github repository, it will pull the contents of the repository and run with those.
 This command will do the same as the "local copy" example above:
@@ -123,10 +138,50 @@ edit the files in your local copy, and nextflow will work from your (now differe
 If you are familiar with repositories, and the workflow appropriate to using them, this can be a very convenient way of
 working;   otherwise it can become quite confusing, and you may find it easier to work with a simple local copy.
 
-
 ### Preparing reference annotations
 
-The reference annotations used in the pipeline need to be pre-processed before they can be used. See the the [GitHub wiki](https://github.com/sanger-pathogens/companion/wiki/Preparing-reference-data-sets) for more details. There are also pre-generated reference sets for various parasite species/families.
+The reference annotations used in the pipeline need to be pre-processed before they can be used.  Only a few pre-generated
+reference sets for various parasite species/families are included in the distribution as examples.
+
+To add a reference organism, you will need:
+
+- a descriptive name of the organism
+- a short abbreviation for the organism
+- the genome sequence in a single FASTA file
+- a structural gene annotation in GFF3 format (see below for details)
+- functional GO annotation in GAF 1.0 format, on the gene level
+- a pattern matching chromosome headers, describing how to extract chromosome numbers from them
+- an [AUGUSTUS](http://bioinf.uni-greifswald.de/augustus/) model, trained on reference genes
+
+Insert these file names, etc., where `<placeholders>` appear in the steps below:
+
+1. Create a new data directory (i.e. the equivalent of the `example-data` directory included in the distribution)
+1. Edit `nextflow.config` (and any config files that are referenced) and change parameters such as
+`inseq` and `ref_dir` to your new data directory.
+1. Copy the new reference genome (FASTA) into `<new_data_dir>/genomes`
+1. Copy GFF3 and GAF files into `<new_data_dir>/genomes`
+1. Copy Augustus model files into `data/augustus/species/<species_name>/`
+1. Create new directory `<new_data_dir>/references/<short_name>/`
+1. Add new section to `amber-test-data/references/references-in.json`, using the
+short name (same as the directory name in the previous step); in this section add
+the names/paths of the files copied (above), a descriptive name, and
+a pattern for matching chromosomes in the FASTA files (in this example, <short_name>_<n>, where _n_ in any integer).
+```
+"<short_name>" : {   "gff"                : "../genomes/<gff3_filename>.gff3",
+                     "genome"             : "../genomes/<ref_genome_name>.fasta",
+                     "gaf"                : "../genomes/<ref_annot_filename>.gaf",
+                     "name"               : "<Descriptive Name of Reference Genome>",
+                     "augustus_model"     : "../../data/augustus/species/<species_name>/",
+                     "chromosome_pattern" : "<short_name>_(%d+)"
+                  }
+```
+8. Finally, change directory to `<new_data_dir>/references` (you _must_ execute the following command in this directory)
+and run `../../bin/update_references.lua`.  This writes the file `<new_data_dir>/references/references.json`.
+
+You can now run _Companion_, and the new reference will be included.
+
+Further documentation on preparing reference data can be found in the [GitHub wiki](https://github.com/sanger-pathogens/companion/wiki/Preparing-reference-data-sets).
+
 
 ## License
 Companion is free software, licensed under [ISC](https://github.com/sanger-pathogens/companion/blob/master/LICENSE).

diff --git a/params_default.config b/params_default.config
@@ -7,7 +7,7 @@ params {
     ref_species = "LmjF.1"
 
     // output directory, if desired
-    // dist_dir = "${baseDir}"
+    dist_dir = "${baseDir}/example-output"
 
     // enable/disable parts of the pipeline
     run_exonerate          = false