exhibit doc up to auspice server

ktmeaton · May 15, 2020 · c711b5b · c711b5b
1 parent 4f056f5
commit c711b5b
Show file tree

Hide file tree

Showing 2 changed files with 135 additions and 14 deletions.
diff --git a/docs/exhibit/exhibit_dhsi2020.rst b/docs/exhibit/exhibit_dhsi2020.rst
@@ -16,8 +16,8 @@ Note: Requires `conda <https://docs.conda.io/projects/conda/en/latest/user-guide
 
 ------------
 
-Phylogenetics Pipeline
---------------------------
+Data Download
+-------------
 
 Download the samples and reference found in the `Morelli et al. 2010 pulication <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2999892/>`_.
 
@@ -36,6 +36,11 @@ Check that there are 15 samples to be downloaded.
 
       wc -l morelli2010/sqlite_import/assembly_for_download.txt
 
+------------
+
+Basic Phylogeny Pipeline
+------------------------
+
 Run the full pipeline, including sample download, aligning to a reference genome, model selection, and phylogenetics.
 
 **Shell script**::
@@ -47,27 +52,75 @@ Run the full pipeline, including sample download, aligning to a reference genome
         --sqlite_select_command_asm "\"SELECT AssemblyFTPGenbank FROM Master WHERE BioSampleComment LIKE '%Morelli%'\"" \
         -resume
 
-Prepare a metadata file for NextStrain. Afterwards, manually clean up dates to remove uncertainty characters and change to the format 2000-XX-XX. Also separate out columns that have multiple entries (ex. AssemblyTotalLength) by retaining the first semi-colon separated value.
+------------
+
+TimeTree Metadata
+-----------------
+
+Prepare a metadata file for timetree.
 
 **Shell script**::
 
       mkdir -p morelli2010/nextstrain/
 
       scripts/sqlite_NextStrain_tsv.py \
         --database results/ncbimeta_db/update/latest/output/database/yersinia_pestis_db.sqlite \
-        --query "SELECT BioSampleAccession,AssemblyFTPGenbank,SRARunAccession,BioSampleStrain,BioSampleCollectionDate,BioSampleHost,BioSampleGeographicLocation,BioSampleBiovar,PubmedArticleTitle,PubmedAuthorsLastName,AssemblyContigCount,AssemblyTotalLength,NucleotideGenes,NucleotideGenesTotal,NucleotidePseudoGenes,NucleotidePseudoGenesTotal,NucleotiderRNAs,AssemblySubmissionDate,SRARunPublishDate,BioSampleComment FROM Master WHERE (BioSampleComment LIKE '%Morelli%' AND TRIM(AssemblyFTPGenbank) > '')" \
+        --query "SELECT BioSampleAccession,AssemblyFTPGenbank,BioSampleStrain,BioSampleCollectionDate,BioSampleGeographicLocation,BioSampleBiovar FROM Master WHERE (BioSampleComment LIKE '%Morelli%' AND TRIM(AssemblyFTPGenbank) > '')" \
         --no-data-char ? \
         --output morelli2010/nextstrain/metadata_nextstrain.tsv
 
         head -n 1 morelli2010/nextstrain/metadata_nextstrain.tsv | \
-          awk -F "\t" '{print "name\t"$0}' \
+          awk -F "\t" '{print "strain\t"$0}' \
           > morelli2010/nextstrain/metadata_nextstrain_edit.tsv
 
         tail -n +2 morelli2010/nextstrain/metadata_nextstrain.tsv  | \
           awk -F "\t" '{split($2,ftpSplit,"/"); name=ftpSplit[10]"_genomic"; print name"\t"$0}' \
           >> morelli2010/nextstrain/metadata_nextstrain_edit.tsv
 
-Estimate a time-scaled phylogeny.
+Afterwards, change the BioSampleCollectionDate column to 'date'.
+
+**Shell script**::
+
+      sed -i 's/BioSampleCollectionDate/date/g' morelli2010/nextstrain/metadata_nextstrain_edit.tsv
+
+- remove uncertainty characters ex. <
+- change format to 2000-XX-XX.
+
+Edit the BioSampleGeographicLocation column so that:
+
+- Change everything just to country name
+- USSR to Russia
+
+Add a line for the Reference Genome that just says "Reference" under the strain column, and is question marks for all remaining columns. We will let the program infer the metadata and see how close it gets.
+
+Geocode the GeographicLocation column to get lat lon coordinates.
+
+**Shell script**::
+
+      scripts/geocode_NextStrain.py \
+         --in-tsv morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
+         --loc-col BioSampleGeographicLocation \
+         --out-tsv morelli2010/nextstrain/metadata_nextstrain_geocode.tsv \
+         --out-lat-lon morelli2010/nextstrain/lat_longs.tsv \
+         --div country
+
+Replace the division name 'country' with our column name 'BioSampleGeographicLocation' in the lat lon file.
+Edit country names in the lat lon file to match our original metadata.
+
+**Shell script**::
+
+      sed -i 's/country/BioSampleGeographicLocation/g' morelli2010/nextstrain/lat_longs.tsv
+      sed -i 's/Iran/Kurdistan/g' morelli2010/nextstrain/lat_longs.tsv
+      sed -i 's/United States of America/USA/g' morelli2010/nextstrain/lat_longs.tsv
+      sed -i 's/Republic of the Congo/Congo/g' morelli2010/nextstrain/lat_longs.tsv
+
+------------
+
+TimeTree Phylogeny
+------------------
+
+
+Estimate a time-scaled phylogeny. Re-root with strain Pestoides F (Accession: GCA_000016445.1_ASM1644v1).
 
 **Shell script**::
 
@@ -77,7 +130,57 @@ Estimate a time-scaled phylogeny.
           --vcf-reference morelli2010/reference_genome/GCF_000009065.1_ASM906v1_genomic.fna \
           --metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
           --timetree \
-          --root residual \
+          --root GCA_000016445.1_ASM1644v1_genomic \
           --coalescent opt \
           --output-tree morelli2010/nextstrain/tree.nwk \
           --output-node-data morelli2010/nextstrain/branch_lengths.json;
+
+------------
+
+Ancestral Traits
+----------------
+
+Reconstruction of ancestral traits.
+Note: Investigate the  --sampling-bias-correction option.
+
+**Shell script**::
+
+          augur traits \
+              --tree morelli2010/nextstrain/tree.nwk \
+              --metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
+              --columns BioSampleGeographicLocation BioSampleBiovar AssemblyTotalLength \
+              --confidence \
+              --output morelli2010/nextstrain/traits.json
+
+------------
+
+Export
+------
+
+Export the json files for an auspice server.
+
+**Shell script**::
+
+          augur export v2 \
+              --tree morelli2010/nextstrain/tree.nwk \
+              --metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
+              --node-data morelli2010/nextstrain/branch_lengths.json morelli2010/nextstrain/traits.json \
+              --auspice-config morelli2010/nextstrain/auspice_config.json \
+              --output morelli2010/nextstrain/morelli2010.json \
+              --lat-longs morelli2010/nextstrain/lat_longs.tsv
+
+
+------------
+
+Visualize
+---------
+
+Start up an auspice server to view the results in a browser.
+
+[error] Uncaught error in app.listen(). Code: ENOTFOUND
+
+Is an error that is frequently encountered. Deactivating and activating the conda environment has been known to help. As well as installing the actual nextstrain conda environment from their documentation.
+
+**Shell script**::
+
+      auspice view --datasetDir auspice
diff --git a/scripts/geocode_NextStrain.py b/scripts/geocode_NextStrain.py
@@ -28,7 +28,7 @@
 #                            Argument Parsing                           #
 #-----------------------------------------------------------------------#
 
-parser = argparse.ArgumentParser(description='Extract Assembly and SRA metadata from an NCBImeta sqlite database to create the tsv input file for NextStrain..',
+parser = argparse.ArgumentParser(description='Geocode a location string into latitude and longtitude coordinates.',
                                  add_help=True)
 
 # Argument groups for the program
@@ -57,6 +57,12 @@
                     dest = 'outLatLon',
                     required = True)
 
+parser.add_argument('--div',
+                    help = 'Constrain all lat lon to division level [country].',
+                    action = 'store',
+                    dest = 'forceDiv',
+                    required = False)
+
 
 
 # Retrieve user parameters
@@ -66,6 +72,7 @@
 col_name = args['colName']
 out_path = args['outPath']
 out_lat_lon = args['outLatLon']
+force_div = args['forceDiv']
 
 #------------------------------------------------------------------------------#
 #                            Error Catching                                    #
@@ -108,8 +115,8 @@
 LOC_DIVISIONS_REV = LOC_DIVISIONS[:]
 LOC_DIVISIONS_REV.reverse()
 
-# Count number of lines in input file
-total_line_count=0
+# Count number of lines in input file (substract 1 for header)
+total_line_count=0-1
 for line in in_file:
     total_line_count += 1
 in_file.close()
@@ -189,15 +196,26 @@
 for geo_loc in geo_loc_dict:
     # Skip if latitude/longitude is empty
     if geo_loc_dict[geo_loc]['latitude'] == NO_DATA_CHAR: continue
-    # Write the highest resolution division and lat lon to different tsv
-    for loc_div in LOC_DIVISIONS_REV:
-        # Once a location is found, write that and break out
+    # If the division level is free to vary
+    if not force_div:
+        # Write the highest resolution division and lat lon to different tsv
+        for loc_div in LOC_DIVISIONS_REV:
+            # Once a location is found, write that and break out
+            if  geo_loc_dict[geo_loc]['address'][loc_div] != NO_DATA_CHAR:
+                out_lat_lon_file.write(loc_div + DELIM +
+                    geo_loc_dict[geo_loc]['address'][loc_div] + DELIM +
+                    geo_loc_dict[geo_loc]['latitude'] + DELIM +
+                    geo_loc_dict[geo_loc]['longitude'] + "\n")
+                break
+    else:
+        # Write the forced division and lat lon to different tsv
+        loc_div = force_div
         if  geo_loc_dict[geo_loc]['address'][loc_div] != NO_DATA_CHAR:
             out_lat_lon_file.write(loc_div + DELIM +
                 geo_loc_dict[geo_loc]['address'][loc_div] + DELIM +
                 geo_loc_dict[geo_loc]['latitude'] + DELIM +
                 geo_loc_dict[geo_loc]['longitude'] + "\n")
-            break
+
 
 #------------------------------------------------------------------------------#
 #                            Clean Up                                          #