Skip to content

Commit

Permalink
exhibit doc up to auspice server
Browse files Browse the repository at this point in the history
  • Loading branch information
ktmeaton committed May 15, 2020
1 parent 4f056f5 commit c711b5b
Show file tree
Hide file tree
Showing 2 changed files with 135 additions and 14 deletions.
117 changes: 110 additions & 7 deletions docs/exhibit/exhibit_dhsi2020.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ Note: Requires `conda <https://docs.conda.io/projects/conda/en/latest/user-guide

------------

Phylogenetics Pipeline
--------------------------
Data Download
-------------

Download the samples and reference found in the `Morelli et al. 2010 pulication <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2999892/>`_.

Expand All @@ -36,6 +36,11 @@ Check that there are 15 samples to be downloaded.

wc -l morelli2010/sqlite_import/assembly_for_download.txt

------------

Basic Phylogeny Pipeline
------------------------

Run the full pipeline, including sample download, aligning to a reference genome, model selection, and phylogenetics.

**Shell script**::
Expand All @@ -47,27 +52,75 @@ Run the full pipeline, including sample download, aligning to a reference genome
--sqlite_select_command_asm "\"SELECT AssemblyFTPGenbank FROM Master WHERE BioSampleComment LIKE '%Morelli%'\"" \
-resume

Prepare a metadata file for NextStrain. Afterwards, manually clean up dates to remove uncertainty characters and change to the format 2000-XX-XX. Also separate out columns that have multiple entries (ex. AssemblyTotalLength) by retaining the first semi-colon separated value.
------------

TimeTree Metadata
-----------------

Prepare a metadata file for timetree.

**Shell script**::

mkdir -p morelli2010/nextstrain/

scripts/sqlite_NextStrain_tsv.py \
--database results/ncbimeta_db/update/latest/output/database/yersinia_pestis_db.sqlite \
--query "SELECT BioSampleAccession,AssemblyFTPGenbank,SRARunAccession,BioSampleStrain,BioSampleCollectionDate,BioSampleHost,BioSampleGeographicLocation,BioSampleBiovar,PubmedArticleTitle,PubmedAuthorsLastName,AssemblyContigCount,AssemblyTotalLength,NucleotideGenes,NucleotideGenesTotal,NucleotidePseudoGenes,NucleotidePseudoGenesTotal,NucleotiderRNAs,AssemblySubmissionDate,SRARunPublishDate,BioSampleComment FROM Master WHERE (BioSampleComment LIKE '%Morelli%' AND TRIM(AssemblyFTPGenbank) > '')" \
--query "SELECT BioSampleAccession,AssemblyFTPGenbank,BioSampleStrain,BioSampleCollectionDate,BioSampleGeographicLocation,BioSampleBiovar FROM Master WHERE (BioSampleComment LIKE '%Morelli%' AND TRIM(AssemblyFTPGenbank) > '')" \
--no-data-char ? \
--output morelli2010/nextstrain/metadata_nextstrain.tsv

head -n 1 morelli2010/nextstrain/metadata_nextstrain.tsv | \
awk -F "\t" '{print "name\t"$0}' \
awk -F "\t" '{print "strain\t"$0}' \
> morelli2010/nextstrain/metadata_nextstrain_edit.tsv

tail -n +2 morelli2010/nextstrain/metadata_nextstrain.tsv | \
awk -F "\t" '{split($2,ftpSplit,"/"); name=ftpSplit[10]"_genomic"; print name"\t"$0}' \
>> morelli2010/nextstrain/metadata_nextstrain_edit.tsv

Estimate a time-scaled phylogeny.
Afterwards, change the BioSampleCollectionDate column to 'date'.

**Shell script**::

sed -i 's/BioSampleCollectionDate/date/g' morelli2010/nextstrain/metadata_nextstrain_edit.tsv

- remove uncertainty characters ex. <
- change format to 2000-XX-XX.

Edit the BioSampleGeographicLocation column so that:

- Change everything just to country name
- USSR to Russia

Add a line for the Reference Genome that just says "Reference" under the strain column, and is question marks for all remaining columns. We will let the program infer the metadata and see how close it gets.

Geocode the GeographicLocation column to get lat lon coordinates.

**Shell script**::

scripts/geocode_NextStrain.py \
--in-tsv morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
--loc-col BioSampleGeographicLocation \
--out-tsv morelli2010/nextstrain/metadata_nextstrain_geocode.tsv \
--out-lat-lon morelli2010/nextstrain/lat_longs.tsv \
--div country

Replace the division name 'country' with our column name 'BioSampleGeographicLocation' in the lat lon file.
Edit country names in the lat lon file to match our original metadata.

**Shell script**::

sed -i 's/country/BioSampleGeographicLocation/g' morelli2010/nextstrain/lat_longs.tsv
sed -i 's/Iran/Kurdistan/g' morelli2010/nextstrain/lat_longs.tsv
sed -i 's/United States of America/USA/g' morelli2010/nextstrain/lat_longs.tsv
sed -i 's/Republic of the Congo/Congo/g' morelli2010/nextstrain/lat_longs.tsv

------------

TimeTree Phylogeny
------------------


Estimate a time-scaled phylogeny. Re-root with strain Pestoides F (Accession: GCA_000016445.1_ASM1644v1).

**Shell script**::

Expand All @@ -77,7 +130,57 @@ Estimate a time-scaled phylogeny.
--vcf-reference morelli2010/reference_genome/GCF_000009065.1_ASM906v1_genomic.fna \
--metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
--timetree \
--root residual \
--root GCA_000016445.1_ASM1644v1_genomic \
--coalescent opt \
--output-tree morelli2010/nextstrain/tree.nwk \
--output-node-data morelli2010/nextstrain/branch_lengths.json;

------------

Ancestral Traits
----------------

Reconstruction of ancestral traits.
Note: Investigate the --sampling-bias-correction option.

**Shell script**::

augur traits \
--tree morelli2010/nextstrain/tree.nwk \
--metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
--columns BioSampleGeographicLocation BioSampleBiovar AssemblyTotalLength \
--confidence \
--output morelli2010/nextstrain/traits.json

------------

Export
------

Export the json files for an auspice server.

**Shell script**::

augur export v2 \
--tree morelli2010/nextstrain/tree.nwk \
--metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
--node-data morelli2010/nextstrain/branch_lengths.json morelli2010/nextstrain/traits.json \
--auspice-config morelli2010/nextstrain/auspice_config.json \
--output morelli2010/nextstrain/morelli2010.json \
--lat-longs morelli2010/nextstrain/lat_longs.tsv


------------

Visualize
---------

Start up an auspice server to view the results in a browser.

[error] Uncaught error in app.listen(). Code: ENOTFOUND

Is an error that is frequently encountered. Deactivating and activating the conda environment has been known to help. As well as installing the actual nextstrain conda environment from their documentation.

**Shell script**::

auspice view --datasetDir auspice
32 changes: 25 additions & 7 deletions scripts/geocode_NextStrain.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
# Argument Parsing #
#-----------------------------------------------------------------------#

parser = argparse.ArgumentParser(description='Extract Assembly and SRA metadata from an NCBImeta sqlite database to create the tsv input file for NextStrain..',
parser = argparse.ArgumentParser(description='Geocode a location string into latitude and longtitude coordinates.',
add_help=True)

# Argument groups for the program
Expand Down Expand Up @@ -57,6 +57,12 @@
dest = 'outLatLon',
required = True)

parser.add_argument('--div',
help = 'Constrain all lat lon to division level [country].',
action = 'store',
dest = 'forceDiv',
required = False)



# Retrieve user parameters
Expand All @@ -66,6 +72,7 @@
col_name = args['colName']
out_path = args['outPath']
out_lat_lon = args['outLatLon']
force_div = args['forceDiv']

#------------------------------------------------------------------------------#
# Error Catching #
Expand Down Expand Up @@ -108,8 +115,8 @@
LOC_DIVISIONS_REV = LOC_DIVISIONS[:]
LOC_DIVISIONS_REV.reverse()

# Count number of lines in input file
total_line_count=0
# Count number of lines in input file (substract 1 for header)
total_line_count=0-1
for line in in_file:
total_line_count += 1
in_file.close()
Expand Down Expand Up @@ -189,15 +196,26 @@
for geo_loc in geo_loc_dict:
# Skip if latitude/longitude is empty
if geo_loc_dict[geo_loc]['latitude'] == NO_DATA_CHAR: continue
# Write the highest resolution division and lat lon to different tsv
for loc_div in LOC_DIVISIONS_REV:
# Once a location is found, write that and break out
# If the division level is free to vary
if not force_div:
# Write the highest resolution division and lat lon to different tsv
for loc_div in LOC_DIVISIONS_REV:
# Once a location is found, write that and break out
if geo_loc_dict[geo_loc]['address'][loc_div] != NO_DATA_CHAR:
out_lat_lon_file.write(loc_div + DELIM +
geo_loc_dict[geo_loc]['address'][loc_div] + DELIM +
geo_loc_dict[geo_loc]['latitude'] + DELIM +
geo_loc_dict[geo_loc]['longitude'] + "\n")
break
else:
# Write the forced division and lat lon to different tsv
loc_div = force_div
if geo_loc_dict[geo_loc]['address'][loc_div] != NO_DATA_CHAR:
out_lat_lon_file.write(loc_div + DELIM +
geo_loc_dict[geo_loc]['address'][loc_div] + DELIM +
geo_loc_dict[geo_loc]['latitude'] + DELIM +
geo_loc_dict[geo_loc]['longitude'] + "\n")
break


#------------------------------------------------------------------------------#
# Clean Up #
Expand Down

0 comments on commit c711b5b

Please sign in to comment.