Skip to content

Commit

Permalink
update metadata formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
ktmeaton committed May 15, 2020
1 parent a73d376 commit 4f056f5
Showing 1 changed file with 35 additions and 0 deletions.
35 changes: 35 additions & 0 deletions docs/exhibit/exhibit_dhsi2020.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,38 @@ Run the full pipeline, including sample download, aligning to a reference genome
--skip_sra_download \
--sqlite_select_command_asm "\"SELECT AssemblyFTPGenbank FROM Master WHERE BioSampleComment LIKE '%Morelli%'\"" \
-resume

Prepare a metadata file for NextStrain. Afterwards, manually clean up dates to remove uncertainty characters and change to the format 2000-XX-XX. Also separate out columns that have multiple entries (ex. AssemblyTotalLength) by retaining the first semi-colon separated value.

**Shell script**::

mkdir -p morelli2010/nextstrain/

scripts/sqlite_NextStrain_tsv.py \
--database results/ncbimeta_db/update/latest/output/database/yersinia_pestis_db.sqlite \
--query "SELECT BioSampleAccession,AssemblyFTPGenbank,SRARunAccession,BioSampleStrain,BioSampleCollectionDate,BioSampleHost,BioSampleGeographicLocation,BioSampleBiovar,PubmedArticleTitle,PubmedAuthorsLastName,AssemblyContigCount,AssemblyTotalLength,NucleotideGenes,NucleotideGenesTotal,NucleotidePseudoGenes,NucleotidePseudoGenesTotal,NucleotiderRNAs,AssemblySubmissionDate,SRARunPublishDate,BioSampleComment FROM Master WHERE (BioSampleComment LIKE '%Morelli%' AND TRIM(AssemblyFTPGenbank) > '')" \
--no-data-char ? \
--output morelli2010/nextstrain/metadata_nextstrain.tsv

head -n 1 morelli2010/nextstrain/metadata_nextstrain.tsv | \
awk -F "\t" '{print "name\t"$0}' \
> morelli2010/nextstrain/metadata_nextstrain_edit.tsv

tail -n +2 morelli2010/nextstrain/metadata_nextstrain.tsv | \
awk -F "\t" '{split($2,ftpSplit,"/"); name=ftpSplit[10]"_genomic"; print name"\t"$0}' \
>> morelli2010/nextstrain/metadata_nextstrain_edit.tsv

Estimate a time-scaled phylogeny.

**Shell script**::

augur refine \
--tree morelli2010/iqtree/iqtree.core-filter0_bootstrap.treefile \
--alignment morelli2010/snippy_multi/snippy-core.full_CHROM.fasta \
--vcf-reference morelli2010/reference_genome/GCF_000009065.1_ASM906v1_genomic.fna \
--metadata morelli2010/nextstrain/metadata_nextstrain_edit.tsv \
--timetree \
--root residual \
--coalescent opt \
--output-tree morelli2010/nextstrain/tree.nwk \
--output-node-data morelli2010/nextstrain/branch_lengths.json;

0 comments on commit 4f056f5

Please sign in to comment.