identify_founders.Rnw

\documentclass[a4paper]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{tabularx,ragged2e,booktabs,caption}
\newcolumntype{C}[1]{>{\Centering}m{#1}}
\renewcommand\tabularxcolumn[1]{C{#1}}

\title{identify-founders: Notes and explanations}
\author{Paul Edlefsen}
\date{}

\begin{document}

\maketitle

<<echo = FALSE>>=
## R packages needed
library( "xtable" )
library( "coin" )
library( "ggplot2" )

## for prettyPrintPValuesTo4Digits
source( "~/src/from-git/projects/pedlefse/rapporttemplates/rapporttemplates-util.R" )

# Setup for prettier Sweave output.
old.continue.option <- options( continue = " " )
@ 

\section{Code TODOs}
\begin{itemize}
  \item Add calls to PFitter.R to get tests for star-like phylogeny, plus timing estimations (formatteddays has est and CI)
    \begin{itemize}
      \item use distance computation from clusterInformativeSites.R
      \item try calling separately on putative founder clusters (does each founder appear Poisson?).
     \end{itemize}
  \item Try recursion after a single cut, as an alternative to the full cutting using dynamicTreeCut.
  \item Add BEAST.
  \item Add comparison to distribution of clade-similar seqs in curated LANL alignment (see Morgane and Josh\'s method).
    \begin{itemize}
      \item could do this with a profile, same way.
    \end{itemize}
  \item Add longitudinal analysis: try running separately by time, then merging the times (?).
\end{itemize}

\section{Background on methods}

\subsection{Founder identification}


\textbf{Ireen E. Kiwelu}'s article \cite{Kiwelu:2012aa} describes a pipeline for founder identification using phyml and nj:
\begin{quote}
Decision on multiplicity of HIV-1 infection was made based on branching topology and clustering of analyzed viral quasispecies. The HIV-1 single variant was defined as an infection of the same HIV-1 subtype in which the viral quasispecies form a distinct phylogenetic cluster with an aLRT support value $\geq$ 0.95 in ML analysis, and/or a bootstrap value $\geq$ 80\% in NJ analysis. The HIV-1 multiple variant was defined as an infection resulting from more than one variant of the same or different HIV-1 subtypes in which the viral quasispecies formed more than one distinct cluster each supported with an aLRT support value $\geq$ 0.95 in ML analysis, and/or a bootstrap value $\geq$ 80\% in NJ analysis. Mean pairwise genetic distances between the clusters were computed for subjects infected with multiple HIV-1 variants.

Nucleotide differences were visualized by the highlighter tool from the Los Alamos HIV-1 Sequence Database website by comparing the HIV-1 proviral DNA and viral RNA sequences to the consensus of the earliest sequences created in BioEdit.
\end{quote}

\textbf{Ireen E. Kiwelu}'s article \cite{Kiwelu:2012aa} data:
\begin{quote}
  All generated sequences were submitted to GenBank. The accession numbers are JX070938–JX072916.
\end{quote}


\textbf{Vladimir Novitsky}'s article \cite{Novitsky:2011aa} outlines a pipeline for founder identification in early acute infection:
\begin{quote}
Analysis of viral distances

Viral pairwise ML-corrected distances were analyzed using DIVEIN [23]. The Kimura-2-parameters (K2P)-corrected and Hamming distances were analyzed in MEGA v4 [24]. The MRCA and pairwise ML-corrected distances to MRCA were estimated in DIVEIN [23]. The MRCA, a hypothetical viral sequence that represents the most recent viral variant from which a subject's viral quasispecies are descended, was reconstructed in DIVEIN [23] by the joint maximum likelihood procedure [25]. The majority consensus sequence, another hypothetical viral sequence that indicates the most abundant nucleotide in the multiple sequence alignment at each position, was built in BioEdit [26]. The pairwise K2P-corrected and Hamming distances to the consensus sequence were quantified in MEGA v4 [24].

Poisson fitness

Poisson fitness analysis was performed using the online tool at Los Alamos National Laboratory at http://www.hiv.lanl.gov/content/sequence​/POISSON_FITTER/poisson_fitter.html [27]. The tool analyzes frequency of Hamming distances by computing the best fitting Poisson distribution through ML and evaluating results of the Goodness of Fit test (GOF).

Highlighter plots

Highlighter plots were generated by Highlighter at www.hiv.lanl.gov, a visualization tool of aligned nucleotide sequences that highlights nucleotide polymorphisms and marks APOBEC signatures.

Recombination analysis

Recombination analysis was performed using package RDP3, a computer program for statistical identification and characterization of recombination events in DNA sequences [28]. RDP3 utilizes a range of non-parametric recombination detection methods including BOOTSCAN, MAXCHI, CHIMAERA, 3SEQ, GENECONV, SISCAN, PHYLPRO and VISRD [29]–[34]. RDP3 treats every sequence within the analyzed alignment as a potential recombinant, and systematically screens sequence triplets and/or quartets to identify sequences that contain a recombinant and two sequences that could serve as parents, and performs a statistical evaluation of recombination signal [28]. Such an approach eliminates the need for reference sequences, which makes analysis of viral quasispecies from epidemiologically unlinked patients more practical.

Estimation of tMRCA

Analysis was performed using Bayesian inference with a Markov Chain Monte Carlo (MCMC) method implemented in BEAST v.1.5.4 [35]. Longitudinal viral quasispecies dated according to the day of sampling from a subset of six acutely infected individuals were utilized to identify the rate of viral evolution within HIV-1 subtype C gag and env. For acutely infected individuals, the MRCA of viral quasispecies sampled at a given time was constrained to a uniform calibration prior bounded between the time of sample collection in relation to the estimated time of seroconversion (lower bound) and the same value plus 30 days as an average time period between the time of infection and seroconversion. The rate of evolution over the entire tree was estimated as the meanRate parameter for each case of acute HIV-1 infection. The geometric mean evolutionary rate in HIV-1C gag was estimated at 1.23E-05 (95\% CI 8.07E-06 – 1.64E-05) substitutions per site per day. The geometric mean evolutionary rate in HIV-1C env gp120 was estimated at 3.71E-05 (95\% CI 1.97E-05 – 5.45E-05) substitutions per site per day. The estimated rate of gag and env gp120 evolution was applied to determine tMRCA for five “undetermined” cases in the study, and the tMRCA was estimated as the treeModel.rootHeight parameter. The gag and env gp120 sequence data were analyzed using the evolutionary model selected by the Akaike information criterion in jModeltest 0.1.1 [36] and a relaxed molecular clock (uncorrelated lognormal) under the Yule model. The value of the effective sampling size (ESS) was controlled to be above 200, and the length of the MCMC chain was at least 20,000,000.

The decision on multiplicity of HIV infection

The decision on multiplicity of HIV infection was made based on the model's fit with or failure to explain the observed extent of viral sequence heterogeneity. The model's fit was associated with transmission of a single viral variant, while the model's failure was interpreted as either transmission of multiple viral variants, or the result of rapid immune selection driving the observed level of viral diversification [1], [37]–[39].
\end{quote}

\textbf{Vladimir Novitsky}'s other article \cite{Novitsky:2013aa} describes the BEAST parameters for estimating evolution rates, in detail:
\begin{quote}
To estimate evolutionary rates in HIV-1C gag and env gp120 we used a comprehensive BEAST package v.1.7 (Drummond and Rambaut, 2007) that incorporates phylogenetic uncertainty into estimates of evolutionary parameters (Gray et al., 2011) as described previously (Stadler et al.). Briefly, the optimal evolutionary model estimated by jModelTest (Posada, 2008) was used for inferring the maximum likelihood (ML) tree by PhyML (Guindon et al., 2009; Guindon et al., 2010). The ML tree was used as a starting tree in Bayesian analysis. The independent HKY+G model was used for each codon, which allowed us to estimate relative rates of evolution at each codon position and codon rate ratio values of the 1st to 3rd codon position, and 2nd to 3rd codon position. We employed a relaxed molecular clock approach that explicitly models and estimates the level of rate variation among lineages (Drummond et al., 2006). The posterior distribution for the coefficient of variation that does not include zero indicates that the relaxed clock model provides a better fit to the data than the strict clock model (Drummond et al., 2006; Gray et al., 2011). The prior on the HKY transition/transversion parameter (κ) was Gamma distribution with a shape of 0.05 and a scale of 20. The prior on the alpha shape parameter (α) was an exponential distribution with a mean of 1. The ucld.stdev (S) parameter of the lognormal relaxed clock had an exponential prior with mean of 0.333. Viral evolutionary rates before and after the ART were estimated in a subset of six subjects who started ART during the follow-up period (Table 1) by assigning temporal partitions corresponding to the sampling intervals before and after initiation of ART.
\end{quote}

\textbf{Vladimir Novitsky}'s other article \cite{Novitsky:2013aa} has data:
\begin{quote}
A total of 1,963 HIV-1 subtype C sequences used in this study were deposited to GenBank previously with the following accession numbers: GQ275453–GQ275667, GQ275750– GQ276051, GQ276137–GQ276247, GQ870874–GQ870904, GQ276248–GQ276284, GQ276418–GQ276698, GQ276766–GQ276795, GQ276797–GQ277080, GQ277179– GQ277222, GQ277224–GQ277293, GQ277295–GQ277374, GQ277404–GQ277443, GQ277446–GQ277569, GQ375107–GQ375128, GQ870874–GQ870939, GQ870957– GQ871036, and GQ871050–GQ871183. Additionally, 2,219 HIV-1C sequences were deposited to GenBank within this study: accession numbers KC628761–KC630979. The supplementary Table S1 includes information for all viral sequences used in this study including sequence ID, former sequence ID (if any), GenBank accession numbers, patient code, source of amplification, date of sampling, and the number of days from estimated seroconversion.
\end{quote}

 Elena E. Giorgi's article \cite{Giorgi:2010aa} describes the Poisson Fitter algorithm.
 Elena E. Giorgi's article \cite{Giorgi:2010aa} has data, but the first one is missing.
\begin{quote}
 Figure 2 shows the graphics obtained by analyzing a fragment of the NEF HIV-1 gene (169 base pairs) from patient CH40 \cite{Keele:2008aa,fischer2010rapid}. The data have been published \cite{fischer2010rapid} and submitted to the NCBI Sequence Read Archive http://www.ncbi.nlm.nih.gov/Traces/sra under accession number SRA020793. This sample was obtained through deep sequencing [6,20] and yielded a little over 4,000 sequences, though our tool can easily handle ten times as many sequences: because it works only with counts of pairwise distances, it can handle samples of almost any reasonable size, though very large jobs will slow the server. The left panel in Figure ​Figure22 shows the pairwise HD frequency counts (black), the best fitting Poisson distribution (blue), and the expected counts if it were a star-phylogeny (red) on a logarithmic scale. The fact that the red line and the black line are indistinguishable confirms that the sample follows a star-like phylogeny. Because the Poisson fit is very sensitive to deviations in the upper tail of the distribution, the tool outputs graphics in the logarithmic scale whenever the sample size is above 100; this helps visualize possible deviations at the higher distances. Though the values are discrete, lines are used for better visualization. In the right panel a histogram of the frequency counts is shown together with, in red, the best fitting Poisson distribution. In this case the sample yielded a good fit (P = 0.981) and a time to MRCA of 34 days 95\% CI = (31, 38).
\end{quote}
  
The Fischer article \cite{Fischer:2010aa} did deep sequencing (454) longitudinally in three subjects [SUMA0874 (SUMA), WEAU0575 (WEAU) and CHAVI-700-01-004-0 (CH40)].
\begin{quote}
The sequence read data for this study have been submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra) under accession number SRA020793.
\end{quote}
NOTE: This is SRA code is MISSING from the Read Archive!

Raabya's paper \cite{Rossenkhan:2012aa} on non-structural genes:
\begin{quote}
HIV-1 Subtyping
Nucleotide sequences were codon aligned using the MUSCLE
algorithm implemented in Mega 5 [28] followed by manual
adjustment in Bioedit [29]. For HIV-1 subtyping, three sequences
per patient were randomly selected from the pool of generated
viral quasispecies. The selection criteria included earliest timepoint
available and sequence length with more than 90\% coverage of
the targeted region (5,041 to 6,310 nt position in HXB2). To
determine phylogenetic relationships and clustering patterns of
generated viral sequences, the phylogenetic tree reconstruction
was performed using Neighbor joining, Maximum likelihood, and
Bayesian methods. A standard set of HIV-1 subtype references
from LANL was included in the analysis. The sequence
CPZ.CM.98.CAM3.AF115393 was used as an outgroup. The
online Rega-2 subtyping tool was used in parallel for confirmation
[30].

HIV-1C Viral Diversity and Diversification
Intra-patient Diversity Analysis
The intra-patient mean pairwise distances were measured in
Mega 5 using the Maximum Composite Likelihood (MCL) model,
following nucleotide sequences alignment using the MUSCLE
algorithm implemented in Mega 5 [28] and manual adjustment in
Bioedit [29]. Mean values were calculated at each sampling
timepoint and the averages of these were used in each time ‘‘bin’’
(0–90, 91–180, 181–500 days p/s). The viral diversity for each
non-structural gene was computed as a mean value with 95\%
Confidence Interval (CI). HIV-1 RNA load in plasma was
measured routinely at each study visit over the first year of
follow-up. The study schedule included weekly visits for the first
two months, bi-weekly for the next two months, and monthly for
the next eight months for acutely infected individuals, or monthly
for individuals identified within Fiebig stage IV-V. After the first
year, quarterly study visits were scheduled for all subjects. Details
of HIV-1 RNA load measurement have been presented elsewhere
[21,22,31]. Mean values for the analyzed time intervals, 0–90 days
p/s, 91–180 days p/s, and 181–500 days p/s, and for 100–300
days p/s as described previously, were also included [21]. Potential
associations between intra-patient viral diversity and HIV-1 RNA
load at the three time intervals, 0–90 days p/s, 91–180 days p/s,
and 181–500 days p/s, were assessed.

Statistical Methods
Sigma plot version 11 was used for descriptive statistics to
summarize medians, means, and standard deviations. Comparisons
between groups were made using t-test and Mann Whitney
Rank sum test for continuous and binary outcomes respectively.
All reported p-values are 2-sided. Regression analysis was
performed using linear regression and the Spearman Rank Test.
\end{quote}
  
  Raabya's paper \cite{Rossenkhan:2012aa} has data:
  \begin{quote}
    GenBank Accession Numbers
Sequences have been assigned GenBank accession numbers
JQ895561–JQ896230
    \end{quote}
http://www.ncbi.nlm.nih.gov/nuccore/?term=JQ895561%3AJQ896230%5Baccn%5D

Josh and Morgane's paper \cite{Herbeck:2011aa} describes a pipeline for founder identification:
\begin{quote}
Nucleotide sequences were aligned with Clustal W, version 1.8 (70), and manually edited with MacClade, version 4.08 (44). Alignments are available at http://mullins.lab.microbiol.washington.edu/publications/herbeck\_2011/. Alignments of phylogenetically informative nucleotide sites omit mutations that occur only once, which are possibly introduced by polymerase-induced errors during PCR. This informative-sites (InSites) approach (http://indra.mullins.microbiol.washington.edu/DIVEIN/insites.html) results in slightly decreased estimates of nucleotide diversity relative to single-template amplification methods (61) although standard methods of PCR/cloning have been shown to produce measures of population structure and genetic diversity equivalent to those found with single-genome amplification methods (31). An insertion or deletion that spanned multiple sites was counted as a single informative site. APOBEC3G/APOBEC3F (APOBEC3F/G)-induced mutations were evaluated using Hypermut, version 2.0 (http://www.hiv.lanl.gov/content/sequence/HYPERMUT/hypermut.html), in intrahost datasets by taking the consensus sequence at visit 1 as a reference; one putative APOBEC-induced G-to-A hypermutated sequence in subject S1 was identified and excluded from subsequent analyses. Maximum-likelihood phylogenetic trees were reconstructed using the general time-reversible model of substitution with gamma distribution in PhyML (version 2.4.5) (23). Potential N-linked glycosylation sites (PNGS) in Env were predicted using N-GLYCOSITE (76). All Env sequences were evaluated for CCR5 or CXCR4 coreceptor specificity using the position-specific site matrix (PSSM) web tool (30) (http://indra.mullins.microbiol.washington.edu/webpssm). For each individual with five or more sequenced time points, the rate of nucleotide diversity increase was estimated using univariate linear regression analysis. Overall rates of diversity increase were calculated by pooling all data points and, alternatively, by estimating the mean of rates calculated separately for each individual. Intrahost phylogenies were reconstructed to identify distinct lineages, taken to indicate multiple founders, replicating within each individual. Using sequences from the first time point examined (visit 1), we examined the distribution of pairwise genetic diversity and Hamming distances (HD; the uncorrected count of nucleotide differences between two sequences) for all nucleotide sites and for phylogenetically informative sites.

Identifying signatures of sequence evolution. (i) Neutrality tests.
Two statistical tests of neutral evolution implemented in the DnaSP software (60) were used. Tajima's D (69) is based on the difference between two estimates of θ (θ = 2Neμ in a haploid population, where Ne is effective population size, and μ is the mutation rate per generation); one estimate is based on the number of segregating nucleotide sites (θW), and the other is based on the average pairwise distance (π, θπ). In a population of constant size in neutral equilibrium, the two estimates of θ will be statistically indistinguishable, and values of D are near zero. Deviations from zero (the null hypothesis of neutral evolution) can reflect selective or demographic processes. The D* of Fu and Li (18) compares θW to θ based on the total number of mutations on a genealogy. D and D* were calculated for genome nucleotide alignments at each time point. Bonferroni corrections for 36 tests (P = 0.05) were done after P values were estimated from null distributions created from 104 simulations under a neutral coalescent model with no recombination, conditioned on the sample size and level of variation in the observed data (60).

(ii) Neutrality tests across multiple loci.
D and D* were calculated separately for each gene, with the heuristic assumption of free recombination among genes (67) and no recombination within them (i.e., to test for deviations from neutrality among genes, we assumed that genes are independent and that selection operating on Gag will not affect Env). D and D* values were then compared across env, gag, nef, and pol, with statistical correction for 144 tests after estimating P values as above (60). Quantitative evidence of distinct evolutionary processes among loci were evaluated using a Hudson-Kreitman-Aguadé (HKA) test (28), treating each time point and gene alignment (env, gag, nef, and pol) as a separate population. The HKA test is based on the fact that selection acting on a specific locus will violate the neutral condition where Ne is equivalent across loci, with statistical significance estimated with a χ2 goodness-of-fit test (28).

(iii) Assessing positive selection.
We tested for evidence of positive selection in all nine HIV-1 genes in each individual with five or more time points. First, we measured the ratio of nonsynonymous (dN) to synonymous (dS) substitutions, dN/dS, or ω, (20, 56) using HyPhy (http://www.datamonkey.org/) (53). The fixed-effects likelihood (FEL) method with the general reversible nucleotide substitution model (REV) was used, and sites with ω of >1 and P of <0.1 were considered to be under positive selection. Second, we tested for directional positive selection using the method of Liu et al. (41), which compares the accumulation rate of amino acid mutations to the expected rate if the accumulation were due to genetic drift alone (determined by simulation).

In silico protein sequence analysis. (i) Epitope repertoires.
HLA-specific HIV-1 epitopes were predicted in all protein sequences using Epipred (25; http://atom.research.microsoft.com/bio/epipred.aspx) and NetMHC (10, 49). Epipred identifies known and potential CTL epitope motifs using 2-digit HLA information; we accepted all epitope motifs with a posterior probability of >0.5. NetMHC predicts binding of peptides to 4-digit HLA alleles; we accepted both strong and weak binders.

(ii) Comparison of each proteome to the consensus at visit 1.
For each individual, we derived a consensus from sequences found at visit 1 (in the event of two founder viruses, two respective consensus sequences were derived). Each sequence from later visits was compared to the visit 1 consensus, and we tracked the frequency of all amino acid mutations longitudinally.

(iii) Population frequency of specific amino acids.
For each amino acid mutation, we calculated the frequency of the mutant and consensus amino acid in circulating HIV-1 sequences, using a data set of independent HIV-1 clade B sequences (comprised of 200 sequences from Env, 125 from Gag, 227 from Pol, 514 from Nef, 184 from Rev, 286 from Tat, 327 from Vif, 225 from Vpr, and 203 from Vpu) (57). We defined an amino acid mutation as a forward (putative escape) mutation when there was a decrease of >50\% between the database frequencies of the visit 1 consensus amino acid and the mutated amino acid. Conversely, a reversion corresponded to an increase of >50\% between the database frequencies of the visit 1 consensus amino acid and the mutated amino acid.
\end{quote}

Josh and Morgane's paper \cite{Herbeck:2011aa} has data:
\begin{quote}
Alignments are available at https://mullinslab.microbiol.washington.edu/publications/herbeck\_2011/
    \end{quote}
NOTE: The alignments are not at that site.  Asked Josh, awaiting reply.
    
NOTE: Justine Sunshine's paper \cite{Sunshine:2015aa} apparently did not submit the data.  Asked Jim, awaiting reply.

There are two participants from HPTN 053 studied in Li Hua Ping's article \cite{Ping2013}:
\begin{quote}
Sequences obtained from those amplicons were used to generate phylogenetic trees using the program BEAST (Bayesian Evolutionary Analysis by Sampling Trees) [14]. BEAST uses a Bayesian Markov Chain Monte Carlo approach, implemented using the program BEAST v.1.6.1 [14], which was used to estimate the time to the most recent common ancestor (MRCA) for each sample. For Case A, this analysis was performed using just the major viral population. We used a fixed mutation rate (1.5×10−5) which we have found useful in defining time since transmission for a number of recent transmission events (unpublished data). We also used tip dating given the known dates of sampling, which allowed an output in days rather than generations. Accession numbers for the env gene sequences used in the BEAST analysis are KC634109-KC634205.

A second approach, the Poisson Fitter [15], was also used to estimate the extent of early evolution in Case A. The Poisson Fitter was implemented at XX %http://www.hiv.lanl.gov/content/sequence​/POISSON_FITTER/poisson_fitter.html 
using the sequences from the major viral population. A mutation rate of 2.16×10−5 per site per generation was used in the model.
\end{quote}
  
Li Hua Ping's article \cite{Ping2013} data:
\begin{quote}
  Accession numbers for the env gene sequences used in the BEAST analysis are KC634109-KC634205.
\end{quote}
http://www.ncbi.nlm.nih.gov/nuccore/?term=KC634109%3AKC634205%5Baccn%5D

Sarah Sterrett's article (with me and Katie Bar) \cite{Sterrett:2014aa} used the Keele approach, and has data:
\begin{quote}
Phylogenetic Analyses
Sequences were aligned with ClustalW and hand-checked using
MacClade 4.08. The composite maximum-likelihood (ML)
phylogenetic tree with ML bootstrap support was estimated
using RAxML-HPC-SSE3 version 7.6.3 with a Γ model of rate
heterogeneity and ML estimate of the α parameter [23]. All
other phylogenetic trees were generated by the neighbor-joining
(NJ) method using ClustalW or ML method using PhyML v.3.
Sequences were analyzed by a previously described model of
neutral virus evolution [2, 24]. Under this model, low-diversity
sequence lineages display a star-like phylogeny and a Poisson
distribution of mutations and coalesce to an unambiguous
T/F sequence [2, 24]. Sequences with evidence of APOBEC3G-mediated
hypermutation were removed from model
calculations and figures. All sequences were deposited in
GENBANK (KJ952241–KJ953713).
\end{quote}
http://www.ncbi.nlm.nih.gov/nuccore?term=KJ952241%3AKJ953713%5Baccn%5D

Chunlai Jiang's article \cite{Jiang:2011aa} has an x4-tropic subject ZP6248 has data:
\begin{quote}
The sequences determined in this study were deposited in GenBank under the accession numbers EU575573-EU575592 (env) and JN400469-JN400492 (half genome).
\end{quote}
http://www.ncbi.nlm.nih.gov/nuccore/?term=JN400469%3AJN400492%5Baccn%5D


Oschenbauer's article \cite{Oschenbauer-2012aa} has data:
\begin{quote}
GenBank accession numbers reporting the inferred T/F proviral DNA sequence (5′-U3 through U5-3′) from each subject are shown in Table 1. 

We also submitted to GenBank the sequences of all near-genome-length
9-kb SG amplicons not previously reported by Salazar-Gonzales (55), i.e.,
those derived from subjects 700010106 (JN944898 to JN944903), RHPA4259
(JN944918 to JN944926), THRO4156 (JN944929 to JN944934), and
REJO4541 (JN944911 to JN944915). Furthermore, the consensus sequences
for all 10 R-U5-encompassing 800-bp SG amplicons have been submitted
to GenBank (with the following accession numbers: CH106_RU5_con,
JN944896; CH40_RU5_con, JN944904; CH58_RU5_con, JN944906;
CH77_RU5_con, JN944908; REJO_RU5_con, JN944910; RHPA_RU5_con,
JN944916; SUMA_RU5_con, JN944927; THRO_RU5_con, JN944929;
TRJO_RU5_con, JN944935; and WITO_RU5_con, JN944937).
\end{quote}

Parrish and Wilen's article \cite{Parrish:2012aa} has data:
\begin{quote}
The inference and cloning of T/F Envs and IMCs from SGA-derived viral sequences has been described (Figure S1) [10], [12], [13], [25], [37]. To ensure efficient expression of cloned subtype C Envs for pseudotyping, the sense primer used for amplification of the corresponding rev1-vpu-env cassette lacked the rev initiation codon (underlined) (5′-CACCGGCTTAGGCATCTCCTATAGCAGGAAGAA-3′) [39].

Since chronic HIV-1 infections represent complex quasispecies of genetic variants, it is impossible to predict, based on sequence analysis alone, which members of this quasispecies are functional and which are defective or partially defective. To generate biologically relevant chronic controls, we thus targeted viral variants for both Env and IMC construction that exhibited evidence of a recent clonal expansion. Viral RNA was extracted from the plasma of chronically infected individuals and subjected to SGA and direct amplicon sequencing as described [12], [13], with the following modifications: 5′ half genome amplification: 1st round sense primer 2010ForRC 5′- GTCTCTCTAGGTRGACCAGAT -3′, 1st round antisense primer 2010Rev1C 5′- AAGCAGTTTTAGGYTGRCTTCCTGGATG -3′, 2nd round sense primer 2010R1C 5′- TAGGTRGACCAGATYWGAGCC -3′ and 2nd round antisense primer 2010Rev2C 5′- CTTCTTCCTGCCATAGGAAAT -3′; 3′ half genome: 1st round sense primer 07For7 5′- CAAATTAYAAAAATTCAAAATTTTCGGGTTTATTACAG -3′, 1st round antisense primer 2.R3.B6R 5′- TGAAGCACTCAAGGCAAGCTTTATTGAGGC-3′, 2nd round sense primer VIF1 5′- GGGTTTATTACAGGGACAGCAGAG -3′ and 2nd round antisense primer Low2C 5′- TGAGGCTTAAGCAGTGGGTTCC -3′. Thermal cycling conditions were identical to [13] except that 60°C was used for primer annealing. Sequences were then aligned using ClustalW [81] and subjected to phylogenetic analysis using PhyML [82]. Phylogenetic trees were inspected for clusters of closely related viruses, or “rakes”, which are indicative of a recent clonal expansion. (Figures 1, S2 and S3). In five subjects (Table 1), at least one env amplicon was identical in sequence to the inferred “rake” consensus and thus selected for cloning using the pcDNA3.1 Directional Topo Expression kit (Invitrogen). In two subjects, observable “rakes” were limited to only two closely related sequences, which encoded Env proteins that differed by a single amino acid. In these cases, the amplicon that matched the within patient consensus at this ambiguous site was cloned. In the remaining six subjects, the consensus sequences of the clonal expansion “rakes” were chemically synthesized and cloned (designated .synR1 in Table 1). IMCs from chronically infected subjects (CH256, CH432, CH457, and CH534) were generated using the same approach. 3′ and 5′ half-genome SGA was performed using viral RNA from subjects with evidence of clonal expansion as determined by env sequencing. 3′ and 5′ half genome sequences were used to construct neighbor joining trees (Figure S3), and clusters of closely related sequences were selected for further analysis. A consensus sequence of the members of such “rakes” was generated using Consensus Maker (hiv.lanl.gov). 3′ and 5′ half genome sequences were confirmed to be identical in their 1,040 bp overlapping regions, chemically synthesized in fragments bordered by unique restriction enzymes, and ligated together to construct infectious proviral clones.

Infection stage	Subject	env clone accession numbers	env SGA accession numbers	5' half accession numbers
Acute	20258279	HQ595763 and HQ595764 (env)	JQ753730-JQ753770	n/a
	2833264	HQ595757 (env)	JQ754177-JQ754189	n/a
	21197826	HQ595753 (env)	JQ754190-JQ754202	n/a
	21283649	HQ595756 (env)	JQ754203-JQ754229	n/a
	20927783	HQ595750 (env)	JQ754230-JQ754240	n/a
	1245045	HQ595742 (env)	JQ754241-JQ754250	n/a
	19157834	HQ595743 (env)	JQ754251-JQ754286	n/a
	2935054	HQ595758 (env)	JQ754287-JQ754304	n/a
\end{quote}

Kamini Gounder's article \cite{Gounder:2015aa} has a pipeline that includes getting epitopes.
TODO: READ Gounder PIPELINE

Kamini Gounder's article \cite{Gounder:2015aa} has acute (maybe also longitudinal?) data with HLA types for 22 subjects:
\begin{quote}
  Full-length gag nucleotide sequences obtained in this study have been submitted to the GenBank database under accession numbers: KM192366—KM193012.
  \end{quote}
http://www.ncbi.nlm.nih.gov/nuccore?term=KM192366%3AKM193012%5Baccn%5D

\subsubsection{LANL HIV Acute Infection Studies}
From http://www.hiv.lanl.gov/content/sequence/HIV/SI_alignments/datasets.html, we found these:

\textit{Special Interest Data: Transmitted Virus Envelopes Prior to Seroconversion}\\
Set of 567 subtype B env sequences from 37 subjects prior to seroconversion.\\

Goeffrey S. Gottlieb's article \cite{Gottlieb:2008aa} (Jim Mullins, senior author) on the MACS cohort.

Goeffrey S. Gottlieb's article has data curated at LANL, and they are also on genbank:
\begin{quote}
  Sequences were deposited in GenBank and were assigned accession numbers AF138652-AF138657 and EU184091-EU184657. 
\end{quote}


\textit{Sequence Alignments for Keele et al. 2008 Identification and Characterization of Transmitted and Early Founder Virus Envelopes in Primary HIV-1 Infection}
Set of 3449 subtype B env sequences from 102 subjects during early infection. A CHAVI study.

The Supplementary Information for Brandon Keele's article \cite{Keele:2008aa} outlines a pipeline for founder identification:
\begin{quote}
The intersequence HDs are not independent, but, because of the star phylogeny, they are the pairwise sums of a set of independent Poisson distributed variates. The form of their distribution, including the (singular) covariance matrix, is, therefore, known up to one unknown parameter, the $\lambda$ of the underlying Poisson distribution. We estimated this parameter by fitting the observed data to the expected form by using a maximum likelihood method and assessed the goodness of fit by using a $\Chi^2$ goodness-of-fit test statistic calculated from a singular value decomposition of the covariance matrix. When the data were consistent with a Poisson distribution, we used the $\lambda$ of the best fitting distribution to estimate a divergence time from the most recent common ancestor (MRCA) based on the estimated number of generations required to achieve the observed distribution. One can, in fact, show the following relationship for $\lambda$:
\begin{equation}
\lambda( t ) = \epsilon N_B \left( \frac{5}{8} t \frac{1+\phi}{\phi} + \frac{1-\phi}{\phi^2} \right)
\end{equation}

Therefore, once we obtain a best fitting Poisson distribution, we calculate its mean $\lambda^*$ and use the above time–dependency relationship to estimate time since MRCA (in days) as follows:
\begin{equation}
t = \frac{ 8 \phi }{ 5( 1 + \phi ) } \left( \frac{ \lambda^* }{ \epsilon N_B } - \frac{ 1 - \phi }{ \phi^2 } \right) \hbox{,}
\end{equation}
where NB is the sequence length in base pairs and

\begin{equation}
\phi = \sqrt{ 1 + \frac{ 8 }{ R_0 } } \hbox{.}
\end{equation}

Furthermore, the fraction of identical sequences expected at that time is:
\begin{equation}
\exp \bigL[ - \epsilon N_B \left( \frac{5}{8} t \frac{1+\phi}{\phi} + \frac{1-\phi}{\phi^2} \right) + O( \epsilon^2 N_b ) \bigR] \hbox{.}
\end{equation}

The change in the Poisson distribution over time illustrates the increasing diversity expected under the model (Fig. S8). It is apparent that as time increases, the number of identical sequences decreases, and the frequency distribution of the intersequence HDs at various times after infection shifts to higher HD values.
  
Bayesian Analysis. The time, in days, to the MRCA for each patient was also estimated by using a Bayesian Markov Chain Monte Carlo (MCMC) approach, implemented in BEAST v1.4.1 (14, 15). The mean substitution rate was fixed at $2.16 × 10^{-5}$ substitutions per site per generation, and all analyses were carried out by using the general time reversible (GTR) substitution model with invariant sites and gamma-distributed rate heterogeneity (four gamma categories). The substitution and rate heterogeneity models were unlinked across codon positions, and we assumed exponential population growth and a relaxed (uncorrelated exponential) molecular clock. This model was used for analysis of the viral sequence alignment of each patient, and the MCMC algorithm was run for at least 107 (14) generations (logging every 1,000 generations; burn in was set to 10\% of the original chain length), with additional runs carried out if the effective sample size for the estimate was $<$ 100. The results were visualized in TRACER (16). We repeated this analysis with the five free parameters of the GTR model fixed at values estimated by using the combined data from all acute patients inferred to be infected with a single viral strain by using the HyPhy package (17) and with alternative demographic and evolutionary models (relaxed uncorrelated molecular clock with logistic population growth and strict molecular clock with exponential population growth). Estimates and confidence intervals for the MRCA times were similar for the alternative relaxed clock models but $\approx$ 25\% lower according to a strict molecular clock (data not shown).  
  
Hypermutated Samples. Enrichment for APOBEC3G/F mutations violates the assumption of constant mutation rate across positions, because the editing performed by these enzymes are baseand context-sensitive. Enrichment for mutations with APOBEC3G/F signatures was assessed by using Hypermut 2.0 (www.hiv.lanl.gov), which compares each sequence in the sample to the consensus sequences. Hypermut detects an enrichment for G $\rightarrow$ A mutations that occur in the context of the APOBEC3G/F signature pattern, where the G is followed by either G or A, then by a base that is not C (in International Union of Pure and Applied Chemistry code, the pattern GRD where the first G changes to A). A contingency table is constructed, which is then used to obtain Fisher exact P values that test whether the extent of hypermutation is more than would be expected by chance. Single sequences that yielded a P value of $\leq$ 0.05 were considered significantly hypermutated and, therefore, were not included in other analyses; there were six such single sequences in the 102-patient dataset. In seven other cases, APOBEC3G/F patterns of substitution were enriched throughout the patient sequence set, but no one sequence was significantly hypermutated. In these cases, rather than testing each sequence separately, we assessed whether or not the entire sample was enriched overall for APOBEC3G/F signature mutations by collapsing all observed mutations into one sequence that represented all within-patient mutations, and then we used the Hypermut 2.0 program to compare this artificially constructed sequence to the consensus. Removing the APOBEC3G/F-mediated mutations in each of the 13 patients that showed enrichment for hypermutation showed that these samples otherwise conformed to our evolutionary model under no selection.

Viral Recombination. Twenty-four subjects of 102 studied were found to have been infected by two or more viruses. Highlighter tracings suggested that 16 of these subjects had HIV-1 env sequences with clear evidence of viral recombination. This interpretation was confirmed by statistical analyses with GARD \cite{KosakovskyPond01102006} or Recco \cite{maydt2006recco} recombination identification tools.

Replicative Fitness and Virus Outgrowth. It is possible that either immediately after transmission or in the first several of rounds of replication, a fitter form of the virus evolves and eventually grows out to become a dominant strain that we then detect. We argue that because of the short duration of time before plasma sampling, the long generation time of the virus, the low substitution rate in env, and the low R0, this scenario is unlikely. Even if advantageous nucleotide (amino acid) replacements were to occur, their frequency based on random mutation throughout the 2.6-kb env gene would be expected to be exceedingly small, because most mutations are neutral or deleterious. Moreover, such viral mutants would be unlikely to replace the prevalent viruses unless the fitness advantage was large. For example, descendants of a virus at 20\% replicative disadvantage compared to another still have more than a 5\% chance of occurring in a sample of size 20, 10 generations (or roughly 20 days) later. Similarly, given the expected rate of one mutation every 20 replications, and an R0 of 6, one expects the first mutation to occur after two generations of exponential growth. For such a form to be the only one, with 95\% confidence, to leave descendants sampled in a sample of size 20 at 20 days later, it needs to have a reproductive advantage of more than 70\%. The likelihood of any single random nucleotide substitution in a gene the size of env to confer such a selective advantage in anything other than a rare individual is remote and cannot plausibly explain the common finding of low diversity env lineages observed in 98 of 102 subjects in the present study.
  
  Comprehensive Summary of Observed Evolution Patterns, Timing Estimates, and Statistics Comparing Results from a Model of Random Evolution in Acute Infection Under No Selection and Those Obtained By Using a Bayesian Method Implemented in BEAST. Dataset S1, Dataset S2, Dataset S3, and Dataset S4 contain demographic and sample information corresponding to each study subject along with summaries of model parameters and statistics, including the estimated time from the MRCA of the sequences found in each sample. The confidence intervals in the table do not take into account uncertainty in the viral mutation rate or the possibility of selection against a proportion of the nonsynonymous mutations. In each section of the table, the estimates of the time to the MRCA obtained by using BEAST were derived from all of the sequence data, whereas the estimates based on the random evolution model were based on all sequences or edited sequence sets (see below). For the low diversity patients that conform to a random evolution model (Dataset S1), the estimated days from the MRCA are generally comparable to Bayesian estimates by using BEAST. In these cases, both models suggest the observed diversity profile could be expected to have emerged within a reasonable estimate of the time since infection given the Fiebig stage of the sample, and, thus, the infection scenario is compatible with evolution from a single transmitted variant. Samples with high levels of APOBEC3G/F-mediated mutations are summarized in Dataset S2. Again, the timing estimates according to the two models are roughly comparable when all substitutions are considered but with BEAST giving higher estimates. Substitutions with APOBEC3G/F signatures were then excluded for a second analysis by using the Poisson/ Star model; when this was done, the observed sequence variation became compatible with this model and the estimated days to the MRCA was reduced. For patients with low diversity that significantly deviated from this model (Dataset S3), we noted in the table whether the best explanation for the deviation was transmission of multiple closely related strains, selection, or early stochastic events. Finally, for the 21 subjects with high diversity and multiple infecting strains (Dataset S4), the timing estimates from a random evolution model are based on times to the MRCA restricted to sequences from the dominant clade in the subject, because this provided a strategy to test whether each such clade plausibly represented the outgrowth of a single transmitted virus, which they did. Attempts to apply this model to the full sample dataset in these cases would clearly seriously violate the model assumptions, so we did not attempt this analysis on the high diversity samples. We did, however, model the minimum time required to achieve the maximum HD between sequences in the sample, as discussed in the main text, and this analysis is also summarized for each high diversity patient in Dataset S4; in all 21 cases, this analysis indicates that the MRCA existed in the donor before transmission and multiple viruses were transmitted. The BEAST estimates were derived from all of the data and indicated that the MRCA long preceded the date of infection, as expected for infection with multiple strains. This provided further corroborative support for the conclusion that multiple strains were transmitted based on the maximum HD described above. The finding that the Bayesian estimates of the time to the MRCA were consistently higher than the simulations based on the random evolution model can be interpreted as follows. The rate of early diversification of the virus depends on the time profile of production of virus from an infected cell, and not only on the average generation time and the number of cells newly infected in each generation. In this work, we have not investigated the parameters of the virus production profiles that govern viral diversification but merely set a weighted arithmetic average generation time as 2 days (13), both in the Bayesian model and in the simulations presented. The two, however, assume different profiles, and we expect this to lead to differences in the calculated rate of viral diversification. As an additional cautionary note, estimates based on both models neglect the effect of recombination although this is expected to have minimal effect, particularly for estimates based on highly homogeneous sequences exhibiting star-like phylogenies.

Power Studies. To better understand our likelihood of missing infrequent transmitted variants, we did a power study to explore the probability of sampling limitations. We show that with a sample of at least n $=$ 20 plasma vRNA sequences (which was the case for 77 of 102 subjects), we could be 95\% confident that a given missed variant comprised $<$ 15\% of the virus population (Fig. S9). For 36 samples, for which n $\geg$ 30, we could be 95\% confident not to have missed any variant that comprised at least 10\% of the total viral population. For 25 samples in which the number of sequences available was between 10 and 20, we had at least an 80\% chance to detect a variant represented in $\geq$ 15\% of the population.

\end{quote}

Brandon Keele's article \cite{Keele:2008aa} data (first timepoint only):
\begin{quote}
  Plasma HIV-1 RNA sequences were derived by SGA-direct amplicon sequencing (17) and can be accessed at GenBank (EU574937–EU579293) with edited env alignments at http://www.hiv.lanl.gov/content/sequence/HIV/USER_ALIGNMENTS/keele.html. A model of random virus diversification and power calculations for estimating the likelihood of detecting infrequent transmitted env variants are presented in SI Text (see also Figs. S8 and S9). Env genes were molecularly cloned and analyzed by Env pseudotype cell entry assay (24).
 \end{quote}
 
 Brandon Keele's article also has longitudinal data (10 ppts):
 \begin{quote}
 We also examined early virus diversification directly by studying 10 subjects sampled longitudinally (Dataset S5). This analysis included a total of 1,045 complete env sequences (median of 89 per subject; range, 57–203). Seven subjects had evidence of infection by one virus and three subjects by more than one virus. The model assumes that before the onset of immune selection, virus evolves randomly with the proportion of sequences identical to the transmitted virus(es) declining as depicted by the blue symbols in Fig. 1A. For subject 1058, whose plasma was sampled three times between Fiebig stages I and IV when viral loads were increasing from 2,737 to 26,162 to 550,000 RNA copies/ml, the proportion of identical viral sequences declined from 89\% to 59\% to 45\% (Fig. 4), consistent with model projections. In all ten subjects, we found the proportion of identical env sequences to decline in a manner that closely approximated the model until Fiebig stages V-VI, when an abrupt decline in the proportion of identical sequences coincided with selection for CTL escape mutants. This is illustrated for subject WEAU0575 (Fig. S6), where 29 of 30 env sequences mutated in a single confirmed HLA-restricted CTL epitope over a period of 29 days as the patient progressed from Fiebig stage II to V. Importantly, in none of these 10 individuals did we find evidence of a transmitted virus lineage that was lost during the acute infection period before peak viremia, nor did we find evidence, aside from CTL escape variants, of a predominant viral lineage that appeared de novo.
 \end{quote}
 TODO: Add these data to BakeOff

\textit{Sequence Alignment from Li et al. 2010\\ High Multiplicity Infection by HIV-1 in Men Who Have Sex with Men} 
Hui Li and Katie Bar's article \cite{Li:2010aa} on MSM multiple variant transmissions has data on LANL.
\begin{quote}
HIV-1 Env Diversity Analysis
A total of 1307 full-length env genes encoding gp160 were
sequenced from plasma vRNA (median of 40 sequences per
subject; range 23–89). In a composite neighbor-joining (NJ)
phylogenetic tree (Fig. 1), viral sequences formed distinct
patient-specific monophyletic lineages, each with high statistical
support. Sequences from known sexual partners, including two
acute-to-acute (AD77 to AD75 and AD83 to 04013240) and two
chronic-to-acute (LACU9000 to HOBR0961 and AD18 to AD17)
transmission pairs, also clustered significantly together (Fig. 1). All
sequences were HIV-1 subtype B. Among the 28 acutely infected
subjects, maximum within-patient env diversities ranged from
0.12\% to 6.82\% (Table 2). Sequences from 22 of these individuals
had distinctly lower env diversities (<0.75\%) compared with env
diversities from six others (>1.25\%). The latter diversity is
inconsistent with single virus transmission within the time frame
of acute and early infection (Fiebig stage I–V) [8,9,10,34], while
env diversity <0.75\% is consistent either with single variant
transmission or with transmission of two or more closely related
viruses. Phylogenetic and Highlighter analyses of env sequences
distinguished between these possibilities for each subject (Fig. 2;
see also www.hiv.lanl.gov/content/sequence/HIV/USER_ALIGNMENTS/Li).
Fig. 2A shows sequences from a subject
(04013440) who was infected by a single virus, Fig. 2B a subject
(04013211) infected by two viruses differing by only 4 nucleotides
out of 2619 (0.15\%), Fig. 2C a subject (04013383) infected by two
viruses differing by 65 of 2547 nucleotides (2.55\%), and Fig. 2D a
subject (04013448) infected by four viruses differing by as many as
47 of 2655 nucleotides (1.79\%) with additional sequences showing
recombination between the transmitted/founder lineages. Altogether,
we determined that 10 of 28 subjects (36\%) had been
productively infected by more than one virus (Table 2).

Model Analysis of HIV-1 Diversification
We next analyzed the env sequences using a mathematical model of
random virus evolution that we previously described [10,13,34].
Sequences resulting from multivariant transmission, APOBEC
hypermutation, early stochastic mutation, selection by cytotoxic Tcells,
or recombination violate model predictions (Table 2) [10,34].
Once these confounders were accounted for, lineage-specific env
sequences from each subject conformed to model predictions and
coalesced to most recent common ancestor sequences at or near the
time of virus transmission estimated from clinical histories and
laboratory staging. These results thus corroborated a large body of
evidence supporting the SGA-direct sequencing strategy for
identifying transmitted/founder viruses [8,9,10,11,12,13,35].
...

Sequence Alignments
All the sequence alignments were initially made with ClustalW and
then hand-checked using MacClade 4.08 to improve the alignments
according to the codon translation. Consensus sequences were
generated for each individual. The full sequence alignment is
available as a supplemental data file (www.hiv.lanl.gov/content/
sequence/HIV/USER_ALIGNMENTS/Li) and sequences are
deposited in GenBank (accession numbers: GU330247–GU331770).
Sequence Diversity Analysis
Complete env sequences (n = 1307) were derived from 30
individuals, and 59 (U5, gag and pol) and 39 (vif, vpu, tat, rev, env, nef,
U3, and R) half genome sequences (n = 188) were derived from PBMC
and plasma at two different time points from subject AD17. We
analyzed sequences for maximum sequence diversity and then
visually inspected each set of sequences using neighbor-joining (NJ)
and Highlighter tools (www.hiv.lanl.gov). Phylogenetic trees were
generated by the neighbor-joining method using ClustalW or PAUP.

Hypermutated Samples
Enrichment for APOBEC3G/F mutations violates the assumption
of constant mutation rate across positions as the editing
performed by these enzymes are base and context sensitive.
Enrichment for mutations with APOBEC3G/F signatures was
assessed using Hypermut 2.0 (www.hiv.lanl.gov). Sequences that
yielded a p-value of 0.05 or lower were considered significantly
hypermutated and excluded from subsequent analyses.

Recombination Analyses
Recombination was evaluated using GARD [58] and Recco
[37] and by visual inspection of Highlighter plots. The minimum
number of recombination events required to explain sequence
datasets was estimated using the four-gamete method of Hudson
and Kaplan [38] as implemented in DNASP v5.00.07 [59].
Recombinant sequences reported in Table 2 were identified by
Highlighter analysis and confirmed by Hudson-Kaplan, GARD and
Recco analyses.

Mathematical Model
The model employed in the present study has been described
[10,12,34] as have measured parameters of early virus expansion
[52,53,54]. Under this model, with no selection pressure and fast
expansion, one can expect small samples from homogeneous virus
populations to have evolved from a founder strain in a star-like
phylogeny with all sequences coalescing at the founder [10,13].
Occasional deviations from a star phylogeny are, however,
expected. The sampling of 10 sequences, for example, from a
later generation of an exponentially growing population with sixfold
growth per generation (R0= 6) has about 3\% chance of
including at least one pair sharing the first four generations, a 19\%
chance of including sequences that share three, and a 75\% chance
of sharing two. Using a point mutation rate of about 1 per 5
generations for the full-length 9 kb HIV-1 genome [10,34], there
is about 75\% chance of finding among ten sequences two that
share one mutation, about 20\% chance of finding two sequences
that share a pair of mutations, and <2\% chance of sharing more
than that. These probabilities are slightly enhanced by early
stochastic events that can lead to the virus producing less than six
descendants in some generations but are diminished by the
chances that mutations cause a fitness disadvantage that results in
early purifying selection, as previously observed [10,60]. We found
examples of such early stochastic mutations leading to deviations
from star phylogeny in several subjects (Table 2).
Statistical Analyses and Power Calculations
We calculated the statistical significance of differences in rates of
single versus multivariant HIV-1 transmission using Fisher’s exact
test. Differences were considered statistically significant at a value
of p#0.05. To estimate the likelihood of missing infrequently
represented transmitted variants, we described previously a power
study that estimated the probability of sampling low frequency
plasma viral sequences [10]. In a sample set of at least n = 20,
there is a 95\% probability that a given missed variant comprises
less than 14\% of the virus population. For a sample set of 30, there
is a 95\% probability not to miss a variant that comprises at least
10\% of the total viral population. And for a sample set of 80, there
is a 95\% probability not to miss a variant that comprises at least
4\% of the total viral population.

We also considered the possibility that the number of
transmitted/founder viruses detected could be influenced by the
clinical stage (Fiebig stage) of the subjects at the time of virus
sampling, because differences in virus replication rates could lead
to increasing differences in virus frequencies with time. If this were
the case and some viruses were outcompeted, the prediction would
be that at later Fiebig stages the numbers of transmitted/founder
virus lineages would be less than at earlier Fiebig stages. Our
model (which is based on previously estimated parameters of an
HIV-1 generation time of 2 days, a reproductive ratio [R0] of 6,
and a reverse transcriptase error rate of 2.1661025 and assumes
that the initial virus replicates exponentially infecting R0 new cells
at each generation and diversifies under a model of evolution that
assumes no selection) predicts that descendants of a transmitted
virus at 45\% replicative disadvantage compared to another
transmitted virus, still have more than a 5\% chance of occurring
in a sample size of 20, ten generations ($\sim$ 20 days) later. In humans,
the eclipse period, defined as the time between HIV-1 transmission
and first detection of virus in the plasma, has been estimated
to be approximately 10–14 days, and the eclipse period plus Fiebig
stages I and II, approximately 22–26 days [10,33]. In the present
study, the numbers of subjects in Fiebig stages I/II, III, IV and V
were 14, 2, 6 and 6, respectively, and this relative distribution was
similar in the three other studies included in the combined analysis
[8,9,10]. As described in the main text, there was no significant
correlation between clinical stage and multiplicity of infection
(Fisher’s exact p = 0.53).
\end{quote}

\textit{Special Interest Data Set 5\\Acute infection sequences: subtype C env}
Set of 1505 full-length env sequences from 69 patients in early stages of infection with HIV-1 subtype C. For each patient, there are approximately 22 sequences plus a consensus sequence.

M.-R. Abrahams's article \cite{Abrahams:2009aa} studies CHAVI 001 and CAPRISA 002 participants, and outlines a pipeline for founder identification.
\begin{quote}
Differences in sequences were visualized by using neighbor-joining trees (MEGA 3.1) (43) and Highlighter nucleotide transition and transversion plots (www.hiv.lanl.gov). Pairwise DNA distances were computed by using MEGA 3.1.
Conformance of intraparticipant sequence diversity to a mathematical model of random evolution was evaluated as described by Keele et al. (16) and Lee et al. (H. Y. Lee, E. E. Giorgi, B. F. Keele, B. Gaschen, G. S. Athreya, J. F. Salazar-Gonzalez, K. T. Pham, P. A. Geopfert, J. M. Kilby, M. S. Saag, E. L. Delwart, M. P. Busch, B. H. Hahn, G. M. Shaw, B. T. Korber, T. Bhattacharya, and A. S. Perelson, submitted for publication) whereby exponential viral replication from a single lineage is assumed to fix mutations at a constant rate, in the absence of positive selection, using the following parameters: an HIV-1 generation time of 2 days (24), a reproductive ratio of 6 (i.e., assuming that the virus replicates exponentially, for each currently infected cell six new cells will be infected in the next generation) (42), and a replication error rate of 2.16 􏰃 10􏰁5 substitutions per site per replication cycle (23).
Time of divergence from the most recent common ancestor (MRCA) was estimated by using BEAST (i.e., Bayesian Evolutionary Analysis Sampling Trees, v1.4.7) (5, 6) with a relaxed (uncorrelated exponential) molecular clock and general time-reversible substitution model, with relative substitution rate parameters estimated by using HyPhy (31), as described previously (16). We used a gamma distribution with four categories and a proportion of invariant sites to model rate heterogeneity across sites. Substitution rates were unlinked across codon positions with a mean fixed at 2.16 􏰃 10􏰁5 substitutions per site per generation (23).
Sequences were analyzed for evidence of APOBEC3G-induced hypermutation by using the Hypermut 2.0 tool (www.hiv.lanl.gov). Sequences with a P value of 􏰐0.1 were considered enriched for mutations consistent with APOBEC3G signatures. In sequence sets showing evidence of enrichment for APOBEC3G-driven G-to-A transitions but with no single significantly hypermutated sequence, hypermutation was tested for after superimposition of all mutations within that sequence set onto a single representative sequence.
We used randomization to test whether there was evidence of clustering of mutations within 10-amino-acid stretches putatively associated with cytotoxic-T-lymphocyte (CTL) immune responses (14). Sites were classified as mutated in a given patient if there was at least one sequence from the patient differing from the patient consensus at that site. For each mutated site, we calculated its nearest-neighbor distance as the distance between the site and the closest mutated site to the left or right of it on the intrapatient sequence alignment. We then compared the number of mutations with a nearest neighbor within 10 amino acids to the number expected by chance by randomizing the locations of the mutated sites 1,000 times. A P value was calculated as the fraction of randomized datasets for which the proportion of mutated sites within 10 amino acids of another mutated site was equal to or greater than the observed proportion. The null hypothesis here is random distribution of mutated sites, which is consistent with the model of neutral drift of the infecting virus in acute infection, proposed by Keele et al. (16). Rejection of the null hypothesis suggests clustering of the mutations on a scale consistent with escape from CTL responses.
Multivariant transmission was considered if (i) within-patient env diversity was heterogeneous, with multimodal distribution of pairwise Hamming distances (HDs; that is, the number of differing sites between sequences) and structure within the phylogenetic trees, and (ii) if these deviations from the model of random evolution from a single founder virus could not be accounted for by APOBEC3G mutations, immune pressure, or stochastic events. For individuals infected with more than one variant, our expectation was that when multiple variants were transmitted, the sequences in the recipient should coalesce at a time predating the estimated time of infection. The number of infecting variants was enumerated, after accounting for recombination, by identification of distinct
lineages on phylogenetic trees, together with examination of Highlighter transition and transversion plots (www.hiv.lanl.gov).
N-linked glycosylation sites were identified by using the N-glycosite program (www.hiv.lanl.gov). HXB2 env protein amino acid locations for putative epitopes were obtained by using the HIV sequence locator tool (www.hiv.lanl.gov). Motif Scan (www.hiv.lanl.gov) was used to detect HLA anchor residue motifs within putative epitopes and to search for matching potential epitopes.

Recombination analysis. The GARD (www.datamonkey.org/GARD/) (17, 18), RAP Beta version (www.hiv.lanl.gov), and RDP version 3.27 (http://darwin .uvigo.es/rdp/rdp.html) (25) tools were used to detect recombination in intraparticipant sequence sets.

Modeling the probability of multivariant infection. We have assumed that established infection with a single genotypic variant signifies that a single virus particle was involved in the transmission event and that, in the setting of low probability of transmission, this represents the minimal infectious dose. The Poisson distribution was used to model the frequency of transmission of one, two, or more variants under the assumption that the transmission of each variant occurs with the same probability, i.e., as independent events. A left truncated Poisson model was used (since the zero events, i.e., no transmission, are not observed), and the model fitted to all of the data with a maximum-likelihood method; this then allowed the frequency of zero events to be estimated given the distribution of one, two, etc., variants being detected. A corresponding transmission probability was estimated as the sum of all probabilities for observing one or more variants.
\end{quote}
RAP Beta is at http://www.hiv.lanl.gov/content/sequence/RAP/RAP.html
RAP Beta version
Purpose: This tool identifies recombinants in a aligned set of nucleotide sequences.

M.-R. Abrahams's article \cite{Abrahams:2009aa} data:
\begin{quote}
Plasma was obtained from 69 individuals experiencing early HIV-1 subtype C infection. Twenty-six were identified through prospective monthly monitoring of HIV-negative women using rapid HIV tests and RNA PCR \cite{Loggerenberg:2008aa} (CAPRISA 002 cohort, Durban, South Africa); 22 were identified (i) with negative rapid HIV or enzyme immunoassay assay (EIA) antibody test results but positive HIV RNA and/or p24 antigen test results or (ii) who were antibody positive but antibody negative or Western blot indeterminate within the previous 45 days (CHAVI 001 cohorts, South Africa and Kamazu Central Hospital Sexually Transmitted Disease Clinic, Malawi). An additional 21 individuals enrolled from the Sexually Transmitted Disease Clinic in Malawi were identified as HIV PCR positive with a rapid HIV or EIA antibody-negative test or who had indeterminate HIV Western blot results (30).
...
The GenBank accession numbers for the 1,505 env sequences are FJ443128 to FJ444632. % corrected from The GenBank accession numbers for the 1,505 env sequences are FJ443128 to FJ444362.  CORRECTION: GenBank accession numbers, line 2: “FJ444362” should read “FJ444632.”
\end{quote}
% http://www.ncbi.nlm.nih.gov/nuccore?term=FJ443128%3AFJ444632%5Baccn%5D

\textit{Sequence Alignments for Salazar et al.\\Deciphering human immunodeficiency virus type 1 transmission and early envelope diversification}
Set of 129 full-length genome sequences from 12 patients in early infection, plus longitudinal sequences from 3 patients.

Jesus S. Salazar-Gonzalez's article \cite{Salazar-Gonzalez:2009aa} applies the Keele methodology:
The model used in the present study has been fully described (11). Under this model, with no selection pressure and fast expansion, one can expect small samples from homogeneous virus populations to have evolved from a founder strain in a star-like phylogeny, with all sequences coalescing at the founder (11, 21). Occasional deviations from a star phylogeny are, however, expected. The sampling of 10 sequences, for example, from a later generation of an exponentially growing population with sixfold growth per generation (R0 = 6) has ∼4\% chance of including a pair sharing the first four generations, 19\% chance of including sequences that share three, and 75\% chance of sharing two. With a predicted rate of mutations of approximately one per five generations for the full-length 9-kb HIV-1 genome, there is ∼75\% chance of finding among 10 sequences 2 that share one mutation, ∼20\% chance of finding 2 sequences that share a pair of mutations, and <2\% chance of sharing more than that. These probabilities are slightly enhanced by early stochastic events that can lead to the virus producing less than six descendants in some generations but are diminished by the chances that mutations cause a fitness disadvantage that results in early purifying selection, as previously observed (11, 24).

Jesus S. Salazar-Gonzalez's article \cite{Salazar-Gonzalez:2009aa} has both cross-sectional and longitudinal data in LANL and in genbank:
\begin{quote}
 The full sequence alignment is available as a supplemental data file (www.hiv.lanl.gov/content/sequence/hiv/user_alignments/Salazar), and sequences are available from GenBank/EMBL/DDBJ under accession nos. FJ495804–FJ496214, EU578998–EU579004, EU579006–EU579015, EU579018–EU579019, EU576471–EU576521, and FJ919955–FJ919967.
\end{quote}
TODO: Use the longitudinal data for bake off (SEE BELOW THE DATA ARE ALSO IN ctlfit \cite{Kessinger:2013aa;Ganusov:2011aa}

Code including some simulations about ctl escapes are in ctlfit \cite{Kessinger:2013aa;Ganusov:2011aa}

\textit{Sequence Alignment and Data from Bar et al. 2010 \\Wide Variation in the Multiplicity of HIV-1 Infection Among Injection Drug Users}
Katie Bar's article \cite{Bar:2010aa} has data at LANL.
\begin{quote}
Fasta file contains 393 sequences covering the 3' half of the HIV-1 genome. Sequences correspond to accession numbers GU562001-GU562291 and GU938194-GU938295.
\end{quote}

\textit{Sequence alignments and annotations from Baalwa et al. 2013}
\begin{quote}
Sequences were aligned using ClustalW version 2.11 (Larkin et al., 2007). Phylogenetic trees were constructed by maximum-likelihood estimation using PhyML version 3.0 (Guindon et al., 2010) or by the neighbor joining method using ClustalW (Larkin et al., 2007). Maximum within patient sequence diversity was determined using Phylogenetic Analysis Using Parsimony (PAUP) version 4.0 (Rogers and Swofford, 1999). Subtype assignments were based on the REGA (www.dbpartners.stanford.edu/RegaSubtyping/) and RIP (www.hiv.lanl.gov) tools and maximum likelihood phylogenetic tree analyses. Recombination breakpoint locations were identified by SimPlot (Lole et al., 1999). Highlighter plots (www.hiv.lanl.gov) were used to display nucleotide substitutions in the viral quasispecies as compared to the T/F sequence. T/F sequences were inferred based on a model of early random virus evolution as previously described (Keele et al., 2008; Lee et al., 2009; Ochsenbauer et al., 2012; Salazar-Gonzalez et al., 2009).
\end{quote}

\textbf{Joshua Baalwa}'s article \cite{Baalwa:2013aa} has data at LANL:
\begin{quote}
JX202785-JX202944.fasta   10 patients, 5' half genomes
JX202945-JX203145.fasta   10 patients, 3' half genomes
JX203146-JX203167.fasta   2 patients, 5' half genomes
JX203168-JX203216.fasta   3 patients, 3' half genomes
JX236668-JX236679.fasta   12 patients, full genomes
Tab-delimited text file containing annotation data, such as sampling date, subtype, genome coordinates, patient code, risk factor, Fiebig stage, etc.
\end{quote}

\subsection{LANL HIV Longitudinal Studies}

Katie Bar's article \cite{Bar:2012aa} has three subjects in fibig II caught longitudinally!!
\begin{quote}
  
\end{quote}
TODO: USE FOR BAKEOFF

synmut: Zanini and Neher's article \cite{Zanini:2013aa} collected all the longitudinal LANL special sets data including Shankarappa and determined a filter to exclude ``complex populations''. Get on git:
git clone git://git.tuebingen.mpg.de/synmut.git
cd synmut/
git checkout remotes/origin/publication
TODO: Model my stuff after this reproducible research in python.  Nicely done.

\cite{Neher:2010aa} looks at recombination in Shankarappa (MACS cohort) data.

Shankarappa data: 11 from MACS cohort \cite{Shankarappa:1999aa}
data is in synmut: synmut/data/longitudinal/Shankarappa/


\subsection{Recombination detection}
See http://www.bioinf.manchester.ac.uk/recombination/programs.shtml

% http://web.cbio.uct.ac.za/~darren/rdp.html
RDP4 \cite[martin2015rdp4] is a program that runs multiple recombination-detection algorithms.

Ireen E. Kiwelu's article \cite{Kiwelu:2012aa} describes recombination detection across subtypes using RIP and REGA, and SimPlot.
\begin{quote}
All DNA sequences (n = 1979) generated in this study were screened for evidence of inter-subtype recombination by the Recombination Identification Program (RIP 3.0) [60] and the REGA HIV-1 Subtyping Tool-Version 2.0 [61]. The identified recombinant viruses were further analyzed for subtype assignment and breakpoints identification by SimPlot software v3.5.1 [62]. The bootscanning method in SimPlot was parameterized with sliding window size of 350 bp, step increment 3 bp, and Kimura 2-parameter distance model for inferring phylogenetic trees by the NJ method. HIV-1 reference subtype sequences retrieved from the Los Alamos HIV Sequence Database were used. Identified breakpoints were visually inspected in BioEdit. To confirm the HIV-1 subtypes in the inter-subtype recombinant viruses, nucleotide sequences on both sides of the breakpoint were analyzed independently by re-constructing phylogenetic trees using the splits at the putatively identified breakpoints.
\end{quote}
REGA (http://dbpartners.stanford.edu:8080/RegaSubtyping/stanford-hiv/typingtool/) seems to be more about subtyping than within-host recombination.

The Immonen article \cite{Immonen:2014aa} uses the phi test defined in \cite{Bruen:2006aa} to do recombination detection.

\subsection{The HIV \textit{in vivo} mutation rate $\epsilon$, reproductive rate R0, and generation time.}

From \cite[Giorgi:2010aa]:
\begin{quote}
A Poisson distribution is fitted to the observed pairwise HD distribution using a maximum likelihood method (see \cite{Lee:2009aa} for details). A $\Chi^2$ goodness of fit (GOF) is performed to test whether the HD distribution significantly diverges from a Poisson (small P-values indicate a bad fit). The test takes into account the non-independence of pairwise HD distances by comparing the observed frequencies to the expected ones if the sample were to follow a star phylogeny. Prior to the onset of positive selection, the population is assumed to undergo a rapid expansion during which the basic reproductive number $R0 > 1$. Therefore, when the sample yields a GOF P-value above 0.05 (indicating a non-significant divergence from a Poisson distribution), we can estimate the time since the MRCA using the parameters characterizing intrahost HIV evolution. Following Stafford et al. \cite[stafford2000modeling], we set $R0 = 6$ for acute HIV-1 infection samples. We assume a constant mutation rate across lineages, which we fix at an average value of $\epsilon = 2.16 × 10^{-5}$ per site and per replication cycle. This value has been adjusted from what was originally derived for HIV-1 by Mansky et al. \cite{Mansky:1995aa} by considering only base substitutions and ignoring insertions and deletions. However, it has not been corrected for possible extra purifying selection in vivo. Using these parameters, we use the mean of the fitted Poisson distribution to infer the time since the MRCA \cite{Lee:2009aa}.
\end{quote}

There is code R for Poisson Fitter! From http://www.hiv.lanl.gov/tmp/POISSON_FITTER/pp429jC8Td/PFitter.R
From the supplemental information for \cite{Keele:2008aa}:
\begin{quote}
We assume a homogeneous infection in
which the virus grows exponentially with no selection pressure,
no recombination, no occurrence of back mutations and a
constant mutation rate across positions and across lineages.
Under this scenario, the HD frequency distribution is given by
a Poisson distribution whose mean depends linearly on the
number of generations since the founder strain. We used previously
estimated parameters of HIV-1 generation time (2 days)
re \cite{markowitz2003novel}, reproductive ratio (R0, 6) \cite{stafford2000modeling}, and reverse transcriptase
point mutation rate ($\epsilon = 2.16 × 10^{-5}$) \cite{Mansky:1995aa} and assumed that the
initial virus replicated exponentially by infecting exactly R0 new
cells at each generation, which, for simplicity, we assumed
happened in two equal bursts at $\tau$ and $2\tau$. The reverse transcriptase
error rate estimate \cite{Mansky:1995aa}  is based on sequencing virus
produced in vitro after a single round of replication. If a mutation
occurs that is lethal with regard to viral production, it would not
be detected in this assay, and such mutations may be similarly
reduced in the natural, in vivo situation. On the other hand, lethal
mutations that were not infectious would be retained in the
single round of replication assay but may be selected against in
vivo; hence, the mutation rate we are using in the model will have
a bias toward being greater than the substitution rate we observe
in vivo, potentially resulting in slight underestimates of the time
to the MRCA.\end{quote}

From \cite{Mansky:1995aa}, we get an estimate of the HIV-1 mutation rate of $3.4 \times 10^{−5}$, and from \cite{Abram2010} we get a range estimates from $1.4 to 4.2 \times 10^{−5}$.
\begin{quote}
Although it is widely believed that HIV-1 replication is exceptionally error-prone, we show here that the in vivo forward mutation rate of HIV-1, measured by the inactivation of lacZα, is approximately $1.4 \times 10^{−5}$ mutations/bp/cycle. This rate is approximately 20-fold lower than the in vitro error rate for RT measured using a DNA template (9, 10, 29, 30, 62, 68) and 17-fold lower than the rate measured using an RNA template (29, 30). Our reported mutation rate for HIV-1 in HOS cells is also about 3-fold lower than earlier reported rates in HeLa and CEM cells (42, 46) and is similar to the average rate reported for other retroviruses (1.5 × 10−5 mutations/bp/cycle) (69). ... This suggests that our measured mutation rate may be 2- to 3-fold lower than the actual overall mutation rate. The mutation rate of HIV-1 was previously determined, using different constraints, to be $3.4 \times 10^{−5}$ mutations/bp/cycle (46). When the rate was recalculated using the 174-nt target, the mutation rate was $2.4 \times 10^{−5}$ mutations/bp/cycle, which is similar to our result (for details, see Text S3 and Table S3 in the supplemental material).
\end{quote}

\section{Estimates of multiplicity}
\cite{Gounder:2015aa, Sterrett:2014aa, Li:2010aa, Gottlieb:2008aa, Abrahams:2009aa, Keele:2008aa} 

\section{Ancestral Sequence Reconstruction}
There is something called ASR at datamonkey. see http://www.datamonkey.org/help/asr.php

\section*{Acknowledgements}
Research reported in this document was partially supported by the Bill and Melinda Gates Foundation Award Number OPP1110049.

\hspace{.2in}
\bibliographystyle{biorefs}
\bibliography{HIVFoundersAndTiming}
\newpage


<<echo = FALSE>>=
# (un)Setup for prettier Sweave output.
options( continue = old.continue.option$continue )
@ 

\end{document}