-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
78 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Background | ||
|
||
## Topic | ||
|
||
* Plague - Modern disease, active field of biosurveillance. | ||
* Plague - Historical pandemics, severe and different mortality, different geo distributions. | ||
* Main phylogenetic questions of interest: | ||
1. Topology + branch lengths | ||
2. Evolutionary rate, variation, divergence times | ||
3. Ancestral trait reconstruction (host, geo, virulence) | ||
|
||
## Problem | ||
|
||
[@OptimalRatesPhylogenetic]: | ||
"Faced with a deluge of sequence data, the question of how to select data most appropriate for a given phylogenetic problem has become a major topic of interest (Salichos and Rokas 2013; Pisani et al. 2015; Shen et al. 2017)." | ||
However | ||
"This question is not new to phylogenomics and has been a driving question in the theory of phylogenetic experimental design for over two decades (Graybeal 1993; Xia et al. 2003)."" | ||
|
||
[@reddyWhyPhylogenomicData2017] | ||
"Continued data collection for large-scale phylogenetic studies, however, has not resulted in a consistent resolution of the deep branches of the bird tree. Specifically, the Jarvis et al. (2014) “total evidence nucleotide tree” (TENT; Fig. 1a), based on 42 Mbp of data extracted from 48 complete avian genomes, and the Prum et al. (2015) (Fig. 1b) tree, based on 0.4 Mbp of data from 259 loci obtained by sequence capture (anchored hybrid enrichment) and sampled for 198 bird species, exhibit a number of conflicts." | ||
|
||
"Both Jarvis et al. (2014) and Prum et al. (2015) report strong support for all of their relationships." | ||
|
||
"The conflicts between the Jarvis TENT and Prum tree are surprising given the size of the data matrices analyzed in each study" | ||
|
||
"adding taxa usually improves phylogenetic accuracy (reviewed by Heath et al. 2008)." | ||
|
||
"Although there are cases where adding taxa reduces support and/or results in decreased phylogenetic accuracy (e.g., Poe and Swofford 1999; Sanderson and Wojciechowski 2000; Braun and Kimball 2002; Meiklejohn et al. 2014)" | ||
|
||
The questions of interest listed above are highly sensitive to taxon sampling bias. | ||
|
||
### Topology + Branch Lengths | ||
|
||
* Adding more taxa breaks up long branches, improves ancestral information, thereby generally increasing accuracy. | ||
* However, adding more taxa increases the complexity of fully resolving all branches of a phylogenetic tree topology, demanding resolution of more hypothetical ancestral relationships from the same data. | ||
|
||
### Evolutionary rate, variation, divergence times | ||
|
||
* Addition of taxa can increase the probability of introducing new rate heterogeneities and biases, thereby adding long branches and potential model violations to an erstwhile tractable phylogenetic problem (Reddy et al. 2017). | ||
|
||
## Previous Work | ||
|
||
## Project Objectives |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
Background | ||
==================== | ||
|
||
Topic | ||
------------ | ||
Plague has an impressively long and expansive history as a human pathogen, dating back to at least the Neolithic `(Rascovan et al., 2019) <https://doi.org/10.1016/j.cell.2018.11.005>`_, with an established presence on every continent except Oceania `("Plague", n.d.) <https://www.who.int/news-room/fact-sheets/detail/plague>`_. Accompanying this prolific global presence is unnervingly high mortality, from the infamous medieval Black Death, to the 21st century outbreaks where case fatality rates range from 22-71% (Bertherat, 2019). As a result, plague maintains its status as a disease that is not only of vital importance to public health initiatives, but as one that brings together diverse researchers with interests spanning the modern period, history, and prehistory. | ||
|
||
While human outbreaks involving catastrophic fatalities may appear to forge the identity of plague as a disease, it is however, first and foremost, a zoonoses of rodents (CITE). The global distribution of plague is therefore highly influenced by the distribution of these rodent reservoir species, forming defined disease zones or plague foci (CITE). While many of these zones are actively monitored and charted (WHO/PED, 2016), plague activity is often difficult to observe and nearly impossible to eradicate (CITE). The challenges inherent to constantly monitoring foci activity reveal one of plague’s most defining epidemiological traits: prolonged dormancy periods lasting multiple decades followed by inexplicable re-emergences (Bertherat et al., 2007; Cabanel et al., 2013). This tumultuous behavior of disease zones, cycling between active and inactive, leads to pronounced anxiety over the perceived invisibility of plague where records are lacking, an apprehension shared by both researchers of modern and ancient plague (CITE). | ||
|
||
As there are indeed numerous points in history where plague records are lacking, these gaps drive a need for a new form of evidence that can promise a window into the past, revealing unseen processes. In this regard, genome sequencing of pathogenic DNA offers great promise, as reconstructing genetic relationships can reveal spatial connections between outbreaks thought to be otherwise unrelated (CITE). Furthermore, as the accumulation of mutations over time can be modelled, genomic data can also be used to estimate the timing of key events in history (CITE). As a form of evidence that records the life history of the plague pathogen and its ancestral population, and with a disease that constantly evades surveillance efforts, the field of genomics provides an alluring way to render what was once invisible, visible. | ||
|
||
Problem | ||
------------ | ||
However, while genomic evidence is powerful in its potential, it must also be approached with a critical mindset. Plague genomic data thus far, has not been produced by a global sequencing initiative in which isolates are drawn to produce representative and “unbiased” samples. Instead, independent sequencing datasets must be carefully stitched together, while bearing in mind that they were originally generated to serve the priorities and agendas of the local institutions that required them. Failing to reflect on the processes that drives who generates publicly available data has led to the emergence, and eventual realization, of several analytical consequences. | ||
|
||
The most prominent issue discussed in the field, is the discovery of strong geographic biases in plague genomic data, such as the over-representation of East Asian regions, specifically the People's Republic of China (CITE). This bias has been attributed to be the driving force behind findings suggesting an East Asian origin of plague, a finding that is now heavily contested (CITE). Conversely, the under-representation of African countries in the current literature poses strict limits on the hypotheses that can be evaluated using genomic data. Considering that 97% of worldwide human cases of plague occur in African countries, this under-representation also reveals a surprising disconnect. However, while these issues may be mentioned in passing in the literature, there does not yet exist routine use of a critical framework for evaluating the comparative data that is employed. | ||
|
||
Compounding the issue of lacking critical frameworks, is the data revolution that has resulted from technological advances in high throughput sequencing. The previous decade has witnessed databases grow from containing 20 plague genomes to over 1500 (CITE: NCBI?). While this unprecedented growth promises tremendous research potential, it also creates methodological and theoretical obstacles. As researchers grapple with an upcoming transition to Big Data, which favors distant treatment of numerous samples over close engagement with the few, it is more important than ever to engage in dialogue about critical and effective use of publicly available datasets. | ||
|
||
Previous Work | ||
------------- | ||
While there is growing awareness in the field, this form of critical thought is sparse and limited. Discussions on geographic bias are appearing (Spyrou et al., 2019) and methodological frameworks have appeared to meet the challenge of the data revolution (Zhou et al., 2020). Similar conversations are also appearing within other fields, such as work on the tuberculosis clinical reference set (Borrell et al., 2019) which proposed a framework for reducing bias by expanding diversity. The breadth of conversation spans both academic research and commercial product development (CITE), indicating a topical problem that spans numerous fields, but still leaves ample room for expansion. | ||
|
||
Amongst the progressive work that is ongoing, several outstanding issues arise. The Enterobase project (Zhou et al., 2020) has made it easier than ever to organize and access plague data, but leaves the task of figuring out how best to use this voluminous data to the user. Similarly, only a few geographic biases have been identified, on only a portion of the available data, thus no one yet has taken a comprehensive approach to systematically investigate these concerns. Furthermore, a rigorous assessment remains to be done of the analytical consequences of using possibly biased data to reconstruct the timing and spread of plague outbreaks. Finally, upon identifying biased results, there exists a niche to suggest strategies for mitigating these consequences. | ||
|
||
Herein lies the opportunity to provide an updated, and detailed investigation of the composition of the global ‘history’ of plague, as it has been told from the genomic perspective. Specifically, this project investigates the following four questions: | ||
|
||
Project Objectives | ||
------------------ | ||
1. What is the composition of plague genomic data with regards to geography and collection date? | ||
2. How has this composition changed over the past 10 years? | ||
3. Does a compositional bias (geographic or temporal) have consequences for reconstructing the spread and timing of plague’s re-emergences? | ||
4. If consequences arise, why, and how may they be rectified? | ||
|
||
Overall this project seeks to reveal processes that may have unknowingly shaped plague research to date, to provide reasonable strategies for responsible data re-use, and to contribute unique perspectives on the global history of plague through novel synthesis of under-utilized datasets. |