-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathx20.finestructure.Rmd
109 lines (76 loc) · 5.01 KB
/
x20.finestructure.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
title: "Haplotype-based population structure with fineStructure"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE,fig.retina=2)
library(tidyverse)
```
We used [fineStructure](https://people.maths.bris.ac.uk/~madjl/finestructure/finestructure_info.html) to look for fine-scale patterns of population structure based on our phased genotype data.
For this analysis we used the phased haps format files from shapeit. These were converted into chromopainter format with `impute2chromopainter.pl` per scaffold. To use linked mode, we generated uniform recombination rate map files using `makeuniformrecfile.pl`. These steps are captured in the snakemake script [0.prepare_chromopainter.smk](data/hpc/fineStructure/0.prepare_chromopainter.smk)
In the chromopainter EM parameter estimation step we used 30% of samples (20). Next, fineStructure was run with 2000000 iterations of which half were assigned to burn in iteration and half to sampling iteration. These steps are captured in [1.fs_computation.sh](data/hpc/fineStructure/1.fs_computation.sh)
This process produces a finestructure tree xml file which we parsed using the R code provided as part of the fineStructure program to produce a standard `phylo` object. This was then plotted using ggtree to produce the finestructure tree shown below.
```{r}
source("data/hpc/fineStructure/FinestructureRcode/FinestructureLibrary.R")
some.colors<-MakeColorYRP() # these are yellow-red-purple
some.colorsEnd<-MakeColorYRP(final=c(0.2,0.2,0.2)) # as above, but with a dark grey final for capped values
chunkfile<-"data/hpc/fineStructure/adigitifera_linked.chunkcounts.out" ## chromopainter chunkcounts file
mcmcfile<-"data/hpc/fineStructure/adigitifera_linked_mcmc.xml" ## finestructure mcmc file
treefile<-"data/hpc/fineStructure/adigitifera_linked_tree.xml" ## finestructure tree file
###### READ IN THE CHUNKCOUNT FILE
dataraw<-read.table(chunkfile,row.names=1,header=T,skip=1) # read in the pairwise coincidence
###### READ IN THE MCMC FILES
mcmcxml<-xmlTreeParse(mcmcfile) ## read into xml format
mcmcdata<-as.data.frame.myres(mcmcxml) ## convert this into a data frame
###### READ IN THE TREE FILES
treexml<-xmlTreeParse(treefile) ## read the tree as xml format
ttree<-extractTree(treexml) ## extract the tree into ape's phylo format
```
```{r, fig.width=6,fig.height=8}
library(ggtree)
tree_data <- data.frame(sample_id=ttree$tip.label) %>%
tidyr::extract(sample_id,into="site",regex="([^_]*)",remove = FALSE) %>%
tidyr::extract(sample_id,into="sample_label",regex="([^_]*_[^_]*_[^_]*)_",remove = FALSE) %>%
mutate(loc = case_when(
site=="AI" ~ "IN",
site=="BR" ~ "IN",
site=="AR" ~ "NO",
site=="RS1" ~ "SO",
site=="RS2" ~ "SO",
site=="RS3" ~ "SO",
)) %>%
filter(sample_id!="BR_5_121_S125_L004")
source("scripts/color_scheme.R")
ttree_dt <- ttree %>%
drop.tip("BR_5_121_S125_L004")
tree_plot <- ttree_dt %>%
ggtree() %<+% tree_data +
geom_tippoint(aes(color=loc)) +
geom_tiplab(aes(label=sample_label),align = F,size=3,hjust=-.1) +
theme(legend.position = "bottom", legend.title = element_blank()) +
scale_color_manual(values=myCol)
```
```{r}
library(ggpubr)
require(aplot)
hm_data <- dataraw[ttree_dt$tip.label,ttree_dt$tip.label]
hmf_data <- hm_data %>%
as.data.frame() %>%
rownames_to_column("ID") %>%
mutate(col_order = row_number()) %>%
pivot_longer(-c(ID,col_order),names_to = "col")
hmf_data$ID <- factor(hmf_data$ID, levels=unique(rownames(hm_data)))
hmf_data$ID <- as.numeric(hmf_data$ID)
hmf_data$col <- factor(hmf_data$col, levels=unique(rownames(hm_data)))
hmf_data$col <- as.numeric(hmf_data$col)
hmp <- ggplot(hmf_data) +
geom_tile(aes(y=ID,x=col,fill=value)) +
scale_fill_viridis_c() +
theme_minimal() +
theme(axis.ticks = element_blank(), axis.text = element_blank(), panel.grid = element_blank()) +
xlab("") + ylab("")
insert_left(hmp + theme(legend.title = element_blank()),tree_plot + guides(color="none") + xlim(NA,5), width = 0.3)
ggsave("figures/fig-s3.jpg",width = 12,height = 10)
```
All nodes in this tree had bootstrap support of 0.99 or better. The tree clearly distinguishes our three locations as distinct clades, but also identifies some structure within these clades. Two small clades (RS1_M11_840, RS1-2_422) and (BR-4-088, BR-4-077) were identified which contained samples with relatively low sequencing depth and presumably reflect small biases due to genotyping error that fineStructure is able to identify as structure. Within the inshore samples however there did seem to be a genuine split between the two inshore locations AI (Adele Island) and BR (Beagle Reef).
Although this split was clearly evident from the fineStructure tree it was not evident in PCA analyses (smartpca, PCAngsd), nor was it visible in patterns of IBD sharing. Since these differences between AI and BR were evidently small relative to the differences between major geographically defined populations (IN, NO, SO) we chose to ignore them for our other analyses.