-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
87 lines (76 loc) · 5.29 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
title: "dialectR: Doing Dialectometry in R"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
dialectR is an R package for doing dialectometry, the quantitative study of dialects. The analyses offered in this package rely upon variants of edit distance to compute the aggregate distance between phonetic transcriptions of dialects, which has been shown in prior studies to correspond with perceptual data of mutual intelligibility.
The developmental version of dialectR can be downloaded with devtools:
```{r, eval=FALSE}
devtools::install_github("b05102139/dialectR")
```
As a preliminary example, we will examine data of Dutch dialects, which was transcribed in the Goeman-Taeldeman-van-Reenen-Project. The data is provided in the package, and can be loaded like this:
```{r}
library(dialectR)
data(Dutch)
```
This dataset provides data of 562 concepts over 613 sites of Dutch speaking areas, where the concepts should be columns and the sites should be rows.
In dialectometry, an important step before further analyses can be done is to compute the aggregate edit distance between dialect sites. While an in depth discussion of edit distance is beyond the scope of this introduction, we briefly remark that it is a metric of distance between strings, which is computed by how many insertions, deletions, and substitutions it takes for one string to transform into another. Due to the existence of multiple entries, missing entries, and the requirement to normalize for the difference in length between entries in the data however, dialectR provides a specialized version of edit distance which meets these needs:
```{r}
dialectR::leven("koguma", "kokoimo")
dialectR::leven("koguma", "kokoimo", alignment_normalization = TRUE)
dialectR::leven("koguma/goguma", "kokoimo", alignment_normalization = TRUE, delim = "/")
```
The code above shows respectively the plain edit distance between two strings; the length-normalized distance; and the possibility of accounting for multiple responses in one site, which is a common situation when collecting dialect data.
The interest of such a metric is shown when the difference between sites are aggregated. Assuming the same function arguments as the above, we can also perform an aggregate calculation of site and site distance:
```{r, cache = TRUE}
distDutch <- dialectR::distance_matrix(Dutch, "leven", alignment_normalization = TRUE)
distDutch[1:3,1:3]
```
This can then be projected onto geography. We provide two such analyses in the package: one which depends on multidimensional scaling by mapping three dimensions onto RGB values and mixing them evenly, and one which utilizes the results of hierarchical clustering. The following two plots show that the dialect groupings of these two analyses largely converge. First we present the results of multidimensional scaling:
```{r dutch_mds, fig.height=5.5, fig.width=6, fig.align='center'}
dutch_points <- get_points(system.file("extdata", "DutchKML.kml", package="dialectR"))
dutch_polygons <- get_polygons(system.file("extdata", "DutchKML.kml", package="dialectR"))
mds_map(distDutch, dutch_points, dutch_polygons)
```
And here we present that of hierarchical clustering:
```{r dutch_cluster, fig.height=5.5, fig.width=6, fig.align='center'}
cluster_map(distDutch, cluster_num = 6, method = "ward.D2", kml_points = dutch_points, kml_polygon = dutch_polygons)
```
In addition to such transcription-based methods, we also provide an acoustic-based method which is capable of computing the distance between audio data:
```{r}
i_audio <- system.file("extdata", "i.wav", package="dialectR")
e_audio <- system.file("extdata", "e.wav", package="dialectR")
acoustic_distance(i_audio, e_audio)
```
The validity of this distance can be shown if we apply it to audio recordings of IPA vowels. Using the [recordings](http://www.phonetics.ucla.edu/course/chapter1/vowels.html) provided by Peter Ladefoged, we show how the distance between IPA vowels can be used to reproduce the acoustic vowel space:
```{r}
# we assume all the vowels are downloaded to a single folder
vowel_dist <- sapply(1:12, function(x){
sapply(1:12, function(y){
acoustic_distance(list.files("C:/Users/USER/Downloads/ipa_vowels", full.names = TRUE)[x],
list.files("C:/Users/USER/Downloads/ipa_vowels", full.names = TRUE)[y])
})
})
```
```{r, echo=FALSE}
vowel_names <- c("ɛ", "a", "e", "i", "ɪ", "o", "ɑ", "ɔ", "ø", "u", "y", "ʏ")
row.names(vowel_dist) <- vowel_names
colnames(vowel_dist) <- vowel_names
```
The distance matrix generated from this can be seen below:
```{r, comment=""}
vowel_dist[2:4,2:4]
```
Now we are in a place to apply multidimensional scaling on the data:
```{r, fig.width=6, fig.height=6, fig.align='center', eval=FALSE}
vowel_mds <- cmdscale(vowel_dist, k = 3)
plot(-vowel_mds[,2], vowel_mds[,1], xlab = "", ylab = "")
text(-vowel_mds[,2], vowel_mds[,1], cex = 1.2, labels = vowel_names, pos = 4)
```
```{r, echo=FALSE, fig.width=7, fig.height=7, fig.align='center'}
knitr::include_graphics("./README_files/figure-gfm/acoustic_vowel_plot.png")
```
As can be seen, the distance between the vowels largely correlates with conventional charts of the acoustic vowel space.
dialectR remains in active development. If you would like to use dialectR in your research and have any concerns, ideas, or questions, do feel free to contact us.