-
Notifications
You must be signed in to change notification settings - Fork 0
/
valid_BCT.Rmd
111 lines (90 loc) · 4.4 KB
/
valid_BCT.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
title: 'Validation: BCT'
author: "Paschalis Agapitos"
---
```{r, echo=FALSE}
# if not installed, uncomment and run
# install.packages("stylo")
# install.packages("here")
```
Aim of this notebook is to validate the Bootstrap Consensus Tree method used in the paper *A Stylometric Analysis of Seneca's Disputed Plays: Authorship Verification of Octavia and Hercules Oetaeus*.
```{r}
library(stylo)
library(here)
```
## Setting the working directory
Set the working directory to `validation_PCA_BCT/`. This directory holds the data and the code to validate the PCA and BCT methods.
```{r}
# to set the working directory and ensure compatibility across all of the operating systems (IF THE CODE BELOW DOES NOT RAISE ANY ERROR):
# Session > Restart R > Set Working Directory > Choose Directory... > set to `seneca_stylometry/2_validation/validation_PCA_BCT/results/BCT/MFCs-4grams/`
# we choose this file in order to write the resulting plots automatically there
setwd(file.path("~", "Documents", "projects", "seneca_paper", "seneca_stylometry", "2_validation", "validation_PCA_BCT", "results", "BCT", "MFCs-4grams"))
getwd() # verify directory
```
## Bootstrap Consensus Tree - MFC 4-grams
### Load and prepare corpus for the validation of BCT
Load and prepare the correct corpus for the validation of BCT, following the same preprocessing steps.
```{r}
# load the corpus for the validation of BCT
raw_corpus_bct <- load.corpus(files = "all",
corpus.dir = file.path("..", "..", "..", "..", "validation_corpora", "validation_corpus_BCT",
encoding = "UTF-8")
# tokenize the corpus
tokenized_corpus_bct <- txt.to.words.ext(raw_corpus_bct,
corpus.lang = "Latin.corr",
preserve.case = FALSE)
```
#### Remove the pronouns
Remove pronouns from the corpus.
```{r}
# remove pronouns from the tokenized corpus
corpus_no_pronouns_bct <- delete.stop.words(tokenized_corpus_bct,
stop.words = stylo.pronouns(corpus.lang = "Latin.corr"))
# list of pronouns removed
stylo.pronouns(corpus.lang = "Latin.corr")
```
#### Extracting the features (character 4-grams)
Extract character 4-grams and add them to a frequency table.
```{r}
# Extract character 4-grams
corpus_char_4grams_bct <- txt.to.features(corpus_no_pronouns_bct,
features = "c",
ngram.size = 4)
# Create a frequency list of the 4-grams
frequent_features_4grams_bct <- make.frequency.list(corpus_char_4grams_bct,
head = 5000)
# Create a table of frequencies for the 4-grams
freqs_4grams_bct <- make.table.of.frequencies(corpus_char_4grams_bct,
features = frequent_features_4grams_bct,
relative = TRUE)
```
```{r}
# BCT 4grams - top 100-2000-100 MFC 4 grams - consensus strength 0.5
bct_results_4grams <- stylo(frequencies = freqs_4grams_bct,
distance.measure = "wurzburg", #cosine delta
analysis.type = "BCT",
mfw.min = 100, mfw.max = 2000, increment = 100,
custom.graph.title = "Who is the author?",
write.pdf.file = TRUE,
gui = TRUE) # GUI only to doublecheck. In principle no change should be implemented
```
## REPLICATION STEPS ON STYLO'S GUI (BCT validation)
Note: `stylo_config.txt` provides a detailed version of the parameters used throughout this experiment. It can be found in `seneca_stylometry/2_validation/validation_PCA_BCT/results/BCT/MFCs-4grams/`.
-----------------------------------
1) Run the script in the cell below, which will opent the Graphic User Interface (GUI) of *Stylo*
2) On the GUI select the following, if they are not already selected:
* INPUT & LANGUAGE
- INPUT: `plain text`
- LANGUAGE: `Latin (u/v > u)`
* FEATURES
- FEATURES: `chars`, `ngram size: 4`
- MFW SETTING: `Minimum: 100`, `Maximum: 2000`, `Increment: 100`, `Start at freq. rank: 1`
- CULLING: `Delete pronouns: YES`
* STATISTICS
- STATISTICS: `Consensus Tree`, `Consensus strength : 0.5`
- DELTA DISTANCE: `Cosine Delta`
* SAMPLING
- `No sampling`
* OUTPUT
- GRAPHS: `Onscreen`, `PDF`
-----------------------------------