2_validation/validation_PCA_BCT/code/valid_PCA.Rmd

---
title: 'Validation: PCA'
author: "Paschalis Agapitos"
---

Aim of this notebook is to validate the Principal Component method used in the paper *A Stylometric Analysis of Seneca's Disputed Plays: Authorship Verification of Octavia and Hercules Oetaeus*.

```{r, echo=FALSE}
# if not installed, uncomment and run
# install.packages("stylo")
# install.packages("here")
```

## Setting the working directory

Set the working directory to `validation_PCA_BCT/`. This directory holds the data and the code to validate the PCA and BCT methods.

```{r}
# to set the working directory and ensure compatibility across all of the operating systems (IF THE CODE BELOW DOES NOT RAISE ANY ERROR):
# Session > Restart R > Set Working Directory > Choose Directory... > set to `seneca_stylometry/2_validation/validation_PCA_BCT/results/PCA/MFCs-4grams/PCA-corr/`
# we choose this file in order to write the resulting plots automatically there
setwd(file.path("~", "Documents", "projects", "seneca_paper", "seneca_stylometry", "2_validation", "validation_PCA_BCT", "results", "PCA", "MFCs-4grams", "PCA-corr"))
getwd() # verify directory
```

```{r}
library(stylo)
```


## Importing the corpus and tokenization
Import and tokenize the corpus. The tokenization follows the rules of the parameter `stylo.Latin.corr` to handle the variation between "u/v". Uppercase letters are converted to lowercase to minimize variations in orthography.
```{r}
# load the corpus for the validation of PCA
raw_corpus <- load.corpus(files = "all", 
                          corpus.dir = file.path("..", "..", "..", "..", "..", "validation_corpora", "validation_corpus_PCA"), 
                          encoding = "UTF-8")

# tokenize the corpus
tokenized_corpus <- txt.to.words.ext(raw_corpus, corpus.lang = "Latin.corr", preserve.case = FALSE)
```

## Remove the pronouns
Remove pronouns from the corpus as they may be connected to the genre of the text.
```{r}
# remove pronouns from the tokenized corpus
corpus_no_pronouns <- delete.stop.words(tokenized_corpus, stop.words = stylo.pronouns(corpus.lang = "Latin.corr"))

# list of pronouns removed
stylo.pronouns(corpus.lang = "Latin.corr")
```

## Character 4-grams

### Extracting the features (character 4-grams)
Extract character 4-grams and add them to a frequency table.
```{r}
# extract character 4-grams
corpus_char_4grams <- txt.to.features(corpus_no_pronouns, 
                                      features = "c", 
                                      ngram.size = 4)

# create a frequency list of the 4-grams
frequent_features_4grams <- make.frequency.list(corpus_char_4grams, 
                                                head = 2000)

# create a table of frequencies for the 4-grams
freqs_4grams <- make.table.of.frequencies(corpus_char_4grams, 
                                          features = frequent_features_4grams, 
                                          relative = TRUE)
```

## Principal Component Analysis - Correlation matrix (MFCs 4grams)

Apply PCA using a correlation matrix to visualize the results with MFCs 4-grams. We will look at the top 100 to 2000 MFCs 4-grams with an increment of 100 in each iteration.

Cosine Delta will be used as a distance metric.

```{r}
# PCA correlation - top 100-2000-100 incr.100 MFCs 4-grams
results_pca_4grams_cor <- stylo(frequencies = freqs_4grams, 
                                analysis.type = "PCR",
                                mfw.min = 100, mfw.max = 2000, increment = 100, 
                                distance.measure = "wurzburg", # Cosine Delta
                                custom.graph.title = "Who is the author?", 
                                pca.visual.flavour = "classic", 
                                write.pdf.file = TRUE, 
                                gui = TRUE) #GUI is set to TRUE to double check the parameters. In principle no additional change should be implemented. Hit "OK" and the results will be displayed after the PCA is done.
```
## REPLICATION STEPS ON STYLO'S GUI (PCA-Correlation validation)

Note: `stylo_config.txt` provides a detailed version of the parameters used throughout this experiment. It can be found in `seneca_stylometry/2_validation/validation_PCA_BCT/results/PCA/MFCs-4grams/PCA-corr/`.

-----------------------------------
1) Run the script in the cell below, which will opent the Graphic User Interface (GUI) of *Stylo*
2) On the GUI select the following, if they are not already selected:

* INPUT & LANGUAGE
  - INPUT: `plain text` 
  - LANGUAGE: `Latin (u/v > u)`
* FEATURES
  - FEATURES: `chars`, `ngram size: 4`
  - MFW SETTING: `Minimum: 100`, `Maximum: 2000`, `Increment: 100`, `Start at freq. rank: 1`
  - CULLING: `Delete pronouns: YES`
* STATISTICS
  - STATISTICS: `PCA (corr.)`
  - DELTA DISTANCE: `Cosine Delta`
* SAMPLING
  - `No sampling`
* OUTPUT
  - GRAPHS: `Onscreen`, `PDF`
  
-----------------------------------