example_lda.Rmd

---
title: "R Notebook"
output: html_notebook
---

# ml_word_visualisations

## Structure of repository

Folder [lda](./lda) consists of the files to run lda experiments 
1. [utils.R](./lda/utils.R) which includes functions for testing lda topics 
2. [main.R](./lda/main.R) which includes methods for creating and testing lda models from different libraries

Folder [data](./data) is there to store your data. <br>
Folder [results](./results) stores the results of your experiment

## Example 

### LDA
#### 0. Preparation
  #### 0.1 Set (hyper)parameters

```{r}
data <- read_csv("./data/suicide_test.csv")
view(data)
data2 <- na.omit(data)
view(data2)
write_csv(data2,"./data/depression_anxiety_final.csv")
```

**Data** 
```{r}
data_dir <- "./data/suicide_test.csv"
data_col <- "suicide"
id_col <- "id"
group_var <- NULL # now necessary, but only used for t-test
cor_var <- "IDASSuicidality"
```

**Document Term Matrix** 
```{r}
ngram_window <- c(1,3)
stopwords <- stopwords::stopwords("en", source = "snowball")
removalword <- "" # just possible with one word
occ_rate <- 0
removal_num_most <- 200
removal_num_least <- 4
removal_mode <- "threshold" # "relative"
split <- 1
```


**LDA** 
```{r}
model_type <- "mallet" # or "mallet"
num_topics <- 20
num_top_words <- 10
num_train_iterations <- 2000
num_pred_iterations <- 200
pred_mode <- "mallet" # or "custom" for mallet
```


**Analysis**
```{r}
cor_var <- "PHQtot" # grouping variable for t-test, to be predicted variable for other
control_vars <- c("PHQtot")#, "GADtot") # vector of variables to control analysis with if test_method is linear_regression
test_method <- "textTrain_regression" # linear_regression, logistic_regression, t-test
```

**Miscellaneous** 
```{r}
seed <- 1234
```

##### 0.2 Create directory to save all computations
All objects created within the pipeline are created in the directory below. These include
- Document Term Matrix
- model
- predictions
- analysis results
```{r}
save_dir <- paste0("./results/",
            model_type,"_",
            data_col, "_",
            #"embed_", embedding_model)#,
            num_topics, 
            "_most_",removal_num_most, 
            "_least_", removal_num_least, 
            "_occ_", occ_rate, 
            "_pred_", pred_mode)
if (!dir.exists("./results")) {
  dir.create("./results")
}
```

##### 0.3 Imports
```{r}
library(textmineR)
library(tidyverse)
library(dplyr)
library(textmineR)
library(mallet)
library(rJava)
library(tokenizers)
library(text2vec)
library(quanteda)
source("./topic_modeling/lda/main.R")
source("./topic_modeling/lda/wordclouds.R")

```

#### 1. Compute Document Term Matrix
```{r}
dtms <- get_dtm(data_dir = data_dir,
                id_col = id_col,
                data_col = data_col,
                group_var = group_var, # used if t-test, is binary, so this is optional
                cor_var = cor_var, # used for regression, correlation
                ngram_window = ngram_window,
                stopwords = stopwords,
                removalword = removalword,
                occ_rate = occ_rate,
                removal_mode = removal_mode,
                removal_rate_most = removal_num_most,
                removal_rate_least = removal_num_least,
                split=split,
                seed=seed,
                save_dir=save_dir)
```
#### 2. Create LDA Model
```{r}
model <- get_lda_model(model_type=model_type,
                        dtm=dtms$train_dtm,
                        num_topics=num_topics,
                        num_top_words=num_top_words,
                        num_iterations = num_train_iterations,
                        seed=seed,
                        save_dir=save_dir)
```

#### 3. Create Predictions
```{r}
preds <- get_lda_preds(model = model,
                        num_iterations=num_pred_iterations,
                        data = dtms$train_data,
                        dtm = dtms$train_dtm,
                        group_var = NULL,
                        seed=seed,
                        mode=pred_mode,
                        save_dir = save_dir)
```

#### 4. Analysis
##### 4.1 textTrain_regression

```{r}

test <- get_lda_test(model=model,
                    preds=preds,
                    data=dtms$train_data,
                    group_var = cor_var,
                    control_vars = c(cor_var),
                    test_method = "textTrain_regression",
                    seed=seed,
                    save_dir=save_dir)
```


##### 4.2 Linear Regression

```{r}
test <- get_lda_test(model=model,
                    preds=preds,
                    data=dtms$train_data,
                    group_var = "IDASSuicidality",
                    control_vars = c("IDASSuicidality"),
                    test_method = "linear_regression",
                    seed=seed,
                    save_dir=save_dir)
view(test)
```

```{r}
summary <- model[["summary"]]
print(summary)

```

```{r}

plot_wordclouds(model = model,
                model_type = "mallet",
                test = test,
                test_type = "linear_regression",
                cor_var = "IDASSuicidality",
                plot_topics_idx = NULL,
                p_threshold = 0.05,
                color_negative_cor = scale_color_gradient(low = "darkgreen", high = "green"),
                color_positive_cor = scale_color_gradient(low = "darkred", high = "red"),
                scale_size=TRUE,
                save_dir=save_dir,
                seed=seed)
```


```{r}
lda_vis <- create_lda_vis(model=model,
                          data=dtms$train_data,
                          data_col="suicide",
                          save_dir=save_dir)
```