manifesto_corpus.Rmd

---
title: "CWB manifesto corpus from scratch"
output: html_document
editor_options: 
  chunk_output_type: console
---

## Problem statement

Getting pdf documents of manifestos, extracting text, cleaning the data and building a manifesto corpus would be a great Fingerübung. But the corpus of the Manifesto Project is one established corpus used by many researchers. As it is licensed data, a corpus built on it cannot be shared freely. But we can share the code to build the corpus, and this is what I want to do here.

## Getting started

```{r}
if (!"manifestoR" %in% unname(installed.packages()[,"Package"])){
  install.packages("manifestoR")
}
library(manifestoR)
library(RcppCWB) # dev version v0.5.0.9010 or higher
library(polmineR) # dev version 
library(tokenizers)
library(dplyr)
```


# downloading manifesto dataset

To access to the Manifesto Database, an API key is required. 

```{r}
mp_setapikey("~/.credentials/manifesto_api_key.txt")
```


```{r}
de <- mp_corpus(countryname == "Germany")
```

```{r}
mpds <- mp_maindataset()
party_ids <- mpds %>%
  select(country, countryname, party, partyname, partyabbrev) %>%
  distinct(party, .keep_all = TRUE)
parties <- setNames(party_ids$partyabbrev, party_ids$party)
```


This is a pretty condensed way to turn this into a tibble.

```{r}
party_ids <- as.character(unname(sapply(lapply(de,  `[[`, "meta"), `[[`, "party")))
df <- data.frame(
  party_id = party_ids,
  party = unname(parties[party_ids]),
  date = as.character(unname(sapply(lapply(de,  `[[`, "meta"), `[[`, "date"))),
  language = as.character(unname(sapply(lapply(de,  `[[`, "meta"), `[[`, "language"))),
  txt = unlist(sapply(lapply(lapply(de,  `[[`, "content"), `[[`, "text"), paste, collapse = "\n"))
)
```


```{r}
sentences <- tokenize_sentences(df$txt)
tok <- lapply(sentences, tokenize_words, lowercase = FALSE, strip_punct = FALSE)
body <- lapply(tok, function(doc) unlist(lapply(doc, function(s) c("<s>", s, "</s>"))))

tags <- sprintf(
    "<text party_id='%s' party='%s' date='%s' language='%s'>",
    df$party_id, df$party, df$date, df$language
  )
xml <- mapply(c, tags, body, rep("</text>", times = length(tags)))
```

```{r}
vrt_dir <- file.path(tempdir(), "vrt")
dir.create(vrt_dir)
```

```{r}
writeLines(text = unlist(xml), con = file.path(vrt_dir, "manifestos.vrt"))
```

Faster alternatives here are `readr::write_lines()` or `data.table::fwrite()`.

```{r}
data_dir <- file.path(tempdir(), "data_dir")
dir.create(data_dir)
```

The new corpus still needs to be loaded.

```{r}
cwb_encode(
  corpus = "MANIFESTOS",
  registry = registry(),
  vrt_dir = vrt_dir,
  data_dir = data_dir,
  encoding = "utf8",
  p_attributes = "word",
  s_attributes = list(text = c("party_id", "party", "date", "language"), s = character()),
  verbose = FALSE, quietly = TRUE
)

p_attr <- "word"
cwb_makeall(corpus = "MANIFESTOS", p_attribute = p_attr, registry = registry(), quietly = TRUE)
cwb_huffcode(corpus = "MANIFESTOS", p_attribute = p_attr, registry = registry(), quietly = TRUE)
cwb_compress_rdx(corpus = "MANIFESTOS", p_attribute = p_attr, registry = registry(), quietly = TRUE)
```

```{r}
cl_load_corpus(corpus = "MANIFESTOS", registry = registry())
cqp_load_corpus(corpus = "MANIFESTOS", registry = registry())
```

This is a rudimentary check (using low-level RcppCWB functions) whether to corpus can be used. How often does a token occur?

```{r}
polmineR::count("MANIFESTOS", query = "Digitalisierung", )
```

```{r, render = knit_print}
kwic(
  "MANIFESTOS", query = "Krieg",
  s_attributes = c("text_party", "text_date"),
  left = c(s = 1L), right = c(s = 1L)
)
```


```{r}
mp_cite()
```