-
Notifications
You must be signed in to change notification settings - Fork 1
/
manifesto_corpus.Rmd
134 lines (102 loc) · 3.55 KB
/
manifesto_corpus.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
title: "CWB manifesto corpus from scratch"
output: html_document
editor_options:
chunk_output_type: console
---
## Problem statement
Getting pdf documents of manifestos, extracting text, cleaning the data and building a manifesto corpus would be a great Fingerübung. But the corpus of the Manifesto Project is one established corpus used by many researchers. As it is licensed data, a corpus built on it cannot be shared freely. But we can share the code to build the corpus, and this is what I want to do here.
## Getting started
```{r}
if (!"manifestoR" %in% unname(installed.packages()[,"Package"])){
install.packages("manifestoR")
}
library(manifestoR)
library(RcppCWB) # dev version v0.5.0.9010 or higher
library(polmineR) # dev version
library(tokenizers)
library(dplyr)
```
# downloading manifesto dataset
To access to the Manifesto Database, an API key is required.
```{r}
mp_setapikey("~/.credentials/manifesto_api_key.txt")
```
```{r}
de <- mp_corpus(countryname == "Germany")
```
```{r}
mpds <- mp_maindataset()
party_ids <- mpds %>%
select(country, countryname, party, partyname, partyabbrev) %>%
distinct(party, .keep_all = TRUE)
parties <- setNames(party_ids$partyabbrev, party_ids$party)
```
This is a pretty condensed way to turn this into a tibble.
```{r}
party_ids <- as.character(unname(sapply(lapply(de, `[[`, "meta"), `[[`, "party")))
df <- data.frame(
party_id = party_ids,
party = unname(parties[party_ids]),
date = as.character(unname(sapply(lapply(de, `[[`, "meta"), `[[`, "date"))),
language = as.character(unname(sapply(lapply(de, `[[`, "meta"), `[[`, "language"))),
txt = unlist(sapply(lapply(lapply(de, `[[`, "content"), `[[`, "text"), paste, collapse = "\n"))
)
```
```{r}
sentences <- tokenize_sentences(df$txt)
tok <- lapply(sentences, tokenize_words, lowercase = FALSE, strip_punct = FALSE)
body <- lapply(tok, function(doc) unlist(lapply(doc, function(s) c("<s>", s, "</s>"))))
tags <- sprintf(
"<text party_id='%s' party='%s' date='%s' language='%s'>",
df$party_id, df$party, df$date, df$language
)
xml <- mapply(c, tags, body, rep("</text>", times = length(tags)))
```
```{r}
vrt_dir <- file.path(tempdir(), "vrt")
dir.create(vrt_dir)
```
```{r}
writeLines(text = unlist(xml), con = file.path(vrt_dir, "manifestos.vrt"))
```
Faster alternatives here are `readr::write_lines()` or `data.table::fwrite()`.
```{r}
data_dir <- file.path(tempdir(), "data_dir")
dir.create(data_dir)
```
The new corpus still needs to be loaded.
```{r}
cwb_encode(
corpus = "MANIFESTOS",
registry = registry(),
vrt_dir = vrt_dir,
data_dir = data_dir,
encoding = "utf8",
p_attributes = "word",
s_attributes = list(text = c("party_id", "party", "date", "language"), s = character()),
verbose = FALSE, quietly = TRUE
)
p_attr <- "word"
cwb_makeall(corpus = "MANIFESTOS", p_attribute = p_attr, registry = registry(), quietly = TRUE)
cwb_huffcode(corpus = "MANIFESTOS", p_attribute = p_attr, registry = registry(), quietly = TRUE)
cwb_compress_rdx(corpus = "MANIFESTOS", p_attribute = p_attr, registry = registry(), quietly = TRUE)
```
```{r}
cl_load_corpus(corpus = "MANIFESTOS", registry = registry())
cqp_load_corpus(corpus = "MANIFESTOS", registry = registry())
```
This is a rudimentary check (using low-level RcppCWB functions) whether to corpus can be used. How often does a token occur?
```{r}
polmineR::count("MANIFESTOS", query = "Digitalisierung", )
```
```{r, render = knit_print}
kwic(
"MANIFESTOS", query = "Krieg",
s_attributes = c("text_party", "text_date"),
left = c(s = 1L), right = c(s = 1L)
)
```
```{r}
mp_cite()
```