-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy path12-quanteda-dfms.qmd
478 lines (310 loc) · 31.8 KB
/
12-quanteda-dfms.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
# Building Document-Feature Matrices {#sec-quanteda-dfms}
## Objectives
Collecting documents in a text corpus, changing the unit of analysis in this corpus (if necessary), tokenising and processing texts are central aspects of each text analysis project. Having processed the tokens object, many statistical analysis of texts require a document-feature matrix (dfm), a mathematical matrix that describes the frequency of terms (e.g., words or phrases) that occur in a collection of documents.
In this chapter, we introduce you to the the assumptions we make when creating and analysing texts through document-feature matrices. We describe the nature of a dfm object and the importance of sparsity. You will learn how to create a dfm, how to subset or group a dfm, how to select features, and how to weight, trim, and smooth a dfm. We also show how to match dfms for machine learning approaches, and how to convert dfm objects for further use in other R packages. Finally, we explain the intuition behind feature co-occurrence matrices (fcm), when and why to use them, and how to construct an fcm from a **quanteda** tokens object. At the end of the chapter, readers will have a solid understanding of creating dfms and fcms, the workhorses for most statistical analyses of textual data.
## Methods
The **document-feature matrix** is the core object of many analysis of texts. It is a structured representation of textual data in the form of a matrix. After tokenising texts, we treat "text as data" by tabulating the counts of features by documents. Each row in a dfm represents a document, while each column represents a unique feature from across all the documents. The cells of the matrix indicate the frequency of a word in each document. After converting raw texts into a matrix, we move textual data into the same format as many forms of quantitative data. This format allows us to apply various statistical techniques and machine learning tools. Many of these methods have well-understood properties that allows us to generate probability statements and calculate measures of uncertainty [@benoit2020text].
::: {.callout-note appearance="simple"}
## Why document-feature matrix rather than document-term matrix?
Many textbooks and software packages speak of "document-term matrices". We prefer the name "document-feature matrix" since a dfm does not only contain terms, but can also contain features such as punctuation characters, numbers, symbols, or emojis. "Document-term matrix" implies that such features are uninformative. However, as we describe in @sec-quanteda-tokens, punctuation characters, numbers, or emojis can provide relevant information about texts and should remain in the matrix representation of the text corpus.
:::
Document-feature matrices allow us to analyse texts systematically. Somewhat ironically, we first need to destroy the structure of texts and make it impossible to "read" the documents. While we still know the order and context of words in tokens objects, document-feature matrix only provide information about the frequencies of features in a document. We no longer know at what position in a given document a certain feature appeared. Yet, only this oversimplification of texts allows us to apply statistical methods and analyse "text as data."
**Note:** _Maybe add a few sentences on numerical representations of real-world phenomena, such as survey responses, economic indicators, or conflict data_ (see Benoit 2020: 464-465)?
Let us explain the structure of a dfm with a simple example. Consider two short documents:
- Document 1: "I love quanteda. I love text analysis."
- Document 2: "Text analysis with quanteda is fun."
A matrix representation of these documents would look as follows:
| | i | love | quanteda | . | text | analysis | with | is | fun |
|------------|---|------|----------|---|------|----------|------|----|-----|
| **Doc 1** | 2 | 2 | 1 | 2 | 1 | 1 | 0 | 0 | 0 |
| **Doc 2** | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
: A document-feature matrix without tokens removal, but lower-casing of all features {#tbl-dfm-raw}
@tbl-dfm-raw shows a dfm consisting of two rows (one per document) and each feature has its own column. The numbers report how often a feature appeared a document. We can treat this dfm like any other quantitative data set. We cannot tell the precise position "love" appeared in Document 1 anymore, but we can immediately say that this feature appeared twice in the Document 1, and that love was not included in Document 2. Since we only know feature frequencies across documents but not their precise positions in a document, we call the process of transforming texts into a matrix **"bag-of-words" approach**.
_Maybe add a few more general sentences about dfms here._
Let's clarify a few key terms by inspecting the @tbl-dfm-raw. First, the dfm consists of 2 **document** and 9 **features**. Document 1 contains 6 **types** (i.e., unique tokens). The types are `i`, `love`, `quanteda`, `text`, `analysis`, `.`. Document 1 consists of 9 tokens since the words `i`, `love` and `.` appear twice while `love`, `quanteda`, `text`, and `analysis` appear only once. In Document 2, the number of types is identical to the number of tokens: the processed document contains 7 tokens, and each token appears only once. The dfm has a **sparsity** of 27.7% since 5 out of the 18 cells have zero counts.
The number of types, tokens, features, and a dfm's sparsity depends on the processing steps we implemented during tokenisation (@sec-quanteda-tokens). For example, if we conduct the common steps of removing stopwords and punctuation characters, the document feature matrix will be considerably smaller. @tbl-dfm-processed shows after removing punctuation characters and the words "i", "with", and "is", which are included in most stopword lists. @grimmer22textasdata [57-59] provide more examples of why the "default" processing steps might be problematic.
| | love | quanteda | text | analysis | fun |
|-------------------|------|----------|------|----------|-----|
| **Doc 1** | 2 | 1 | 1 | 1 | 0 |
| **Doc 2** | 0 | 1 | 1 | 1 | 1 |
: A document-feature matrix after removing punctuation characters and stopwords {#tbl-dfm-processed}
The sparsity was reduced from 28% to 20%, and the number of columns changed from nine to five. At the same time, removing these stopwords and punctuation did not result in a considerable loss of information about the content of each document. It is important to repeat, however, that stopword removal is not always recommended and that you should inspect the features included in a stopword list. We removed the pronoun "I". In many cases, the removal of pronouns will not be an issue, but in other cases pronouns can tell us a lot about a text. For instance, @crisp21voteseeking show Japanese candidates mention first-person pronouns more often in their personal manifesto if they face stronger intra-party competition. The usage of pronouns is a signal of candidates' campaign strategies and their degree of personalisation.
::: {.callout-warning appearance="simple"}
## Selecting features in your tokens object rather than the dfm
We strongly recommend to conduct all processing steps involving the removal of features during the tokenisation step rather than applying the same functions to a dfm. Multi-word expressions can be easily detected in tokens object, but this information is not available in a dfm anymore. Similarly, we strongly recommend applying dictionaries (@sec-exploring-dictionaries) to the tokens object to identify possible multi-word expressions correctly.
There are, however, a few processing steps that rely on feature frequencies. Users might want to remove word appearing in less or more than a certain number of percentage of documents. These so-called trimming operations, covered in detail in the Applications section, only work dfm objects since we need to the word counts to filter by frequncies.
:::
The dfm in @tbl-dfm-raw reports counts of features. This is the default when a dfm is created. Oftentimes, we might prefer prefer **weighting feature frequencies**. For example, many users might want to normalise a dfm by calculating the proportion of feature counts within each document. This process is called **document normalisation** as it homogenises the counts for each document. As such, it addresses the (potential) issue of varied document length in a corpus, allowing for comparing relative frequencies as opposed to raw counts. For example, two mentions of "terrible" in a short hotel review consisting of 100 words have a larger weight than three mentions of "terrible" in review of 500 words. When normalising the document, the relative frequency of "terrible" would be 0.02 (or 2%) in the short review, and 0.006 (0.6%) in the longer review, highlighting the much higher prevalence of this negative term in the first review.
Boolean weighting is another type of weighting. Applying boolean weights implies that all non-zero counts are recoded as 1. This weighting scheme is useful when users simply want to know whether or not a feature appeared in a document without being intersted in the absolute frequencies of the term.
A popular, and slightly more complex, weighting scheme is called **tf-idf** or **term frequency-inverse document frequency**. Tf-idf is a very popular approach in information retrieval and often used in machine learning (@sec-ml-classifiers). Tf-idf up-weights terms that appear more often in a given document and down-weight terms that are common across documents. As a term appears in more documents, its weight approaches zero. While tf-idf is a very useful approach for identifying "unique" words that define a given document, tf-idf may be problematic when working with a topic-specific texts corpus. Tf-idf will assign low weights to words appearing across documents. When we apply td-idf to a corpus of debates on climate protection, td-idf will eliminate climate-related features even if they appear in different frequencies across documents. To sum up, while tf-idf can be useful for certain classification tasks, users should proceed with great caution when applying tf-idf weighting.
Let's go through a simple example to understand how to calculate tf-idf scores. Going back to the example above, Document 1 contains the word "horrible" 2 times and consists of 100 tokens. Document 2 contains the word horrible 3 times but consists of 500 tokens.
Let's use **quanteda**'s default calculation for an example.
To calculate tf-idf we need to know term frequencies and inverse-docuemnt frequencies.
$$
\text{tf-idf} = \text{tf} \times \text{idf}
$$
When using the counts of the term frequencies for "horrible", Document 1 has a $tf = 2$, while the $tf$ of "horrible" for Document 2 equals $3$.
Note that we can also calculate tf with normalised frequencies, i.e.,
$$
\text{tf} = \frac{\text{Number of times term appears in the document}}{\text{Total number of terms in the document}}
$$
In this case, the $tf$ for Document 1 would be $for \frac{2}{100}$ and $\frac{3}{500}$ for Document 2. The `scheme_tf` argument in **quanteda**'s `dfm_tfidf()` allows you to select `count` or `prop`.
Next we calculate the inverse document frequency (idf) as:
$$
\text{idf} = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents with the term}}\right)
$$
Since "horrible" appears in both Document 1 and Document 2:
$$
\text{idf for "horrible"} = \log\left(\frac{2}{2}\right) = \log(1) = 0
$$
In the last step, we multiply each document's $tf$ for "horrible" with the $idf$ of the same term.
$$
\text{tf-idf} = \text{tf} \times \text{idf}
$$
For Document 1: $\text{tf-idf for "horrible" in Document 1} = 2 \times 0 = 0$
For Document 2: $\text{tf-idf for "horrible" in Document 2} = 3 \times 0 = 0$
If Document 1 contained "terrible" 2 times, but Document 2 did not mention "horrible" at all, $idf$ would change to: $\text{idf for "horrible"} = \log\left(\frac{2}{1}\right)$ because "terrible" is included in only one of our two documents. tf-idf for Document 2 would remain 0, but tf-idf for "horrible" in Document 1 changes to $2 \times \log(2) = 1.386$. The example shows highights that terms appearing in all documents always receive a tf-idf score of 0, no matter how (in)frequent they are. Keyness analysis (@sec-exploring-freqs) is another approach of identifying words more "unique" to one group of documents, and--in contrast to tf-idf--features appearing across all documents are not down-weighted to 0 if their expected frequencies differ.
The **trimming** of a document-feature matrix is another important step to remove the number of features and sparsity of the dfm. The dfm is reduced in size based on the document frequency or term frequency. Usually, we trim features based on minimum frequencies, but we can also exclude features in terms of maximum frequencies. Combining trimming operations for minimum and maximum frequencies will return features that fall within this range. Let's consider the following example. If we have a large text corpus and want to remove very infrequent terms we could trim based on the minimum term frequencies and document frequencies. We could, for instance only keep terms occurring at least 10 times across all documents and also appear in at least 2 documents. In this case, the minimum term frequency equals $10$ and the minimum document frequency equals $2$. Of course, we can also trim features based on relative frequencies. The Application section below provides several examples of trimming dfms.
**Smoothing** is another frequently used form of weighting a dfm. It implies that zero counts are changed to a constant other than zero. If we set a smoothing parameter of 1, the constant of 1 is added to all cells in the dfm: zero counts are changed to 1, cells with the value 1 get the value 2 etc.
Smoothing a dfm is required for statistical models that struggle with zero probabilities. For example, a zero probability of a feature in a Naive Bayes classifier (@sec-ml-classifiers) would change the entire document's probably to 0, even if other terms indicate it belongs to that class.
Next, we move to the **feature co-occurrence matrix**, which measure the co-occurrence of features. Feature co-occurrence matrices are the input for many word embedding models, which will be covered extensively in @sec-lms-embeddings. The intuition behind fcms is simple: instead of counting how often a feature appears in a document (the intuition behind dfms) we want to know which feature appear together within a user-defined context window.
_Add content on fcm here._
## Applications
In this section, you will learn how to create a dfm, how to get identify the number of types and tokens, and assess it sparsity. You will learn how to weight, trim, group, and subset a document-feature matrix, and how to create and modify a feature co-occurrence matrix.
### Creating a Document-Feature Matrix from a Tokens Object
We start with replicating the example mentioned above. Suppose we have two documents in a text corpus. We first tokenise the corpus and then create a `dfm()`.
```{r}
#| echo: false
#| include: false
library("quanteda")
```
```{r}
# create example corpus
corp <- corpus(
c(doc1 = "I love quanteda. I love text analysis.",
doc2 = "Text analysis with quanteda is fun."))
# tokenize corpus without making processing
toks <- tokens(corp)
# inspect tokens object
print(toks)
# create a document-feature matrix
dfmat <- dfm(toks)
# inspect dfm: first three documents, first four features
print(dfmat, max_ndoc = 3, max_nfeat = 4)
```
### Types, Tokens, Features, and Sparsity
We can access the number of features, tokens, types, and the sparsity of a dfm with in-built **quanteda** functions. We convert the corpus of US inaugural speeches to a document-feature matrix retrieve these statistics.
```{r}
dfmat_inaug <- data_corpus_inaugural |>
tokens() |>
dfm()
# sparsity() and nfeat() are reported on the dfm-level
sparsity(dfmat_inaug)
nfeat(dfmat_inaug)
# ntype() and ntoken() are reported on the document-level
# show the number of types for first five documents
ntype(dfmat_inaug) |> head(n = 5)
# show the number of tokens for first five documents
ntoken(dfmat_inaug) |> head(n = 5)
```
::: {.callout-note appearance="simple"}
## `dfm()`'s default behaviour
When converting a tokens object into a document-feature matrix, the `dfm()` considers all features included in the tokens object. It does not remove any features, but, by default, all tokens are transformed to lower case, even if you have not used `tokens_tolower()`.
```{r}
toks_example <- tokens("Transforming a tokens object to a dfm.")
# by default, dfm() transforms all tokens to lowercase
dfm(toks_example)
# can change this behaviour by using dfm(x, tolower = FALSE)
dfm(toks_example, tolower = FALSE)
```
:::
In almost all applications we can think of, lower-casing tokens is a reasonable and recommended processing choice. The sparsity of a dfm and the number of types (unique tokens) increases drastically when you treat upper- and lower-case tokens separately. If you want to make sure certain tokens (e.g., `Labour` vs. `labour`) are treated separately, we suggest replacing these tokens during the tokenisation process (@sec-quanteda-tokensadvanced) rather than preserving the original case of _all_ tokens.
```{r}
# transform all features to lowercase
dfmat_lower <- data_corpus_inaugural |>
tokens() |>
dfm(tolower = TRUE) # default
# sum of features
nfeat(dfmat_lower)
# do not change tokens to lowercase
dfmat_unchanged <- data_corpus_inaugural |>
tokens() |>
dfm(tolower = FALSE) # keep original case
# sum of features
nfeat(dfmat_unchanged)
```
The number of unique tokens increased from `r nfeat(dfmat_lower) |> format(big.mark = ",")` to `r nfeat(dfmat_unchanged) |> format(big.mark = ",")` when not changing tokens to its lower-cased form. In the second case, upper-case tokens will be treated as a different feature than lower-case token.
### Feature Weighting
When you have created a document-feature matrix, you can apply weights using `dfm_weight()` and adjust the `scheme` argument. Let's weight the document-feature matrix of inaugural speeches by proportion to normalise the documents.
```{r}
dfmat_inaug <- data_corpus_inaugural |>
tokens() |>
dfm()
# document normalisation using dfm(x, scheme = prop)
dfmat_inaug_prop <- dfm_weight(dfmat_inaug, scheme = "prop")
# inspect dfm: first three documents, first four features
print(dfmat_inaug_prop, max_ndoc = 3, max_nfeat = 4)
```
We can transform non-zero counts to 1 with `dfm_weight(x, scheme = "boolean"`).
```{r}
# apply boolean weighting
dfmat_inaug_boolen <- dfm_weight(dfmat_inaug, scheme = "boolean")
# inspect dfm: first five documents, first four features
print(dfmat_inaug_boolen, max_ndoc = 3, max_nfeat = 4)
```
After applying `dfm_weight(x, scheme = "boolean")`, the cells contain only 0 (document does not include feature) or 1 (document contains feature at least once) rather than absolute or relative frequencies.
You can apply tf-idf weighting with `dfm_tfidf()`. Let's transform `dfmat_inaug` to tf-idf by using counts instead of normalised frequencies first for the term frequencies.
```{r}
# apply dfm-idf weighting; use count-based measure of term frequencies
dfmat_tfidfcount <- dfm_tfidf(dfmat_inaug, scheme_tf = "count")
# inspect dfm: first five documents, first four features
print(dfmat_tfidfcount, max_nfeat = 5, max_ndoc = 4)
```
The output reveals that features appearing in all documents, including `of` and `the` receive a score of 0. This "feature" of tf-idf will down-weight all terms appearing across documents, which might be problematic since it might also include meaningful terms. For example, it is very likely that every politicians will mention "climate" in parliamentary debates about environmental protection. Tf-idf, in turn, would assign the word "climate" a score of 0 for all speakers.
You can adjust the way **quanteda** calculates tf-idf, for example by using harmonised term frequencies (`dfm_tfidf(x, scheme_tf = "prop")`).
```{r}
# use harmonised counts (=prop) for term frequencies
dfmat_tfidfprop <- dfm_tfidf(dfmat_inaug, scheme_tf = "prop")
```
You can also change the base for the logarithms. **quanteda**'s default base 10, which mirrors the tf-idf implementation in @manning08iir. Other R packages use different default weights. **tidytext** [@tidytextJOSS] applies the natural log, whereas **tm** [@tmpackage] uses base 2. Below we show how to replicate other packages' tf-idf scores.
```{r}
#| eval: false
# mirror tidytext's implementation
dfm_tfidf(dfmat_inaug,
scheme_tf = "prop",
scheme_df = "inverse",
base = exp(1))
# mirror tm's implementation
dfm_tfidf(dfmat_inaug,
scheme_tf = "prop",
scheme_df = "inverse",
base = 2)
```
### Trimming
As mentioned above, most dfms are very sparse. Sparsity above 90% is the norm in most text analysis applications, especially if documents are short. Most features simply do not appear in a document. Very high levels of sparsity can result in convergence issues for topic models or unsupervised scaling approaches such as Wordfish [@slapin08wordfish]. In turn, if we keep the most frequent terms in our dfm, many models might be heavily influenced by these very frequent terms, and potential differences across documents will not be uncovered. Let's take the following example: as we have seen above, a large proportion of documents are punctuation characters and stopwords. If we keep these features in our dfm, these terms will drive the results and, possibly, underestimate differences in topic prevalence or policy positions. @sec-ml-unsupervised-scaling and @sec-ml-topicmodels cover these issues in more detail.
The function `dfm_trim()` allows you to exclude very infrequent or very frequent terms from your dfm before proceeding with the analysis to avoid these issues or (at least) to assess the validity of the results when you exclude very frequent and infrequent terms. We can features in terms of absolute frequencies, proportions, ranks, or quantiles. In addition, we can reduce the size based on document frequencies ("In how many documents does a feature appear?") or term frequencies ("How often does a feature appear across all documents?"). We explain all of these arguments in the example below. We use the corpus of hotel review as an example and always return the number of types and tokens to provide an overview of the reduction.
```{r}
#| eval: false
# tokenise corpus of hotel reviews without any tokens removal
toks_hotels <- tokens(TAUR::data_corpus_TAhotels)
# no processing apart from lower-casing terms (quanteda's default)
dfmat_hotels <- dfm(tokes)
# keep only features occurring at least 10 times (min_termfreq = 20)
# and in >= 10 documents (min_docfreq = 10)
dfm_trim(dfmat_hotels, min_termfreq = 10, min_docfreq = 10)
# keep only features occurring at least 10 times (min_termfreq = 10)
# and in at least 40% of the documents (min_docfreq = 0.4)
dfm_trim(dfmat_hotels, min_termfreq = 10, min_docfreq = 0.4)
# keep only features occurring at most 10 times (max_termfreq = 10)
# and in at most 200 documents (max_docfreq = 200)
dfm_trim(dfmat_hotels, max_termfreq = 10, max_docfreq = 200)
# keep only features occurring at most 10 times and
# in at most 3/4 of the documents
dfm_trim(dfmat_hotels, max_termfreq = 10, max_docfreq = 0.75)
# keep only words occurring frequently (top 20% -> min_termfreq = 0.2)
# and in at most 2 documents
dfm_trim(dfmat, min_termfreq = 0.2, max_docfreq = 2,
termfreq_type = "quantile")
```
### Grouping and Subsetting
By default, a dfm contains as many documents as the input tokens object. Yet, in some cases you might want to **combine documents in a dfm by a grouping variable**. Instead of analysing each speech by a delivered by a politician, we might want to aggregated speeches to the level of parties. If we want to understand differences between hotel reviews with low and high rankings, we could group our dfm by ranking rather than review.
The function `dfm_group()` sums up cell frequencies within groups, determined by a document level variable and creates new "documents" with the group labels. Let's introduce `dfm_group()` with an example: we transform the corpus of `r ndoc(TAUR::data_corpus_TAhotels) |> format(big.mark = ",")` hotel reviews (`data_corpus_TAhotels`) to a dfm, and then group this dfm by the "Rating" document-level variable, indicating the reviewer's satisfaction with the hotel (ranging from 1 to 5 stars). The grouped dfm should therefore consist of five "documents", with each document summing up cell frequencies of all documents with the same rating.
```{r}
# tokenise the corpus and create a document feature matrix in one go
dfmat_hotels <- TAUR::data_corpus_TAhotels |>
tokens(remove_punct = TRUE) |> # remove punctuation
tokens_remove(pattern = stopwords("en")) |> # remove stopwords
dfm() # create dfm
# inspect dfm: first five documents and first five features
print(dfmat_hotels, max_ndoc = 5, max_nfeat = 5)
# now group this dfm consisting of 20,491 review
# based on Rating document-level variable
dfmat_hotels_grouped <- dfm_group(dfmat_hotels, groups = Rating)
# inspect the grouped dfm: all documents and first five features
print(dfmat_hotels_grouped, max_ndoc = 5, max_nfeat = 5)
```
The grouped dfm consists of only `r ndoc(dfmat_hotels_grouped)` documents, while the number of features is the same as in the review-level dfm (`dfmat_hotels`). The number of frequencies does not change because `dfm_group()` simply sums up the cell counts. Inspecting the output of the first five features, we see that the `nice` appears `r dfm_select(dfmat_hotels_grouped, "nice") |> dfm_subset(Rating == "1") |> sum()` times in 1-star reviews, while the term appears `r dfm_select(dfmat_hotels_grouped, "nice") |> dfm_subset(Rating == "5") |> sum() |> format(big.mark = ",")` times in 5-star reviews.
In @sec-quanteda-corpus, we introduced `corpus_subset()` to filter documents based on one or more document-level variables. We can do the same with dfm's by using `dfm_subset()`. For example, we can filter only 1-star reviews to investigate word frequencies in very negative reviews.
```{r}
# get overview of reviews by rating
table(dfmat_hotels$Rating)
# subset dfmat_hotels by keeping only 1-star reviews
dfmat_hotels_1star <- dfm_subset(dfmat_hotels, Rating == "1")
# check that subsetting worked as expected by
# creating a cross-table of the Rating docvars
table(dfmat_hotels_1star$Rating)
```
::: {.callout-tip appearance="simple"}
## Subset your objects wisely
Tokenisation is the bottleneck operation in processing texts for quantitative analyses. If you know that you want to exclude certain documents from all analyses (for instance, all documents published prior to a specific date), we recommend filtering out these documents _before_ from the corpus object using `corpus_subset()` rather than excluding the documents from the tokens object or dfm. Having said that, keep in mind that excluding documents may lead to selection biases and other problems [@grimmer22textasdata: ch. 3].
:::
### Converting a dfm Object for Further Use
Document-feature matrices are the input for many statistical analyses of textual data. The **quanteda** package infrastructure includes many text models that work directly with a **quanteda** dfm object. Other packages require a dfm in a slightly different format. The function `convert()` provides easy conversion from a dfm to document-term representations used in all other text analysis packages.
**quanteda**'s `convert()` function allows users to convert a dfm object to the following formats.
- [`lda`](https://cran.r-project.org/web/packages/lda/lda.pdf): a list with components "documents" and "vocab" as needed by the function `lda.collapsed.gibbs.sampler` from the **lda** package [@lda].
- [`tm`](https://cran.r-project.org/web/packages/tm/index.html): a DocumentTermMatrix from the **tm** package [@JSSv025i05].
- [`stm`](https://cran.r-project.org/web/packages/stm/index.html): the format for the **stm** package [@stm] for structural topic models. Note: the **stm** package also allows you to input a **quanteda** object directly, and no conversion is required.
- [`topicmodels`](https://cran.r-project.org/web/packages/topicmodels/index.html): the `dtm` format as used by the **topicmodels** package [@topicmodels] for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM)
In addition, the dfm can be converted to a `data.frame` and `tripletlist` consisting of `document`, `feature`, and `frequency`.
Many recently developed R packages, such as **keyATM** [@keyatm] for keyword-assisted topic models, **LSX** [@lsx] for latent semantic scaling, or **conText** [@context] for 'a la Carte' on Text Embedding Regression, rely on **quanteda** and allow you to input processed `tokens()` or `dfm()` objects.
The popular **tidytext** package [@tidytextJOSS] includes the function `cast_dfm()` which allows you to cast a data frame to a **quanteda** dfm (see also @sec-furthertidy).
### Creating and Processing a Feature Co-Occurrence Matrix
_To be added._ Also reference @sec-lms-embeddings which will explain how to create word embeddings based on an fcm object.
## Advanced
We introduce two more operations for dfm objects, which are useful---and required---for many analyses, but slightly more advanced: smoothing and matching dfms.
### Smoothing
As described in the Methods section above, smoothing a dfm implies that zero counts in a dfm are changed to a constant other than zero. Smoothing is required for models that do not work with zero probabilities, for instance Naive Bayes classifiers. We return to the minimal dfm object created above to show how to smooth your `dfm()`.
```{r}
# print dfm object
print(dfmat)
# smooth dfm by adding a constant of 1 to zero cells
dfm_smooth(dfmat, smoothing = 1)
```
```{r}
#| eval: false
# equivalent to:
dfm_weight(dfmat, smoothing = 1)
```
```{r}
# smooth dfm by adding a constant of 0.5 to zero cells
dfm_smooth(dfmat, smoothing = 0.5)
```
### Matching Two Document-Feature Matrices
Finally, we show how to match the features of two dfms. Matching features is required for some machine learning classifiers that can only consider features occurring in the training set for predictions of documents in a second dfm. More details are provided in @sec-ml-classifiers.
Let's split the corpus of US inaugural speeches into two objects: one corpus containing speeches delivered between 1945 and 1990, and another one including speeches delivered since 1990.
Afterwards, we match the feature set of one dfm with the features existing in the other dfm.
```{r}
# corpus consisting of speeches between 1945 and 1990
corp_pre1990 <- corpus_subset(data_corpus_inaugural, Year > 1945 & Year <= 1990)
# check that subsetting worked as expected
# by inspecting the Year document-level variable
docvars(corp_pre1990, "Year")
# corpus consisting of speeches since 1990
corp_post1990 <- corpus_subset(data_corpus_inaugural, Year > 1990)
# check that subsetting worked as expected
# by inspecting the Year document-level variable
docvars(corp_post1990, "Year")
# create two dfms
dfmat_pre1990 <- corp_pre1990 |>
tokens() |>
dfm()
dfmat_post1990 <- corp_post1990 |>
tokens() |>
dfm()
# match dfms: only keep features in dfmat_pre1990
# that appear in dfmat_post1990
dfmat_pre1990matched <- dfm_match(dfmat_pre1990, features = featnames(dfmat_post1990))
# inspect dfm
print(dfmat_pre1990matched)
# match dfms: only keep features in dfmat_post1990
# that appear in dfmat_pre1990
dfmat_post1990matched <- dfm_match(dfmat_post1990, features = featnames(dfmat_pre1990))
# inspect dfm
print(dfmat_post1990matched)
```
## Further Reading
- One of the first, and possibly most famous bag-of-words applications: @Most63
- A deep dive into weighting document-feature matrices: @manning08iir [ch. 6]
- Document-feature matrices, processing, and why the default options are often not ideal: @grimmer22textasdata [ch. 5]
## Exercises
Add some here.