-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.Rmd
330 lines (226 loc) · 13.3 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
---
title: "Data Science: Text Mining with R"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: scroll
css: style.css
---
# Intro {.sidebar}
This dashboard covers the course materials for the course [S41: Data Science: Text Mining with R](https://utrechtsummerschool.nl/courses/social-sciences/data-science-introduction-to-text-mining-with-r).
---
<!-- <center> -->
<!-- ![](logo1.png){width=100%} -->
<!-- </center> -->
<!-- --- -->
<!-- <!-- ADD COURSE INFO -->
<!-- <!-- Instructor: FILL -->
<!-- <!-- Study load: FILL -->
<!-- <!-- Assessment: FILL -->
<!-- --- -->
Course director: [Ayoub Bagheri](https://www.uu.nl/staff/ABagheri)
Instructors:
- [Qixiang Fang](https://www.uu.nl/staff/QFang)
* [Luka van der Plas](https://www.uu.nl/medewerkers/LPvanderPlas)\
* [Hugh Mee Wong](https://www.uu.nl/medewerkers/HMWong)\
* [Ayoub Bagheri](https://www.uu.nl/staff/ABagheri)\
<!-- - [Pablo Mosteiro](https://www.uu.nl/medewerkers/PJMosteiroRomero)\ -->
<!-- - [Laurence Frank](https://www.uu.nl/medewerkers/QFang)\ -->
Study load: 1.5 ECTS
Location: [Koningsberger Building, Room 224](https://www.uu.nl/en/victor-j-koningsberger-building)
---
# Quick Overview
## Column 1
### Outline
From the social sciences to the humanities and healthcare, much of today's data is contained in text. However, text is considered to be a type of unstructured information that is difficult to process automatically. Therefore, text mining can be applied to create a more structured representation of a text, making its content more accessible to researchers. Therefore, this course provides a comprehensive introduction to text mining with R. The course has a strong practical focus, and students will gain experience in applying text mining to real data from, for example, social science and healthcare domains, and in interpreting the results. Through lectures and labs, students will learn the skills necessary to design, implement, and understand their own text mining pipeline. Topics covered in this course include regular expressions, text preprocessing, text classification and clustering, and word embedding approaches for text data.
The course deals with the following topics:
* Understand and explain the fundamental approaches to text mining;
* Understand and apply current methods for analyzing texts;
* Understand how text is handled, manipulated, preprocessed and cleaned;
* Define a text mining pipeline given a practical data science problem;
* Implement generic text mining tools such as regular expression, text clustering, text classification, sentiment analysis, and word embedding.
The course starts at a very basic level and builds up gradually. By the end of the course, participants will have mastered text mining skills with R.
### Requirements
Participants should have a basic knowledge of scripting in R.
### Prerequisites
Participants are requested to bring their own laptop for the lab meetings.
### Certificate
Participants will receive a certificate at the end of the course.
### Additional references
1- Jurafsky, D., Martin, J.H. (2024). Speech and language processing, third edition. Find online chapters [here](https://web.stanford.edu/~jurafsky/slp3/)
2- Eisenstein, J. (2018). Natural Language Processing. Find online chapters [here](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
3- Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach. O'Reilly Media, Inc. Find the book [here](https://www.tidytextmining.com/)
## Column 2
### Daily schedule
| Start time | End time | Type |
|:-----------|:---------|:----------|
| 09:00 | 10:30 | Lecture |
| |**Break** | |
| 10:45 | 11:45 | Practical |
| 11:45 | 12:30 | Discussion|
| |**[Lunch](https://www.uu.nl/en/vening-meineszgebouw-a)** | |
| 14:00 | 15:30 | Lecture |
| |**Break** | |
| 15:45 | 16:30 | Practical |
| 16:30 | 17:00 | Discussion|
# How to Prepare
## Column 1
### Preparing yourself and your machine for the course
If you have no experience with `R` or another programming language, you are going to need to catch up before starting the course and during the course.
Some good sources are:
- The first two chapters of [introduction to R on datacamp](https://www.datacamp.com/courses/free-introduction-to-r)
- Install `R`, play around, and [read the workflow basics chapter in Hadley Wickham's R for Data Science](http://r4ds.had.co.nz/workflow-basics.html#workflow-basics)
- Interactive R course: install `R` as in the previous point and in the console type the following lines one by one
```r
install.packages("swirl")
library(swirl)
swirl()
```
and follow the guide to run the `R Programming: The basics of programming in R` interactive course.
### System requirements
Bring a laptop computer to the course and make sure that you have full write access and administrator rights to the machine. We will explore programming and compiling in this course. This means that you need full access to your machine. Some corporate laptops come with limited access for their users, we therefore advise you to bring a personal laptop computer, if you have one.
### **1. Install `R`**
`R` can be obtained [here](https://cran.r-project.org). We won't use `R` directly in the course, but rather call `R` through `RStudio`. Therefore it needs to be installed.
### **2. Install `RStudio` Desktop**
Rstudio is an Integrated Development Environment (IDE). It can be obtained as stand-alone software [here](https://www.rstudio.com/products/rstudio/download/#download). The free and open source `RStudio Desktop` version is sufficient.
### **3. Start RStudio and install the following packages. **
Execute the following lines of code in the console window:
```{r eval=FALSE, echo = TRUE}
install.packages(c("ggplot2", "tidyverse", "dplyr", "magrittr", "xlsx",
"wordcloud", "stringr", "caret", "knitr", "rmarkdown",
"plotly", "e1071", "SnowballC", "devtools", "rpart", "proxy",
"topicmodels", "tidyr", "dbscan", "text2vec", "tidytext",
"tensorflow", "keras"),
dependencies = TRUE)
```
If you are not sure where to execute code, use the following figure to identify the console:
<center>
<img src="console.png" alt="HTML5 Icon" width = 70%>
</center>
Just copy and paste the installation command and press the return key. When asked
```{r eval = FALSE, echo = TRUE}
Do you want to install from sources the package which needs
compilation? (Yes/no/cancel)
```
type `Yes` in the console and press the return key.
### Required `R` knowledge
The following is the minimum of what you should know about `R` before starting with the first practical
- What is `R` (a fancy calculator) and what is an `.R` file (a recipe for calculations)
- What is an `R` package (a set of functions you can download to use in your own code)
- How to run `R` code in `RStudio`
- What is a variable `x <- 10`
- What is a function `y <- fun(x = 10)`
- Understand what the following statements do (tip: you may run it in `R` line by line)
```r
y <- "What?"
x <- "R!"
z <- paste(x, "No, text mining is the best.", y)
rep(z, 3)
1:10
sample(1:20, 4)
sample(1:20, 40, replace = TRUE)
z <- c(1, 2, 3, 4, 5, 4, 3, 2, 1)
z^2
z == 2
z > 2
install.packages("dplyr")
library(dplyr)
```
- Be able to read the help file of any function, (e.g., type `?plot` in the console)
## Column 2
### What if the steps to the left do not work for me?
If all fails and you have insufficient rights to your machine, the following web-based service will offer a solution.
- Open a free account on [rstudio.cloud](https://rstudio.cloud). You can run your own cloud-based `RStudio` environment there. Naturally, you will need internet access for these services to be accessed.
# Monday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Monday:
- Part 1: Introduction
- [Lecture 1](Monday/Lectures/Lecture 1/Lecture_1.html)
- [Lecture 1 Handout](Monday/Lectures/Lecture 1/Lecture_1_handout.pdf)
<!-- - [Impractical 1](Monday/Practicals/Practical 1/Impractical_1.html) -->
- [Practical 1](Monday/Practicals/Practical 1/Practical_1.html)
- [Data for practical 1](Monday/Practicals/Practical 1/data.zip)
- Part 2: Text preprocessing
- [Lecture 2](Monday/Lectures/Lecture 2/Lecture_2.pdf)
- [Lecture 2 Handout](Monday/Lectures/Lecture 2/Lecture_2_Handout.pdf)
<!-- - [Impractical 2](Monday/Practicals/Practical 2/Impractical_2.html) -->
- [Practical 2](Monday/Practicals/Practical 2/Practical_2.html)
- [Data for practical 2](Monday/Practicals/Practical 2/data.zip)
- [Data (lecture and practical)](Monday/Practicals/Practical 2/Data (lecture and practical).zip)
## Column 2
### Additional references
- Chapters 1, 2, 3 of Ref 1
- Chapter 1 of Ref 2
- Chapters 1, 3, 4, 5 of Ref 3
# Tuesday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Tuesday:
- Part 3: Text representation & classification
- [Lecture 3](Tuesday/Lectures/Lecture 3/Lecture_3.html)
- [Lecture 3 Handout](Tuesday/Lectures/Lecture 3/Lecture_3_Handout.pdf)
<!-- - [Impractical 3](Tuesday/Practicals/Practical 3/Impractical_3.html) -->
- [Practical 3](Tuesday/Practicals/Practical 3/Practical_3.html)
- [Data for practical 3](Tuesday/Practicals/Practical 3/news_dataset.zip)
- Part 4: Sentiment analysis
- [Lecture 4](Tuesday/Lectures/Lecture 4/Lecture_4.html)
- [Lecture 4 Handout](Tuesday/Lectures/Lecture 4/Lecture_4_Handout.pdf)
<!-- - [Impractical 4](Tuesday/Practicals/Practical 4/Impractical_4.html) -->
- [Practical 4](Tuesday/Practicals/Practical 4/Practical_4.html)
- [Data for practical 4](Tuesday/Practicals/Practical 4/data.zip)
## Column 2
### Additional references
- Chapters 4, 5, 20 of Ref 1
- Chapters 2, 3, 4 of Ref 2
- Chapter 2 of Ref 3
# Wednesday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Wednesday:
- Part 5: Feature selection & text clustering
- [Lecture 5](Wednesday/Lectures/Lecture 5/Lecture_5.html)
- [Lecture 5 Handout](Wednesday/Lectures/Lecture 5/Lecture_5_Handout.pdf)
<!-- - [Impractical 5]() -->
- [Practical 5](Wednesday/Practicals/Practical 5/Practical_5.html)
- [Data for practical 5](Wednesday/Practicals/Practical 5/data.zip)
- Part 6: Topic modeling
- [Lecture 6](Wednesday/Lectures/Lecture 6/Lecture_6.html)
- [Lecture 6 Handout](Wednesday/Lectures/Lecture 6/Lecture_6_Handout.pdf)
<!-- - [Impractical 6](Wednesday/Practicals/Practical 5/Impractical_5.html) -->
- [Practical 6](Wednesday/Practicals/Practical 6/Practical_6.html)
- [Data for practical 6](Wednesday/Practicals/Practical 6/data.zip)
## Column 2
### Additional references
- Chapters 6 and 7 of Ref 1
- Chapters 5 and 14 of Ref 2
- Chapter 6 of Ref 3
# Thursday
## Column 1
### Materials
We adapt the course as we go. To ensure that you work with the latest iteration of the course materials, we advise all course participants to access the materials online. All lectures are in html format. Practicals are walkthrough files that guide you through the exercises, use the show/hide code in front of each question when you feel you need a tip.
Here you will find the materials for Thursday:
- Part 7: Word embedding
- [Lecture 7](Thursday/Lectures/Lecture 7/Lecture_7.pdf)
- [Lecture 7 Handout](Thursday/Lectures/Lecture 7/Lecture_7.pdf)
<!-- - [Impractical 7](Wednesday/Practicals/Practical 6/Impractical_6.html) -->
- [Practical 7](Thursday/Practicals/Practical 7/Practical_7.html)
- [Data for practical 7](Thursday/Practicals/Practical 7/data.zip)
- Part 8: Deep learning for text
- [Lecture 8](Thursday/Lectures/Lecture 8/Lecture_8.html)
- [Lecture 8 Handout](Thursday/Lectures/Lecture 8/Lecture_8.pdf)
<!-- - [Impractical 8](Thursday/Practicals/Practical 4/Impractical_4.html) -->
- [Practical 8](Thursday/Practicals/Practical 8/Practical_8.html)
- [Data for practical 8](Thursday/Practicals/Practical 8/data.zip)
## Column 2
### Additional references
- Chapters 4, 5, 20 of Ref 1
- Chapters 2, 3, 4 of Ref 2
- Chapter 2 of Ref 3
# Archive
On the last day of the course, all the materials will be available in a compact file for download:
[Download the Materials](All materials/TM with R_Materials.zip)
We wish all the participants success with their Text Mining projects!