-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathregex.Rmd
314 lines (238 loc) · 10.2 KB
/
regex.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: "Regex Kurs"
author: "Leon Werner"
date: "25 11 2023"
output: html_document
---
content mostly based on the Text Analytics I HWS 22/23 excercise by Marlene Lutz
and partly on https://campus.datacamp.com/courses/intermediate-regular-expressions-in-r/regular-expressions-writing-custom-patterns?ex=2
# Background
Rgular Expressions (RegEx) allow to look for simple patterns in text.
E.g.
- look for valid email adresses
- look for .de websites
- look for variable names in a big dataset
This is a markdown file, it is structured into executable code chunks and normal text (like this)
The next line is a code chunk but ignore this one for now
```{r setup, include=FALSE}
#this is onyl to set up the code, dont't touch this cell
#remove variables
rm(list = ls())
#load required packages, install only if not installed yet
p_needed = c(
"tidyverse",
"knittr",
"devtools"
)
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)] #all packages that are not installed yet
#install the uninstalled packages
if(length(p_to_install > 0)){
install.packages(p_to_install)
}
#Install regexplain from github
if (FALSE == ("regexplain" %in% packages)){
devtools::install_github("gadenbuie/regexplain")
}
#load all required packages
pkg_success <- sapply(p_needed, require, character.only = TRUE)
pkg_fail <- names(pkg_success)[which(!pkg_success)]
# set chunk options for the markdown file
knitr::opts_chunk$set(echo = TRUE)
```
# Recap of R
The course starts here
```{r recap R numers 1, include=True}
#this is a comment
this is code #but it doesn't work, put a hashtag before it to mark this line as comment!
# run code either line by line using ctrl + enter or chunk by chunk using the green play button
# you will see the results in the console
1 #this is value of 1 (also code). It will be "printed" (inclueded in the output)
print(1) #most of the times we don't have to write print
2+3 # use R as calculator, the result will be in the output
2*3
2**3
```
We don't really know which number belongs to which calculation.
Maybe we should put some annotation in the output. This is where we need strings
```{r recap R strings 1, include=True}
"This is a string, a sequence of characters, it is encased by double quotation marks"
"Strings can include special characters !§$%&/()=? and numbers 1234567890"
"Strings can include most special characters like !§$%&/()=? and numbers 1234567890"
'It can also be included in single quotation marks'
"Or one insite the other 'like this'"
"Cannot use the same" "type of quotation marks twice" #if uncommented this line will cause an error
"we could discribe outputs, like: now we will print the number 1"
1
```
```{r recap R numers 2, include=True}
a = 5 # store values by assign them to "object" or "variable" a; variable names have to start with a letter
"value of a"
a # get value of a
get("a") # get value of a
"value of 2*a"
2*a # work with the value of the variable
a <- 2 #assign different value (using not = but <-)
"new value of a"
a
"sequence of numbers"
10:13
"a is a variable of class"
class(a)
```
```{r recap R lists, include=True}
list1 = c(a,2,a,4,5) #have a "list" or "vector" of values using c()
"value of list1"
list1
"new value of list1"
list1 = 1:5
list1
# select some values from the list unsing indices within []
"selected values from c"
list1[1]
list1[2:4]
"list1 is a variable of class"
class(list1)
"list1[1] is a variable of class"
class(list1[1])
```
Strings are also values. We can also assign them to variables
```{r recap R strings 2, include=True}
hello = "Hello" #assign string to variable
world = "World!"
string_list = c(hello, world) # put string variables into one list
"List"
string_list
helloworld = paste(hello, world) # we can also paste the strings together
"combinded (pasted) strings"
helloworld
"helloworld is a variable of class"
class(helloworld)
```
```{r recap R bool, include=True}
TRUE
FALSE #are the two boolean values
bool_list = c(TRUE,FALSE,FALSE,TRUE,FALSE)
list1[bool_list] #can use list of boolean values to choose from list (instead of index)
"TRUE is of class"
class(TRUE)
```
# Regex
The functions we use are from the tidyverse package
str_detect and grep check if a text contains a pattern
Try to figure out why we split the string first
```{r regex1, include=T}
a_txt = "This is an example text consisting fof 20 words with an average of 7 chars per word, test 1234123 1233 2312 1231 "
# we can split strings using str_split (e.g by pace " ")
"split output"
a_txt_split = str_split(a_txt," ")
a_txt_split #this is a nested list, lets change unlist it
"unlisted"
a_txt_split = unlist(a_txt_split)
a_txt_split
#we can use grep to get all elements in a list that fit with our re (or pattern)
# here we look for the occurence of "a" in the string
re = "a"
"outputs"
detect_output = str_detect(a_txt_split, pattern = re) #returns a list of Bools
detect_output
grep_output = grep(a_txt_split, pattern = re) #returns a list of indices
grep_output
#we can use both to select from the original list
"list selection"
a_txt_split[detect_output]
a_txt_split[grep_output]
```
The next methods stop after they have found something (and go to the next element in the list)
They all have verions that continue for all occurences of the pattern.
```{r start str_extract, include=T}
#This extracts whatever matched the pattern
str_extract(a_txt, pattern = re)
"version with all"
str_extract_all(a_txt, pattern = re)
```
```{r start str_replace, include=T}
#This replaces whatever matched the pattern
replacement = "" #replace with nothing (delete)
str_replace(a_txt, pattern = re, replacement = replacement)
str_replace(a_txt_split, pattern = re, replacement = replacement)
"version with all"
str_replace_all(a_txt, pattern = re, replacement = replacement)
str_replace_all(a_txt_split, pattern = re, replacement = replacement)
```
# Handy shiny-app
```{r regexplain, include=T}
regexplain::regexplain_gadget()
```
# Excercise 1
In this task you are asked to create regular expressions that meet the specified conditions.
__a)__ Write a regular expression that returns all integer numbers from a text that are surrounded by whitespaces.
```{r t1a R, include=T}
a_txt = "This is an example text consisting fof 20 words with an average of 7 chars per word, test 1234123 1233 2312 1231 "
```
__b)__ Write a regular expression that returns all valid years that are surrounded by whitespaces in a text. A valid year is a 4 digit number in the range from 0000 to 2022. __(2 pts)__
```{r t1b R, include=T}
b_txt = "test 10001 0000 0100 0001 1111 0011 1234 1999 test 2000 test 2001 test 2010 test 2019 test 2022 test3 2023 test 2024 test test 9999 "
```
__c)__ Write a regular expression that returns all dates in the format YYYY-MM-DD or YYYY/MM/DD from a given text. Make sure that YYYY is a valid year (see task __b)__), MM is a valid month (1 to 12) and DD is a valid day (1 to 31). There is no need to make sure that e.g. XXXX-02-31 does not exist
```{r t1c R, include=T}
c_txt = "NOT VALID 12001-11-11 also not valid 2001-11-123 and not x2001-11-12-11 VALID 2022-12-31 2022/12/31 2022-09-31 2022-12-05 not valid 2023-12-31 2022-13-31 2022-12-32 1 2012/10-20 2012-10/20"
```
__d)__ Assume you are given a list ``l`` of strings like the one below. Using regular expressions, return a list that contains all elements from ``l`` that **don't contain both, the letter ``a`` AND ``e``** and store the result in a variable ``l_filtered``.
__Example:__ _given the list_
``l = ["apple", "cucumber", "tomato", "zucchini", "pumpkin", "pear", "raspberry", "blueberry"]``
_you should return_
``l_filtered = ['cucumber', 'tomato', 'zucchini', 'pumpkin', 'blueberry']``.
# example list
l = ["apple", "cucumber", "tomato", "zucchini", "pumpkin", "pear", "raspberry", "blueberry"]
```{r t1d, include=T}
c_txt = "NOT VALID 12001-11-11 also not valid 2001-11-123 and not x2001-11-12-11 VALID 2022-12-31 2022/12/31 2022-09-31 2022-12-05 not valid 2023-12-31 2022-13-31 2022-12-32 1 2012/10-20 2012-10/20"
```
__e)__ For the given string ``s`` with 4 lines, change the _whole_ word ``pot`` (i.e. ``pottery`` should not be changed) to ``1234`` only if it is at the start of a line.
```{r t1e, include=T}
s = "
pottery clot pot
pot dot plot hot
spot rot pot got
not shot forgot"
```
#Excercise 2
```{r loading data, include=T}
#Load data
load("Thesis_data_changed.Rda")# Data from my thesis (text entries are in randomized order, demographic variables and edit time removed)
#column names names in data
colnames(df)
#we can use regex in select() and match()
IV03 = df %>% select( matches("IV03_01"))
df_without_IV03 = df %>% select( -matches("IV03_01"))
```
a) remove all TIME variables from df
```{r t2a, include=T}
```
b) create a second dataframe with only the first items from all variables
```{r t2b, include=T}
```
c) how many red dogs are mentioned in string_var_2 of df?
```{r t2c, include=T}
```
d) how many mentions of "Geld" or "Job" are in in IV03_01.
How many answers contain multiple mentions of "Job"
```{r t2d, include=T}
```
e) use ctrl+F to find all comments in this file that contain the string "regex"
#Excercise 3
See if a "FOUND" is followed by more than one "CHECK" before the next "FOUND" in the file Output.txt
#Solutions
```{r Incomplete solutions Excercise 1, include=T}
#t1a = "(?<=\s)[0-9]+(?=\s)"
#t1b = "(?<=\s)([0-1]\d{3}|20[0-1]\d|202[0-2])(?=\s)"
#t1c = "(?:\b(?:[0-1]\d{3}|20[0-1]\d|202[0-2])-(?:0\d|1[12])-(?:0\d|[12]\d|3[01])\b)|(?:\b(?:[0-1]\d{3}|20[0-1]\d|202[0-2])/(?:0\d|1[12])/(?:0\d|[12]\d|3[01])\b)"
#t1c2 ="\b(?:[0-1]\d{3}|20[0-1]\d|202[0-2])[-/](?:0\d|1[12])(?:(?<=-..)-|(?<=/..)/)(?:0\d|[12]\d|3[01])\b")
# used ?: to include group seperator
# Using \b boundary to fix it to 4/2/2 pattern
# Backreference and [/-] is shorter but doesn't return a result in nice format
#t1d = "a.*e|e.*a" # a followed by e or e followed by a
#l_filtered = [w for w in l if not re_d.search(w)]
#\b is word boundry
#t1e = "^pot\b", "1234" # this is wrong
```