-
Notifications
You must be signed in to change notification settings - Fork 0
/
part1.Rmd
300 lines (203 loc) · 11.1 KB
/
part1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: "Web Scraping & Text cleaning"
author: "Hamza Saadi"
date: "September 20, 2018"
output: html_document
---
<center>
![](Proposition_10.jpg)
</center>
```{r message=FALSE,warning=FALSE,cache=T}
library(tidyverse) #data manipulation and visualization
library(rvest)
```
# Part I : Scrap the data from LKERIA web site:
## website introduction:
Lkeria website contain alot of articles that we may find useful to analyse.
With the first look in `Actualité`section we find different articles with different topics.
<center>
![](First_page.png)
</center>
As we select one article, we see a page like this one bellow, contain different informations.
<center>
![](second_page.png)
</center>
## Web Scraping.
This is an exemple how source code look like:
<center>
![](html_sourcepage.png)
</center>
#
Here we're going to create a function to scrap all avaible links from Lkeria page's.
```{r message=FALSE,warning=FALSE,cache =TRUE}
Scrap_links<- function(url){
url %>%
read_html()%>%
html_nodes("a")%>%
html_attr("href")%>%
unique()
}
```
#
Since all articles Lkeria website have one root, wiche is:
```{r message=FALSE,warning=FALSE,cache =TRUE}
url <- "https://www.lkeria.com/actualité-P" # The root Link.
links <- lapply(paste0(url, # list_apply function "Scrap_link"
c(1:72)), # on all avaible "Actualité"
Scrap_links) # links on Lkeria website to return list of vectors of links
```
#
Let's take look at links:
```{r message=FALSE,warning=FALSE,cache =TRUE}
head(links[[1]],10) # print first 5 Observations
```
#
Since it's list of vectors of links, we need to find links with a specific pattern.
```{r message=FALSE,warning=FALSE,cache =TRUE}
#Define indices to use it later on.
indices <- lapply(X = links, # use list_apply function on links
FUN = grep, # with grep function
pattern = "^actualité/") # to return all indicies of character that start with "actualité/"
head(indices[[1]]) # indicies of first link on first webpage
```
Now, that we have all links, and indicies needed let's wrap it up:
```{r message=FALSE,warning=FALSE,cache =TRUE}
# Function to combine list and it's indicies:
scrap_lkeria <- function(x,l,i){ # x: page range, l: list, i: indices
return(l[[x]][i[[x]]])
}
Thelinks <- sapply(c(1:72), # application simplify_apply on scrap_lkeria
FUN =scrap_lkeria, # to return list of needed links.
USE.NAMES = F,
l=links,
i=indices)%>%
unlist(use.names = F) # to transfer it from list of vectors to charachter vector.
head(Thelinks)
```
#
So far so good, but we still need to add strings in order to be a usefull links.
```{r message=FALSE,warning=FALSE,cache =TRUE}
#Easy Step
Thelinks <- paste0("https://www.lkeria.com/",Thelinks)
head(Thelinks)
print(paste0("Number of links: ",length(Thelinks)))
```
#
Getting links it was the easy part, now let's start scraping the articles using `Thelinks`, But first of all we need to define patterns to use in scraping and subsutting with *Regular expression*.
### Comments:
- `<.*?>` To remove all HTML tags.
- `http[^[:space:]]*` To remove all html links that start with http or https.
- `\\(adsbygoogle=window.adsbygoogle\\|\\|\\[\\]\\).push\\(\\{\\}\\);` To remove google ads.
- `\\[.*?\\]` in case there are tags like `[caption][/captio]` etc...
- Also removing names of writers since there are few names mentioned bellow.
```{r message=FALSE,warning=FALSE,cache =TRUE}
the_pattern <- paste("<.*?>",
"http[^[:space:]]*",
"\\(adsbygoogle=window.adsbygoogle\\|\\|\\[\\]\\).push\\(\\{\\}\\);",
"\\[.*?\\]",
"Idir Zidane",
"Izouaouen Noreddine",
"Lotfi Ramdani",
"Nabil Walid",
"Rédaction Lkeria",
"Walid Nsaibia",
sep = "|")
# Define scrap_articles function:
Scrap_articles <- function(url){
html <- url %>% # creat HTML page from URL input
read_html()
title <- html%>% # Get title from
html_nodes(".content-title")%>% # .content-title node
html_text()%>% # transform from html to text
iconv(from = "UTF-8","latin1", sub="")%>% # remove all character that's not latin1
str_trim() # remove leading and trailing white spaces
raw_info <- html%>% # Get all avaible text from
html_nodes(".content-meta")%>% # .content-meta node
html_text()%>% # transform from html to text
str_split(pattern = "le|par") # split to a list based on pattern to define date/writer
date <- raw_info[[1]][2] %>% # get date
str_trim() # remove leading and trailing white spaces
writer <- raw_info[[1]][3] %>% # get writer name's
str_trim() # remove leading and trailing white spaces
intro <- html%>% # get introduction section from
html_nodes(".col-xs-9")%>% # .col-xs-9 node
html_text()%>% # transform from html to text
str_replace_all(pattern =the_pattern , # remove unwanted pattern and
replacement = " ")%>% # replace with white space
iconv(from = "UTF-8","latin1", sub="")%>% # remove all character that's not latin1
str_trim() # remove leading and trailing white spaces
if((length(intro) == 0) || # Check in case intro is Empty
( str_detect(intro,"Article modifié le") || # Check in case intro is the same as Text(further var)
nchar(intro)<=1)){ # Check in case intro is only white space
intro <- NA # Force intro to be NA
}
text <- (html%>%
html_nodes(".content")%>% # Select node .content from HTML page
html_text() %>% # get the text
str_split(pattern = "Article modifié le"))[[1]][1] %>% # Split based on patter retrive text
str_replace_all(pattern = the_pattern,replacement = " ")%>% # remove unwated pattern
iconv(from = "UTF-8","latin1", sub="")%>% # remove all character that's not latin1
str_trim() # remove leading and trailing white spaces
return(c(title,date,writer,intro,text,url)) # retun all retrieved var
}
secure_scrap <- possibly(.f = Scrap_articles, # Creat secure funtion to return NAs
otherwise = c(NA,NA,NA,NA,NA,NA),# in case there is dead links
quiet = T) # skip warnings or msg.
```
#
After defining the function needed, let's start the process to scrap the data:
```{r message=FALSE,warning=FALSE,cache =TRUE}
start.time <- Sys.time() # Start time excution
DF <- Thelinks %>%
map(.f = secure_scrap) %>% # map var Thelinks as input to scure_scrap
map(function(x) data.frame(t(x),stringsAsFactors = F))%>% # map list output from previous map to creat DFs
bind_rows() # tansform list of Data frames to one DF
names(DF) <- c("Title", # Define column names
"Publication.Date",
"Writer",
"Introduction",
"Text",
"Link")
end.time <- Sys.time() # Stop time excution
time.taken <- end.time - start.time # difference
print(time.taken)
```
```{r message=FALSE,warning=FALSE,cache =TRUE}
glimpse(DF)
write.csv(x = DF, # Define var to store
file = "ScapedFile.csv", # file name
row.names = F) # drop raw names
```
## Part Two: Data Cleaning & Manipulation
From the previous section we've got a complete data set of articles, but still has some NA's, and uncleaned data.
```{r message=FALSE,warning=FALSE,cache =TRUE}
DF <- read.csv("ScapedFile.csv", # Since it's small datasets we can use base function
encoding = "latin1", # "latin1" to be able to read charachters with "accents"
stringsAsFactors = F) # Drop factors Option
glimpse(DF) # take a glimpse
```
### Comments:
Now, we need to some manipulation and cleaning:
* Transform `Publication.Date` to Date format to be useful.
* Remove rows with NA values in `Publication.Date` var.
* Creat a variable that take `Intro` if `Text` == NA else take `Text`.
* Extract New variable `Type` from `Link` using *regular expression*.
```{r message=FALSE,warning=FALSE,cache =TRUE}
DF$Publication.Date <- DF$Publication.Date %>% # Transform Publication.Date from
as.Date(format="%d/%m/%Y") # Character to date type
DF <- DF[!is.na(DF$Publication.Date),] # Select only rows that doesn't contain NA's
DF$Full.Text <- DF$Text # Creat Full.text Var similair to Text var
DF$Full.Text[is.na(DF$Full.Text)] <- DF$Introduction[is.na(DF$Full.Text)] # fill NA's Var.
DF$Full.Text<- DF$Full.Text %>% # Remove residue of HTML code
str_replace_all("\\{[^\\}]+\\}|\\.[A-Za-z0-9_]+|#[#A-Za-z0-9_]+|\\W", " ")
sum(is.na(DF$Full.Text)) # Check if there are NA's in Full.text var
DF$Type <- sapply(DF$Link, # Created new var Type from Link
FUN = function(x) strsplit(x,split = "/")[[1]][5], # custum function to select
USE.NAMES = F) # remove row names.
DF <- DF %>% # We removed Intro, Text, Link
select(Title,Publication.Date,Writer,Full.Text,Type)
glimpse(DF) # take glimpse
table(DF$Type) # Count Different Obs in Type var
write.csv(DF,"CleanedArticles.csv",row.names = F) # always save your progress
Col_class <- sapply(DF,class,USE.NAMES = F)
```