-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathLab06_Tutorial_function-purrr.html
576 lines (445 loc) · 14 KB
/
Lab06_Tutorial_function-purrr.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>Lab06_function-purrr</title>
<meta charset="utf-8" />
<meta name="author" content="曾子軒 Dennis Tseng" />
<script src="libs/header-attrs-2.13/header-attrs.js"></script>
<link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
<link href="libs/remark-css-0.0.1/metropolis.css" rel="stylesheet" />
<link href="libs/remark-css-0.0.1/metropolis-fonts.css" rel="stylesheet" />
</head>
<body>
<textarea id="source">
class: center, middle, inverse, title-slide
# Lab06_function-purrr
## Writing Functions and using purrr
### 曾子軒 Dennis Tseng
### 台大新聞所 NTU Journalism
### 2022/05/26
---
<style type="text/css">
.remark-slide-content {
padding: 1em 1em 1em 1em;
font-size: 28px;
}
.my-one-page-font {
padding: 1em 1em 1em 1em;
font-size: 20px;
/*xaringan::inf_mr()*/
}
</style>
# 今日重點
- Web Scraping
- function
- purrr
---
# Web Scraping - HTML
- 基本架構
- 試錯: 先拿單一網址練習,確定可以再寫迴圈
- 迴圈外: 迴圈外放空的 tibble、index(現在在哪)、length(迴圈長度)
- 迴圈內: 切分讀 html, 讀每個 nodes, 把讀到的 nodes 合併, index 和訊息, **以及休息**
---
# Web Scraping - HTML
```r
# test
library(tidyverse)
library(httr)
library(rvest)
url_test <- "https://tw.feature.appledaily.com/charity/projlist/1"
html <- url %>%
curl::curl(handle = curl::new_handle("useragent" = "Mozilla/5.0")) %>% read_html()
link <- html %>%
html_nodes(".artcatdetails") %>% html_attr("href")
link_detail <- html %>%
html_nodes(".details") %>% html_attr("href")
```
---
# Web Scraping - HTML
```r
# loop
p_read_html <- possibly(read_html, otherwise = NULL)
df_case_list <- tibble()
for(i in 1:10){
url <- str_c("https://tw.feature.appledaily.com/charity/projlist/", i)
html <- url %>%
curl::curl(handle = curl::new_handle("useragent" = "Mozilla/5.0")) %>% p_read_html()
link <- html %>% html_nodes(".artcatdetails") %>% html_attr("href")
link_detail <- html %>% html_nodes(".details") %>% html_attr("href")
df_case_list_tmp <- dtibble(link = link, link_detail = link_detail)
df_case_list <- df_case_list %>% bind_rows(df_case_list_tmp)
print(i)
Sys.sleep(20)
}
```
---
# Web Scraping - HTML
- 考慮
- 如果網址掛了怎麼辦?
- 缺 nodes 產生空值?長度不一樣?
- 如果跑到一半網路斷了怎麼辦?
- 怎樣才比較不容易被發現是爬蟲?
- 怎樣才比較不會出事被告?
- 爬很久怎麼辦?
---
# Web Scraping - HTML
- 對策
- 網址掛了 -> 寫 `ifelse` 處理網址掛掉
- 缺 nodes -> 不管結果如何都要補空值
- 網路斷了 -> 可以考慮把錯誤存起來
- 網路斷了 -> 列印 index 並儲存當下進度
- 不被發現/吿 -> 對人家有禮貌一點
- 爬很久 -> 用 map 爬快點
---
```r
library(polite)
session <- bow("https://tw.feature.appledaily.com/charity/projlist/1", force = TRUE)
result <- scrape(session) %>%
html_node(".artcatdetails") %>%
html_text()
```
---
class: inverse, center, middle
# [範例 - 蘋果日報基金會](https://github.com/p4css/R4CSS_TA_1102/scrape_appledoundation.R)
---
# 寫函數
- Datacamp 上的 [Introduction to Writing Functions in R
](https://learn.datacamp.com/courses/introduction-to-writing-functions-in-r)
- 為什麼要寫函數: DRY, [Do not Repeat Yourself](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)
- 把很多行程式碼組成有意義的區塊
- 有了函數之後,就算有些值改變,只需要更改使用者輸入的參數即可
- 避免複製貼上時出錯
- 時機若對就要自己寫函數: 當你一直複製貼上的時候
- 寫函數的步驟
- 思考你想透過這個函數達成什麼,通常就是複製貼上的部分
- 想一個合適貼切而且易懂的名字,不要寫 `function_01()`
- 列出使用者輸入的參數,放在 `function(){}` 的括號裡面
- 列出函數內部的運作,放在 `function(){}` 的中括號裡面
---
# 寫函數
- 一個不那麼生活化的例子,但很易懂
- 你有個特殊的運算需求,但沒人寫過相關函數
```r
ten_time_add_two <- function(x){
y = x * 10 + 2
print(y)
}
ten_time_add_two(1)
```
```
## [1] 12
```
---
# 寫函數
- 一個生活化的例子
- 你有個特殊的評論需求,但沒人寫過相關函數
```r
bad_reply <- function(name_seller){
value_bad <- "真的很差勁"
stringr::str_c(name_seller, value_bad)
}
bad_reply("賣家b04701103")
```
```
## [1] "賣家b04701103真的很差勁"
```
```r
bad_reply(c("賣家b04701103","賣家r09342011","賣家ooxx"))
```
```
## [1] "賣家b04701103真的很差勁" "賣家r09342011真的很差勁"
## [3] "賣家ooxx真的很差勁"
```
---
# 寫函數 - 注意事項
- 取名字
- 不要太長也不要太短 e.g. `g()`, `first_add3_then_time10()`
- 不要沒意義 e.g. `elegant_function()`
- 不要不一致 e.g. `get_second()` & `GET.THIRD()`
- 相同開頭(prefix) 為佳 e.g. `remove_second()` & `remove_third()`
- 不要惡搞已經有的函數 e.g. `sum <- function(x){mean(x)}`
- 參數 argument
- 使用者輸入的東西跟你想得很不一樣
- 你希望輸入是日期但可能會有數字、字串等等
- 可以加上 `if else` 語句判斷後再執行以免出錯
- 可以不只一個 argument,也可以有預設值
---
# 寫函數 - 注意事項
```r
bad_reply <- function(name_seller, score = 1, value_bad = "真的很差勁"){
if(!is.integer(score)){
print("請重新輸入: score 為 1 - 5 的正整數")
} else {
stringr::str_c(name_seller, value_bad, " 怒給", score, "星")
}
}
bad_reply("賣家b04701103", score = 2)
bad_reply("賣家r09342011", "沒有誠信")
bad_reply("賣家ooxx", score = 1.2)
bad_reply("賣家ooxx", score = "爛")
```
---
# 寫函數 - 注意事項
- 環境
- global, 整個大環境
- local, 函數裡面的環境, 不影響大環境
```r
y = 1
add_five <- function(x){y = x + 5; return(y)}
add_five(3)
```
```
## [1] 8
```
```r
y
```
```
## [1] 1
```
---
# 寫函數 - 注意事項
- 最重要的就是 Do not Repeat Yourself 同時也避免複製貼上出錯
- 名字要好好取,要有意義、有一致性、不要太長也不要太短
- 參數可以有很多個,也可以有預設值
- 使用者輸入的跟你想的可能不一樣,可能需要處理例外的 branch
- 函數的環境跟整個大環境是分開的
---
# purrr
- `map(.x, .f, ...)`
- Apply a function to each element of a list or atomic vector
- 前面放對象,後面放函數 e.g.
<img src="photo/Lab09_map_description.png" width="65%" height="65%" />
ref: [數據科學中的 R 語言](https://bookdown.org/wangminjie/R4DS/purrr.html#purrr-1)
---
# purrr
- 問:底下的 list 要怎麼取每個同學的平均分數?
```r
exams <- list(
student1 = c(100,80,70),
student2 = c(90,60,50),
student3 = c(20,90,55)
)
exams
```
```
## $student1
## [1] 100 80 70
##
## $student2
## [1] 90 60 50
##
## $student3
## [1] 20 90 55
```
---
# purrr
```r
# base
mean(exams$student1)
mean(exams$student2)
mean(exams$student3)
# purrr
exams %>% map(mean)
exams %>% map_dbl(mean)
exams %>% map_df(mean)
```
---
# purrr
- 圖解
<img src="photo/Lab09_map_description02.png" width="65%" height="65%" />
---
# purrr
- 換一張圖解
<img src="photo/Lab09_map_01.jpeg" width="45%" height="45%" />
ref: [@_ColinFay](https://twitter.com/_ColinFay/status/1045257504446443520)
---
# purrr
- 前面放對象,後面放函數 e.g. `map(.x, .f, ...)`
- 前面的對象可以是 vector,也可以是 list
- 函數可以正規表達,也可以用匿名函數
```r
# .x 放 vector
url <- str_c("https://tw.feature.appledaily.com/charity/projdetail/", c("A5135", "A5134", "A5133"))
url %>% map(read_html)
# .x 放 list
exams %>% map(mean)
```
---
# purrr
- 前面放對象,後面放函數 e.g. `map(.x, .f, ...)`
- 前面的對象可以是 vector,也可以是 list
- 函數可以正規表達,也可以用匿名函數
```r
# .f
exams %>% map(function(x){(x + 5)^2})
exams %>% map(~(. + 5)^2)
exams %>% map(. %>% mean() %>% sqrt())
exams %>% map(. %>% mean() %>% `^`(3))
```
---
# Anonymous Function 匿名函數
- function
- 平常寫函數
- 但為了方便也可以不要寫完整,一次性使用
- `.` 點點代表前面的變數/資料
- 匿名函數的形式
- `~ function(x){sqrt(mean(x))}`
- `~ (sqrt(mean(.)))`
- `~ . %>% mean() %>% sqrt()`
---
# purrr
- 有 `map(.x)`, `map2(.x, .y)`, `pmap(.l)`
<img src="photo/Lab09_map_03.jpeg" width="45%" height="45%" /><img src="photo/Lab09_map2_02.jpeg" width="45%" height="45%" />
---
# purrr
- 有 `map(.x)`, `map2(.x, .y)`, `pmap(.l)`
```r
weight <- c(1,1,2)
weight_diff <- list(c(1, 1, 2), c(1, 2, 1), c(2, 1, 1))
mean_weight <- function(x, y){
if(length(x)==length(y)){sum(x*y)/sum(y)}
else{print("length - bad")}
}
map(exams, ~mean_weight(x=., y=weight))
map2(exams, weight, mean_weight)
```
---
# purrr
- 運用在 web scraping 的好時機
- 一次爬**多個網址**的時候
- 舉例來說,爬文章列表不需要 `map`,但是爬每篇文章的時候就可以用
```r
df_zec_main[j:k,] %>%
pull(title_link) %>%
str_c("https://www.zeczec.com", .) %>%
map(function(x){
x %>%
curl::curl(handle = curl::new_handle("useragent" = "Mozilla/5.0")) %>%
read_html()
})
```
---
class: inverse, center, middle
# 結束了!
</textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"countIncrementalSlides": false,
"self_contained": true
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
window.dispatchEvent(new Event('resize'));
});
(function(d) {
var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
if (!r) return;
s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
d.head.appendChild(s);
})(document);
(function(d) {
var el = d.getElementsByClassName("remark-slides-area");
if (!el) return;
var slide, slides = slideshow.getSlides(), els = el[0].children;
for (var i = 1; i < slides.length; i++) {
slide = slides[i];
if (slide.properties.continued === "true" || slide.properties.count === "false") {
els[i - 1].className += ' has-continuation';
}
}
var s = d.createElement("style");
s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
var deleted = false;
slideshow.on('beforeShowSlide', function(slide) {
if (deleted) return;
var sheets = document.styleSheets, node;
for (var i = 0; i < sheets.length; i++) {
node = sheets[i].ownerNode;
if (node.dataset["target"] !== "print-only") continue;
node.parentNode.removeChild(node);
}
deleted = true;
});
})();
(function() {
"use strict"
// Replace <script> tags in slides area to make them executable
var scripts = document.querySelectorAll(
'.remark-slides-area .remark-slide-container script'
);
if (!scripts.length) return;
for (var i = 0; i < scripts.length; i++) {
var s = document.createElement('script');
var code = document.createTextNode(scripts[i].textContent);
s.appendChild(code);
var scriptAttrs = scripts[i].attributes;
for (var j = 0; j < scriptAttrs.length; j++) {
s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
}
scripts[i].parentElement.replaceChild(s, scripts[i]);
}
})();
(function() {
var links = document.getElementsByTagName('a');
for (var i = 0; i < links.length; i++) {
if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
links[i].target = '_blank';
}
}
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
const hlines = d.querySelectorAll('.remark-code-line-highlighted');
const preParents = [];
const findPreParent = function(line, p = 0) {
if (p > 1) return null; // traverse up no further than grandparent
const el = line.parentElement;
return el.tagName === "PRE" ? el : findPreParent(el, ++p);
};
for (let line of hlines) {
let pre = findPreParent(line);
if (pre && !preParents.includes(pre)) preParents.push(pre);
}
preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>
<script>
slideshow._releaseMath = function(el) {
var i, text, code, codes = el.getElementsByTagName('code');
for (i = 0; i < codes.length;) {
code = codes[i];
if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
text = code.textContent;
if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
/^\$\$(.|\s)+\$\$$/.test(text) ||
/^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
code.outerHTML = code.innerHTML; // remove <code></code>
continue;
}
}
i++;
}
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
if (location.protocol !== 'file:' && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
</body>
</html>