chapter_6.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>chapter_6.knit</title>
    <meta charset="utf-8" />
    <meta name="author" content="Pac_B" />
    <script src="libs/header-attrs-2.17/header-attrs.js"></script>
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <script src="libs/htmlwidgets-1.5.4/htmlwidgets.js"></script>
    <link href="libs/datatables-css-0.0.0/datatables-crosstalk.css" rel="stylesheet" />
    <script src="libs/datatables-binding-0.27/datatables.js"></script>
    <script src="libs/jquery-3.6.0/jquery-3.6.0.min.js"></script>
    <link href="libs/dt-core-1.12.1/css/jquery.dataTables.min.css" rel="stylesheet" />
    <link href="libs/dt-core-1.12.1/css/jquery.dataTables.extra.css" rel="stylesheet" />
    <script src="libs/dt-core-1.12.1/js/jquery.dataTables.min.js"></script>
    <link href="libs/crosstalk-1.2.0/css/crosstalk.min.css" rel="stylesheet" />
    <script src="libs/crosstalk-1.2.0/js/crosstalk.min.js"></script>
    <link rel="stylesheet" href="css/Custumed_Style.css" type="text/css" />
    <link rel="stylesheet" href="css/zh-CN.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">

class: center, middle
&lt;span style="font-size: 50px;"&gt;**第六章**&lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 50px;"&gt;__如何探索数据: __&lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 40px;"&gt;描述性统计与数据可视化基础&lt;/span&gt;&lt;br&gt;
&lt;span style="font-size: 30px;"&gt;胡传鹏&lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 20px;"&gt; &lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 30px;"&gt;2023-04-04&lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 20px;"&gt; Made with Rmarkdown&lt;/span&gt; &lt;br&gt;

---
#回顾
##函数及用法
##-函数名及语句
##-group_by()
##for loop
##练习题
&lt;br&gt;&lt;br&gt;
#本节课内容
##探索性数据分析
##--描述性统计
##--数据可视化基础
##----ggplot2介绍
##----ggplot2可视化基础

---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;重新了解函数：
&lt;/font&gt;

```r
df.mt.raw &lt;-  read.csv('./data/match/match_raw.csv',
                       header = T, sep=",", stringsAsFactors = FALSE) 
```
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;以用的最多的read.csv()这一函数为例，它包括两部分：“read.csv”——函数名；“()”中的输入参数。&lt;br&gt;
&amp;emsp;&amp;emsp;比如，“header=…”：是否要使用第一行作为列名；&lt;br&gt;
&amp;emsp;&amp;emsp;再如，“sep=…” ：指定分割符什么 &lt;br&gt;

&lt;/font&gt;

---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;重新了解函数：
&lt;/font&gt;

```r
df.mt.raw &lt;-  read.csv('./data/match/match_raw.csv',
                       header = T, sep=",", stringsAsFactors = FALSE) 
```
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;以用的最多的read.csv()这一函数为例，它包括两部分：“read.csv”——函数名；“()”中的输入参数。&lt;br&gt;
&amp;emsp;&amp;emsp;比如，“header=…”：是否要使用第一行作为列名；&lt;br&gt;
&amp;emsp;&amp;emsp;再如，“sep=…” ：指定分割符什么 &lt;br&gt;
&lt;br&gt;
&amp;emsp;&amp;emsp;为什么第一个文件名没有相应的argument？。
&lt;/font&gt;

---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;重新了解函数：
&lt;/font&gt;

完整输入arguments

```r
result1 &lt;- functionName(arg1 = value1, arg2 = value2, arg3 = value3)
result1 &lt;- functionName(arg3 = value3, arg1 = value1, arg2 = value2)
```

省略arguments,按照顺序输入argument的值

```r
result2 &lt;- functionName(value1, value2, value3)

result2 &lt;- functionName(value3, value1, value2) # will not return expected
```

省略有默认值的arguments

```r
result3 &lt;- functionName(arg1 = value1)
```

---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;比如，完整地写出argument应该是这样。
&lt;/font&gt;

```r
df.mt.raw &lt;-  read.csv(file = './data/match/match_raw.csv',
                       header = T, 
                       sep=",",
                       stringsAsFactors = FALSE) 
```

---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;argument: “TRUE ~ NA_real_”?&lt;br&gt;
&lt;/font&gt;

![p1](./picture/chp6/real.png)
&lt;br&gt;
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;首先，查找case_when()，但是我们并未发现对这一语句的描述，这可能是因为该argument已经在最新的语句中有了替代。&lt;br&gt;
&amp;emsp;&amp;emsp;因此，我们直接查找Na_real_，可以发现它表示这个默认值是一个缺失值（NA），并且是数值类型（real），那再结合case_when()的作用就很好理解了，那些“TRUE”，也即不在给出的任何条件中的值，被赋为NA。&lt;br&gt;
&amp;emsp;&amp;emsp;同学们也可以尝试使用case_when()文档中提供的语句重新写。
&lt;/font&gt;

---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;关于函数“dplyr::group_by()"的作用: 比较操作前后的结果。
&lt;/font&gt;   

```r
library("tidyverse")
```

```
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
```

```r
group &lt;- df.mt.raw %&gt;% 
  group_by(Shape)
group#注意看数据框的第二行，有Groups:   Shape [4]的信息
```

```
## # A tibble: 25,920 × 16
## # Groups:   Shape [4]
##    Date        Prac    Sub   Age Sex   Hand  Block   Bin Trial Shape Label Match
##    &lt;chr&gt;       &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt;
##  1 02-May-201… Exp    7302    22 fema… R         1     1     1 immo… immo… mism…
##  2 02-May-201… Exp    7302    22 fema… R         1     1     2 mora… mora… mism…
##  3 02-May-201… Exp    7302    22 fema… R         1     1     3 immo… immo… mism…
##  4 02-May-201… Exp    7302    22 fema… R         1     1     4 mora… mora… mism…
##  5 02-May-201… Exp    7302    22 fema… R         1     1     5 immo… immo… match
##  6 02-May-201… Exp    7302    22 fema… R         1     1     6 immo… immo… match
##  7 02-May-201… Exp    7302    22 fema… R         1     1     7 mora… mora… match
##  8 02-May-201… Exp    7302    22 fema… R         1     1     8 mora… mora… match
##  9 02-May-201… Exp    7302    22 fema… R         1     1     9 mora… mora… mism…
## 10 02-May-201… Exp    7302    22 fema… R         1     1    10 immo… immo… mism…
## # … with 25,910 more rows, and 4 more variables: CorrResp &lt;chr&gt;, Resp &lt;chr&gt;,
## #   ACC &lt;int&gt;, RT &lt;dbl&gt;
```
---
#回顾
##函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;当某一列作为group_by分类的对象时，生成的数据框具有此分类列的信息，这个信息是数据框的一个基本信息，你甚至无法删除这一列。
&lt;/font&gt;   

```r
group &lt;- group %&gt;% 
  select(-Shape)
```

```
## Adding missing grouping variables: `Shape`
```

```r
#注意返回的提示，因为数据框已经按照此列分类，所以不能再检索到此列。
```

---
# 回顾
## 函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;对于group_by()函数的作用，我们可以对比不使用它的效果。
&lt;/font&gt;


```r
library("tidyverse")
df.mt.raw &lt;-  read.csv('./data/match/match_raw.csv',
                       header = T, sep=",", 
                       stringsAsFactors = FALSE) 
group &lt;- df.mt.raw %&gt;% 
  group_by(., Shape) %&gt;% 
  summarise(n())
DT::datatable(group)
```

<div id="htmlwidget-5280007504acbafd7a75" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-5280007504acbafd7a75">{"x":{"filter":"none","vertical":false,"data":[["1","2","3","4"],["immoralOther","immoralSelf","moralOther","moralSelf"],[6480,6480,6480,6480]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Shape<\/th>\n      <th>n()<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":2},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>
---
# 回顾
## 函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;对于group_by()函数的作用，我们可以对比不使用它的效果。
&lt;/font&gt;

```r
ungroup &lt;- df.mt.raw %&gt;% 
  summarise(n())
DT::datatable(ungroup)
```

<div id="htmlwidget-ae03e1bc3d34241f96e3" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ae03e1bc3d34241f96e3">{"x":{"filter":"none","vertical":false,"data":[["1"],[25920]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>n()<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":1},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
# 回顾
## 函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;从对比中可以看出，使用了group_by()后，实际上对数据框进行了分组，summarise针对的是每一个按照Shape列拆分开的数据框。
&lt;/font&gt;

```r
remove(group, ungroup)
```

---

# 回顾
## 函数及用法
&lt;font size=5&gt;
&amp;emsp;&amp;emsp; 理解函数的最好办法：创建函数
&lt;/font&gt;

```r
simpleFun &lt;- function(a = 1, b = 100){
  sum &lt;- a + b
  product &lt;- a * b
  
  res &lt;- c(sum, product)
  
  return(res)
}
```


```r
simpleFun()
```

```
## [1] 101 100
```


```r
simpleResult &lt;- simpleFun(a=1, b=5)
```
---

# 回顾
## for loop
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;我们看一下for loop语句最基础的结构：&lt;br&gt;
&amp;emsp;&amp;emsp; __for (variable in sequence) {statement}__ &lt;br&gt;
&amp;emsp;&amp;emsp;其中，variable是循环变量，sequence是一个向量或列表，statement是要重复执行的语句。在每次循环中，variable会被赋值为sequence中的下一个元素，然后执行statement。&lt;br&gt;
比如，我如果想计算1到100的和：
&lt;/font&gt;

```r
sum &lt;- 0
for (i in 1:100) {
  sum &lt;- sum + i
}
print(sum)
```

```
## [1] 5050
```

---
# 回顾
## for loop
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;理解了这一点我们再来看使用for loop如何批量读取文件。
&lt;/font&gt;

```r
files &lt;- list.files(file.path("data/match"), 
                    pattern = "data_exp7_rep_match_.*\\.out$")

df_list &lt;- list()

for (i in seq_along(files)) {
  df &lt;- read.table(file.path("data/match", files[i]), header = TRUE) 
  df_list[[i]] &lt;- df
}
```


&lt;font size=5&gt;
&amp;emsp;&amp;emsp;1.利用list.files读取所有的文件保存成value&lt;br&gt;
&amp;emsp;&amp;emsp;2.创建新的df_list存储结果&lt;br&gt;
&amp;emsp;&amp;emsp;3.for (i in seq_along(files))……部分是，对files里的第一个值执行read.table()，并将结果放进df_list中。接着对第二个值执行，直至循环结束整个files
&lt;/font&gt;
---
# 回顾
## 练习

定义一下函数，其有两个整数型的arguments (a, b)，第一个比第二个小，函数的输出是计算从a到b的所有整数之和。

```
# 回顾
## 练习题
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;我们在实际使用这些函数时，是带着目的去使用的。想要计算击中率虚报率，我们首先要知道什么情况是击中，也要让计算机知道什么情况是击中，一旦分类了，计算的时候也需要进行判断。但在分类判断之前，我们要先告诉计算机什么是正确情况什么是错误情况，哪些是这个被试做的，哪些不是，所以也需要分组。&lt;br&gt;
&amp;emsp;&amp;emsp;梳理到这，我们有了基本的思路：1.分组，使用group_by。2.告诉计算机什么是hit，什么是miss，使用summarise函数。3.利用判断语句进行分类计算，我们这里使用了ifelse。&lt;br&gt;
&amp;emsp;&amp;emsp;当然，具体执行时，我们也需要考虑缺失值等细节，但总体的思路不会有变。
&lt;br&gt;
&amp;emsp;&amp;emsp;以上次的练习为例。首先我们需要加载数据。
&lt;/font&gt;


```r
# 读取原始数据
df.mt.raw &lt;-  read.csv('./data/match/match_raw.csv',
                       header = T, sep=",", stringsAsFactors = FALSE) 
```
---
# 回顾
## 练习题
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;接着，我们选出需要的几列，包括自变量和因变量，以及分组变量。使用select()函数
&lt;/font&gt;

```r
df.mt.clean &lt;- df.mt.raw %&gt;%
  dplyr::select(Sub, Block, Bin,  # block and bin
                Shape, Match, # 自变量
                ACC, RT, # 反应结果
                )
```
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;因为数据中存在缺失值，所以也需要除去它们。
&lt;/font&gt;

```r
df.mt.clean &lt;- df.mt.raw %&gt;%
  dplyr::select(Sub, Block, Bin,  # block and bin
                Shape, Match, # 自变量
                ACC, RT, # 反应结果
                ) %&gt;% 
  tidyr::drop_na()
```

---
# 回顾
## 练习题
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;在计算击中率误报率等的时候，我们针对的是被试的在某一特定实验条件下的反应。所以我们需要通过bin、block、shape、sub进行分组，这样进行计算时，就是每一个条件组下分别计算。不会出现在虚报的实验条件下计算正确拒绝的情况。
&lt;/font&gt;

```r
df.mt.clean &lt;- df.mt.raw %&gt;%
  dplyr::select(Sub, Block, Bin,  # block and bin
                Shape, Match, # 自变量
                ACC, RT, # 反应结果
                ) %&gt;% 
  tidyr::drop_na()  %&gt;% #删除缺失值
  dplyr::group_by(Sub, Block, Bin, Shape)
```
---
# 回顾
## 练习题
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;接下来就是要使用summarise函数来给击中、虚报等分类，并使用ifelse函数根据分类计算概率
&lt;/font&gt;

```r
df.mt.clean &lt;- df.mt.raw %&gt;%
  dplyr::select(Sub, Block, Bin,  # block and bin
                Shape, Match, # 自变量
                ACC, RT, # 反应结果
                ) %&gt;% 
  tidyr::drop_na()  %&gt;% #删除缺失值
  dplyr::group_by(Sub, Block, Bin, Shape) %&gt;%
  dplyr::summarise(
      hit = length(ACC[Match == "match" &amp; ACC == 1]),
      fa = length(ACC[Match == "mismatch" &amp; ACC == 0]),
      miss = length(ACC[Match == "match" &amp; ACC == 0]),
      cr = length(ACC[Match == "mismatch" &amp; ACC == 1]),
      Dprime = qnorm(
        ifelse(hit / (hit + miss) &lt; 1,
               hit / (hit + miss),
               1 - 1 / (2 * (hit + miss))
              )
           ) - qnorm(
        ifelse(fa / (fa + cr) &gt; 0,
               fa / (fa + cr),
               1 / (2 * (fa + cr))
              )
                    ))
```

```
## `summarise()` has grouped output by 'Sub', 'Block', 'Bin'. You can override
## using the `.groups` argument.
```

---
# 回顾
## 练习题
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;最后使用ungroup、select等函数对数据进行清理，除去计算中的过程变量，并进行长宽转换方便查看。
&lt;/font&gt;

```r
df.mt.clean &lt;- df.mt.raw %&gt;%
  dplyr::select(Sub, Block, Bin,  # block and bin
                Shape, Match, # 自变量
                ACC, RT, # 反应结果
                ) %&gt;% 
  tidyr::drop_na()  %&gt;% #删除缺失值
  dplyr::group_by(Sub, Block, Bin, Shape) %&gt;%
  dplyr::summarise(
      hit = length(ACC[Match == "match" &amp; ACC == 1]),
      fa = length(ACC[Match == "mismatch" &amp; ACC == 0]),
      miss = length(ACC[Match == "match" &amp; ACC == 0]),
      cr = length(ACC[Match == "mismatch" &amp; ACC == 1]),
      Dprime = qnorm(
        ifelse(hit / (hit + miss) &lt; 1,
               hit / (hit + miss),
               1 - 1 / (2 * (hit + miss))
              )
           ) - qnorm(
        ifelse(fa / (fa + cr) &gt; 0,
               fa / (fa + cr),
               1 / (2 * (fa + cr))
              )
                    )) %&gt;% 
  dplyr::ungroup() %&gt;%
  select(-"hit",-"fa",-"miss",-"cr") %&gt;%
  dplyr::group_by(Sub, Shape)  %&gt;%
  tidyr::pivot_wider(names_from = Shape,
                     values_from = Dprime) 
```

```
## `summarise()` has grouped output by 'Sub', 'Block', 'Bin'. You can override
## using the `.groups` argument.
```

---
class: center, middle
&lt;span style="font-size: 50px;"&gt;**第六章**&lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 50px;"&gt;__如何探索数据: __&lt;/span&gt; &lt;br&gt;
&lt;span style="font-size: 40px;"&gt;描述性统计与数据可视化基础&lt;/span&gt;&lt;br&gt;
---

# 探索性数据分析
## What is exploratory data analysis?
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;在介绍描述性统计和可视化之前，我们先了解一个概念：探索性数据分析(Exploratory Data Analysis, EDA)。&lt;br&gt;
&amp;emsp;&amp;emsp;In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their maincharacteristics, often with visual methods (Wikipedia).
&lt;/font&gt;
&lt;div style="text-align:center;"&gt;
  &lt;img src="https://blog.escueladedatosvivos.ai/content/images/2020/12/main_img.png" alt="layer" style="width:70%; height:auto;" /&gt;
&lt;/div&gt;

---
# 探索性数据分析
## What is exploratory data analysis?
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;要进行EDA，首先要了解自己的数据，并提出有质量的问题。但是提出有质量的问题前，我们可以先从几个基础的简单问题开始：&lt;br&gt;
&amp;emsp;&amp;emsp;有哪些变量，类型如何？变量的值是如何变化的？变量之间有什么关系？
&lt;/font&gt;

```r
# 检查是否已安装 pacman
if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman") }   # 如果未安装，则安装包

# 加载所需要的R包
pacman::p_load("tidyverse")
# 读取数据
df.pg.raw &lt;- read.csv("./data/penguin/penguin_rawdata.csv",
                      header = TRUE, sep=",", stringsAsFactors = FALSE)
df.mt.raw &lt;-  read.csv('./data/match/match_raw.csv',
                       header = T, sep=",", stringsAsFactors = FALSE) 
```

---
# 探索性数据分析
## 有哪些变量，类型如何？


```r
colnames(df.mt.raw)                # 查看列名，观察有哪些变量
```

```
##  [1] "Date"     "Prac"     "Sub"      "Age"      "Sex"      "Hand"    
##  [7] "Block"    "Bin"      "Trial"    "Shape"    "Label"    "Match"   
## [13] "CorrResp" "Resp"     "ACC"      "RT"
```

```r
DT::datatable(head(df.mt.raw, 3))  # 了解数据内容
```

<div id="htmlwidget-62e1f1ef7922915c5aa9" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-62e1f1ef7922915c5aa9">{"x":{"filter":"none","vertical":false,"data":[["1","2","3"],["02-May-2018_14:23:06","02-May-2018_14:23:08","02-May-2018_14:23:10"],["Exp","Exp","Exp"],[7302,7302,7302],[22,22,22],["female","female","female"],["R","R","R"],[1,1,1],[1,1,1],[1,2,3],["immoralSelf","moralOther","immoralOther"],["immoralSelf","moralOther","immoralOther"],["mismatch","mismatch","mismatch"],["n","n","n"],["m","n","n"],[0,1,1],[0.7561,0.7043,0.9903]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Date<\/th>\n      <th>Prac<\/th>\n      <th>Sub<\/th>\n      <th>Age<\/th>\n      <th>Sex<\/th>\n      <th>Hand<\/th>\n      <th>Block<\/th>\n      <th>Bin<\/th>\n      <th>Trial<\/th>\n      <th>Shape<\/th>\n      <th>Label<\/th>\n      <th>Match<\/th>\n      <th>CorrResp<\/th>\n      <th>Resp<\/th>\n      <th>ACC<\/th>\n      <th>RT<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[3,4,7,8,9,15,16]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
# 探索性数据分析
## 了解数据

```r
# 前几节课提过的summary()函数
# 这里使用datatable()是为了方便在ppt呈现
DT::datatable(summary(df.mt.raw))
```

<div id="htmlwidget-ccf2bcd5032d6b453dd9" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-ccf2bcd5032d6b453dd9">{"x":{"filter":"none","vertical":false,"data":[["","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""],["","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""],["    Date","    Date","    Date","    Date","    Date","    Date","    Prac","    Prac","    Prac","    Prac","    Prac","    Prac","     Sub","     Sub","     Sub","     Sub","     Sub","     Sub","     Age","     Age","     Age","     Age","     Age","     Age","    Sex","    Sex","    Sex","    Sex","    Sex","    Sex","    Hand","    Hand","    Hand","    Hand","    Hand","    Hand","    Block","    Block","    Block","    Block","    Block","    Block","     Bin","     Bin","     Bin","     Bin","     Bin","     Bin","    Trial","    Trial","    Trial","    Trial","    Trial","    Trial","   Shape","   Shape","   Shape","   Shape","   Shape","   Shape","   Label","   Label","   Label","   Label","   Label","   Label","   Match","   Match","   Match","   Match","   Match","   Match","  CorrResp","  CorrResp","  CorrResp","  CorrResp","  CorrResp","  CorrResp","    Resp","    Resp","    Resp","    Resp","    Resp","    Resp","     ACC","     ACC","     ACC","     ACC","     ACC","     ACC","      RT","      RT","      RT","      RT","      RT","      RT"],["Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Min.   : 7302  ","1st Qu.: 7313  ","Median : 7324  ","Mean   : 8853  ","3rd Qu.: 7336  ","Max.   :73370  ","Min.   :18.00  ","1st Qu.:19.00  ","Median :20.00  ","Mean   :20.83  ","3rd Qu.:22.00  ","Max.   :28.00  ","Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Min.   :1.000  ","1st Qu.:1.000  ","Median :1.000  ","Mean   :1.611  ","3rd Qu.:2.000  ","Max.   :3.000  ","Min.   :1.000  ","1st Qu.:1.000  ","Median :2.000  ","Mean   :2.417  ","3rd Qu.:3.000  ","Max.   :5.000  ","Min.   : 1.00  ","1st Qu.: 6.75  ","Median :12.50  ","Mean   :12.50  ","3rd Qu.:18.25  ","Max.   :24.00  ","Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Length:25920      ","Class :character  ","Mode  :character  ",null,null,null,"Min.   :-1.0000  ","1st Qu.: 1.0000  ","Median : 1.0000  ","Mean   : 0.7959  ","3rd Qu.: 1.0000  ","Max.   : 2.0000  ","Min.   :0.1060  ","1st Qu.:0.6104  ","Median :0.7019  ","Mean   :0.7150  ","3rd Qu.:0.8053  ","Max.   :1.1831  "]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>Var1<\/th>\n      <th>Var2<\/th>\n      <th>Freq<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
# 探索性数据分析
## 了解你的数据
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;更进一步，如果我想知道变量的平均数、中位数和标准差等统计量应该怎么办？
&lt;/font&gt;&lt;br&gt;

```r
# 使用psych包中的describe()函数
DT::datatable(psych::describe(df.mt.raw))
```

<div id="htmlwidget-fd1b9c01a7f53dc0fa1e" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-fd1b9c01a7f53dc0fa1e">{"x":{"filter":"none","vertical":false,"data":[["Date*","Prac*","Sub","Age","Sex*","Hand*","Block","Bin","Trial","Shape*","Label*","Match*","CorrResp*","Resp*","ACC","RT"],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],[25920,25920,25920,25920,25920,25920,25920,25920,25920,25920,25920,25920,25920,25262,25920,25920],[12218.6378086412,1,8852.5925925922,20.8333333333334,3.26851851851848,1.97685185185185,1.61111111111111,2.41666666666666,12.5,2.5,2.5,1.5,1.5,5.49564563375821,0.795949074074075,0.714950381944444],[6999.30300668651,0,9931.83282454277,2.47585664394741,0.654151598830282,0.150376806082763,0.803172843163891,1.36170415422526,6.92232008612875,1.11805555638763,1.11805555638763,0.500009645340819,0.500009645340819,0.508936839277808,0.463647210061554,0.150839365248012],[12220.5,1,7324,20,3,2,1,2,12.5,2.5,2.5,1.5,1.5,5,1,0.7019],[12219.0297550138,1,7324.23958333333,20.5277777777781,3.32638888888891,2,1.51388888888903,2.27083333333354,12.5,2.5,2.5,1.5,1.5,5.49524987629897,0.90055941358024,0.70775078125],[8920.0629,0,16.3086,1.4826,0,0,0,1.4826,8.8956,1.4826,1.4826,0.7413,0.7413,0,0,0.14336742],[1,1,7302,18,1,1,1,1,1,1,1,1,1,1,-1,0.106],[24362,1,73370,28,4,2,3,5,24,4,4,2,2,9,2,1.1831],[24361,0,66068,10,3,1,2,4,23,3,3,1,1,8,3,1.0771],[0.00220558477379076,null,6.34183273983716,1.09360106540283,-0.83747464140177,-6.34184841681397,0.815447754334393,0.670628948711956,0,0,0,0,0,-0.0870917896362933,-2.16997928633046,0.35980327619413],[-1.19185000099972,null,38.2203865539374,0.348239273835886,1.58332877329944,38.220515903065,-0.96867852976512,-0.815503330725602,-1.2043124771982,-1.36012654076884,-1.36012654076884,-2.00007715900539,-2.00007715900539,-0.333591646518246,4.12510599124667,0.237084145756972],[43.4747703279714,0,61.6895926023372,0.0153782882178081,0.00406313178492967,0.000934035446223226,0.0049887474305451,0.00845795292803207,0.0429966063183225,0.00694457840751138,0.00694457840751138,0.00310570987885453,0.00310570987885453,0.00320206411027458,0.00287985188687711,0.000936908539937426]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>vars<\/th>\n      <th>n<\/th>\n      <th>mean<\/th>\n      <th>sd<\/th>\n      <th>median<\/th>\n      <th>trimmed<\/th>\n      <th>mad<\/th>\n      <th>min<\/th>\n      <th>max<\/th>\n      <th>range<\/th>\n      <th>skew<\/th>\n      <th>kurtosis<\/th>\n      <th>se<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[1,2,3,4,5,6,7,8,9,10,11,12,13]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

```r
# 需要注意的是，describe()函数不会帮你处理缺失值，它会跳过缺失值。
```

---
# 探索性数据分析
## 了解你的数据
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;更进一步，如果我想知道变量的平均数、中位数和标准差等统计量应该怎么办？
&lt;/font&gt;&lt;br&gt;

```r
# 使用dplyr包中的summarise()函数
df.mt.raw %&gt;%
  summarise(mean_RT = mean(RT),
            sd_RT = sd(RT),
            n_values = n())
```

```
##     mean_RT     sd_RT n_values
## 1 0.7149504 0.1508394    25920
```

```r
# summarise函数不会忽略缺失值，如果计算的列中有缺失值，会有报错。
```

---
# 探索性数据分析
## 变量的值是如何变化的？可视化
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;可视化的部分我们介绍一个最常用的包：ggplot2。&lt;br&gt;
&amp;emsp;&amp;emsp;所谓gg源于“grammar of graphics”，即图形语法。&lt;br&gt;
&amp;emsp;&amp;emsp;ggplot2绘图的核心在于使用图层去描述和构建图形。&lt;br&gt;
&amp;emsp;&amp;emsp;我们在这里给出一个示例，探究体温和健康的关系，并简单了解一下ggplot2的语法。
&lt;/font&gt;
&lt;div style="text-align:center;"&gt;
  &lt;img src="https://picb.zhimg.com/v2-1ea8eef8abdab39c4e5cfcc0285f9d95_720w.jpg?source=172ae18b" alt="layer" style="width:60%; height:auto;" /&gt;
&lt;/div&gt;

---
# 可视化
## 柱状图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;比方说，我们想要看看被试回答正确率的情况。
&lt;/font&gt;

```r
ggplot2::ggplot(data=df.mt.raw,       # 指定数据
                aes(x=ACC)) +         # 确定映射到x轴的变量
  geom_bar() +                        # 绘制直方图
  theme_classic()                     # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-28-1.png)&lt;!-- --&gt;
---
# 可视化
## 柱状图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;消除图形与x轴之间的空白。
&lt;/font&gt;

```r
ggplot2::ggplot(data=df.mt.raw,         # 指定数据
                aes(x=ACC)) +           # 确定映射到x轴的变量
  geom_bar() +                          # 绘制直方图
  scale_y_continuous(expand=c(0, 0)) +  # x轴在 y=0 处  
  theme_classic()                       # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-29-1.png)&lt;!-- --&gt;
---

# 可视化
## 直方图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;对于连续变量，我们可以使用直方图进行可视化。比如说，我们想看看被试的反应时分布。
&lt;/font&gt;

```r
ggplot2::ggplot(data=df.mt.raw,      # 指定数据
                aes(x=RT)) + # 确定映射到x轴的变量
  geom_histogram() +                   # 绘制直方图
  stat_bin(bins = 40) +                # 设定连续变量分组数量
  scale_x_continuous(name = "RT") +    # 命名x轴
  scale_y_continuous(expand=c(0, 0)) + # x轴在 y=0 处 
  theme_classic()                      # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-30-1.png)&lt;!-- --&gt;
---
# 可视化
## 密度图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;同样的我们可以使用密度图来描述反应时的分布情况。
&lt;/font&gt;

```r
ggplot2::ggplot(data=df.mt.raw,         # 指定数据
                aes(x=RT)) +  # 确定映射到x轴的变量
  geom_density() +                      # 绘制密度曲线
  scale_x_discrete(name="RT") +         # 命名x轴
  scale_y_continuous(expand=c(0, 0)) +  # x轴在 y=0 处 
  theme_classic()                       # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-31-1.png)&lt;!-- --&gt;
---
# 可视化
## 图层叠加


```r
ggplot2::ggplot(data=df.mt.raw,             # 指定数据
                aes(x=RT,                   # x轴的变量
                    y=after_stat(density),  # y轴对应的是密度曲线
                    alpha=0.8)) +           # 透明度
  geom_histogram() +                        # 绘制直方图
  geom_density() +                          # 绘制密度曲线
  guides(alpha=FALSE) +                     # 隐藏透明度alpha设置带来的图例
  theme_classic()                           # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-32-1.png)&lt;!-- --&gt;

---
# 可视化
## 箱线图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;除了单个变量的可视化，我们可以尝试将两个变量的关系可视化。
&amp;emsp;&amp;emsp;这里我们利用箱线图看看不同Label的RT如何。
&lt;/font&gt;

```r
ggplot2::ggplot(data=df.mt.raw,  # 指定数据
                aes(x=Label,     # 确定映射到xy轴的变量
                    y=RT)) +                 
  geom_boxplot() +               # 绘制箱线图
  theme_classic()                # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-33-1.png)&lt;!-- --&gt;

---
# 可视化
## 散点图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;而对于两个连续变量，我们可以使用散点图。比如，我们可以看看被试在做penguin问卷前后体温的关系。
&lt;/font&gt;

```r
ggplot2::ggplot(data=df.pg.raw,                  # 指定数据
                aes(x=Temperature_t1,            # 确定映射到xy轴的变量
                    y=Temperature_t2)) +
  geom_point() +                                 # 绘制散点图
  scale_x_continuous(name = "Temperature_t1") +  # 修改X轴的名称
  scale_y_continuous(name = "Temperature_t2") +  # 修改Y轴的名称
  theme_classic()                                # 设定绘图风格
```

![](chapter_6_files/figure-html/unnamed-chunk-34-1.png)&lt;!-- --&gt;

---
# 可视化
## 散点图
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;事实上，当我们进行探索时往往需要先对数据进行处理，再进行可视化。以下我们想看看手机使用和焦虑是否存在关系。
&lt;/font&gt;

```r
# 利用管道符，可以帮助我们更简洁地合并数据处理和可视化的过程。
df.pg.raw %&gt;% 
  dplyr::mutate(stress_ave=rowMeans(.[,c("stress1", "stress2", "stress3","stress4", "stress5", 
                                         "stress6","stress7", "stress8", "stress9","stress10", 
                                         "stress11", "stress12","stress13", "stress14")]),
                phone_ave=rowMeans(.[,c("phone1","phone2","phone3","phone4","phone5",
                                        "phone6","phone7","phone8","phone9")])
                ) %&gt;% 
  ggplot(aes(x=stress_ave, 
             y=phone_ave)) +
  geom_point() +
  geom_smooth(method="lm") +    # 在散点图上叠加回归线，语法可以查找帮助文档
  theme_classic()
```

![](chapter_6_files/figure-html/unnamed-chunk-35-1.png)&lt;!-- --&gt;

---

# 可视化
## ggplot2小组

* 化繁为简：大量的默认值
* 精准定制：所有元素均可控
* 易于叠加：丰富的信息
* 日益丰富的生态系统 https://r-graph-gallery.com/

---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
# load DataExplorer
pacman::p_load("DataExplorer")

DataExplorer::plot_str(df.pg.raw)
```
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_str(df.mt.raw)
```
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_intro(df.mt.raw)
```

![](chapter_6_files/figure-html/unnamed-chunk-38-1.png)&lt;!-- --&gt;
---
# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_missing(df.mt.raw)
```

![](chapter_6_files/figure-html/unnamed-chunk-39-1.png)&lt;!-- --&gt;
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_bar(df.mt.raw)
```

![](chapter_6_files/figure-html/unnamed-chunk-40-1.png)&lt;!-- --&gt;
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_bar(df.mt.raw, by="Match")
```

![](chapter_6_files/figure-html/unnamed-chunk-41-1.png)&lt;!-- --&gt;
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_histogram(df.mt.raw)
```

![](chapter_6_files/figure-html/unnamed-chunk-42-1.png)&lt;!-- --&gt;
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_qq(df.pg.raw[,2:10])
```

![](chapter_6_files/figure-html/unnamed-chunk-43-1.png)&lt;!-- --&gt;
---

# 可视化
## Explore data with DataExplorer
&lt;font size=5&gt;
&amp;emsp;&amp;emsp;DataExplorer可能是是一个不错的工具
&lt;/font&gt;

```r
DataExplorer::plot_correlation(na.omit(df.pg.raw[, 2:30]))
```

![](chapter_6_files/figure-html/unnamed-chunk-44-1.png)&lt;!-- --&gt;
---

#练习
## 1. 读取match数据，对自己感兴趣的变量进行描述性统计。
## 2. 读取match数据，对不同shape的击中率进行分组绘图，可使用boxplot观察差异。
## 3. 读取penguin数据，选择自己感兴趣的两个变量进行处理并画出散点图。
## 4. 对两个数据中自己感兴趣的变量们做探索性数据分析。
#探索
## 在本章的例子中，我们举例了反应时的分布。但其实我们是对所有被试的所有反应时绘制了总的分布，那么我们能不能找到一个办法绘制出每一个被试的反应时分布呢？


    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightLines": true,
"highlightStyle": "github",
"countIncrementalSlides": false
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
  let res = {};
  d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
    const t = tr.querySelector('td:nth-child(2)').innerText;
    tr.querySelectorAll('td:first-child .key').forEach(key => {
      const k = key.innerText;
      if (/^[a-z]$/.test(k)) res[k] = t;  // must be a single letter (key)
    });
  });
  d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>