This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
forked from moderndive/ModernDive_book
-
Notifications
You must be signed in to change notification settings - Fork 13
/
91-appendixA.Rmd
executable file
·38 lines (20 loc) · 2.39 KB
/
91-appendixA.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# (APPENDIX) Appendix {-}
# Statistical Background {#appendixA}
## Basic statistical terms
### Mean
The mean is the most commonly reported measure of center. It is commonly called the "average" though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have $n$ data points, the mean is given by: $$Mean = \frac{x_1 + x_2 + \cdots + x_n}{n}$$
### Median
The median is calculated by first sorting a variable's data from smallest to largest. After sorting the data, the middle element in the list is the **median**. If the middle falls between two values, then the median is the mean of those two values.
### Standard deviation
We will next discuss the **standard deviation** of a sample dataset pertaining to one variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean:
$$Standard \, deviation = \sqrt{\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \cdots + (x_n - Mean)^2}{n - 1}}$$
### Five-number summary
The **five-number summary** consists of five values: minimum, first quantile (25^th^ percentile), median (50^th^ percentile), third quantile (75^th^) quantile, and maximum. The quantiles are calculated as
- first quantile ($Q_1$): the median of the first half of the sorted data
- third quantile ($Q_3$): the median of the second half of the sorted data
The _interquartile range_ is defined as $Q_3 - Q_1$ and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets.
### Distribution
The **distribution** of a variable/dataset corresponds to generalizing patterns in the dataset. It often shows how frequently elements in the dataset appear. It shows how the data varies and gives some information about where a typical element in the data might fall. Distributions are most easily seen through data visualization.
### Outliers
**Outliers** correspond to values in the dataset that fall far outside the range of "ordinary" values. In regards to a boxplot (by default), they correspond to values below $Q_1 - (1.5 * IQR)$ or above $Q_3 + (1.5 * IQR)$.
Note that these terms (aside from **Distribution**) only apply to quantitative variables.