regression1.Rmd

---
title: "Regression analysis"
output: 
  html_document:
    fig_caption: no
    number_sections: yes
    toc: yes
    toc_float: false
    collapsed: no
---

```{r regression1-1, echo=FALSE}
options(width = 105)
knitr::opts_chunk$set(dev='png', dpi=300, cache=FALSE)
pdf.options(useDingbats = TRUE)
klippy::klippy(position = c('top', 'right'))
```

<p><span style="color: #00cc00;">NOTE:  This page has been revised for Winter 2024, but may undergo further edits.</span></p>

# Introduction #

*Regression analysis* aims at constructing relationships between a single dependent or response variable and one or more independent or predictor variables, and is one of the more widely used methods in data analysis. Although the computations and analysis that underlie regression analysis appear more complicated than those for other procedures, simple analyses are quite straightforward. (Most of this discussion was taken from (`https://pjbartlein.github.io/GeogDataAnalysis/regression1.html`)

The general model that underlies regression analysis is

&nbsp; &nbsp; &nbsp; &nbsp; *data = predictable component + unpredictable component*

"Data" in this case are the observed values of the dependent variable, the predictable component consists of the predictions generated by the regression equation, and the unpredictable component consists of the "residuals" or unpredictable parts of the data.  The general idea in regression analysis is to move information into the predictable component, leaving the unpredictable component with no information or pattern.

# Regression basics #

Details of regression analysis, including:

- the regression model and alternative representations
- other quantities in regression analysis (fitted values, residuals, sums of squares)
- "fitting" the regression equation

can be found [[here]](https://pjbartlein.github.io/REarthSysSci/topics/regression1.pdf).

[[Back to top]](regression1.html)

# Example #

The simplest regression model is the bivariate one, in which there is one response or dependent variable, and one predictor or independent variable, and the relationship between the two is represented by a straight line.

Read an example data set:

```{r regression1-2}
# read regrex1.csv
# modify the following path to reflect local files
csv_path <- "/Users/bartlein/projects/RESS/data/csv_files/"
csv_name <- "regrex1.csv"
csv_file <- paste(csv_path, csv_name, sep="")
regrex1 <- read.csv(csv_file) 
```


## Fitting a regression equation (or linear model) ##

Building a bivariate linear regression model to represent the relationship between two variables by a straight line involves determining the coefficients of that line, a process known as "fitting" the regression line.  First plot the data! 

```{r regression1-3 }
plot(y ~ x, data=regrex1)
```
 
Fit the model.  The `lm()` (linear model) function creates an object that contains the coefficients of the regression equation, fitted values, residuals, etc. which are saved here in the object `ex1_lm`.

```{r regression1-4 }
# fit the model
ex1_lm <- lm(y ~ x, data=regrex1)
```

Examine the model.  "Printing" the object gives a very short summary, while the `summary()` function produces a a few more details, and the `attributes()` function reveals what's contained in the `exa1_lm` object.

```{r regression1-5 }
# examine the model object
print(ex1_lm)
summary(ex1_lm)
attributes(ex1_lm)
```

The regression line can be visualized by adding it to the existing plot using the `abline()` function

```{r regression1-6 }
# plot the regression line
plot(y ~ x, data=regrex1)
abline(ex1_lm, col="red", lwd=2)
```
 
The individual "deviations" (one per observation) that are being minimized in the analysis can be visualized using the `segments()` function, which as might be expected plots a line between two points, i.e. between the predicted value of `y` (retrieved using the `fitted()` function from `ex1_lm`) and the observed value, at each `x` value. 

```{r regression1-7, include=TRUE}
plot (y ~ x, data=regrex1)
abline(ex1_lm, col="red", lwd=2)
# plot deviations
segments(regrex1$x, fitted(ex1_lm), regrex1$x, regrex1$y, col="red")
```

## Examining the regression equation ##

Once the regression equation has been fit to the data, the next step is to examine the results and the significance of several statistics.  First, the standard output

```{r regression1-8 }
# examine the model object
ex1_lm
summary(ex1_lm)
```

The fit of the regression model can also be displayed by plotting confidence intervals (which allow variability in the regression line to be visually assessed) and prediction intervals (which allow variability in the data to be assessed).

```{r regression1-9 }
# get and plot prediction intervals and confidence intervals
pred_data <- data.frame(x=seq(1:25))
pred_int <- predict(ex1_lm, int="p", newdata=pred_data)
conf_int <- predict(ex1_lm, int="c", newdata=pred_data)
# plot the data
ylim=range(regrex1$y, pred_int, na.rm=T, lwd=2)
plot(regrex1$x, regrex1$y, ylim=ylim)
pred_ex1 <- pred_data$x
matlines(pred_ex1, pred_int, lty=c(1,2,2), col="black")
matlines(pred_ex1, conf_int, lty=c(1,2,2), col="red")
```

## A second example ##

Side-by-side comparisons of results of regression analyses provide a means for understanding just what is being described by the summary statistics.  A second example shows how the regression equation and goodness-of-fit statistics vary as a function of the strength of the relationship between the response and predictor variable.  

Read the second example data set:

```{r regression1-10}
# read regrex2.csv
# modify the following path to reflect local files
csv_path <- "/Users/bartlein/projects/RESS/data/csv_files/"
csv_name <- "regrex2.csv"
csv_file <- paste(csv_path, csv_name, sep="")
regrex2 <- read.csv(csv_file) 
```

```{r regression1-11 }
summary(regrex2)
```
 
The variables x and y1 are the same as in the previous example, so as a second example look at the relationship between y2 and x:

```{r regression1-12, fig.width=9 }
opar <- par(mfcol = c(1,2))
plot(y1 ~ x, data=regrex2, main="Example 1")
plot(y2 ~ x, data=regrex2, main="Example 2")
```

Get the second regression:

```{r regression1-13 }
ex2_lm <- lm(y2 ~ x, data=regrex2)
summary(ex2_lm)
```

And compare it with the first:

```{r regression1-14 }
summary(ex1_lm)
```

Note that the intercept and slope parameter estimates are identical, but all of the measures of goodness-of-fit or of the uncertainty of the parameter values (their standard errors) and significance (t-tests and p-values) are "better" in the second example.  Compare the two regression lines and note the differences in the deviations that are being minimized.


```{r regression1-15, fig.width=9 }
opar <- par(mfcol = c(1,2))
plot(y1 ~ x, data=regrex2, main="Example 1")
abline(ex1_lm, col="red", lwd=2)
segments(regrex2$x, fitted(ex1_lm), regrex2$x, regrex2$y1, col="red")
plot(y2 ~ x, data=regrex2, main="Example 2")
abline(ex2_lm, col="blue", lwd=2)
segments(regrex2$x, fitted(ex2_lm), regrex2$x, regrex2$y2, col="blue")
```

## Residual plots and case-wise statistics ##

There are a number of residual diagnostic plots that can be generated to examine the regression results as well as to assess the importance of any assumption violations. 

```{r regression1-16, fig.height=7, fig.width=9}
# standard regression diagnostics (4-up)
oldpar <- par(mfrow = c(2, 2))
plot(ex1_lm, which=c(1,2,4,5))
par(oldpar)
```

The residual scatter plot (upper left) shows the relationship between the residuals and the fitted values and should show no systematic pattern ("megaphone" or "arch") that would signal model inadequacy (i.e., it should be a featureless cloud of points).  The red lowess curve helps summarize any trend that might be apparent.  The Normal QQ plot (upper right) provides a quick look at the distribution of the residuals, which should be approximately normal (i.e., the individual values should plot along a straight line).  Points that plot off the line are labeled with their case numbers (i.e. rows in the data frame).  The two plots on the bottom, of Cook's distance (lower left) and leverage (lower right) provide information on the influence of each observation on the overall regression.

[[Back to top]](regression1.html)

# Iterative fitting (i.e. brute-force optimzation) of a regression equation #

Regression equations can also be fit by perturbing (or iteratively considering different) regression parameter values, and choosing the combination that minimizes the sum of squares of residuals.  The following code sets up a range of potential regression coefficient/parameter values that will be iteratively considered, keeping track of the sum of squares of deviations for each combination (expressed as the residual standard error for comparison with the the results produced by `lm()`).  (The optimal combination is then the one that minimizes the sum of squares.)

```{r regression1-17 }
# fit a linear regression equation by
# minimizing the sum of squared residuals
# uses regrex1.csv

# set some constant values
n <- length(regrex2$y1)
K <- 1

# intercept values range
n_b0 <- 11
b0_min <- 1.0 # 2.00 # 2.24
b0_max <- 3.0 # 2.40 # 2.25
# slope values range
n_b1 <- 11
b1_min <- 0.0 # 0.4 # 0.46
b1_max <- 1.0 # 0.5 # 0.47
# coefficients
b0 <- seq(b0_min, b0_max, len=n_b0)
b1 <- seq(b1_min, b1_max, len=n_b1)
 
# space for residual standard-error values
rse <- matrix(nrow=n_b0*n_b1, ncol=3)
colnames(rse) <- c("b0", "b1", "rse")

# matrix for residual sum of squares
rss <- matrix(NA, nrow=n_b0, ncol=n_b1)
```
 
Now iterate over the combinations of intercept and slope values, keeping track of the residual standard errors and drawing the line produced by each combination.

```{r regression1-18 }
plot(y1 ~ x, data=regrex2)
m <- 0
for (j in 1:n_b0) {
    for (k in 1:n_b1) {
        m <- m + 1
        sse <- 0.0
        for (i in 1:n) {
            sse <- sse + (regrex2$y1[i] - b0[j] - b1[k]*regrex2$x[i])^2
        }
        rss[j, k] <- sse
        rse[m,1] <- b0[j]
        rse[m,2] <- b1[k]
        rse[m,3] <- sqrt(sse/(n-K-1))
        abline(b0[j], b1[k], col="gray")
    }
}

# find the coefficients that minimize the rse
m_min <- which.min(rse[,3])
print(m_min)
print(c(rse[m_min,1],rse[m_min,2],rse[m_min,3]))

# plot the line for the optimal coefficients
abline(rse[m_min,1],rse[m_min,2], col="purple", lwd=2)
# plot the OLS regression line
abline(ex1_lm, col="red", lwd=2)
```

The `which.min()` function determines the row in the `rse` matrix of the minimum (i.e. "optimal") residual standard error, and the corresponding parameter values are then printed out.  The regression line generated by those values is plotted in blue, and can be compared with the OLS estimates in red.  Pretty close.

It's useful to see the residual-sum-of-squares surface, which shows graphically how the goodness-of-fit varies as the regression coefficient values change.

```{r regression1-19 }
image(b0, b1, rss, col  = gray((9:64)/64))
```

The values in the `rss` matrix can be inspected using the `View()` function.

[[Back to top]](regression1.html)

# Examining the regression equation #

Examination of the regression equation involves an assessment of the "strength" of the relationship between the response variable and the predictor variable(s).   Two hypothesis tests are considered:

- a test of the null hypothesis that the proportion of the variance of the response variable "explained" by the predictor variable(s) is not significant (an *F*-test, analogous to the one in ANOVA of the hypothesis that the group means are not different)
- individual tests of the null hypotheses that the estimated regression coefficients (*b<sub>0</sub>*, *b<sub>1</sub>*) are equal to zero.

The two hypotheses are tested by examining the regression equation and related statistics

- decomposition of individual deviations
- how strong is the relationship? (*F*-test, *R<sup>2</sup>*)
- are the coefficients significant? (*t*-tests and standard errors of the predictions)

The fit of the regression model can also be displayed by plotting confidence intervals (which allow variability in the regression line to be visually assessed) and prediction intervals (which allow variability in the data to be assessed).

[[Back to top]](regression1.html)

# Analysis of Variance #

One of the main elements of the output in fitting a regression equation is the *analysis of variance table*, which as sounds, has some parallels with "Analysis of Variance" (ANOVA) the procedure for testing for equality of group means.  In regression analysis, the decomposition of the variance of the dependent variable is

&nbsp; &nbsp; &nbsp; &nbsp; *variance of response = variance "explained" by regression + residual variance (noise)*

In ANOVA, the total variance of the variable of interest is decomposed as 

&nbsp; &nbsp; &nbsp; &nbsp; *total variance = between-group variance + within-group variance*

If ANOVA is thought of as evaluating how much of the variability of a "response" variable can be explained by knowing which group each observation belongs to, then the similarity of the analyses can be seen.

[[Back to top]](regression1.html)