matrix-code.Rmd

---
title: "Linear Algebra Code"
description: |
  A Crash Course in Linear Algebra using R and python
author: 
  - name: Alex Stephenson
output:
  distill::distill_article:
    toc: true
    toc_depth: 3
    hightlight: haddock
site: distill::distill_website
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## What's here? {#top}

This page provides code in R and python for doing linear algebra. In python, we make use of the [NumPy](https://numpy.org/doc/stable/index.html) library and the PyTorch library.

```{python}
import numpy as np 
#import torch
```

For tensor processing in R, we'll use the R port of `torch` 

```{r r-packages, eval = F}
library(torch)
```

For more on the theory of Linear Algebra consult the [Linear Algebra Notes](linalg.html). 

## Vectors 

The simplest way to represent vectors in R is by using the vector data structure. 

```{r r-vectors}
x = c(-1.1, 0.0, 3.6, -7.2)
length(x) ## 4 
```

In python:

```{python}
x = np.array([-1.1, 0.0, 3.6, -7.2])
len(x)
```

### Block and stacked vectors 

In addition to creating vectors, we can concatenate vectors together to produce blocked and stacked vectors using the `c()` function.

```{r r-stack}
x = c(1,-2)
y = c(1,1,0)
z = c(x,y)
z
```

In python: 

```{python}
x = np.array([1,-2])
y = np.array([1,1,0])
z = np.concatenate((x,y))
print(z)

```

### Some special vectors 

The Zeros vector is a default behavior of creating a vector with a given length. 

```{r r-zeros}
z = numeric(3)
z
```

In python: 

```{python}
z = np.zeros(3)
z

```

The Ones vector can be made by way of the `rep()` function. 

```{r}
o = rep(1,3)
o
```

In python: 

```{python}
o = np.ones(3)
o
```

### Random Vectors 

For simulation, it is often useful to generate random vectors to check an implementation. Alternatively, we might want to test an identity with some ground truth we already know. To generate a random vector of size $n$ 

```{r}
## Generate a vector of three values from a uniform distribution.
runif(3)
```

In python 

```{python}
## By default this is on the [0,1) interval. 
np.random.random(3)
```

We can generate additional vectors from certain distributions. The most common from a simulation standpoint are standard normal. 
```{r}
rnorm(3)
```

In python 

```{python}
np.random.randn(3)
```
[Return to Top](#top)

## Vector Addition and Multiplication 

If x and y are vectors of the same size, then x+y and x-y give their element wise sum and difference respectively. R by default computes most vector operations element wise. 

```{r}
x = c(1,2,3)
y = c(100,200,300)
x+y
```

In python: 

```{python}
x = np.array([1,2,3])
y = np.array([100,200,300])
x + y
np.add(x,y)
```


### Scalar Multiplication and division 

If a is a number and x is a vector, then we can express the scalar vector product as either `a*x` or `x*a`

```{r}
a = 2 
x = c(1,2,3)
a*x
x*a
```

In python: 

```{python}
a = 2
x = np.array([1,2,3])
a*x
x*a
```

### Elementwise operations 

We can perform elementwise operations on both R vectors and NumPy arrays. 

```{r}
p_i = c(50,40,30)
p_f = c(52,38,33)
out = (p_f - p_i)/p_i
out
```

In python 

```{python}
p_i = np.array([50, 40, 30])
p_f = np.array([52,38,33])
out = (p_f - p_i)/p_i
print(out)
```
### Using what we've learned to confirm the distributive property 
The distributive property $\beta(a+b) = \beta a + \beta b$ holds for any two n-vector *a* and *b* and any scalar $\beta$. 

```{r}
a = c(3,5,6)
b = c(2,4,9)
beta = 5
lhs = beta*(a+b)
rhs = beta*a + beta*b
print(lhs)
print(rhs)
lhs == rhs
```

In python: 

```{python}
a = np.array([3,5,6])
b = np.array([2,4,9])
beta = 5 
lhs = beta*(a+b)
rhs = beta*a + beta*b
print('lhs:', lhs)
print('rhs:', rhs)
lhs == rhs
```

### Inner Product 

The inner product of n-vector x and y is denoted $x^Ty$ 

```{r}
x = c(1,2,3,4)
y = c(3,4,6,7)

## t() is the transpose function in R
t(x)%*% y
```

In python: 

```{python}
x = np.array([1,2,3,4])
y = np.array([3,4,6,7])
np.inner(x,y) 

# Alternatively 
x @ y
```

[Return to Top](#top)

## Matrices 

A matrix $\textbf{X}$ is an $m$ x $n$ data structure that is a rectangular array of scalar numbers. The numbers $x_{ij}$ are components or elements of $\textbf{X}$. The transpose of a matrix is the $n$ x $m$ matrix $\textbf{X}'$

```{r}
## Creating a matrix in R 
X = matrix(seq(1,16,1), 
           nrow = 4, 
           byrow = T)
X
```
We can also create matrices from vectors or from data frames

```{r}
## Equivalent to above but with vectors 
X2 = rbind(1:4, 5:8,9:12,13:16)
X2
```

```{r}
## via a data frame 
df = data.frame(
  x = 1:4,
  y = 5:8,
  z = 9:12,
  w = 13:16
)
X3 = as.matrix(df)
X3
```

In python: 

```{python}
X = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
X
X.shape
```

Some other useful matrices 

```{python}
np.identity(4) 
np.zeros((4,4))
np.ones((4,4))
```

[Return to Top](#top)

### Additional Definitions 

#### Transpose 

The transpose in R 

```{r}
X_transpose = t(X)
X_transpose
```

In python: 

```{python}
X_transpose = X.T
X_transpose
```

#### Diagonal 

A diagonal matrix with all elements not on the diagonal equal to zero is a diagonal matrix. By default, R creates an identity matrix with the `diag()` function.

```{r}
dM = diag(4)
dM
```

#### Trace 

The trace of a matrix is the sum of the diagonal elements 

$$trace(X) = \sum_{i=1}^n x_{ii}$$ 

```{r}
matrix_trace = function(mat){
  return(sum(diag(mat)))
}

matrix_trace(X)
```

#### Inverses 

A matrix is invertible if it is not singular. The inverse is useful because we can use it to solve the system of equations represented by the matrix. An inverse indicates whether the system has a solution. 

```{r}
A = rbind(c(1,-2,3), c(0,2,2), c(-4,-4,-4))
solve(A)
```

In python 

```{python}
A = np.array([[1,-2,3], [0,2,2], [-4,-4,-4]])
np.linalg.inv(A)
```

[Return to Top](#top)

### Matrix Arithmetic 

Addition and subtraction of matrices of the same order are performed element by element. Scalar multiplication is element by element 

```{r}
A = matrix(data = seq(1,9,1), nrow = 3, byrow = T)
B = matrix(data = seq(1,9,1), nrow = 3, byrow = T)
A+B

```

Provided that the number of columns of A equals the number of rows of B, we can multiple A by B. 

```{r}
## To get the appropriate multiplication, we wrap * in %*%
A%*%B
```

Note that we can perform X'X in one of two ways. 

```{r}
t(A)%*%A 

## same but can be slightly faster 
crossprod(A)

```

In python: 

```{python}
A = np.array([[1,2,3], [4,5,6], [7,8,9]])
B = np.array([[1,2,3], [4,5,6], [7,8,9]])
A+B
np.add(A,B)

# Multiplication 
np.matmul(A,B)

np.matmul(A.T, A)
```

[Return to Top](#top)

## Norms 

### Conceptual Definition 

Informally, the norm of a vector tells us how *big* it is. Formally, a norm is a function $\lVert \cdot \rVert$ that maps a vector to a scalar which satisfies:

1. Given any vector **x** $\lVert \alpha x \rVert = \lVert \alpha \rVert \lvert x \rVert$. 
2. For any vectors **x**, **y**, norms satisfy the triangle inequality $\lVert x + y \rVert \leq \lVert x \rVert + \lVert y \rVert$
3. The norm of a vector $\lVert x \rVert > 0,\forall x \neq 0$

### Euclidean ($\ell_2$) Norm 

The most common norm is the $\ell_2$ norm or Euclidean norm which is defined as $\lVert x \rVert_2 = \sqrt{\sum_{i=1}^n x_i^2}$

```{r}
x = c(3,-4)
l2 = sqrt(crossprod(x,x))
l2

## We can also call R's built in norm() function 
norm(x, "2")
```

In python 

```{python}
x = np.array([3,-4])
np.linalg.norm(x)
```

### Manhattan ($\ell_1$) Norm 

We sometimes take the Manhattan norm ($\ell_1$) which sums the absolute values of a vectors elements $\lVert x \rVert_1 = \sum_{i=1}^n|x_i|$

```{r}
x = c(3,-4)
l1 = function(x){
  return(sum(abs(x)))
}
l1(x)

```

In python 

```{python}
x = np.array([3,-4])
np.linalg.norm(x, 1)

```

### Infinity ($\ell_\infty$) Norm 

The infinity norm or max norm ($\ell_\infty$) is common in machine learning applications defined as $\lVert x \rVert_\infty = \max_i |x_i|$, which simplifies to the absolute value of the element with the largest magnitude in the vector. 

```{r}
x = matrix(c(3,-4), byrow = T, nrow = 2)
norm(x, type ="I")

l_inf = function(x){
  return(max(abs(x)))
}
l_inf(c(3,-4))
```

In python 

```{python}
x = np.array([3,-4])
max(abs(x))
```

### Distance 

The distance between two vectors is $d(x,y) = \lVert x-y \rVert$. 
```{r}
l2 = function(x, y){
  d = x - y 
  return(sqrt(t(d)%*%d))
}
u = c(1.8, 4, 6, 24)
v = c(2,4,6,8)
l2(u,v)
```

In python 

```{python}
u = np.array([1.8, 4, 6, 24])
v = np.array([2,4,6,8])
np.linalg.norm(u-v)
```

[Return to Top](#top)

## Taylor Approximation 

The (first order) Taylor approximation of some function $f: \textbf{R}^n \rightarrow \textbf{R}$ at the point $z$ is the affine function $\hat{f}(x) = f(z) + \nabla f(z)'(x-z)$

In python 

```{python}
## Suppose f(x) = x1 + exp(x2 - x1)
f = lambda x: x[0] + np.exp(x[1]-x[0])
grad_f = lambda z: np.array([1 - np.exp(z[1]-z[0]), np.exp(z[1] - z[0])])
z = np.array([1,2])

f_hat = lambda x: f(z) + grad_f(z) @ (x-z)
f([1,2]), f_hat([1,2])
f([.96, 1.98]), f_hat([.96, 1.98])
```

[Return to Top](#top)

## Root Mean Square 

The root mean square of a vector $x$ is $\frac{\lVert x\rVert}{\sqrt{n}}$. 

```{r}
rms = function(x){
  num = sqrt(t(x)%*%x)
  den = sqrt(length(x))
  return(num/den)
}
t = c(0,1.01, 0.01)
x = cos(10*t) - 2*sin(11*t)
rms(x)
```

In python 

```{python}
rms = lambda x: (sum(x**2)**0.5)/(len(x)**0.5)
t = np.array([0,1.01, 0.01])
x = np.cos(10*t) - 2*np.sin(11*t)
rms(x)
```

[Return to Top](#top)

## Common Transformations 

### Demeaning a Vector 

A demeaned vector is the result of performing ${x - E[x]\textbf{1}}$. If we want this to be normalized, we then divide by the standard deviation $\sigma_x = \frac{\lVert x - E[x]\textbf{1}\rVert}{\sqrt{n}}$. 

```{r, warning = F}
scale_h = function(x){
  demean = x - as.numeric(mean(x))
  demean = as.vector(demean)
  std = (sqrt(crossprod(demean)))/sqrt(length(x))
  return(demean/std)
}

x = c(2,4,5,7)
scale_h(x)
```

In python 

```{python}
def scale_h(x):
  demean = x - sum(x)/len(x)
  std = np.linalg.norm(x - sum(x)/len(x))/(len(x)**0.5)
  return demean / std

x = np.array([2,4,5,7])
scale_h(x)
```

### Correlation 

The correlation coefficient between two vectors $a$ and $b$ is defined $\rho = \frac{\tilde{x}^T\tilde{y}}{\lVert \tilde{x} \rVert \lVert \tilde{y} \rVert}$ presuming that both standard deviations are non-zero. The $\tilde{x}$ means that we have taken a demeaned version of $a$. 

```{r}
corr_c = function(x,y){
  x_tilde = x - as.numeric(mean(x))
  y_tilde = y - as.numeric(mean(y))
  x_tn = sqrt(crossprod(x_tilde))
  y_tn = sqrt(crossprod(y_tilde))
  return((t(x_tilde)%*%y_tilde)/ (x_tn*y_tn))
}

x= c(4.4, 9.4, 15.4, 12.4, 10.4, 1.4, -4.6, -5.6, -0.6, 7.4)
y = c(6.2, 11.2, 14.2, 14.2, 8.2, 2.2, -3.8, -4.8, -1.8, 4.2)
corr_c(x,y)
```

In python 

```{python}
def corr_c(x,y):
  x_tilde = x - sum(x)/len(x)
  y_tilde = y - sum(x)/len(x)
  denom = np.linalg.norm(x_tilde) * np.linalg.norm(y_tilde)
  return (x_tilde @ y_tilde)/denom 

x = np.array([4.4, 9.4, 15.4, 12.4, 10.4, 1.4, -4.6, -5.6, -0.6, 7.4])
y = np.array([6.2, 11.2, 14.2, 14.2, 8.2, 2.2, -3.8, -4.8, -1.8, 4.2])
corr_c(x,y)
```

[Return to Top](#top)

## Ordinary Least Squares

### OLS in Matrix Form 

The simplest linear model expresses the dependence of a dependent or response variable y on independent variables $x_1,.., x_p$ and is usually written $y = X\beta + \epsilon$. See the [Lecture Notes](stats.html) for more details on the properties of this model. 

Define the design matrix as the $n \times p$ matrix of independent variables $x_1,..,x_p$ and assume that the first columns is a column of ones and that the design matrix has full rank. Then the usual OLS estimator is defined as $(X'X)^{-1}X'Y$

```{r}
beta_estimator = function(X,y){
  X = cbind(rep(1,nrow(X)), X)
  betas = solve(t(X)%*%X)%*%t(X)%*%y
  return(betas)
}

## example data 
set.seed(123)
x1 = rnorm(10000)
x2 = rnorm(10000)
y = 2*x1 + 4*x2 + runif(10000)

X = cbind(x1, x2)
beta_estimator(X,y)

```
 
In python: 

```{python}
x1 = np.random.default_rng(seed=123).normal(0, 1, size =1000)
x2 = np.random.default_rng().normal(0, 1, size =1000)
ones = np.ones(1000)
y = 2*x1 + 4*x2 + np.random.default_rng().uniform(size = 1000)

X = np.concatenate((ones, x1, x2)).reshape((-1,3), order = 'F')

# Alternatively we can make a matrix with column_stack()
X = np.column_stack((ones, x1, x2))

# (X'X)^-1X'y
np.linalg.inv(X.T @ X) @ X.T @ y
```

### Frisch-Waugh-Lovell (FWL)

FWL states that any coefficient in an OLS fit is equivalent to the coefficient estimated from a bivariate model in which the residualized outcome is regressed onto the residualized component of the predictors, and the residuals are taken from models regressing the outcome and the coefficient on all other coefficients in the OLS fit separately. 

```{r}
fwl = function(){
  ### Regress X1 on X2 (Partialling out/orthogonalization)
  
  ### Compute the residuals x1_tilde (variation not explained by X2)
  
  ### regress Y on X2 
  
  ### Compute residuals y_tilde (variation not explained by X2)
  
  ### regress y_tilde on x1_tilde
}
```

In python

```{python}
def FWL():
  pass
```


[Return to Top](#top)

## Covariance Matrices and Standard Errors

### Classical Standard Errors 

The least squares solution gives us point estimates for coefficients, but if we want to do inference, we need to get standard errors. See the [Lecture Notes](lectures.html) for more details. 

To get standard errors, we must first calculate the covariance matrix of our estimates and then take the square root of the diagonal. 

```{r}
beta_estimator = function(X,y){
  X = cbind(rep(1,nrow(X)), X)
  betas = solve(t(X)%*%X)%*%t(X)%*%y
  return(betas)
}

betas_and_std_errors = function(X,y){
  betas = beta_estimator(X,y)
  ## get the design matrix again
  X = cbind(rep(1,nrow(X)), X)
  residuals = y - X %*% betas 
  
  ## Degree of freedom calculation 
  p = ncol(X) - 1 
  df = nrow(X) - p - 1 
  
  ## Residual variance 
  res_var = sum(residuals^2) / df 
  
  ## Covariance matrix of estimate 
  ## cov(\hat{\beta}|X) = (X'X)^-1X'cov(\epsilon|X)X(X'X)^-1
  beta_cov = res_var * solve(t(X)%*%X)
  
  ## Standard errors are square root of diagonal 
  return(list(beta = betas, se = sqrt(diag(beta_cov))))
}

## example data 
## To keep consistent with python examples I use pre-generated 
## random variables created in python with:
# x1 = np.random.default_rng(seed=123).normal(0, 1, size =1000)
# x2 = np.random.default_rng().normal(0, 1, size =1000)
# ones = np.ones(1000)
# y = 2*x1 + 4*x2 + np.random.default_rng().uniform(size = 1000)

x1 = read.csv("x1.csv", header = F) |>
  unlist()
x2 = read.csv("x2.csv", header = F) |>
  unlist()
y = read.csv("y.csv", header = F) |>
  unlist()
X = cbind(x1, x2)

betas_and_std_errors(X,y)
```

In python 

```{python}
def betas_and_se(X,y):
  betas = np.linalg.inv(X.T @ X) @ X.T @ y
  residuals = y - X @ betas
  df = X.shape[0] - X.shape[1]
  res_var = np.sum(residuals**2) / df
  cov_mat = res_var * np.linalg.inv(X.T @ X)
  se = np.sqrt(np.diag(cov_mat))
  return betas, se


x1 = np.loadtxt("x1.csv", delimiter = ",", dtype = float)
x2 = np.loadtxt("x2.csv", delimiter = ",", dtype = float)
y = np.loadtxt("y.csv", delimiter = ",", dtype = float)

# One way to make a matrix from vectors 
# X = np.concatenate((ones, x1, x2)).reshape((-1,3), order = 'F')

# Alternatively we can make a matrix with column_stack()
X = np.column_stack((ones, x1, x2))

betas, se = betas_and_se(X,y)
print("betas:", betas)
print("SE:", se)
```

### Sandwich Standard Errors 

```{r}
betas_and_std_errors_sandwich = function(X,y){
  betas = beta_estimator(X,y)
  ## get the design matrix again
  X = cbind(rep(1,nrow(X)), X)
  residuals = y - X %*% betas 

  ## Degree of freedom calculation 
  p = ncol(X) - 1 
  df = nrow(X) - p - 1 
  
  
  ## HC1 or Eicker-Huber_White Variance Estimator 
  ## This is a way of creating a diagonal matrix from a matrix
  ## with one column in R. 
  u2 = matrix(diag(as.vector(residuals^2)), ncol = nrow(X))
  beta_cov = (nrow(X)/df) * solve(t(X)%*%X) %*% t(X) %*% u2 %*% X  %*% solve(t(X)%*%X)
  
  ## Standard errors are square root of diagonal 
  return(list(beta = betas, se = sqrt(diag(beta_cov))))
}

## example data 
x1 = read.csv("x1.csv", header = F) |>
  unlist()
x2 = read.csv("x2.csv", header = F) |>
  unlist()
y = read.csv("y.csv", header = F) |>
  unlist()
X = cbind(x1, x2)

betas_and_std_errors_sandwich(X,y)
```

In python 

```{python}
def betas_and_std_errors_sandwich(X,y):
  ## Calculate Beta coefficients
  betas = np.linalg.inv(X.T @ X) @ X.T @ y
  
  ## Get residuals, degree of freedom, and squared residuals
  residuals = y - X @ betas
  df = X.shape[0] - X.shape[1]
  u2 = residuals**2

  ## apply the HC1 formula with appropriate correction 
  beta_cov = (X.shape[0]/df) * np.linalg.inv(X.T @ X) @ X.T @ np.diag(u2) @ X @ np.linalg.inv(X.T @ X)
  se = np.sqrt(np.diag(beta_cov))
  return betas, se

x1 = np.loadtxt("x1.csv", delimiter = ",", dtype = float)
x2 = np.loadtxt("x2.csv", delimiter = ",", dtype = float)
y = np.loadtxt("y.csv", delimiter = ",", dtype = float)

X = np.column_stack((ones, x1, x2))

betas, se = betas_and_std_errors_sandwich(X,y)
print("betas:", betas)
print("SE:", se)
```

[Return to Top](#top)

## Principal Components Analysis 

Suppose we have a collection of points in $\mathbb{R}^n$ and we want to encode these points to represent a lower-dimensional version of them. If $X'$ is a $n \times p$ matrix, then the first principal component of $X'$ is the linear combination of the $p$ variables $y'_1 = (X-\bar{X})'a_1$ s.t $V(y_1)'$ is maximized subject to the constraint that $a_1'a_1 =1$. Subsequent principal components are defined successively in a similar way. 

```{r}
options(scipen=999)

scale_and_center = function(x){
  ## center columns 
  x_s = x - mean(x)
  
  ## return scaled columns 
  return(x_s/sd(x))
}

prcomp_by_hand = function(A) {
  ## Calculate mean of each column
  C = apply(A, 2, scale_and_center)
  
  ## Calculate covariance matrix of centered matrix
  V = cov(C)
  ## Eigendecomposition of covariance matrix
  eig = eigen(V, symmetric = F)
  ## Transpose eigenvectors
  eig.t = t(eig$vectors)
  ## calculate new dataset
  A.new = eig.t %*% t(C)
  df.new = t(A.new)
  return(list(points = df.new, vectors = eig$vectors))
}

results = prcomp_by_hand(USArrests)
results$vectors 
head(results$points)
```

Using built-in function in R.

```{r}
## As Brian Ripley pointed out on R-help back in 2003
## using different compilers on the same machine and 
## the same version of R may give different signs for the eigenvectors. 
## The moral is, don't rely on the signs of eigenvectors! 
## (This is on the help page.)
t = prcomp(USArrests, center = T,scale = T)
head(-1*t$rotation)
head(-1*t$x) 
```


In python 

```{python}
np.set_printoptions(suppress=True)

def scale(mat):
  center = mat - np.mean(mat, axis = 0)
  scale = center / np.std(mat, axis = 0, ddof = 1)
  return scale

def pca(mat):
  center = mat - np.mean(mat, axis = 0)
  scale = center / np.std(mat, axis = 0, ddof = 1)
  cov_mat = np.cov(scale, rowvar = False)
  vals, vec = np.linalg.eig(cov_mat)
  
  ## Sort eigen vectors and eigen values in order 
  idx = (-vals).argsort()
  vals = vals[idx]
  vec = vec[:, idx]
  ## Calculate new dataset 
  A_new = (vec.T @ scale.T).T
  return vec, A_new

## Test with the same UArrests dataset 

## We need to do a bit of cleaning of the raw dataset to 
## turn it into an appropriate matrix
usarrests = np.loadtxt("USArrests.csv", delimiter = ",", dtype = str, skiprows = 1)
usarrests = np.delete(usarrests, (0), axis = 1)
usarrests = usarrests.astype(dtype = "float")
vec, transformed = pca(usarrests)
print('Principal Components:', vec)
transformed[:6]

```

[Return to Top](#top)

## K-Means 

[Return to Top](#top)

## Tensors 

Tensors are generic $n^{th}$ order arrays. Vectors are a 1st order tensor. Matrices are a second order tensor. 

```{r, eval = F}
t1 = torch_tensor(1)
```

In python using `PyTorch`

```{python, eval = F}
t1 = torch.tensor(1.0)
t1
```

[Return to Top](#top)