Spatial Regression Models

Startup

Import data

library(sf)
WIfinal = st_read("Data/Spatial/wi_final_census2_random4.shp")

## Reading layer `wi_final_census2_random4' from data source `D:\GIS\TDM\Transport-Demand-Modelling\Data\Spatial\wi_final_census2_random4.shp' using driver `ESRI Shapefile'
## Simple feature collection with 417 features and 34 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.5423 ymin: 42.84136 xmax: -87.79183 ymax: 43.54352
## CRS:            NA

class(WIfinal) #it is a spatial feature dataset

## [1] "sf"         "data.frame"

View(WIfinal)

FIPS	MSA	TOT_POP	POP_16	POP_65	WHITE_	BLACK_	ASIAN_	HISP_	MULTI_RA	MALES	FEMALES	MALE1664	FEM1664	EMPL16	EMP_AWAY	EMP_HOME	EMP_29	EMP_30	EMP16_2	EMP_MALE	EMP_FEM	OCC_MAN	OCC_OFF1	OCC_INFO	HH_INC	POV_POP	POV_TOT	HSG_VAL	BLACK1	BLACK_R	PCTBLACK	PCTBLCK	polyid	geometry
55131430100	Milwaukee	5068	1248	429	5005	5	6	32	17	2610	2458	1763	1628	2817	2690	127	1852	838	2854	1563	1291	477	456	44	58295	5057	185	157200	5	2201	0.860631	0.000987	1	MULTIPOLYGON (((-88.28074 4…
55089610100	Milwaukee	8003	1812	667	7720	35	36	129	59	3999	4004	2760	2764	4476	4237	239	2930	1307	4544	2386	2158	817	700	96	55124	7160	164	145900	35	26	0.005959	0.004373	2	MULTIPOLYGON (((-87.8117 43…
55131410100	Milwaukee	4393	1026	534	4320	2	19	19	27	2198	2195	1446	1387	2389	2316	73	1636	680	2418	1306	1112	466	352	23	51769	4327	211	129800	2	97	0.030012	0.000455	3	MULTIPOLYGON (((-88.16157 4…
55131400101	Milwaukee	7687	1801	703	7509	6	7	106	57	3943	3744	2652	2531	4296	4137	159	2637	1500	4358	2360	1998	736	896	61	62083	7682	224	162600	6	320	0.141892	0.000781	4	MULTIPOLYGON (((-88.16078 4…
55131420104	Milwaukee	5086	1065	821	4957	64	0	11	0	2485	2601	1598	1602	2701	2632	69	1767	865	2787	1479	1308	423	510	48	51858	5086	160	156000	64	40	0.010384	0.012584	5	MULTIPOLYGON (((-88.21622 4…
55131420102	Milwaukee	7619	1943	534	7253	0	143	103	15	3891	3728	2643	2499	4016	3898	118	2557	1341	4066	2187	1879	815	678	86	51844	7468	296	128800	0	2258	0.868852	0.000000	6	MULTIPOLYGON (((-88.16137 4…

Project your spatial data

R does not know in which coordinate system our data is (CRS = _NA_), and assumes it is in WGS84 (read more about it here). WGS84 is the global coordinate system for GPS data, for instance.
But it is not projected in a plan, as a cartesian XY, so if we want do deal with distances in meters (and not angular distances), we should project the data in another CRS, the 3857 (see more).

st_crs(WIfinal) = 4326 #set the crs to WGS84
WIfinal = st_transform(WIfinal, 3857) #project the data

Let’s look at the distribution of Hispanic people with a map, using a quantile classification.

summary(WIfinal$HISP_)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    44.0    91.0   225.1   192.0  3613.0

library(tmap) #tmap package
tm_shape(WIfinal) + 
  tm_polygons(style = "quantile", col = "HISP_", title= "Hispanic people", legend.hist = TRUE) +
  tm_legend(outside = TRUE, text.size = .8)

You can use tmap() as interactive view mode. Example, using the same command.

tmap_mode("view") #you can choose "plot" (as above) or "view"

## tmap mode set to interactive viewing

tm_shape(WIfinal) + 
  tm_polygons(style = "quantile", col = "HISP_", title= "Hispanic people")

Global spatial autocorrelation

For polygon geometry

Neighbors

The first step requires that we define “neighboring” polygons. This could refer to contiguous polygons, polygons within a certain distance band, or it could be non-spatial in nature and defined by social, political or cultural “neighbors”.

Here, we’ll adopt a contiguous neighbor definition where we’ll accept any contiguous polygon that shares at least on vertex (this is the “queen” case and is defined by setting the parameter queen=TRUE). If we required that at least one edge be shared between polygons then we would set queen=FALSE (rook neighbours).

library(spdep)
neighbors <- poly2nb(WIfinal, queen=TRUE) #queen
neighbors

## Neighbour list object:
## Number of regions: 417 
## Number of nonzero links: 2628 
## Percentage nonzero weights: 1.511309 
## Average number of links: 6.302158

neighbors_rook <- poly2nb(WIfinal, queen=F) #rook
neighbors_rook

## Neighbour list object:
## Number of regions: 417 
## Number of nonzero links: 2368 
## Percentage nonzero weights: 1.361788 
## Average number of links: 5.678657

Weights

Next, we need to assign weights to each neighboring polygon. In this case, each neighboring polygon will be assigned equal weight (style="W"). Style can take values “W”, “B”, “C”, “U”, “minmax” and “S”.
Use ?nb2listw to see more details.

weightsW = nb2listw(neighbors, style="W")
weightsW

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 417 
## Number of nonzero links: 2628 
## Percentage nonzero weights: 1.511309 
## Average number of links: 6.302158 
## 
## Weights style: W 
## Weights constants summary:
##     n     nn  S0       S1       S2
## W 417 173889 417 140.0808 1714.318

Moran’s I test

The correlation score is between -1 and 1. Much like a correlation coefficient:

1 determines perfect positive spatial autocorrelation (so your data is clustered)
0 identifies the data is randomly distributed, and
-1 represents negative spatial autocorrelation (so dissimilar values are next to each other).

To get the Moran’s I value use the moran.test() function.

moran.test(WIfinal$HISP_, weightsW)

## 
##  Moran I test under randomisation
## 
## data:  WIfinal$HISP_  
## weights: weightsW    
## 
## Moran I statistic standard deviate = 29.937, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.8194219433     -0.0024038462      0.0007535949

What can you say about this result?

Moran test with Monte Carlo simulation

Note that the p-value computed from the moran.test function is not computed from an Monte Carlo simulation but analytically instead. This may not always prove to be the most accurate measure of significance.
To test for significance using the Monte Carlo simulation method instead, use the moran.mc function.

#for a Monte Carlo simulation with 600 rounds
moran.mc(WIfinal$HISP_, weightsW, nsim=599)

## 
##  Monte-Carlo simulation of Moran I
## 
## data:  WIfinal$HISP_ 
## weights: weightsW  
## number of simulations + 1: 600 
## 
## statistic = 0.81942, observed rank = 600, p-value = 0.001667
## alternative hypothesis: greater

Plot the distribution (note that this is a density plot instead of a histogram).

plot(moran.mc(WIfinal$HISP_, weightsW, nsim=599), main="", las=1) #density plot

Local spatial autocorrelation

Local Moran

Moran scatterplot

moran.plot(WIfinal$HISP_, listw = weightsW)

Notice how the plot is split in 4 quadrants. The top right corner belongs to areas that have high level of Hispanic people and are surrounded by other areas that have above the average level of Hispanic people This are the high-high locations. The bottom left corner belongs to the low-low areas. These are areas with low level of Hispanic people and surrounded by areas with below average levels of Hispanic people. Both the high-high and low-low represent clusters.
A high-high cluster is what you may refer to as a hot spot. And the low-low clusters represent cold spots. In the opposite diagonal we have spatial outliers. They are not outliers in the standard sense, extreme observations, they are outliers in that they are surrounded by areas that are very unlike them. So you could have high-low spatial outliers, areas with high levels of Hispanic people and low levels of surrounding Hispanic people, or low-high spatial outliers, areas that have themselves low levels of Hispanic people (or whatever else you are mapping) and that are surrounded by areas with above average levels of Hispanic people.

Local Moran statistics

localmoranstats <- localmoran(WIfinal$HISP_, weightsW)
summary(localmoranstats)

##        Ii                E.Ii               Var.Ii            Z.Ii            Pr(z > 0)     
##  Min.   :-0.23179   Min.   :-0.002404   Min.   :0.0536   Min.   :-0.75376   Min.   :0.0000  
##  1st Qu.: 0.03026   1st Qu.:-0.002404   1st Qu.:0.1332   1st Qu.: 0.08241   1st Qu.:0.3558  
##  Median : 0.09557   Median :-0.002404   Median :0.1558   Median : 0.24318   Median :0.4039  
##  Mean   : 0.81942   Mean   :-0.002404   Mean   :0.1650   Mean   : 2.20017   Mean   :0.3845  
##  3rd Qu.: 0.14393   3rd Qu.:-0.002404   3rd Qu.:0.1874   3rd Qu.: 0.36976   3rd Qu.:0.4672  
##  Max.   :33.54955   Max.   :-0.002404   Max.   :0.9455   Max.   :92.01262   Max.   :0.7745

The outputs of this statistics’ table are defined as:

Ii: local moran statistic. One for each area.
E.Ii: expectation of local moran statistic
Var.Ii: variance of local moran statistic
Z.Ii: standard deviate of local moran statistic
Pr(): p-value of local moran statistic

Let’s map the local moran statistics Ii and p-value

moranmap <- cbind(WIfinal, localmoranstats) #first, bind the statistics to the original data
names(moranmap)[39] <- "Pvalue" #change the name of this variable to make it easier to call it

tm_shape(moranmap) +
  tm_polygons(col = "Ii", style = "pretty", title = "local Moran's statistic")

## Variable(s) "Ii" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

tm_shape(moranmap) +
  tm_polygons(
    col = "Pvalue",
    breaks = c(-Inf, 0.01, 0.05, 0.1, 0.15, Inf),
    palette = "-Blues",
    title = "local Moran's I p-values")

A positive value for Ii indicates that the unit is surrounded by units with similar values.
Notice that with this variable we only have significant values at a confidence level of 90% (pvalue > 0.1).

LISA clusters

“Everything is related to everything else, but near things are more related than distant things.” - The First Law of Geography (Tobler)

In order to produce the LISA map we need to do some previous work. First we are going to create some new variables that we are going to need.
First we scale the variable of interest. When we scale HISP_ what we are doing is re-scaling the values so that the mean is zero.
We also want to account for the spatial dependence of our values, so we create a spatial lag variable with lag.listw(). Spatial lag is when the dependent variable y in place i is affected by the independent variables in both place i and j. This will be important to keep in mind when considering spatial regression. With spatial lag in ordinary least square regression, the assumption of uncorrelated error terms is violated, because near things will have associated error terms. Similarly, the assumption of independent observations is also violated, as the observations are influenced by the other observations near them. As a result, the estimates are biased and inefficient. Spatial lag is suggestive of a possible diffusion process – events in one place predict an increased likelihood of similar events in neighboring places.

#scale the variable of interest and save it to a new column
moranmap$HISP_scale <- as.vector(scale(moranmap$HISP_))
summary(moranmap$HISP_scale)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.49380 -0.39729 -0.29420  0.00000 -0.07266  7.43120

#create a spatial lag variable and save it to a new column
moranmap$HISP_lag <- lag.listw(weightsW, moranmap$HISP_scale)
summary(moranmap$HISP_lag)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.473697 -0.347207 -0.264430  0.004558 -0.074120  4.905265

Then we need to create a variable to distinguish in which quadrant each observation is (recall the Moran scatterplot!).

library(tidyverse)
siglevel = 0.15 #we can change to different levels
moranmap <- moranmap %>%  mutate(quad_sig = ifelse(moranmap$HISP_scale > 0 & 
                              moranmap$HISP_lag > 0 & 
                              moranmap$Pvalue <= siglevel, 
                     "high-high",
                     ifelse(moranmap$HISP_scale <= 0 & 
                              moranmap$HISP_lag <= 0 & 
                              moranmap$Pvalue <= siglevel, 
                     "low-low", 
                     ifelse(moranmap$HISP_scale > 0 & 
                              moranmap$HISP_lag <= 0 & 
                              moranmap$Pvalue <= siglevel, 
                     "high-low",
                     ifelse(moranmap$HISP_scale <= 0 & 
                              moranmap$HISP_lag > 0 & 
                              moranmap$Pvalue <= siglevel,
                     "low-high", 
                     "non-significant")))))
moranmap$quad_sig = factor(moranmap$quad_sig)
table(moranmap$quad_sig)

## 
##       high-high non-significant 
##              31             386

And now let’s put the results in a map!

# palcolor = c("red",rgb(1,0,0,alpha=0.4),rgb(0,0,1,alpha=0.4),"blue", "white") #define the color palette for the 5 categories (HH, HL, LH, LL, NS)
palcolor = c("red", "white") #the color palette for only 2 categories
tmap_mode("plot")
tm_shape(moranmap)+
  tm_polygons(col = "quad_sig", palette = palcolor, title = "local moran statistic")+
  tm_legend(outside = TRUE)

What can you conclude?
Now, try with other variable, for example the % of Black people per tract (PCTBLCK).

See more here, here, and here

For points

Some of these analysis may also be performed for points, instead of polygons. How? ##### Install packages Install and load ape package

# install.packages("ape")
library(ape)

Prepare data

It does not deal with ordered factors, zeros, or infinite distances.
So we need to clean data first.

str(TABLE)
TABLE$classfactor<-as.numeric(TABLE$CLASS) #make ordered factors as numeric
TABLE$classfactor<-factor(TABLE$classfactor)
TABLEmoran<-TABLE
TABLEmoran$geometry<-NULL #drop geometry
TABLEmoran<-na.omit(TABLEmoran) #remove cases with NA
TABLEmoran<-TABLEmoran[TABLEmoran$Orig_Lat!=0,] #Remove cases with Lat/Lon equals to zero

Distances matrix, from coordinates (Lat Long)

To calculate Moran’s I, we will need to generate a matrix of inverse distance weights. In the matrix, entries for pairs of points that are close together are higher than for pairs of points that are far apart.

We can first generate a distance matrix, then take inverse of the matrix values and replace the diagonal entries with zero:

DISTANCES <- as.matrix(dist(cbind(TABLEmoran$Orig_Long, TABLEmoran$Orig_Lat)))
DISTANCESinv <- 1/DISTANCES
diag(DISTANCESinv) <- 0 #diagonal as zero
DISTANCESinv[is.infinite(DISTANCESinv)] <- 0 #remove infinite distances

We have created a matrix where each off-diagonal entry [ i, j] in the matrix is equal to 1/(distance between point i and point j). Note that this is just one of several ways in which we can calculate an inverse distance matrix.

Moran’s I test

We can now calculate Moran’s I using the command Moran.I.

#First attempt
Moran.I(TABLEmoran$classfactor, DISTANCESinv)
#Remove distances over 15 km
DISTANCESbin <- (DISTANCES > 0 & DISTANCES <= 15000)

#Second attempt
Moran.I(TABLEmoran$classfactor, DISTANCESbin) #Moran’s I =0.012, p = .001

Note: The result (observed) is the Moran’s I value, and if it is enough close to zero, we can affirm (with p=…) that ther is not a spatial pattern, suggesting an aleatory distribution in space. Tf the result was close to 1 or -1, it would suggest a pattern in distribuition in space.

See more here