-
Notifications
You must be signed in to change notification settings - Fork 0
/
03_methods.qmd
391 lines (347 loc) · 21.7 KB
/
03_methods.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
# Methods {#sec-methods}
```{r setup, file = "R/chapter_start.R", include = FALSE, cache = FALSE}
# a number of commands need to run at the beginning of each chapter. This
# includes loading libraries that I always use, as well as options for
# displaying numbers and text.
library(mlogit)
library(wesanderson)
library(kableExtra)
library(ggspatial)
```
The nutrition access literature is long and has been approached from numerous
angles including public health, urban science and economics, and social justice.
In general, researchers have sought to link spatial access to nutrition with health outcomes including obesity, caloric intake, and the like.
Complete --- though somewhat dated --- reviews of this literature can be had
from @beaulac2009 and @walker2010. More recent work has extended the description
and refinement of the measures used to evaluate food access and control for
confounding variables. @widener2014 considered that temporal access to quality food is as important as spatial access.
@aggarwal2014 suggested that spatial access was not as important as store choice, given that most people were not observed to shop at the nearest vendor.
By contrast, @chen2016 compared spatial access to quality food vendors with
observed food expenditures and showed poor access explained obesity even when controlling for consumption. @cooksey-stowers2017 jointly pursued spatial
proximity with quality of offerings and showed the latter might be more predictive of obesity rates.
What has not been frequently attempted in the nutrition access literature,
however, is a serious comparison of multiple alternative policies to address the
problem, which would require a multi-dimensional analysis of spatial access,
store quality, and observed tradeoffs between the two. @macfarlane2021b
illustrated the potential for a utility-based model of access to establish
relationships between urban green space access and health, and then continued
that methodology into a policy analysis of park space during Covid-19 [@macfarlane2022a]. The potential for application of this methodology to the nutrition literature is well-motivated by the previous attempts as well as
the lack of clear policy solutions [@wright2016].
This section describes how we construct a model of access to grocery stores in
communities in Utah. We first describe the theoretical model, and then describe
data collection efforts to estimate this model and apply it.
## Model
A typical model of destination choice [@recker1978] can be described as a random
utility maximization model where the utility of an individual $i$ choosing a
particular destination $j$ is
$$ U_{ij} = \beta_{s}f(k_{ij}) + \beta_{x}(X_j) $$ {#eq-utility} where
$f(k_{ij})$ is a function of the travel impedance or costs from $i$ to $j$ and
$X_{j}$ represents the location attributes of $j$. The coefficients $\beta$ can
be estimated given sufficient data revealing the choices of individuals. The
probability that individual at location $i$ will choose alternative $j$ from a
choice set $J$ can be estimated with a multinomial logit model (MNL)
[@mcfadden1974],
$$ P_i(j) = \frac{\exp(U_{ij})}{\sum_{j' \in J}{\exp(U_{ij'})}}$$ {#eq-mnl} The
overall fit of the model can be described with the Akaike Information Criterion
(AIC) --- which should be minimized --- or by the McFadden likelihood ratio
$\rho^2_0 = 1 - \ln\mathcal{L} / \ln\mathcal{L}_0$. In this ratio
$\ln{\mathcal{L}}$ is the model log-likelihood and $\ln{\mathcal{L}_0}$ the
log-likelihood of an alternative model where all destinations are equally
likely; a higher $\rho^2_0$ value indicates more explanatory power relative to
this null, random chance only model.
The idea of using destination choice logsums as accessibility terms is not new,
and the theory for doing so is described in @ben-akiva1985 [p.301]. Effectively,
the natural logarithm of the denominator in @eq-mnl represents the consumer
surplus --- or total benefit --- available to person $i$:
$$ CS_i = \ln\left(\sum_{j \in J} \exp(U_{ij})\right)$$ {#eq-cs}
A difference in logsum measures may exist for a number of reasons that affect
the utility functions described in @eq-utility. For example, individuals at
different locations or with different mobility will see different impedance
values $k_{ij}$ and therefore affected utility. Changes to the attributes of the
destinations $X_j$ will likewise affect the utility.
Despite the relative maturity of this theory, applications of utility-based
access in the literature are still rare, outside of public transport forecasting
analyses [@geurs2010]. The rarity is likely explained by an unfamiliarity with
destination choice models and the ready availability of simpler methods on one
hand [@logan2019], and the difficulty in obtaining a suitable estimation dataset
for particular land uses on the other [@kaczynski2016]. This second limitation
has been somewhat improved by a new methodology developed by @macfarlane2022a,
relying on commercial location-based services data to estimate the affinity for
simulated agents to visit destinations of varying attributes and distances.
## Data
In this research, we develop a unique dataset to estimate the destination choice
utility coefficients for grocery store choice in three different communities in
Utah. The three communities were selected to maximize potential observed
differences in utility between community residents. The three communities are
Utah County, West Salt Lake County, and San Juan County. Note that in this
document we refer to the second community as "Salt Lake" even though this does
not refer to the entire Salt Lake County nor to Salt Lake City, rather, we focus
on communities in the western part of the valley, such as Magna, Kearns, and
West Valley City. The communities are shown in a wider context in @fig-communities.
```{r communities, message = FALSE}
#| label: fig-communities
#| fig-cap: Location of study regions in Utah.
tar_load(all_groceries)
tar_load(ut)
ggplot(ut) +
geom_sf(lwd = 0) +
ggspatial::annotation_map_tile("cartolight", zoom = 7) +
geom_sf(color = "black", fill = NA) +
ggspatial::annotation_scale() +
ggspatial::annotation_north_arrow(style = ggspatial::north_arrow_minimal) +
geom_sf(data = all_groceries |> filter(!is.na(total_registers)),
aes(color = county))+
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()) +
scale_color_manual("Region", values = wesanderson::wes_palette('Darjeeling1'))
```
@tbl-acsdata shows several key population statistics based on 2021 American
Community Survey data for block groups in the three communities of interest.
Utah County is a largely suburban county with high incomes and a low percentage
of minority individuals. The Salt Lake region is more dense with somewhat lower
incomes and household sizes but a high share of minority individuals. San Juan
County is primarily rural, with a few small communities and a large reservation
for the Navajo Tribe.
```{r acsdata}
#| label: tbl-acsdata
#| tbl-cap: Demographic Statistics of Study Regions
tar_load(neighbor_acs)
df <- neighbor_acs |>
mutate(county = relevel(factor(county), ref = "Utah")) |>
group_by(county) |>
summarise(
`Total population` = sum(population),
`Total households` = sum(households),
`Housing units per sq. km` = weighted.mean(density, housing_units),
`Median income` = Hmisc::wtd.quantile(income, weights = households, probs = .5),
`Percent minority individuals` = weighted.mean(100 - white, population)
) |>
data.table::transpose(make.names = 'county', keep.names = 'measure')
if(!knitr::pandoc_to("docx")){
kbl(df, col.names = c("", "Utah", "Salt Lake", "San Juan"), digits = 0,
format.args = list(big.mark = ','), booktabs = TRUE) |>
kable_styling()
}
```
Estimating the utility model described in @eq-mnl for grocery stores requires
three interrelated data elements:
1. An inventory of grocery store attributes $X_j$;
2. A representative travel impedance matrix $K$ composed of all combinations of
origin $i$ and destination $j$;
3. A database of observed person flows between $i$ and $j$ by which to estimate
the $\beta$ coefficients.
We describe each of these elements in turn in the following sections.
### Store Attributes
```{r nems}
tar_load(nems_groceries)
```
The store attributes were collected using the Nutritional Environment Measures
Survey --- Stores (NEMS-S) tool [@glanz2007]. This tool was developed to reveal
significant differences in the availability and cost of healthy foods in an
environment, and has been validated for this purpose. Beyond superficial
attributes such as the store category (dollar store, convenience store, ethnic
market, etc.) and the number of registers, the NEMS-S collects detailed
information about numerous store offerings such as the availability of produce,
dairy products, lean meats, juices, and canned and dry goods of various specific
types. Of particular interest to the survey are availability and price
differentials of lower-fat alternatives: for example, the survey instrument
requests the shelf space allocated to milk products of various fat levels and
the price of each product.
Student research assistants collected the store attributes by visiting grocery
stores, dollar stores, ethnic markets, and other food markets in the three
communities of interest described above. Stores were identified using
internet-based maps combined with in-person validation and observation. The
student researchers completed the NEMS-S instrument with the aid of a digital
survey and a tablet computer. Each researcher who collected data was trained to
use the survey at a control store in Provo, and the training data helped to
eliminate the risk of surveyor bias. The store attributes were collected in the
spring of 2021 for Utah County and spring of 2022 for Salt Lake and San Juan
Counties. In Utah and Salt Lake Counties, we included dollar stores and grocery
stores but did not include convenience stores. Given the rural nature of San
Juan County, we made two adjustments to capture the entirety of the nutrition
environment. First, we included convenience stores and trading posts if they
were the only food market in a community. We also included full-service grocery
stores in Cortez, Colorado, and Farmington, New Mexico in the San Juan data
collection, as community conversations made it clear that many residents will
drive these long distances for periodic shopping with greater availability and
lower prices.
Using the information in the NEMS-S survey, two measures of a store can be
calculated: an availability score based on whether stores stock particular items
as well as lower-calorie options; and a cost score describing the spread between
prices of these options. These score values are described in @lunsford2021, and
we developed an R package to compute the scores; this package is available at
<https://github.com/byu-transpolab/nemsr>. In the availability score, each store
is given a value for whether or not there are more healthful options available
in the store, such as low-calorie chips, or low-fat milk. If the store does not
have a more healthful option in a category it receives a lower score, so stores
with more availability of healthful food options will receive a higher
availability score. For the cost score, the measure is the price spread between
healthful and less healthful options: if the price of whole wheat bread is cheaper
than white bread, the store receives positive points for the cost option, if the
price is the same then zero points are awarded, and if the wheat bread is more
expensive then the store receives negative points. Thus a store with a higher
availability and cost score will have both more healthful options, and a more
advantageous pricing scheme towards those options.
One important store attribute that the NEMS-S instrument does not collect or
compute directly is a measure of the cost of common goods that can be compared
across stores. We therefore used the data collected from the NEMS-S instrument
to construct a market basket-based affordability measure that could be compared
across stores, following the approach of @hedrick2022. This market basket score
is based on the US Department of Agriculture (USDA) 2021 Thrifty Food Plan
[@fns2021], which calculates a reference market basket for a family of four.
Because this market basket contains more (and sometimes different) items than
what the NEMS-S instrument requests, we chose relevant items from our NEMS-S
data as replacements. For example, the USDA market basket contains a certain
amount of poultry, but the NEMS-S score collects the per-pound cost of ground
beef at various fat contents. For any stores that were missing any of the
elements in the market basket, we first substituted with another ingredient that
would fit the nutritional requirements. If no substitute was available, we
included the average price of the missing good at other stores in that community
multiplied by 1.5 as a penalty for not containing the product. The final market
basket score is the total cost of all foods in the market basket. These costs
can then be compared from store to store to understand general affordability
comparisons between stores.
```{r tbl-nems}
#| label: tbl-nems
#| tbl-cap: Grocery Store Attributes
tar_load(nems_groceries)
balnems <- nems_groceries |>
ungroup() |>
transmute(Type = ifelse(type == "Trading Post", "Other", type),
Pharmacy = pharmacy,
`Ethnic market` = ethnic,
`Other merchandise sold` = merch,
`Registers (incl. self checkout)` = total_registers,
`NEMS-S availability score` = availability,
`NEMS-S cost score` = cost,
`Market basket cost` = market,
County = factor(county, levels = c("Utah", "Salt Lake", "San Juan")))
if(!knitr::pandoc_to("docx")){
datasummary_balance(~County, data = balnems) |>
kable_styling(latex_options = "scale_down")
}
```
@tbl-nems presents the store attribute data collected for each community. Utah
County generally has the largest average store size (as measured by the number
of checkout registers) while having the lowest market basket cost, the highest
availability of healthful food (measured by the NEMS-S availability score) and
the lowest difference between healthy and unhealthy food (the NEMS-S cost
score). San Juan County has the smallest average stores, highest costs, and the
lowest availability of healthy options, and Salt Lake falls in between.
#### Imputation of Missing Store Data
We collected detailed store attributes for a complete census of stores in Utah
County, San Juan County, and a portion of Salt Lake County using the NEMS-S
survey instrument.
These attributes form the basis of the choice models used to determine access and
provide a complete picture of access in those communities, assuming people do not
leave the communities for grocery trips.
But understanding access in other parts of Salt Lake County -- including how stores
outside of the West Salt Lake County area might shape access inside that community ---
requires us to impute the measured attributes onto the stores that we did not
directly measure.
To do this, we used web-based mapping databases (including OpenStreetMap and
Google Maps) to obtain a list of grocery stores, dollar stores, and appropriate
convenience stores throughout the state. From this search, we were able to
determine each store's location, brand name, and store type, which we also
collected in the manual data assembly efforts. Using this information, we built
a multiple imputation model using the `mice` package for R [@mice]. The
predictor variables in the imputation included the store brand and type, as well
as the average income and housing density in the nine closest block groups to
the store location (based on population-weighted block group centroids and
Euclidean distances).
```{r marketimp}
#| label: fig-marketimp
#| fig-cap: Imputed market price values for 12 random grocery stores.
#| out-width: "5in"
tar_load(imputed_groceries)
combined <- mice::complete(imputed_groceries, "long") |>
filter(type == "Grocery Store") |>
as_tibble()
sampled_store_ids <- sample(combined$id, 15)
set.seed(42)
combined |>
filter(id %in% sampled_store_ids) |>
filter(!grepl("SL|UT|SJ", id)) |>
arrange(id) |>
ggplot(aes(x = market)) +
geom_density() + facet_wrap(~id) + theme_bw() +
xlab("Imputed Market Basket Price [$]") +
ylab("Density")
```
Thirty iterations of the multiple imputation algorithm were run for each of ten
independent imputations. @fig-marketimp shows the density of the ten imputed
market basket prices for a randomly selected set of 12 stores. As the figure
reveals, there is some general peaking in the predicted market price for most
stores, but the imputation model still predicts a wide range of possible prices
for most stores. When using the imputed data for analysis, we take the mean of
the ten predictions for continuous values, and the mode for discrete values.
### Travel Impedances {#sec-mcls}
The second element of the utility equation in @eq-utility is the travel
impedance between $i$ and $j$. Many possibilities for representing this
impedance exist, from basic euclidean distance to complex network paths. A
primary purpose of the model we are developing in this research is to study
comparative tradeoffs between infrastructure-focused and environment-focused
improvements to the nutrition access of households. It is therefore essential
that we use a travel impedance measure that can combine and compare the cost of
traveling by multiple modes so that highway improvements and transit / active
transport improvements can be compared in the same basic model.
```{r utilities}
tar_load(utilities)
```
Just as the log-sum of a destination choice model is a measure that sums the
utility of multiple destination attributes and costs in a rigorous manner, the
log-sum of a mode choice model combines the utilities of all available travel
modes. In this study we assert the following mode choice utility equations:
\begin{align*}
V_{\mathrm{auto}, ij} &= `r utilities$CAR['ivtt']`(t_{\mathrm{auto}, ij})\\
V_{\mathrm{bus}, ij} &= `r utilities$TRANSIT['constant']` `r utilities$TRANSIT['ivtt']`(t_{\mathrm{bus}, ij}) `r utilities$TRANSIT['wait']`(t_{\mathrm{wait}, ij}) `r utilities$TRANSIT['wait']`(t_{\mathrm{access}, ij})\\
V_{\mathrm{walk, ij}} &= `r utilities$WALK['constant']` `r utilities$WALK['ivtt']`(t_{\mathrm{walk}, ij}) `r utilities$WALK['short_distance']`(d_{ij<1.5}) `r utilities$WALK['long_distance']`(d_{ij>1.5})\\
\end{align*} where $t$ is the in-vehicle travel time in minutes for each mode
between $i$ and $j$. The transit utility function additionally includes the wait
time for transit as well as the time necessary to access the transit mode on
both ends by walking. The walk utility includes a per-mile distance disutility
that increases for distances greater than 1.5 miles. These equations and
coefficients are adapted from a statewide mode choice model for home-based non-work
trips in urban and rural regions developed for UDOT research [@barnes2021].
The log-sum, or total weighted impedance by all modes is therefore $$
k_{ij} = \ln(e^{V_{\mathrm{auto}, ij}} + e^{V_{\mathrm{bus}, ij}} + e^{V_{\mathrm{walk},ij}})
$$ {#eq-mcls}
In this implementation, $i$ is the population-weighted centroid of a 2020 Census
block group, and $j$ is an individual grocery store. We measure the travel times
from each $i$ to each $j$ using the `r5r` implementation of the R5 routing
engine [@pereira2021; @conway2017; @conway2018; @conway2019]. This algorithm
uses common data elements --- OpenStreetMap roadway and active transport
networks alongside General Transit Feed Specification (GTFS) transit service
files --- to simulate multiple realistic route options by all requested modes.
We obtained OpenStreetMap networks and the Utah Transit Authority GTFS file
valid for May 2023 and requested the minimum total travel time by each mode of
auto, transit, and walking for a departure between 8 AM and 9 AM on May 10,
2023. The total allowable trip time by any mode was set to 120 minutes, and the
walk distance was capped at 10 kilometers; if a particular $i,j$ pair exceeded
these parameters then the mode was presumed to not be available and contributes
no utility to the log-sum.
### Mobile Device Data
The final element of destination utility presented in @eq-utility is the
set of coefficients, which are often estimated from household travel surveys in a
travel demand context. It is unlikely, however, that typical household diaries
would include enough trips to grocery stores and similar destinations to create
a representative sample.
Emerging mobile device data, however, could reveal the typical home locations
for people who are observed in the space of a particular store. @macfarlane2022a
present a method for estimating destination choice models from such data, which
we repeat in this study. We provided a set of geometric polygons for the grocery
stores of interest to StreetLight Data, Inc., a commercial location-based
services aggregator and reseller. StreetLight Data in turn provided data on the
number of mobile devices observed in each polygon grouped by the inferred
residence block group of those devices during summer 2022. We then created a
simulated destination choice estimation dataset for each community resource by
sampling 10,000 block group - grocery store "trips" from the StreetLight
dataset. This created a "chosen" alternative; we then sampled ten additional
stores from the same community at random (each simulated trip was paired with a
different sampled store) to serve as the non-chosen alternatives. Random
sampling of alternatives is a common practice that results in unbiased
estimates, though the standard errors of the estimates might be larger than
could be obtained through a more carefully designed sampling scheme
[@train2009].