-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdraft 210314 1000.Rmd
272 lines (214 loc) · 10.7 KB
/
draft 210314 1000.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
---
title: "Who Lives in the New Orleans Area?"
author: "Daniel Moscoe"
date: "3/14/2021"
output:
pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)
xlsx_src <- "C:/Users/dmosc/OneDrive/Documents/academic/CUNY SPS/DATA 607/Proj2/NO_short_data.xlsx"
csv_dest <- "C:/Users/dmosc/OneDrive/Documents/academic/CUNY SPS/DATA 607/Proj2/combined.csv"
```
#### Procedure Overview
In this section of the project, we examine demographic data from the New Orleans Metropolitan Area for the period 2000-2019. The data are contained in an Excel spreadsheet. Some of the features of the data in the spreadsheet enhance human readability. For example, geographic regions are listed as column headers, even though each of these regions represents a level of a variable. Variables, like different age groups, are listed as row headers. While this arrangement makes the table easier to read at a glance, it means the table does not have a "tidy" structure. In this section, we import the data from the Excel sheet, transform it into a tidy form, and conduct some exploratory analysis.
Each of the tables from the Excel sheet required similar transformations:
* Transpose the table so that observations reside in rows and variables in columns.
* Repair row names.
* Repair column names.
* Drop empty rows and columns.
* Convert data to a numeric form.
While the transformation of each table conformed to this rough outline, each table also required some special manipulation.
The product of the transformation of these individual tables is one large table, `combined`, that contains all the information from all the imported tables.
#### Procedure Detail
Import the raw data from the Excel sheet. The path to the local copy of the Excel file is stored as `xlsx_src`. (Note that the `readxl` package requires a local file and will not import data directly from a URL.) Each table is named for the position of its upper-left cell.
```{r message = FALSE, warning = FALSE}
S1A14 <- read_excel(xlsx_src, range = "A14:C18")
S1A22 <- read_excel(xlsx_src, range = "A22:U48")
S1A53 <- read_excel(xlsx_src, range = "A53:B68")
S1A72 <- read_excel(xlsx_src, range = "A72:J88")
S1A91 <- read_excel(xlsx_src, range = "A91:I104")
S1A108 <- read_excel(xlsx_src, range = "A108:I110")
```
We begin by transforming the most complex table, `S1A22`, in order to demonstrate completely all the elements of the general transformation procedure described above.
```{r}
###S1A22###
#TRANSPOSE
S1A22 <- S1A22 %>%
t() %>%
data.frame()
#FIX ROW NAMES
S1A22 <- rownames_to_column(S1A22)
S1A22$rowname <- gsub('.[0-9]+', NA, S1A22$rowname)
S1A22 <- fill(S1A22, rowname)
#FIX COLNAMES
S1A22 <- rename(S1A22, "parish" = rowname,
"year" = X1,
"f_wht_not_hisp" = X3, #f_ indicates fraction, not count
"f_blk_not_hisp" = X4,
"f_hisp_any" = X5,
"f_asn_not_hisp" = X6,
"age_btw_0_4" = X9,
"age_btw_5_9" = X10,
"age_btw_10_14" = X11,
"age_btw_15_19" = X12,
"age_btw_20_24" = X13,
"age_btw_25_29" = X14,
"age_btw_30_34" = X15,
"age_btw_35_39" = X16,
"age_btw_40_44" = X17,
"age_btw_45_49" = X18,
"age_btw_50_54" = X19,
"age_btw_55_59" = X20,
"age_btw_60_64" = X21,
"age_btw_65_69" = X22,
"age_btw_70_74" = X23,
"age_btw_75_79" = X24,
"age_btw_80_84" = X25,
"age_geq_85" = X26)
#DROP EMPTY ROWS AND COLUMNS
S1A22 <- S1A22[-c(1),-c(3,8,9)]
S1A22 <- remove_rownames(S1A22)
#CONVERT TO NUMERIC
S1A22[2:length(S1A22[1,])] <- sapply(S1A22[2:length(S1A22[1,])], as.numeric)
```
Choosing variable names carefully will be important here, since using these names consistently across tables will allow us to more easily construct `combined`.
```{r}
###S1A14###
#TRANSPOSE
S1A14 <- S1A14 %>%
t() %>%
data.frame()
#FIX ROW NAMES
S1A14 <- rownames_to_column(S1A14)
S1A14 <- cbind(parish = "Orleans", S1A14)
#FIX COLNAMES
S1A14 <- rename(S1A14, "year" = rowname,
"wht_not_hisp" = X2,
"blk_not_hisp" = X1,
"hisp_any" = X3,
"asn_not_hisp" = X4)
#DROP EMPTY ROWS AND COLUMNS
S1A14 <- S1A14[-c(1),-c(3,5)]
S1A14 <- remove_rownames(S1A14)
#CONVERT TO NUMERIC
S1A14[2:length(S1A14[1,])] <- sapply(S1A14[2:length(S1A14[1,])], as.numeric)
```
```{r}
###S1A53###
#FIX ROW NAMES
S1A53 <- cbind(parish = "Orleans", S1A53)
#FIX COLNAMES
S1A53 <- rename(S1A53,
"year" = Year,
"blk_not_hisp" = ...2)
#CONVERT TO NUMERIC
S1A53[2:length(S1A53[1,])] <- sapply(S1A53[2:length(S1A53[1,])], as.numeric)
```
```{r}
###S1A72###
S1A72 <- pivot_longer(S1A72, `Orleans`:`New Orleans Metro`)
#FIX COLNAMES
S1A72 <- rename(S1A72, "parish" = name,
"year" = ...1,
"hisp_any" = value)
#IMPROVE CONSISTENCY ACROSS TABLES
S1A72$parish[S1A72$parish == "St. John the Baptist"] <-
"St. John"
#DROP EMPTY ROWS AND COLUMNS AND REARRANGE
S1A72 <- S1A72[-c(1:9),]
S1A72 <- S1A72[,c(2,1,3)]
#CONVERT TO NUMERIC
S1A72[2:length(S1A72[1,])] <- sapply(S1A72[2:length(S1A72[1,])], as.numeric)
```
```{r}
###S1A91###
#TRANSPOSE
S1A91 <- S1A91 %>%
t() %>%
data.frame()
#FIX ROW NAMES
S1A91 <- rownames_to_column(S1A91)
S1A91$rowname <- gsub('.[0-9]+', NA, S1A91$rowname)
S1A91 <- fill(S1A91, rowname)
#FIX COLNAMES
S1A91 <- rename(S1A91, "parish" = rowname,
"year" = X1,
"f_hisp_cub" = X3,
"f_hisp_dom" = X4,
"f_hisp_mex" = X5,
"f_hisp_pr" = X6,
"f_hisp_hon" = X7,
"f_hisp_gua" = X8,
"f_hisp_nic" = X9,
"f_hisp_sal" = X10,
"f_hisp_ca" = X11,
"f_hisp_sa" = X12,
"f_hisp_oth" = X13)
#DROP EMPTY ROWS AND COLUMNS
S1A91 <- S1A91[-c(1,3,5,7,9),-3]
S1A91 <- remove_rownames(S1A91)
#CONVERT TO NUMERIC
S1A91[2:length(S1A91[1,])] <- sapply(S1A91[2:length(S1A91[1,])], as.numeric)
```
```{r}
###S1A108###
#TRANSPOSE
S1A108 <- S1A108 %>%
t() %>%
data.frame()
#FIX ROW NAMES
S1A108 <- rownames_to_column(S1A108)
S1A108$rowname <- gsub('.[0-9]+', NA, S1A108$rowname)
S1A108 <- fill(S1A108, rowname)
#FIX COLNAMES
S1A108 <- rename(S1A108, "parish" = rowname,
"year" = X1,
"age_btw_0_18" = X2)
#DROP EMPTY ROWS AND COLUMNS
S1A108 <- S1A108[-c(1),]
S1A108 <- remove_rownames(S1A108)
#CONVERT TO NUMERIC
S1A108[2:length(S1A108[1,])] <- sapply(S1A108[2:length(S1A108[1,])], as.numeric)
```
Now that each table conforms to a tidy format, each variable name is used consistently whenever it appears across tables, and each observation is led by a parish and a year in the first two columns, we are ready to join the tables. Since we don't want to lose any information, we use `full_join` on `parish` and `year` on one table after the other. The result is a large table containing all the information, called `combined`. The dataframe `combined` is then exported to a `csv` [file](https://raw.githubusercontent.com/sconnin/New_Orleans_Demographics/466abeb2b61d435b1982d23a51b86284e70890db/combined.csv). The path for this file is stored as `csv_dest`.
```{r}
combined <-
full_join(S1A14, S1A22, by = c("parish", "year")) %>%
full_join(S1A53, by = c("parish", "year")) %>%
full_join(S1A72, by = c("parish", "year")) %>%
full_join(S1A91, by = c("parish", "year")) %>%
full_join(S1A108, by = c("parish", "year"))
write_csv(combined, csv_dest, na = "")
```
A first glance at `combined` shows a significant proportion of missing values. It's interesting to compare `combined` to the original Excel tables, which contained no empty cells. Combining the data from each table together helps highlight gaps in the dataset that were hidden by the "full" appearance of the Excel tables. For example, the dataset is missing year-by-year counts of the white, Asian, and African American populations in most or all parishes.
With a tidy table containing all the data, we can now proceed to some exploratory analysis.
#### Analysis
*How did the number of African American and Hispanic residents in select parishes change during the period 2000-2019?*
This question allows us to explore some of the more complete regions of this dataset. While we might have asked this same question about the other demographic groups described in the dataset, complete year-by-year data is only available for Hispanic residents (in all parishes) and African American residents (in Orleans parish).
```{r}
combined %>%
filter(parish %in% c("Orleans", "Jefferson", "New Orleans Metro")) %>%
ggplot() +
geom_col(mapping = aes(x = year, y = hisp_any/1000), fill = "light blue") +
facet_wrap(~parish) +
labs(title = "Hispanic Population in Selected Regions, 2000-2019",
x = "Year",
y = "Population (thousands)")
```
The plot shows that in the three parishes with the greatest numbers of Hispanic residents, the number of Hispanic residents has increased in the period from 2000 to 2019. In the New Orleans Metro area broadly, the number of Hispanic residents has nearly doubled. In Jefferson and Orleans, increases have been more modest. Measurements are missing for the period from 2001-2005.
```{r}
combined %>%
filter(parish == "Orleans") %>%
ggplot() +
geom_col(mapping = aes(x = year, y = blk_not_hisp/1000), fill = "light blue") +
labs(title = "African American Population in Orleans Parish, 2000-2019",
x = "Year",
y = "Population (thousands)")
```
By contrast, the number of African Americans in Orleans Parish has declined significantly during the period from 2000 to 2019. The decline is sharpest during the period for which data is missing, which includes Hurricane Katrina. The graph above is consistent with the claim that Hurricane Katrina caused a very large fraction--well over half--of African American residents to leave Orleans Parish. While some have returned, the total African American population remains about 30% smaller than it was in 2000.
#### Source
The Excel file analyzed in this section is available [here](https://github.com/sconnin/New_Orleans_Demographics/blob/main/New%20Orleans%20Who%20Lives%20data%20tables.xlsx?raw=true). This file was provided by [The Data Center](https://www.datacenterresearch.org/data-resources/who-lives-in-new-orleans-now/) in response to a request.