-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2659993_URBAN5127_SummativeAssessment.Rmd
1165 lines (954 loc) · 60.4 KB
/
2659993_URBAN5127_SummativeAssessment.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Summative Assessment"
author: "2659993 (Student ID)"
date: "`r format(Sys.time(), '%d/%m/%Y')`"
output: html_document
---
# Gender Influence in the UK Adult Population Mental Health in 2018-2019
## 1. Introduction
The links between mental health and gender have become increasingly evident in
recent years, although most research in the field still minimises or ignores
gender differences in mental health (Patalay and Demkowicz, 2023). Spencer and
Broome (2023) call for an urgent need to better understand young women's mental
health, as an NHS digital survey shows rates of diagnosed mental illness rise as
children approach adulthood: 15% of 7-10 years-old against 22% of 17-24
years-old have a diagnosable disorder (Newlove-Delgado et al., 2022). There is
also evidence that the decline in mental health during the Covid-19 outbreak was
twice as large for women as for men in the UK (Etheridge and Spantig, 2022).
Hence, this essay aims to address the following research question: "Is there a
gender gap in the UK adult population mental health in 2018-2019?". To answer
this, we will utilise data from the Understanding Society Wave 10 dataset (University of
Essex, 2023b). Understanding Society is the UK Household Longitudinal Study, a survey of
the members of approximately 40,000 households (Wave 1). It collects data on subjects
such as health, work, education, income, family and social life, with both objective and
subjective indicators. Waves indicate the answers of sample members in a particular year
(University of Essex, 2023a).
Our outcome variable is the 12-Item Short Form Survey (SF-12) Mental Component Summary
(MCS). This is a derived variable that converts valid answers from questions on the
"Self-Completion SF12 Module" of the individuals' survey questionnaire into a
single mental functioning score. The score is a continuous scale ranging from 0
(low functioning) to 100 (high functioning) (University of Essex, 2023a).
```{r setup, include=FALSE}
library(tidyverse)
library(gtsummary)
library(textreg)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(lmtest)
library(car)
library(miceadds)
library(sandwich)
# Read in wave10_indresp
wave10_indresp <- read.delim("Datasets/UKDA-6614-tab/tab/ukhls/j_indresp.tab",
sep = "\t", header = TRUE)
# Selecting variables to be used
project_dataset <- wave10_indresp %>%
select(j_sf12mcs_dv, j_age_dv, j_sex_dv, j_urban_dv, j_marstat,
j_qfhigh_dv, j_jbstat, j_fimnnet_dv, j_ethn_dv, j_sclonely,
j_rach16_dv, j_howlng, j_hidp)
# Setting negative values as missing
project_dataset <- project_dataset %>%
mutate(j_sf12mcs_dv = case_when(j_sf12mcs_dv %in% c(-9, -8, -7) ~ as.double(NA),
TRUE ~ j_sf12mcs_dv)) %>%
mutate(j_age_dv = case_when(j_age_dv %in% c(-9) ~ as.integer(NA),
TRUE ~ j_age_dv)) %>%
mutate(j_urban_dv = case_when(j_urban_dv %in% c(-9) ~ as.integer(NA),
TRUE ~ j_urban_dv)) %>%
mutate(j_sex_dv = case_when(j_sex_dv %in% c(-9, 0) ~ as.integer(NA),
TRUE ~ j_sex_dv)) %>%
mutate(j_fimnnet_dv = case_when(j_fimnnet_dv %in% c(-9) ~ as.double(NA),
TRUE ~ j_fimnnet_dv)) %>%
mutate(j_marstat = case_when(j_marstat %in% c(-9, -8, -7, -2, -1) ~ as.integer(NA),
TRUE ~ j_marstat)) %>%
mutate(j_qfhigh_dv = case_when(j_qfhigh_dv %in% c(-9, -8) ~ as.integer(NA),
TRUE ~ j_qfhigh_dv)) %>%
mutate(j_jbstat = case_when(j_jbstat %in% c(-9, -2, -1) ~ as.integer(NA),
TRUE ~ j_jbstat)) %>%
mutate(j_ethn_dv = case_when(j_ethn_dv %in% c(-9) ~ as.integer(NA),
TRUE ~ j_ethn_dv)) %>%
mutate(j_sclonely = case_when(j_sclonely %in% c(-9, -8, -7, -2, -1) ~ as.integer(NA),
TRUE ~ j_sclonely)) %>%
mutate(j_rach16_dv = case_when(j_rach16_dv %in% c(-9, -8) ~ as.integer(NA),
TRUE ~ j_rach16_dv)) %>%
mutate(j_howlng = case_when(j_howlng %in% c(-9, -7, -2, -1) ~ as.integer(NA),
TRUE ~ j_howlng)) %>%
mutate(j_qfhigh_dv = case_when(j_qfhigh_dv %in% c(96) ~ as.integer(17),
TRUE ~ j_qfhigh_dv)) %>%
mutate(j_jbstat = case_when(j_jbstat %in% c(97) ~ as.integer(12),
TRUE ~ j_jbstat)) %>%
mutate(j_ethn_dv = case_when(j_ethn_dv %in% c(97) ~ as.integer(18),
TRUE ~ j_ethn_dv))
View(project_dataset)
# Turning int variables into factor
project_dataset <- mutate(project_dataset, j_urban_dv = factor(j_urban_dv,
levels = 1:2,
labels = c(
"Urban area", "Rural area")),
j_sex_dv = factor(j_sex_dv, levels = 1:2, labels = c(
"Male", "Female")), j_sclonely = factor(j_sclonely,
levels = 1:3,
labels = c(
"Hardly ever or never", "Some of the time",
"Often")),
j_marstat = factor(j_marstat, levels = 1:9, labels = c(
"Single", "Married", "Civil Partner (legal)",
"Separated legally married",
"Divorced", "Widowed",
"Separated from Civil Partner",
"A former Civil Partner",
"Surviving Civil Partner")), j_qfhigh_dv =
factor(j_qfhigh_dv, levels = 1:17, labels = c(
"Higher degree",
"1st degree or equivalent",
"Diploma in HE",
"Teaching qual. (not PGCE)",
"Nursing/other med. qual.",
"Other higher degree", "A level",
"Welsh baccalaureate",
"I'nationl baccalaureate", "AS level",
"Highers (Scot.)",
"Cert. 6th year studies",
"GCSE/O level", "CSE",
"Standard/o/lower",
"Other school cert.",
"None of the listed")),
j_jbstat = factor(j_jbstat, levels = 1:12, labels = c(
"Self employed", "Paid employment (FT/PT)",
"Unemployed",
"Retired", "On maternity leave",
"Family care or home",
"Full-time student",
"LT sick or disabled",
"Govt training scheme",
"Unpaid, family business",
"On apprenticeship",
"Doing something else")),
j_ethn_dv = factor(j_ethn_dv, levels = 1:18, labels =
c("british/english/scottish/welsh/northern irish",
"irish", "gypsy or irish traveller",
"any other white background",
"white and black caribbean",
"white and black african",
"white and asian",
"any other mixed background",
"indian", "pakistani", "bangladeshi",
"chinese",
"any other asian background",
"caribbean", "african",
"any other black background",
"arab", "any other ethnic group")),
j_rach16_dv = factor(j_rach16_dv, levels = 1:2, labels =
c("Yes", "No")))
glimpse(project_dataset)
# Renaming our variables so it's easier to understand
project_dataset_fv <- project_dataset %>%
rename(sf12mcs = j_sf12mcs_dv, age = j_age_dv, urban_rural = j_urban_dv, sex =
j_sex_dv, monthly_income = j_fimnnet_dv, marital_status = j_marstat,
education = j_qfhigh_dv, job_status = j_jbstat, ethnic_group = j_ethn_dv,
lonely = j_sclonely, parental_resp = j_rach16_dv, hours_hw = j_howlng)
glimpse(project_dataset_fv)
# Reducing / re-aggregating categories of factor variables to make descriptive analysis and modelling more manageable
levels(project_dataset_fv$marital_status) = c("Single","Married", "Married",
"Divorced",
"Divorced",
"Widowed",
"Divorced",
"Divorced",
"Widowed")
levels(project_dataset_fv$job_status) = c("Employed (self employed or paid employment [FT/PT])", "Employed (self employed or paid employment [FT/PT])",
"Unemployed", "Retired", "Maternity leave; family care or home", "Maternity leave; family care or home", "Other", "Other",
"Other", "Other", "Other", "Other")
levels(project_dataset_fv$ethnic_group) = c("British (English, Scottish, Welsh), Northern Irish, Irish or other white background", "British (English, Scottish, Welsh), Northern Irish, Irish or other white background", "Other ethnic group",
"British (English, Scottish, Welsh), Northern Irish, Irish or other white background", "Other ethnic group",
"Other ethnic group", "Other ethnic group",
"Other ethnic group", "Other ethnic group",
"Other ethnic group", "Other ethnic group",
"Other ethnic group", "Other ethnic group",
"Other ethnic group", "Other ethnic group",
"Other ethnic group", "Other ethnic group",
"Other ethnic group")
levels(project_dataset_fv$education) = c("Higher education degree or similar",
"Higher education degree or similar",
"Higher education degree or similar",
"Higher education degree or similar",
"Higher education degree or similar",
"Higher education degree or similar",
"College (A levels or similar)",
"College (A levels or similar)",
"College (A levels or similar)",
"College (A levels or similar)",
"College (A levels or similar)", "School (GSCE or similar)", "School (GSCE or similar)",
"School (GSCE or similar)", "School (GSCE or similar)", "School (GSCE or similar)", "None")
View(project_dataset_fv)
```
## 2. Data
### 2.1 Variables
The following variables displayed in Table 1 were selected from the individual
response questionnaire to be used in the analysis:
```{r table1, echo=FALSE}
table1 <- tribble(
~"Variable", ~"Description",
"SF-12 Mental Component Summary (MCS)", "Single mental functioning score, ranging from 0 to 100",
"Sex", "Sex (male/female)",
"Age", "Age, in completed years, at the time of interview",
"Residential area", "Binary indicator classifying the address as falling into an urban or rural area",
"Marital status", "Legal marital status",
"Education", "Highest educational qualification",
"Job status", "Current employment situation",
"Individual monthly income", "Total estimated net monthly income in pounds (£)",
"Ethnic group", "Ethnic group",
"Loneliness", "How often feels lonely",
"Parental responsibility", "Whether responsible for child under 16",
"Hours on housework per week", "Hours spent on housework per week (e.g. cooking, cleaning, doing the laundry)"
)
knitr::kable((table1), booktabs = TRUE,
caption = "Table 1: Variable names and definitions")
```
It is important to note that some variables the literature points as having an
influence on mental health wellbeing are not included in the selection. Consumption
of alcohol is associated with impacts on mental health (Newton-Howes and Boden, 2016)
but the variable in Understanding Society Wave 10 *dklm* ("How many times in the
last four weeks have you had an alcoholic drink?") has over 90% of "inapplicable"
answers. Similar situation is observed with *ypnpal* (number of close friends -
question was only asked to respondents aged 16-21) and *aidhrs* (number of hours
spent per week on caring), both relevant aspects to mental health wellbeing
(Etheridge and Spantig, 2022; Madia et al., 2023). Potential implications to the
final model are discussed later.
### 2.2 Cleaning the data
```{r dim original dataset, echo=TRUE}
dim(project_dataset_fv)
```
The Understanding Society Wave 10 dataset has responses from 34,318 individuals.
To proceed with the analysis, some filters are applied to make sure only relevant
observations are included. The criteria set for filtering are as follows, and each
one was performed individually to check how many observations are dropped at each
stage:
- Include only people aged 16 or above, since the focus is on the adult population
+ Only 3 observations are dropped.
- For individual monthly income, exclude negative values (monthly income below 0)
+ A drop of 15 observations.
- For individual monthly income, exclude very high values (monthly income > 25000)
+ Five observations are removed.
- Exclude missing values within independent variables
+ Age: no drops.
+ Sex: two observations removed.
+ Residential area: 24 observations are dropped.
+ Individual monthly income: no drops.
+ Marital status: 1,000 observations removed.
+ Job status: 30 observations are excluded.
+ Ethnic group: 11 observations dropped.
+ Loneliness: A drop of 1,385 observations.
+ Parental responsibility: no drops.
+ Hours on housework per week: 1,008 observations are removed.
+ Education: 4,148 observations are dropped (13%). This might be a considered
reduction. Looking in the data documentation, we can see this variable has originally
13.57% of its observations as "inapplicable". It is not possible to know exactly the
reason of missingness, but we can speculate those respondents might not have had access
to formal education. According to the Office for National Statistics, 18.2% of the adult
population in England and Wales reported having no qualifications at all in the 2021
Census (Office for National Statistics, 2021). Nevertheless, the survey has the option of
answering "None of the above", which probably represents people with no qualifications
(18.85% of observations are for that answer). Hence we are not sure what the missing
answers might represent, but decide to drop them anyway.
- Exclude missing values within the dependent variable
+ 391 observations are excluded.
After applying these filters, the final sample size is dropped to 26,296 observations
(a drop of roughly 23.4% from the original data sample).
```{r cleaning, include=FALSE}
project_dataset_fv_clean <- project_dataset_fv %>%
filter(age >= 16)
project_dataset_fv_clean <- project_dataset_fv_clean %>%
filter(monthly_income >= 0)
project_dataset_fv_clean <- project_dataset_fv_clean %>%
filter(monthly_income < 25000)
project_dataset_fv_clean <- project_dataset_fv_clean %>%
drop_na(age) %>%
drop_na(sex) %>%
drop_na(urban_rural) %>%
drop_na(monthly_income) %>%
drop_na(marital_status) %>%
drop_na(job_status) %>%
drop_na(ethnic_group) %>%
drop_na(lonely) %>%
drop_na(parental_resp) %>%
drop_na(hours_hw) %>%
drop_na(education)
project_dataset_fv_clean <- project_dataset_fv_clean %>%
drop_na(sf12mcs)
```
```{r dim cleaned sample, echo=TRUE}
dim(project_dataset_fv_clean)
```
### 2.3 Summary statistics
Table 2 shows summary statistics for the variables, grouped by sex (main independent
variable of the analysis). The dependent variable (SF-12 Mental Component Summary), age,
individual monthly income and hours on housework per week are continuous (numerical)
variables, hence the mean, standard deviation and range were calculated for them. All
other variables are categorical and display the count of observations and their
proportion in percentage numbers in relation to the sample.
Additionally, in order to make descriptive analysis and modelling more manageable,
we reduced the number of levels in some factor variables by aggregating them. These
variables are marital status (9 levels aggregated to 4), education (17 levels to 4),
job status (12 levels to 5) and ethnic group (18 levels to 2).
```{r summary statistics table, echo=FALSE}
project_dataset_fv_clean %>%
select(sf12mcs, sex, age, urban_rural, marital_status, education, job_status,
monthly_income, ethnic_group, lonely, parental_resp, hours_hw) %>%
tbl_summary(by = sex,
type = list(sf12mcs ~ "continuous2", age ~ "continuous2",
urban_rural ~ "categorical", marital_status ~ "categorical",
education ~ "categorical", job_status ~ "categorical",
monthly_income ~ "continuous2", ethnic_group ~ "categorical",
lonely ~ "categorical", parental_resp ~ "categorical",
hours_hw ~ "continuous2"),
statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{min} - {max}"), all_categorical() ~ "{n} ({p}%)"),
digits = all_continuous() ~ 2,
label = list(sf12mcs ~ "SF-12 Mental Component Summary (MCS)",
age ~ "Age", urban_rural ~ "Residential area",
marital_status ~ "Marital status", education ~ "Education",
job_status ~ "Job status", monthly_income ~ "Individual monthly income (£)",
ethnic_group ~ "Ethnic group", lonely ~ "Loneliness",
parental_resp ~ "Parental responsibility",
hours_hw ~ "Hours on housework per week"),
missing_text = "(Missing)") %>%
add_overall() %>%
modify_header(label ~ "**Variable**") %>%
modify_footnote(all_stat_cols() ~ "Mean (SD) or Frequency (%)") %>%
modify_caption("Table 2: Summary statistics for variables used in the analysis") %>%
bold_labels()
```
It is possible to see that the mean score of the SF-12 Mental Component Summary scale is
46.67 for women and 49.16 for men. This is a considerable difference. To test if this
difference is applicable to the general population, a t-test is performed between the
dependent variable and the key independent variable (which is a "dummy" variable -
categorical with two levels, reason why a t-test is performed instead of Anova or
Correlation tests).
## 3. Inferential Statistics
```{r t-test, echo=TRUE}
t.test(sf12mcs ~ sex, data = project_dataset_fv_clean)
```
The level of significance (p-value) adopted is the conventional value of 0.05. By
running the t-test, we tested the null hypothesis that there is no statistical
difference between the average mean Mental Component Summary score for men and
women in the population. The p-value obtained is smaller than 0.05, thus we reject
the null hypothesis.
## 4. Visualising the output variable
Plot 1 is a histogram of the distribution of the SF-12 MCS variable for the whole
sample. We can see the distribution is skewed to the right with a peak of observations
around 56-57 rating scores. Important to note the plots here use linear scale - when
plotting the MCS score with a logarithmic scale, the distribution appeared more
skewed to the right than with the linear scale. Moreover, we had to filter out one
observation that has a MCS score equal to zero. Finally, the range of values within
the dependent variable is small (i.e. the maximum values are not considerably larger
than the minimum values) hence linear scale is more appropriate for this case.
```{r plot univariate distribution SF-12 MCS, echo=FALSE, message=FALSE}
library(ggplot2)
project_dataset_fv_clean %>%
ggplot(aes(sf12mcs)) +
geom_histogram() +
labs(title = "Plot 1: Distribution of SF-12 Mental Component Summary",
subtitle = "Scores for adult population (aged 16 and over)",
x = "SF-12 MCS score",
y = "Count",
caption = "Source: Understanding Society Wave 10 (2018-2019)") +
theme_minimal()
```
```{r univariate distribution SF-12 MCS with log scale, include=FALSE}
# Plotting with a log scale
project_dataset_fv_clean %>%
filter(sf12mcs > 0) %>%
ggplot(aes(sf12mcs)) +
geom_histogram() +
labs(title = "Plot 1: Distribution of SF-12 Mental Component Summary",
subtitle = "Logarithmic scale",
x = "SF-12 MCS score",
y = "Count",
caption = "Source: Understanding Society Wave 10 (2018-2019)") +
theme_minimal() +
scale_x_log10()
```
The density plot (Plot 2) shows the SF-12 MCS score distribution divided by sex. From
this graph it is possible to notice the distribution for men is slightly more skewed to
the right than for women, which indicates men tend to have higher SF-12 MCS score
than women. The maximum number of observations is also higher for males males than
females, and occurs at approximately the same value of the MCS.
```{r bivariate distribution by sex, echo=FALSE}
median <- project_dataset_fv_clean %>%
group_by(sex) %>%
summarise(median = median(sf12mcs))
project_dataset_fv_clean %>%
ggplot(aes(sf12mcs)) +
geom_density(aes(fill = sex), alpha = 0.5) +
geom_vline(data = median, aes(xintercept = median, color = sex),
linewidth = 0.5) +
labs(title = "Plot 2: Distribution of SF-12 Mental Component Summary by sex",
subtitle = "Scores for adult population (aged 16 and over)",
x = "SF-12 MCS score",
y = "Density",
caption = "Source: Understanding Society Wave 10 (2018-2019)") +
theme_minimal() +
theme(legend.title = element_blank())
```
The vertical lines represent the median SF-12 MCS scores both for men and women.
The median for men is higher than for women.
## 5. Analysis - linear regression and model building
After demonstrating there is a statistically significant difference in the average
MCS score of males and females in the t-test, the difference is tested by
controlling for other characteristics in the UK adult population. These characteristics
(or variables) were selected based on the literature review that indicated the most
commonly influencing factors on mental health and gender differences. To do this,
Ordinary Least Squares (OLS) regression models are calculated. OLS estimates the line
to best describe the linear relationship between a predictor and an outcome variable
(in bivariate regression models) or a multidimensional plane in multiple regression
models (Fogarty, 2019).
Therefore, five regression models are presented in Table 3:
- Model 1 (M1, the bivariate model): the baseline model with the dependent variable
(SF-12 MCS) and the key independent variable of the analysis (sex).
- Model 2 (M2, demographic factors model): M1 added with the age, marital status,
highest level of education and ethnic group variables.
- Model 3 (M3, economic factors model): M2 with addition of job status and
individual net monthly income variables.
- Model 4 (M4, residential factors model): M3 added with residential area variable.
- Model 5 (M5, social and care factors model): M4 with addition of loneliness,
parental responsibility and hours per week on housework variables.
```{r model 1, include=FALSE}
m1 <- project_dataset_fv_clean %>%
lm(sf12mcs ~ sex, data = .)
m1 %>% summary()
```
```{r model 2, include=FALSE}
m2 <- project_dataset_fv_clean %>%
lm(sf12mcs ~ sex + age + marital_status + education + ethnic_group, data = .)
m2 %>% summary()
```
```{r model 3, include=FALSE}
m3 <- project_dataset_fv_clean %>%
lm(sf12mcs ~ sex + age + marital_status + education + ethnic_group + job_status +
monthly_income, data = .)
m3 %>% summary()
```
```{r model 4, include=FALSE}
m4 <- project_dataset_fv_clean %>%
lm(sf12mcs ~ sex + age + marital_status + education + ethnic_group + job_status +
monthly_income + urban_rural, data = .)
m4 %>% summary()
```
```{r model 5, include=FALSE}
m5 <- project_dataset_fv_clean %>%
lm(sf12mcs ~ sex + age + marital_status + education + ethnic_group + job_status +
monthly_income + urban_rural + lonely + parental_resp + hours_hw, data = .)
m5 %>% summary()
```
```{r regression results table texreg, results='asis', echo=FALSE}
texreg::htmlreg(list(m1, m2, m3, m4, m5),
groups = list("Marital status (Ref: Single)" = 4:6,
"Education (Ref: Higher education degree or similar)" = 7:9,
"Ethnic group (Ref: British (English, Scottish, Welsh), Northern Irish, Irish or other white background)" = 10,
"Job status (Ref: Employed (self employed or paid employment [FT/PT])" = 11:14,
"Residential area (Ref: Urban area)" = 16,
"Loneliness (Ref: Hardly ever or never)" = 17:18,
"Parental responsibility (Ref: Yes)" = 19),
caption = "Table 3: Regression results for SF-12 MCS (dependent variable)",
caption.above = TRUE,
custom.model.names = c("M1", "M2", "M3", "M4", "M5"),
custom.coef.names = c("Intercept", "Female", "Age (Years)",
"Married", "Divorced", "Widowed",
"College (A levels or similar)", "School (GSCE or similar)",
"None", "Other ethnic group", "Unemployed", "Retired",
"Maternity leave; family care or home", "Other",
"Individual monthly income (£)", "Rural area", "Some of the time", "Often",
"No", "Hours on housework per week"))
```
Model 1 presents the relationship between the key independent and dependent variables. As
demonstrated previously with the inferential test, men tend to have higher average SF-12
MCS scores than women. In this bivariate model, the average difference is of 2.49 points.
Adjusted R-squared value of 0.013 indicates this model explains 1.3% of the variance in
the SF-12 MCS variable values.
In Model 2 the relationship between the main pair of variables is controlled by
adding demographic variables such as age, marital status, highest level of education
and ethnic group. In the literature age has a positive correlation with mental health
wellbeing (i.e. the gender gap in mental health decreases as people get older) (Kiely
et al., 2019). This is also observed in this model as for one year increase in age,
MCS scores are expected to increase 0.13 points on average. Being married has been
associated with better mental health (Strohschein et al., 2005) and decreased odds of
using mental health services (Dolja‑Gore et al., 2018) in Canada and Australia,
respectively. This model reflects something similar for the UK population since married
status is associated with higher MCS score (1.74 points) and divorced people have
scored lower in the scale when compared to the reference category (single). Widowed
status is not statistically significant.
Higher levels of education were significantly linked to having a protective effect
against anxiety and depression (Bjelland et al., 2008). In Model 2, people with no
educational qualifications scored on average 1.05 points less in the MCS scale when
compared with people with degrees (reference category). College or school levels were
not statistically significant. When it comes to race and ethnicity, black and minority
ethnic communities appear to be at greater risk of developing mental health illnesses
in the UK (Bamford et al., 2021). Nevertheless, in this model there was not a
statistically significant difference between MCS scores of white/British communities
(reference) when compared with other minorities. Overall, controlling for demographic
variables in Model 2 resulted in a decrease in the mental health gender gap, which is
expected due to the multifactorial nature of mental health wellbeing.
In Model 3 economic factors were introduced. Job insecurity has strong evidence from
the literature to be associated with mental distress (Chum et al., 2022) whilst income
deprivation is a key driver of mental health inequalities, with women's mental health
being potentially more sensitive to poverty (Thomson et al., 2023). In the model,
Unemployed, maternity leave / family care or home and other employment situations
appear to have a negative impact on mental health functioning when compared to self
and paid employment (ref.). Conversely being retired seems to be associated
with higher MCS scores. As for individual monthly income, despite being statistically
significant, its positive association with the MCS score is very marginal: an increase
of £1000 in monthly income represents an average increase of 0.14 points on the mental
wellbeing scale.
Interesting to note that the School (GSCE or similar) category on the highest educational
level variable is now statistically significant with higher average score than the
higher education degree level. Alternatively, the category indicating no educational
qualifications is no longer significant. Furthermore, the other ethnic group category
has become statistically significant in Model 3 with higher average score than the
white/British ethnicity level. These changes between Models 2 and 3 are rather odd as
they contradict what the wider literature indicates. Despite that, adjusted R-squared
measure for Model 3 indicates it explains approximately 10% of the variation in the
SF-12 MCS scores.
Model 4 accounts for residential area characteristics. In the UK, people living in
urban areas are more likely to suffer poor physical and mental health than rural
populations (Harriss and Hawton, 2011). This reflects in the model as people residing
in rural regions present on average one point (1.0) higher scores in the MCS scale
than those living in urban areas (ref.). Adjusted R-Squared and the gender gap in the
MCS scale remain roughly the same as Model 3 though.
In Model 5, the variables that account for frequency of feeling lonely, parental
responsibility and hours in housework per week are added. Loneliness is correlated
with worse mental and physical functioning (Lee et al., 2019). People that often feel
lonely are likely to have lower MCS scores by roughly 16 points on average when
compared to those who hardly ever or never feel solitary (ref). It is also reported
that the disproportional burden of unpaid child and house care on women has a negative
correlation on their mental health. However, both variables for parental responsibility
and hours on housework appear to be not statistically significant in Model 5. This most
comprehensive model explains 30% of the variation in the MCS scale, and the gender gap in
mental health is estimated at 1.37 points.
Important to note in Model 5 that the married and divorced status are no longer
significant as they were in previous models, whereas widowed is now statistically
significant. Something similar seems to happen in the education variable with the
College category. The individual monthly income variable has also become not
significant in this last model.
These changes between models may indicate some of the assumptions of the linear
model are being violated. Therefore, in the next session the robustness of
the model is tested before the final discussion and conclusions.
```{r regression results table tab_model(), include=FALSE}
# Another way of showing regression results with tab_model()
tab_model(m1, m2, m3, m4, m5,
dv.labels = c("M1", "M2", "M3", "M4", "M5"),
show.ci = FALSE,
p.style = "stars",
collapse.se = TRUE,
pred.labels = c("Intercept", "Female", "Age (Years)",
"Marital status [Married]", "Marital status [Divorced]",
"Marital status [Widowed]",
"Education [College (A levels or similar)]",
"Education [School (GSCE or similar)]",
"Education [None]", "Ethnic group [Other ethnic group]",
"Job status [Unemployed]", "Job status [Retired]",
"Job status [Maternity leave; family care or home]", "Job status [Other]",
"Individual monthly income (£)", "Residential area [Rural area]",
"Loneliness [Some of the time]", "Loneliness [Often]",
"Parental responsibility [No]", "Hours on housework per week"))
```
```{r regression results table tbl_regression(), include=FALSE}
# Another option for regression tables using functions from the gtsummary package
rt1 <- m1 %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 2),
intercept = TRUE,
label = list(sex ~ "Sex")) %>%
modify_column_hide(column = ci) %>%
modify_column_unhide(column = std.error) %>%
add_significance_stars()
rt2 <- m2 %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 2),
intercept = TRUE,
label = list(sex ~ "Sex", age ~ "Age (Years)",
marital_status ~ "Marital status", education ~ "Education",
ethnic_group ~ "Ethnic group")) %>%
modify_column_hide(column = ci) %>%
modify_column_unhide(column = std.error) %>%
add_significance_stars()
rt3 <- m3 %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 2),
intercept = TRUE,
label = list(sex ~ "Sex", age ~ "Age (Years)",
marital_status ~ "Marital status", education ~ "Education",
job_status ~ "Job status", monthly_income ~ "Individual monthly income (£)",
ethnic_group ~ "Ethnic group")) %>%
modify_column_hide(column = ci) %>%
modify_column_unhide(column = std.error) %>%
add_significance_stars()
rt4 <- m4 %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 2),
intercept = TRUE,
label = list(sex ~ "Sex", age ~ "Age (Years)",
urban_rural ~ "Residential area",
marital_status ~ "Marital status", education ~ "Education",
job_status ~ "Job status", monthly_income ~ "Individual monthly income (£)",
ethnic_group ~ "Ethnic group")) %>%
modify_column_hide(column = ci) %>%
modify_column_unhide(column = std.error) %>%
add_significance_stars()
rt5 <- m5 %>% tbl_regression(estimate_fun = function(x) style_number(x, digits = 2),
intercept = TRUE,
label = list(sex ~ "Sex", age ~ "Age (Years)",
urban_rural ~ "Residential area",
marital_status ~ "Marital status", education ~ "Education",
job_status ~ "Job status", monthly_income ~ "Individual monthly income (£)",
ethnic_group ~ "Ethnic group", lonely ~ "Loneliness",
parental_resp ~ "Parental responsibility",
hours_hw ~ "Hours on housework per week")) %>%
modify_column_hide(column = ci) %>%
modify_column_unhide(column = std.error) %>%
add_significance_stars()
tbl_merge(list(rt1, rt2, rt3, rt4, rt5))
```
## 6. OLS Assumptions and Model Diagnostics
There are a number of assumptions linear regression (and specifically OLS) makes
about the data, variables and estimates. If assumptions are not satisfied it is not
possible to be sure if our predictions are correct (Fogarty, 2019).
Best Linear Unbiased Estimator (BLUE) assumptions are crucial to ensure estimates
minimise error variance (best or efficient, i.e. standard errors are not biased upwards
or downwards) and reflect on average the true population parameter (i.e. coefficients
are not biased). In order for the model to be BLUE, it is important to check if the
four assumptions from the Gauss–Markov theorem hold (Fogarty, 2019). These are:
1. Predictor variables are independent of each other (exogenous variables)
2. Correct functional form (a linear relationship between predictors and outcome variable)
3. Errors have constant variance (homoscedasticity)
4. No autocorrelation between errors (independent observations)
Additionally to the BLUE assumptions, there are other three assumptions of the OLS
that have implications for regression results (Fogarty, 2019). Although not as
crucial as the BLUE ones, here we also test for them:
5. Errors are normally distributed
6. There is no multicollinearity between variables (predictors are not functions of
one other)
7. There are no influential data points (influential outliers)
In this report, the author opted to follow the order of the tutorial exercises to
test the assumptions.
### 6.1 Influential outliers (Additional assumption #7)
```{r influential outliers, echo=TRUE}
m5 %>%
plot(which = 5) %>%
title(main = "Plot 3: Residuals vs. Leverage for M5")
```
Influential data points are tested by looking at Cook's distance lines, "which measures
the effect of each observation on the regression coefficients" (Fogarty, 2019, p.222).
If a point is flagged by the Cook's distance, it is considered an influential outlier.
Points with a Cook's distance greater than 0.5 are considered high, whereas those with
distance equal or greater than 1 are extreme influential outliers (Fogarty, 2019).
Here there are no Cook's distance margins showing in the plot, thus there are no
influential outliers in our data.
### 6.2 Correct functional form (BLUE #2)
```{r non-linearity, echo=TRUE}
m5 %>%
plot(which = 1) %>%
title(main = "Plot 4: Residuals vs. Fitted for M5")
```
In the Residuals vs. Fitted plot, it is possible to see a non-linear pattern in Model 5
(red line is not completely flat, indicating the conditional mean of the error term is
not 0). To confirm a potential non-linearity in the model, we also run a Ramsay RESET
test where the null hypothesis is that the functional form of the model is correct.
```{r RESET test, echo=TRUE}
resettest(m5, power = 2:3, type = "fitted")
```
The RESET test compares if a new version of our model with predicted values in a
non-linear form "explains significantly more variance in the outcome variable than our
original model" (Fogarty, 2019, p.219). As the p-value is below 0.05, the null hypothesis
is discarded. Therefore, we assume the model has a non-linear functional form.
### 6.3 Errors normally distributed (Additional assumption #5)
```{r q-q plot, echo=TRUE}
m5 %>%
plot(which = 2) %>%
title(main = "Plot 5: Q-Q Residuals for M5")
```
Normally distributed errors are desired because it allows us to generalise the estimates
from the sample data to the population. However, violating normality does not mean
estimates are wrong, just that we cannot be confident to apply them to statements about
the general population (Fogarty, 2019).
The Q-Q plot shows the errors in M5 are not normally distributed. If they were, the
points should sit on the 45 degree line. Instead, we observe a departure especially in
the lower end of the distribution. The plot of the SF-12 MCS variable (Plot 1) was skewed
to the right, hence this is not unexpected.
### 6.4 Errors have constant variance (BLUE #3)
```{r heteroscedasticity, echo=TRUE}
m5 %>%
bptest()
```
The Residuals vs. Fitted plot may present a pattern of distribution in the residuals,
although this is not obvious. To verify if the model presents heteroscedasticity,
we run the Breusch–Pagan test. The null hypothesis is that the model has constant error
variance (homoscedasticity). The test returns a p-value below the level of significance
(p < 0.05), hence we reject the null hypothesis and assume we have an issue of
heteroscedasticity.
Heteroscedasticity makes estimates inefficient and standard errors biased downwards.
When standard errors are smaller than they should be, we may encounter statistically
significant variables which in fact are not (Fogarty, 2019).
### 6.5 No multicollinearity between variables (Additional assumption #6)
```{r multicollinearity, echo=TRUE}
m5 %>%
vif()
```
Multicollinearity is present when two or more independent variables are functions of
one other. The variance inflation factor (VIF) test measures how much common variance
there is between predictors. VIF value of 1 represent no multicollinearity, and anything
below 5 does not cause concern; if it is equal or higher than 10, then it is likely
we have issues (Fogarty, 2019).
The results of the VIF test for M5 are low, therefore multicollinearity is not a problem
in our model.
### 6.6 Exogenous variables (BLUE #1)
Endogeneity problems occur when there is omitted-variable bias (OVB), which means key
predictor variable(s) might have been excluded from the model. Since there is no test that
can tell which variables are missing in our model, the best approach to diagnose this
issue is reviewing the theory on the subject (Fogarty, 2019).
As mentioned in section 2.1 (Variables), some of the key variables the literature on
mental health highlights as relevant have a high level of missingness within the
Understanding Society Wave 10 dataset (the case of *dklm* - number of times in the last
four weeks respondent has had an alcoholic drink -, *ypnpal* - number of close friends -
and *aidhrs* - number of hours spent per week on caring). Furthermore, the variable in
our model measuring the frequency of loneliness may have an endogenous relationship with
the dependent variable SF-12 MCS. Our hypothesis is that people who often feel lonely
have lower scores in the MCS scale (low functioning mental health). Alternatively, those
with poorer mental health might be more likely to suffer with illnesses like depression
or anxiety, which may lead them to feelings of loneliness more frequently.
Based on those reflections, our model is likely to have endogeneity issues.
### 6.7 No autocorrelation (independent observations) (BLUE #4)
One of the assumptions of the Gauss–Markov theorem is that residuals should be
independent (i.e. knowing something about a particular realisation of a residual should
not help predict the value of another observation). Problems of autocorrelation are
associated with spatial, longitudinal and time series datasets (Fogarty, 2019).
To check for autocorrelation, we should look at the data documentation. Understanding
Society is a panel survey of UK households with yearly interviews. In this project
we are looking at data from the individual questionnaire instead of the household
questionnaire. Despite that, we might have some clustering as all adults in the same
household respond the individual questionnaire, hence are more likely to be correlated.
This might not be a big issue as there is a large number of households and few people per
household. Nevertheless, here we apply cluster robust standard errors to estimate a new
model. We assume residuals are correlated within a group (variable *j_hidp*, which
specifies the household of respondents) but are independent between groups.
```{r cluster robust standard errors, echo=TRUE}
project_dataset_fv_clean %>%
lm.cluster(data = .,
sf12mcs ~ sex + age + marital_status + education + ethnic_group +
job_status + monthly_income + urban_rural + lonely + parental_resp +
hours_hw,
cluster = "j_hidp") %>%
summary()
```
As expected, there is not a considerable difference from our original model.
Furthermore, there is also a possibility of spatial or temporal clustering. It might be
that there is some autocorrelation between people in the same city/region/area. Here we
have controlled for urban and rural areas. As for clustering over time, we
assume this does not occur in our data as we are looking at one specific wave of the
Understanding Society survey. Thus our data is cross-sectional (one point in time) and
not longitudinal.
All in all, we conclude we do not have autocorrelation issues. Table 4 summarises the
results of model diagnostics.
```{r model diagnostics, echo=FALSE}
table4 <- tribble(
~"Assumption", ~"Issue?", ~"Plan",
"**BLUE Assumptions**", "", "",
"1. Predictor variables are independent of each other", "Yes", "Drop *lonely* variable?",
"2. Correct functional form", "Yes", "Take logarithm of dependent variable?",
"3. Errors have constant variance", "Yes", "Correct with robust standard errors",
"4. No autocorrelation between errors (independent observations)", "No", "NA",
"**Additional Assumptions**", "", "",
"5. Errors are normally distributed", "Yes", "Take logarithm of dependent variable?",
"6. There is no multicollinearity between variables", "No", "NA",
"7. There are no influential data points", "No", "NA"
)
knitr::kable((table4), booktabs = TRUE,
caption = "Table 4: Model diagnostics results for M5")
```
### 6.8 Refining for a final model
To correct for endogeneity, we dropped the *lonely* variable and compared the regression
results with the original model. Firstly, the adjusted R-squared decreased from 30% to
10%. Secondly, there are changes in the significance of some variables/categories:
for example, parental responsibility and monthly income are now statistically
significant, whereas college education and being widowed are not. Finally, running some
of the tests for the other assumptions, the Q-Q plot seems more detached from the
45-degree line than when the model includes the variable - indicating we are less
confident to generalise the estimates to statements about the population.
Furthermore, some of the literature on the subject has included similar variable to
measure level of loneliness in their analysis (Etheridge and Spantig, 2022). Based on
these factors, we decided to keep *lonely* in the model, as removing it seemed to cause
more omitted-variable bias.
For incorrect functional form and errors not normally distributed, we tried a semi-log
version of our model by transforming the dependent variable. Re-running the tests (plots
and RESET test results below), it seems the semi-log model is causing more violation of
these assumptions than the original model. This is not unexpected since in section 4 the
histogram with the log version of SF-12 MCS was more skewed than the one with the linear
scale. Hence, we conclude the linear model is more suitable in this case.
```{r refining, include=FALSE}
# Correction 1: Predictor variables are not independent of each other -> drop lonely variable
m5_drop <- project_dataset_fv_clean %>%
lm(sf12mcs ~ sex + age + marital_status + education + ethnic_group + job_status +
monthly_income + urban_rural + parental_resp + hours_hw, data = .)
m5_drop %>% summary()
m5_drop %>%
plot(which = 2)
m5_drop %>%
bptest()
resettest(m5_drop, power = 2:3, type = "fitted")
m5_drop %>%
vif()
# Conclusions - Adj. R-squared drops to 0.1. Maybe should keep it based on literature?
# Correction 2: Incorrect functional form + errors not normally distributed -> log of sf12mcs
m5_log <- project_dataset_fv_clean %>%
filter(sf12mcs > 0) %>%
mutate(log_sf12mcs = log(sf12mcs)) %>%
lm(log_sf12mcs ~ sex + age + marital_status + education + ethnic_group + job_status +
monthly_income + urban_rural + lonely + parental_resp + hours_hw, data = .)
m5_log %>%
plot(which = 1)
resettest(m5_log, power = 2:3, type = "fitted")
m5_log %>%
plot(which = 2)
# Conclusions - does not look good. Maybe we should keep the linear scale?
# Correction 3: Heteroscedasticity -> correct with robust standard errors
m5_SE <- coeftest(m5, vcov = vcovHC(m5, type = "HC0"))
# Conclusions - ok
```
```{r comparing m5 and m5_log, include=FALSE, message=FALSE, results='asis'}
texreg::htmlreg(list(m5, m5_log))
```
```{r comparing m5 and m5_SE, include=FALSE, message=FALSE, results='asis'}
texreg::htmlreg(list(m5, m5_SE))
```
```{r plots and reset test for m5_log, echo=FALSE}
m5_log %>%
plot(which = 1) %>%
title(main = "Plot 6: Residuals vs. Fitted for Semi-Log M5")
resettest(m5_log, power = 2:3, type = "fitted")
m5_log %>%
plot(which = 2) %>%
title(main = "Plot 7: Q-Q Residuals for Semi-Log M5")