title: "Simulation" author: "Ishu Jaswani, Cristian Leo" date: "2022-11-15" output: html_document: toc: true toc_float: true header-includes:
- \usepackage{sectsty}
- \allsectionsfont{\color{cyan}}
knitr::opts_chunk$set(echo=TRUE,warning = FALSE, message = FALSE)
The population considered is the number of students in New York in the academic year 2017-2018 (NYSED Data Site, 2018). Based on this population we want a sample with a statistical power of 90% for an effect size of 0.1. Based on research question 3 will be running a RCT, therefore we can get the minimum sample size to achieve the stated statistical power by running an anova power test with 4 groups, which are the 4 independent variables in our analysis: treatment or control, level of income, gender, and level of education. Based on the test 356 students per group are required to achieve the statistical power desired.
“2018 NY STATE Higher Education Enrollment Data | NYSED Data Site.” Accessed December 5, 2022. https://data.nysed.gov/highered-enrollment.php?year=2018&state=yes.
Uni_studnets_NY <- 1236841
library(pwr)
sample_h1 = pwr.t.test(d = 0.2/2,sig.level = 0.05,power = 0.9,type = "two.sample",alternative = "greater")$n; sample_h1 # based in power of the test, effect size, in each group
sample_ratio <- sample_h1/Uni_studnets_NY; sample_ratio
sample_anova <- pwr.anova.test(k = 4,f = 0.1,sig.level = 0.05,power = 0.9)$n;sample_anova
#356 in each group
What is your house hold income level?
- 0-50k 2) 50-150K 3) 150K-300 4) 300+
The household income of students are aound 150k-300k since college tuition is expensive. We are assuming that most students dont take a loan to come to college therefore, dont have extremely now household incomes.
set.seed(224) # randomly generate numbers to remain constant
income_house <- rnorm(n = 360,mean = 150,sd = 100) # normal distribution
max(income_house);min(income_house)
library(Hmisc);library(dplyr);library(forcats)
income_house_sust <- cut(income_house,breaks = c(-Inf,50,150,300,Inf),preset_bins_labels = cut(income_house,breaks = c(-Inf,50,150,300,Inf)),labels =c("0-50k", "50-150k", "150-300k", "300k+"))
fct_count(income_house_sust)
What is your gender ?
- male
- female
Considering only male and female We have simulated a binomial distribution where there are equal number of males and equal number of females.
set.seed(3)
gender <- factor(rbinom(n=360, size=1, p=0.5), levels = c(0, 1), labels = c("Male", "Female")) # binomial distribution
fct_count(gender)
What is your education level?
- undergraduate 2) Graduate 3) MBA 4) PHD
Since there are more undergraduates we have skewed the distribution towards undergraduates. we have used a normal distribution which was centered around 1 and had a standard deviation of 2 and assigned values accordingly so that undergraduates are more and rest are assigned less.
set.seed(2)
edu <- rnorm(n = 360,mean = 1,sd = 2) # normal distribution
max(edu);min(edu)
library(Hmisc);library(dplyr)
edu_sust <- cut(edu,breaks = c(-Inf,1,2,3,Inf),preset_bins_labels = cut(edu,breaks = c(-Inf,1,2,3,Inf)),labels =c("Undergraduate", "Graduate", "MBA", "PHD"))
fct_count(edu_sust)
How often do you shop for new clothing? Answers: Once a week or more; About once a month; A few times a year; Once a year; Never
We have simulated a normal distribution as we understand taht people shopping never are less and people shopping once a week or more also less too. the general audiance lies within the range of once a month and few times a year.
# normal dist
set.seed(1)
often_tr <- rnorm(n = 360,mean = 2.5,sd = 1) # normal distribution
max(often_tr);min(often_tr)
library(Hmisc);library(dplyr)
often_clothing <- cut(often_tr,breaks = c(-1,1,2,3,4,6),preset_bins_labels = cut(often_tr,breaks = c(-1,1,2,3,4,6)),labels =c("Once a week or more","About once a month","A few times a year","Once a year","Never"))
fct_count(often_clothing)
Please indicate your level of agreement with the following statement: When given the option, I choose to purchase sustainable products. Answers: Strongly disagree; Somewhat disagree; Neither agree nor disagree; Somewhat agree; Strongly agree
We have used a normal distribution and then assigned values according so that teh distribution is skewed towards somewhat agree and strongly agree as we assume at if people are given the option holding other parameters constant. They would wish to purchase sustainable products
#
library(dplyr);library(Hmisc);library(forcats)
set.seed(1)
agree <- rnorm(n = 360,mean = 4,sd = 1.5)
max(agree);min(agree)
library(Hmisc);library(dplyr)
sust_agree <- cut(agree,breaks = c(-1,1,2,3,4,Inf),preset_bins_labels = cut(agree,breaks = c(-1,1,2,3,4,6)),labels =c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree"))
fct_count(sust_agree)
Please indicate your level of agreement with the following statement: I am willing to pay more for a product that is sustainably produced. Answers: Strongly disagree; Somewhat disagree; Neither agree nor disagree; Somewhat agree; Strongly agree
We assume that the distribution of the likert scale will follow a similar trend - as question 5. As people think that sustainable products are more expensive but it is a step further to saving the enviornment. Therefore they would be willing to pay more in general.
set.seed(2)
pay <- rnorm(n = 360,mean = 4,sd = 1.5)
max(pay);min(pay)
library(Hmisc);library(dplyr)
sust_pay <- cut(pay,breaks = c(-1,1,2,3,4,Inf),preset_bins_labels = cut(pay,breaks = c(-1,1,2,3,4,6)),labels =c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree"))
fct_count(sust_pay)
Please indicate your level of agreement with the following statement: When retail shopping, I look at the tags and other available information to determine if the article of clothing to determine if it was produced sustainably. Answers: Strongly disagree; Somewhat disagree; Neither agree nor disagree; Somewhat agree; Strongly agree
set.seed(2)
info_tags <- rnorm(n = 360,mean = 1,sd = 2)
max(info_tags);min(info_tags)
library(Hmisc);library(dplyr)
info_tags_sust <- cut(info_tags,breaks = c(-Inf,1,2,3,4,Inf),preset_bins_labels = cut(info_tags,breaks = c(-Inf,1,2,3,4,Inf)),labels =c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree"))
fct_count(info_tags_sust)
Please indicate your level of agreement with the following statement: #rephrase When a company adopts sustainable business practices, I will go out of my way to buy their products. Answers: Strongly disagree; Somewhat disagree; Neither agree nor disagree; Somewhat agree; Strongly agree
Question 5 aims to identify in what extent sustainability is a purchasing criteria. On a scale from 1 to 5, where 1 is "Strongly disagree", and 5 is "Strongly agree", we assume that most of the responses will focus on the slight right of middle of the scale. Indeed, our assumed mean is 3.3 and standard deviation 2. We used a random normal distribution to assure a distribution focused on the mean and with tails on the extremes.
set.seed(3)
extra_effort <- rnorm(n = 360,mean = 3.3,sd = 2)
max(extra_effort);min(extra_effort)
library(Hmisc);library(dplyr);library(forcats)
sust_extra_effort <- cut(extra_effort,breaks = c(-Inf,1,2,3,4,Inf),preset_bins_labels = cut(extra_effort,breaks = c(-Inf,1,2,3,4,Inf)),labels =c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree"))
fct_count(sust_extra_effort)
Treatment group: For the shirt pictured here (insert picture of non-sustainable shirt), how much is this item worth? Answers: Continuous scale ($0 to $50)
Control group: For the shirt pictured here (insert picture of sustainable shirt with description), how much is this item worth? Answers: Continuous scale ($0 to $50)
Question 6 is a customized question, which will show a t-shirt with a description. In the treatment group the description will mention that the t-shirt is made by sustainable fabrics and the company operates following environmental sustainable practices. In the control group the same t-shirt will be displayed, but this time the description will mention that the t-shirt and the company that produced are not sustainable. In the survey will be given the possibility to the survey takers to drag a point in the bar from 0 to 50 to indicate how much is the given t-shirt worth it. We assume that the respondents in the treatment group will answer with a higher price on average. Therefore, the distribution in the control group will be right skewed with a lower mean and concentration of price given, while the treatment group will report the opposite. To do so, we used a Chi-squared distribution with 20 degrees of freedom for the control group and a skewed random normal distribution with a skewness parameter of 0.5 with the rsnorm function from the "fGarch" library.
# control
set.seed(3)
cont_price <-as.integer(rchisq(n = 180,df = 20)); min(cont_price);max(cont_price)
hist(cont_price)
# treatment
library(fGarch)
set.seed(2)
treat_price <- as.integer(rsnorm(180, 40, 5, 0.5)); min(treat_price);max(treat_price)
hist(as.integer(treat_price))
Treatment group: Would you be willing to buy the same shirt for $40? Answers: Yes; No
Control group: Would you be willing to buy the same shirt for $40? Answers: Yes; No
Following question 6, question 7 is a customized question as well. However, if question 6 measures the price perception of the t-shirt, question 7 measures the willingness to buy given the same price both for the treatment group and the control group. We assume that the proportion of "Yes" in the treatment group is higher than the proportion of "Yes" in the control group. Moreover, we state a meaningful difference of roughly 40%, in order to be compelling for the companies to switch their production from non sustainable to sustainable. To simulate the results we used a random binomial distribution with probability of 0.3 in the control group and 0.6999 in the treatment group.
# control
set.seed(3)
cont_buy <- as.factor(rbinom(n = 180,size = 1,prob = 0.3))
cont_buy <- fct_recode(cont_buy,"Yes"="1","No"="0")
fct_count(cont_buy)
# treatment
set.seed(12029)
treat_buy <- as.factor(rbinom(n = 180,size = 1,prob = 0.6999))
treat_buy <- fct_recode(treat_buy,"Yes"="1","No"="0")
fct_count(treat_buy)
Do you believe the quality of sustainably produced clothing is higher than non-sustainably produced clothing? Answers: Yes; No; I don’t know
Question 8 allows us to understand a deeper insights on why a sustainable product is worthed more than a non sustainable one. We assume that one of the main reason is a higher quality of the product. Therefore, we may expect a higher proportion of yes both in the treatment and control group. To achieve that, we simulated the data with a random normal distribution with mean 1 and standard deviation of 0.8. Then we binned the numbers generated by the random normal distribution with the possible answers, where numbers up to 1 mean "Yes", 1 to 2 mean "No", and greater than 2 mean "I don't know".
set.seed(2)
quality <- rnorm(n = 360,mean = 1,sd = 0.8)
max(quality);min(quality)
library(Hmisc);library(dplyr)
quality_sust <- cut(quality,breaks = c(-Inf,1,2,Inf),preset_bins_labels = cut(quality,breaks = c(-Inf,1,2,Inf)),labels =c("Yes","No","I don't Know"))
fct_count(quality_sust)
main_data <- data.frame(index = 1:360,often_clothing,sust_agree,sust_pay,info_tags_sust,sust_extra_effort,quality_sust,edu_sust,gender,income_house_sust)
colnames(main_data)
set.seed(20)
main_data <- main_data %>%
mutate(z = row_number() %in% sample(n(), n() / 2) )
skimr::skim(main_data)
main_data <- main_data |>
arrange(z)
main_data <- main_data |>
mutate(price = c(cont_price,treat_price),buy=c(cont_buy,treat_buy))
main_data
Do sustainable labels influence people’s perception about products’ perceived price?
In the first research question we examine whether the sustainability of a product affects the customer perception about the products' perceived price. To do so we analyze the answers of the question 10 of the survey with a prop test.
main_data |>
select(z,buy)|>
table()
rq1 <- prop.test(x = c(128,54),n=c(180,180),alternative = "greater")
effect_size_rq1 <- prop.test(x = c(128,54),n=c(180,180),alternative = "greater")$estimate; (effect_size_rq1[1]-effect_size_rq1[2])*100 # in percentage%
Do people willing to pay a higher price for a product if it comes with a sustainable label? If so, to what degree?
The second research questions aims to provide an actionable insight for the companies. Indeed, the goal of this study is to compel the companies to switch their operations into a sustainable way.
rq2 <- t.test(x=main_data[main_data$z == 1, ]$price,y=main_data[main_data$z == 0, ]$price,alternative = "greater")
effect_size_rq2 <- t.test(x=main_data[main_data$z == 1, ]$price,y=main_data[main_data$z == 0, ]$price,alternative = "greater")$estimate; (effect_size_rq2[1]-effect_size_rq2[2])
Do gender, education and household income level af- fect the influences of sustainability tags on people’s willingness to purchase?
library(tidyverse);library(broom)
main_data_hilo <- main_data %>%
mutate(price_hilo = factor(ifelse(price >= mean(price), 1, 0), levels = c(0, 1)))
model <- glm(price_hilo~z+edu_sust+income_house_sust+gender,data=main_data_hilo, family = "binomial")
pred <- predict(model, type='response')
pred_hilo <- ifelse(pred>0.5, 1, 0)
prop.table(table(pred = pred_hilo, true = main_data_hilo$price_hilo))
summary(model)
main_data$z <- as.factor(main_data$z)
model <- lm(price~z+edu_sust +income_house_sust+gender,data=main_data)
summary(model)
model_anov <- aov(price~z+edu_sust +income_house_sust+gender,data=main_data)
t1 <- TukeyHSD(model_anov,which = "income_house_sust" ,conf.level=.95)$income_house_sust
t2 <- TukeyHSD(model_anov,which = "z" ,conf.level=.95)$z
t3 <- TukeyHSD(model_anov,which = "edu_sust" ,conf.level=.95)$edu_sust
t4 <- TukeyHSD(model_anov,which = "gender" ,conf.level=.95)$gender
t5 <- rbind(t2,t1,t3,t4)
plot(TukeyHSD(model_anov,which = "income_house_sust" ,conf.level=.95))
tidy(anova(object = model))
class(tidy(model))
tidy(model) |>
select(estimate,p.value) |>
mutate(i = seq(1,9,1)) |>
filter(i==1) |>
select(estimate,p.value)
set.seed(4172)
B <- 1000 # number of experiments to replicate
# put above experiment into function for replication
experiment <- function(n = 360) {
y_0 <- rnorm(n = n, mean = 150, sd = 100) %>% round(digits = 1)
tibble(y_0)
}
# per notes, which is less intuitive (reads like one experiment, size 60 * 1000)
income_house <- experiment(n = 360 * B) %>% mutate(i = rep(seq(B), each = 360) )
max(income_house);min(income_house)
library(Hmisc);library(dplyr)
income_house_sust <- cut(income_house$y_0,breaks = c(-Inf,50,150,300,Inf),preset_bins_labels = cut(income_house,breaks = c(-Inf,50,150,300,Inf)),labels =c("0-50k", "50-150k", "150-300k", "300k+"))
fct_count(income_house_sust)
fct_count(as.vector(replicate(1000,factor(rbinom(n = 360, size = 1, p=0.5), levels = c(0, 1), labels = c("Male", "Female")))))
rep(seq(1000),each = 360)
library(tidyverse)
library(broom)
B <- 1000 # number of experiments to replicate
# put above experiment into function for replication
analyze_experiment <- function(y_1 , y_0) {
prop.test(x= c(y_1,y_0), n=c(360/2,360/2), alternative = 'greater') %>%
tidy() |>
mutate(effect = estimate1-estimate2) |>
select(effect_size = effect, upper_ci = conf.high, p = p.value)
}
experiment <- function(n = 360) {
y_0 <- factor(rbinom(n = n/2, size = 1, p=0.5), levels = c(0, 1), labels = c("Male", "Female"))
y_1 <- factor(rbinom(n = n/2, size = 1, p=0.5), levels = c(0, 1), labels = c("Male", "Female"))
tibble(y_0, y_1 )
}
# per notes, which is less intuitive (reads like one experiment, size 60 * 1000)
Gender_simulation <- experiment(n = 360 * B) %>% mutate(i = rep(seq(B), each = 180) )
# run test on each experiment
x <- Gender_simulation %>% group_by(i,y_1) |> summarise(no_rows = length(y_1))
y <- Gender_simulation %>% group_by(i,y_0) |> summarise(no_rows = length(y_0))
Gender_simulation <- cbind(x,y)
results <- Gender_simulation |>
filter(y_1 =="Male") |>
select(no_rows...3, no_rows...6) |> # row 3 is treatment and row 6 is control
mutate(i = rep(seq(B), each = 1) ) |>
group_by(i) |>
summarise(analyze_experiment(no_rows...3,no_rows...6))
mean(results$p < 0.05)
summary(results$effect_size)
summary(results$upper_ci)
library(tidyverse)
library(broom);library(dplyr)
set.seed(2)
B <- 1000 # number of experiments to replicate
# put above experiment into function for replication
analyze_experiment <- function(y_1 , y_0) {
prop.test(x= c(y_0,y_1), n=c(360/2,360/2), alternative = 'greater') %>%
tidy() |>
mutate(effect = estimate1-estimate2) |>
select(effect_size = effect, upper_ci = conf.high, p = p.value)
}
experiment <- function(n = 360) {
y_0 <- factor(rbinom(n = n/2,size = 1,prob = 0.3), levels = c(0, 1), labels = c("Yes", "No"))
y_1 <- factor(rbinom(n = n/2,size = 1,prob = 0.6999), levels = c(0, 1), labels = c("Yes", "No"))
tibble(y_0, y_1 )
}
buy_simulation <- experiment(n = 360 * B) %>% mutate(i = rep(seq(B), each = 360/2) )
# run test on each experiment
x <- buy_simulation %>% group_by(i,y_1) |> summarise(no_rows = length(y_1))
y <- buy_simulation %>% group_by(i,y_0) |> summarise(no_rows = length(y_0))
buy_simulation <- cbind(x,y)
results <- buy_simulation |>
filter(y_1 =="Yes") |>
select(no_rows...3, no_rows...6) |> # row 3 is control and row 6 is test
mutate(i = rep(seq(B), each = 1) ) |>
group_by(i) |>
summarise(analyze_experiment(no_rows...3,no_rows...6))
mean(results$p > 0.05)
summary(results$effect_size)
summary(results$upper_ci)
set.seed(4172)
B_2 <- 1000 # number of experiments to replicate
# put above experiment into function for replication
experiment <- function(n = 360) {
y_0 <- as.integer(rchisq(n = n/2,df = 20))
y_1 <- as.integer(rsnorm(n/2, 40, 5, 0.5))
tibble(y_0, y_1 )
}
analyze_experiment <- function(y_1, y_0) {
t.test(x = y_1, y = y_0, alternative = 'less') %>%
tidy() %>%
select(effect = estimate, upper_ci = conf.high, p = p.value)
}
# per notes, which is less intuitive (reads like one experiment, size 60 * 1000)
d_rep <- experiment(n = 360 * B_2) %>% mutate(i = rep(seq(B), each = 360/2) )
# run test on each experiment
results <- d_rep %>% group_by(i) %>% summarise( analyze_experiment(y_1, y_0) )
# same summaries as in notes; but can do whatever you want
mean(results$p < 0.05)
summary(results$effect)
summary(results$upper_ci)
set.seed(224) # randomly generate numbers to remain constant
income_house <- rnorm(n = 360,mean = 180,sd = 100) # normal distribution
max(income_house);min(income_house)
library(Hmisc);library(dplyr)
income_house_sust <- cut(income_house,breaks = c(-Inf,50,180,360,Inf),preset_bins_labels = cut(income_house,breaks = c(-Inf,50,180,360,Inf)),labels =c("0-50k", "50-180k", "180-360k", "360k+"))
fct_count(income_house_sust)
set.seed(3)
gender <- factor(rbinom(n=360, size=1, p=0.5), levels = c(0, 1), labels = c("Male", "Female")) # binomial distribution
fct_count(gender)
set.seed(2)
edu <- rnorm(n = 360,mean = 2.5,sd = 2) # normal distribution
max(edu);min(edu)
library(Hmisc);library(dplyr)
edu_sust <- cut(edu,breaks = c(-Inf,1,2,3,Inf),preset_bins_labels = cut(edu,breaks = c(-Inf,1,2,3,Inf)),labels =c("Undergraduate", "Graduate", "MBA", "PHD"))
fct_count(edu_sust)
# normal dist
set.seed(1)
often_tr <- rnorm(n = 360,mean = 2.5,sd = 2) # normal distribution
max(often_tr);min(often_tr)
library(Hmisc);library(dplyr)
often_clothing <- cut(often_tr,breaks = c(-1,1,2,3,4,6),preset_bins_labels = cut(often_tr,breaks = c(-1,1,2,3,4,6)),labels =c("Once a week or more","About once a month","A few times a year","Once a year","Never"))
fct_count(often_clothing)
#
library(dplyr);library(Hmisc);library(forcats)
set.seed(1)
agree <- rnorm(n = 360,mean = 2.5,sd = 2)
max(agree);min(agree)
library(Hmisc);library(dplyr)
sust_agree <- cut(agree,breaks = c(-1,1,2,3,4,Inf),preset_bins_labels = cut(agree,breaks = c(-1,1,2,3,4,6)),labels =c("Strongly disagree", "Somewhat disagree", "Neither agree nor disagree", "Somewhat agree", "Strongly agree"))
fct_count(sust_agree)
# control
set.seed(3)
cont_price <-rnorm(180, 30, 15); min(cont_price);max(cont_price)
# treatment
library(fGarch)
set.seed(2)
treat_price <- as.integer(rnorm(180, 30, 15)); min(treat_price);max(treat_price)
hist(as.integer(treat_price))
# control
set.seed(3)
cont_buy <- as.factor(rbinom(n = 180,size = 1,prob = 0.5))
cont_buy <- fct_recode(cont_buy,"Yes"="1","No"="0")
fct_count(cont_buy)
# treatment
set.seed(12029)
treat_buy <- as.factor(rbinom(n = 180,size = 1,prob = 0.5))
treat_buy <- fct_recode(treat_buy,"Yes"="1","No"="0")
fct_count(treat_buy)
set.seed(2)
quality <- rnorm(n = 360,mean = 1.5,sd = 0.8)
max(quality);min(quality)
library(Hmisc);library(dplyr)
quality_sust <- cut(quality,breaks = c(-Inf,1,2,Inf),preset_bins_labels = cut(quality,breaks = c(-Inf,1,2,Inf)),labels =c("Yes","No","I don't Know"))
fct_count(quality_sust)
main_data <- data.frame(index = 1:360,often_clothing,sust_agree,sust_pay,info_tags_sust,sust_extra_effort,quality_sust,edu_sust,gender,income_house_sust)
colnames(main_data)
set.seed(20)
main_data <- main_data %>%
mutate(z = row_number() %in% sample(n(), n() / 2) )
skimr::skim(main_data)
main_data <- main_data |>
arrange(z)
main_data <- main_data |>
mutate(price = c(cont_price,treat_price),buy=c(cont_buy,treat_buy))
main_data
main_data |>
select(z,buy)|>
table()
rq1 <- prop.test(x = c(82,74),n=c(180,180),alternative = "greater")
effect_size_rq1 <- prop.test(x = c(82,74),n=c(180,180),alternative = "greater")$estimate; (effect_size_rq1[1]-effect_size_rq1[2])*100 # in percentage%
main_data[main_data$z == 1, ]$price
rq2 <- t.test(x=main_data[main_data$z == 1, ]$price,y=main_data[main_data$z == 0, ]$price,alternative = "greater")
effect_size_rq2 <- t.test(x=main_data[main_data$z == 1, ]$price,y=main_data[main_data$z == 0, ]$price,alternative = "greater")$estimate; (effect_size_rq2[1]-effect_size_rq2[2])
main_data <- main_data %>%
mutate(price_hilo = factor(ifelse(price >= mean(price), 1, 0), levels = c(0, 1)))
model <- glm(price_hilo~z+edu_sust+income_house_sust+gender,data=main_data, family = "binomial")
pred <- predict(model, type='response')
pred_hilo <- ifelse(pred>0.5, 1, 0)
prop.table(table(pred = pred_hilo, true = main_data$price_hilo))
summary(model)