-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathcausal-inference-week3.Rmd
165 lines (144 loc) · 4.95 KB
/
causal-inference-week3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
title: "week3-casual inference"
author: "Zihao Wang"
date: "2017�6�1�"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In this part,we focus on the topic the statistics power curve. However, the x-axis is no longer the $\bar Y_{i}$ but $\frac{\bar Y_{i}}{\sqrt{\sigma_{i}^2+\sigma_{0}^2}}$. Since if we scale the statistics, the nominator and denominator would cancel off, we are expected to get the identical plot of the statistics power as that in the week two.
```{r}
simulation<-function(N,K,n,Ybar,sigma,M){
Ybar.sim<-matrix(nrow = K,ncol = M)
variance.sim<-matrix(nrow = K,ncol = M)
for (i in 1:K){
for (j in 1:M){
Ybar.sim[i,j]=rnorm(1,Ybar[i],sigma[i]*sqrt(K/N))
variance.sim[i,j]=rchisq(1,n[i]-1)*(sigma[i]^2)/(n[i]-1)
}
}
Vneyman.sim<-matrix(nrow = (K-1),ncol = M)
for(i in 1:(K-1)){
for(j in 1:M) {
Vneyman.sim[i,j]<-variance.sim[i+1,j]/n[i]+variance.sim[1,j]/n[1]
}
}
theta.sim<-matrix(nrow = (K-1),ncol = M)
for(i in 1:(K-1)){
for(j in 1:M) {
theta.sim[i,j]<-(Ybar.sim[i+1,j]-Ybar.sim[1,j])/sqrt(Vneyman.sim[i,j])
}
}
pvalue.sim<-matrix(nrow = (K-1),ncol = M)
for(i in 1:(K-1)){
for(j in 1:M){
pvalue.sim[i,j]=2*pnorm(-abs(theta.sim[i,j]))
}
}
pcombine.sim<-as.numeric()
for(j in 1:M) {pcombine.sim[j]=min(pvalue.sim[,j])}
pcombine.sim
}
```
```{r}
basecase=simulation(10000,10,rep(1000,10),rep(0,10),rep(1,10),10000)
hist(basecase)
cutoff=sort(basecase)[500]
```
```{r}
statistics_power<-function(N,K,n,Ybar,sigma,M,epsilon){
cutoff<-sort(simulation(N,K,n,rep(0,K),rep(1,K),M))[0.05*M]
prob<-as.numeric()
for(i in 1:length(epsilon)){
pcombine<-simulation(N,K,n,c(Ybar[1:(length(Ybar)-1)],epsilon[i]),sigma,M)
prob[i]<-length(which(pcombine<cutoff))/length(pcombine)
}
plot(epsilon,prob/sqrt(sigma[1]^2+sigma[length(K)]^2))
}
```
Here, I define a function to plot $\frac{\bar Y_{i}}{\sqrt{\sigma_{i}^2+\sigma_{0}^2}}$ versus $\epsilon$, the result I have is the same with the plot of $\bar Y_{i}$ versus $\epsilon$, which proves the idea that the standardized statistics we obtain from dividing $\bar Y_{i}$ by the corresponding square root of the variance behaves as well as $\bar Y_{i}$
```{r}
statistics_power(10000,10,rep(1000,10),rep(0,10),rep(1,10),10000,seq(0,1,0.01))
```
```{r}
statistics_power(10000,20,rep(500,20),rep(0,20),rep(1,20),10000,seq(0,1,0.01))
statistics_power(10000,40,rep(250,40),rep(0,40),rep(1,40),10000,seq(0,1,0.01))
```
From the above, the increasing number of arms, that is K, does not affect the general tendency of the plot. However, the bigger K is, the slower the plot increases. It is natural that with more groups in hand, it is getting more and more difficult for us to reject the null hypothesis conditioning the alternative hypothesis is right.
```{r}
K<-seq(5,20,1)
epsilon<-seq(0.01,1,0.01)
result<-as.numeric()
for(i in 1:16){
cutoff<-sort(simulation(10000,K[i],rep(floor(10000/K[i]),K[i]),rep(0,K[i]),rep(1,K[i]),10000))[500]
#print(c(i, cutoff))
upper=1
lower=0
while(TRUE){
mean=(upper+lower)/2
pcombine<-simulation(10000,K[i],rep(floor(10000/K[i]),K[i]),c(rep(0,(K[i]-1)),mean),rep(1,K[i]),10000)
prob<-length(which(pcombine<cutoff))/length(pcombine)
#print(c(j, prob))
if(abs(prob-0.8)<1e-32){
result[i]<-mean
#print(c(i, j, result[i], epsilon[j]))
break
}
if(prob>0.8){
upper=mean
}
if(prob<0.8){
lower=mean
}
}
}
```
```{r}
plot(seq(5,20,1),result,xlab = "Number of the arms", ylab = "The corresponding epsilon value",main = "The relationship between K and epsilon")
```
Here, I want to study the relationship between the number of the arms, that is K, with $\epsilon$ where $\epsilon$ is the samllest one satisfying that the corresponding statistics power is over 0.8.
```{r}
#par(mfrow=c(1,3))
plot(seq(5,20,1),result)
plot(sqrt(seq(5,20,1)),result)
plot(log(seq(5,20,1)),result)
```
```{r}
K.1<-seq(2,50,1)
epsilon<-seq(0.01,1,0.01)
result.1<-as.numeric()
for(i in 1:49){
cutoff<-sort(simulation(10000,K.1[i],rep(floor(10000/K.1[i]),K.1[i]),rep(0,K.1[i]),rep(1,K.1[i]),10000))[500]
#print(c(i, cutoff))
upper=1
lower=0
while(TRUE){
mean=(upper+lower)/2
pcombine<-simulation(10000,K.1[i],rep(floor(10000/K.1[i]),K.1[i]),c(rep(0,(K.1[i]-1)),mean),rep(1,K.1[i]),10000)
prob<-length(which(pcombine<cutoff))/length(pcombine)
#print(c(j, prob))
if(abs(prob-0.8)<1e-32){
result.1[i]<-mean
#print(c(i, j, result[i], epsilon[j]))
break
}
if(prob>0.8){
upper=mean
}
if(prob<0.8){
lower=mean
}
}
}
```
```{r}
plot(K.1,result.1,xlab = "Number of the arms", ylab = "The corresponding epsilon value",main = "The relationship between K and epsilon")
```
```{r}
par(mfrow=c(1,3))
plot(K.1,result.1)
plot(sqrt(K.1),result.1)
plot(log(K.1),result.1)
```
From above, I find that the corresponding $\epsilon$ increases as K increases and the linear relationship between $\epsilon$ and $\sqrt{K}$ is the most significant one.