appendix-evaluationmethodology.tex

\tightsection{Counterfactual testing \jc{This part must be significantly reduced!}}
\label{sec:counterfactualtesting}
In this paper we make use of a methodology we call ``counterfactual testing'', or CT, for producing unbiased estimates of the impact of CDN selection algorithms on quality outcomes.  CT is a relative of A/B testing, a well-known method for producing estimates of the effects of decisions (see, for example, \cite{kohavi2009online} for a case study of its use at Microsoft).  Like A/B testing it relies crucially on randomization of decisions, but unlike A/B testing it permits more flexible retrospective evaluation of decision-making algorithms, at the cost of reduced statistical efficiency (i.e. the need for more data).  \henry{It is quite likely that this method is used to test treatment targeting strategies in clinical trials, and we should cite related literature if it exists, but I cannot find any.}

The idea is that we can accurately evaluate sessions whose counterfactual decisions happen to match their actual decisions, since we observe their outcomes.  By dropping from our dataset all sessions whose counterfactual decisions do not match their actual decisions, we identify a dataset over which we can feasibly compute average quality under the counterfactual decisions.  Assuming the actual decisions are random, the choice to drop a session in this way is random, which means that the procedure is equivalent to randomly dropping some of the data.  A simple reweighting step, which corrects for the different probability of dropping sessions with different counterfactual decisions, permits recovery of unbiased results.  This procedure works for arbitrary sets of counterfactual decisions, as long as the actual decisions in the dataset were taken randomly.

Let $J \subset \mathbb{N}$ be a set of indices for the possible decisions, which we assume for simplicity is the same for each session.  Let $d_i \in J$ be the decision made for the $i$th session, $0 \leq 0 < N$, and $\hat{d}_i \in J$ be the simulated (counterfactual) decision for the same session.  Let $q_i \in \mathbb{R}$ be the quality outcome actually observed for that session, and let $\hat{q}_i \in \mathbb{R}$ be the (unknown) quality outcome we would have observed if decision $\hat{d}_i$ had been taken for that session.  It is assumed that the $d_i$ were chosen mutually independently and further that $d_i$ is independent of all observable attributes of session $i$.  We also assume that each $d_i$ has known distribution $P_i$ and that $P_i$ assigns strictly positive probability to each possible decision.  All of these assumptions are met if, for example, the decisions are chosen uniformly at random for each session.  The most restrictive assumption is that there are no interactions among the effects of decisions; that is, we assume $q_i$ is independent of $d_j$ for $i \neq j$ and $\hat{q}_i$ is independent of $\hat{d}_j$ for $i \neq j$.  This assumption is certainly violated to some extent, though a similar assumption is made in A/B testing.

Let $\bar{Q}$ be the average quality in the counterfactual situation, $\frac{1}{N} \sum_{i < N} \hat{q}_i$.  Let $\hat{Q}$ be the average quality reported by CT, and let $1[x] = 1$ if $x$ is true and $0$ otherwise.  Then $\hat{Q}$ is computed as follows:
\begin{align*}
  \hat{Q} &= \frac{1}{N} \sum_{i < N} \frac{1}{P_i(\hat{d}_i)} 1[d_i = \hat{d}_i] q_i
\end{align*}
Sessions for which counterfactual decisions do not match actual decisions are dropped, and the remaining sessions are weighted according to their probability of being dropped.  It is straightforward to verify that $\hat{Q}$ is an {\it unbiased} estimate of $\bar{Q}$ for any $N$ (it is essentially an estimate of $\bar{Q}$ using {\it importance sampling}, which is unbiased \cite{something}):
\begin{align*}
    \E[\hat{Q}] &= \E[\frac{1}{N} \sum_{i < N} \frac{1}{P_i(\hat{d}_i)} 1[d_i = \hat{d}_i] q_i] \\
    &= \frac{1}{N} \sum_{i < N} \frac{1}{P_i(\hat{d}_i)} \E[1[d_i = \hat{d}_i] q_i] \\
    &= \frac{1}{N} \sum_{i < N} \frac{1}{P_i(\hat{d}_i)} \E[1[d_i = \hat{d}_i] \hat{q}_i] \\
    &= \frac{1}{N} \sum_{i < N} \frac{1}{P_i(\hat{d}_i)} P_i(\hat{d}_i) \hat{q}_i \\
    &= \frac{1}{N} \sum_{i < N} \hat{q}_i \\
    &= \bar{Q}
\end{align*}

Unbiasedness means that CT does not on average (i.e. over many hypothetical repetitions of randomized decision-making) over- or underestimate $\bar{Q}$.  However, like A/B testing, it is still subject to estimation error, which depends on the sample size.  By dropping many sessions to compute $\hat{Q}$, CT reduces the effective sample size, exacerbating estimation error.  Confidence intervals reported in this paper incorporate this additional uncertainty.