-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlinear-regression.tex
398 lines (360 loc) · 15.3 KB
/
linear-regression.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
% Copyright 2016 Thomas E. Vaughan
%
% The present document is distributed under the terms of the GNU Free
% Documentation License.
%
% Other source code released with the document is distributed under the terms
% of the GNU Lesser General Public License.
\newcommand{\doctitle}{Linear Regression, the Best-Fit Curve, and C++}
\documentclass[twocolumn]{article}
\usepackage{amsmath} % for \text
\usepackage[english]{babel} % language selection
\usepackage{amsfonts} % for \mathbb
% for formatting figure captions
\usepackage[margin=10pt,font={sf},labelfont=bf]%
{caption} % for formatting figure captions
\usepackage{fancyhdr}
\usepackage{graphicx}
\usepackage{lastpage}
\usepackage{natbib}
\usepackage{tikz}
\usepackage{times}
\usepackage{vmargin}
\usepackage{xcolor}
\usepackage[colorlinks=true,citecolor=blue,hyperfootnotes=false]
{hyperref} % This must be the last package.
\selectlanguage{english} % Configure babel.
% vmargin setup
\setpapersize{USletter}
\setmarginsrb%
{0.375in}% left
{0.375in}% top
{0.375in}% right
{0.5in}% bottom
{2\baselineskip}% headheight
{2\baselineskip}% headsep
{3\baselineskip}% footheight
{4\baselineskip}% footskip
% mydate macro
\newcommand{\mydate}{%
\number\year\space%
\ifcase\month\or%
Jan\or\ Feb\or\ Mar\or\ Apr\or\ May\or\ Jun\or%
Jul\or\ Aug\or\ Sep\or\ Oct\or\ Nov\or\ Dec
\fi\space%
\number\day%
}
% fancyhdr settings
\pagestyle{fancy}
\lhead{\sffamily\textbf{\doctitle}}
\chead{}
\rhead{\sffamily \thepage~of~\pageref{LastPage}}
\renewcommand{\headrulewidth}{1pt}
\renewcommand{\footrulewidth}{1pt}
\lfoot{%
\footnotesize\sffamily
\begin{minipage}{0.95\textwidth}
Copyright\ \copyright\ \ 2016\ \ Thomas E.\ Vaughan.
PDF image generated on \mydate.
Permission is granted to copy, distribute and/or modify this document under
the terms of the GNU Free Documentation License, Version 1.3 or any later
version published by the Free Software Foundation; with no Invariant
Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
license is included in the section entitled ``GNU Free Documentation
License''.
\end{minipage}%
}
\cfoot{}
\rfoot{%
\begin{minipage}{0.05\textwidth}
\begin{flushright}
\includegraphics[width=0.85\textwidth]{logo}
\end{flushright}
\end{minipage}%
}
\begin{document}
\thispagestyle{fancy}
%\begin{abstract}
%
%This is the abstract.
%
%\end{abstract}
\begin{figure}
\begin{center}
\includegraphics[width=0.95\columnwidth]{sinusoid}
\caption{%
Different linear regressions (red and green curves) fitting the same
simulated data (blue symbols). The red curve represents a linear
regression on the first three functions ($1$, $\cos(2\pi x)$, and
$\sin(2\pi x)$) of the Fourier basis, and the green curve represents a
linear regression on the first four functions ($1$, $x$, $x^2$, and
$x^3$) of the polynomial basis. The 100 data points are spaced evenly
over one period of a model in which Gaussian noise has been added to
the output of $\sin(2\pi x)$. The standard deviation of the model's
Gaussian noise is $0.3$. The best-fit coeffients are displayed in the
legend of the graph.%
}
\label{fig:sinusoid}
\end{center}
\end{figure}
\section{Introduction}
\emph{Linear regression} allows one to find the best-fit curve through a set of
points in the plane. The ``linear'' in ``linear regression'' refers not to the
shape of the best-fit curve (for it need not be a straight line) but to an
aspect of the expression that represents the curve. The function corresponding
to the best-fit curve must be expressed as a \emph{linear combination} of basis
functions; an extremely convenient and useful fact is that each basis function
may be a \emph{non}-linear function of its variable. So, if the curve can be
expressed as a linear combination of functions of the same variable, and if, by
``best'', one mean that the coefficients of the linear combination are chosen
to minimize the sum of squared deviations between the points and the curve,
then one can find the best-fit curve by way of linear regression.
For example, in the $x$-$y$ plane, a curve $y = f(x)$ might correspond to a
function whose expression is
\begin{equation}
f(x) = c_1 b_1(x) + c_2 b_2(x).
\end{equation}
In this case, one calls $x$ the \emph{independent variable} and $y$ the
\emph{dependent variable}. The expression for $f(x)$ is the linear combination
of two expressions, $b_1(x)$ and $b_2(x)$. The coefficients of the linear
combination are $c_1$ and $c_2$. Each expression that is linearly combined
corresponds to a function---in this case, to $b_1$ or $b_2$---that is called
called a \emph{basis function}. For linear regression with the model in the
present example, there must be at least three constraining points---in general,
at least one point more than the number of coefficients to be solved for---at
least three points that constrain the trajectory of the curve. Suppose that
one has a set $\{(x_1, y_1),$ $(x_2, y_2),$ $(x_3, y_3)\}$ of exactly three
measurements, each of which corresponds to a point in the $x$-$y$ plane. The
linear regression determines $c_1$ and $c_2$ so that the sum of squared
deviations between the points and the curve is minimized. That is, the
regression finds $c_1$ and $c_2$ such that the sum
\begin{equation}
\left[y_1 - f(x_1)\right]^2 + \left[y_2 - f(x_2)\right]^2 + \left[y_3 -
f(x_3)\right]^2
\end{equation}
is minimized.
The red curve in Figure~\ref{fig:sinusoid} shows a linear regression for a set
of 100 simulated measurements. The first three functions of the Fourier basis
are used to estimate from the measurements the underlying oscillation's
vertical offset $B$, amplitude $A$, and phase $\phi$. This choice of basis
amounts to the hypothesis that in the data the variation underlying the noise
is a sinusoidal oscillation with unknown amplitude, offset, and phase but with
known period or wavelength. For the case of unit wavelength, the desired form
of the oscillation is expressed as
\begin{equation}
f(x) = B + A \sin(2\pi x + \phi).
\end{equation}
Expanding this according to the appropriate trigonometric identity, one finds
\begin{equation}
f(x) = B + A \sin(\phi) \cos(2\pi x) + A \cos(\phi) \sin(2\pi x).
\end{equation}
Using the first three terms of the Fourier basis, however, one constructs
$f(x)$ to find $c_1$, $c_2$, and $c_3$ such that
\begin{equation}
f(x) = c_1 + c_2 \cos(2\pi x) + c_3 \sin(2\pi x);
\end{equation}
so we have
\begin{eqnarray}
B &=& c_1\\
A \sin(\phi) &=& c_2\\
A \cos(\phi) &=& c_3,
\end{eqnarray}
and therefore
\begin{eqnarray}
A &=& \sqrt{c_2^2 + c_3^2}\\
\phi &=& \arctan\left(\frac{c_2}{c_3}\right)
\end{eqnarray}
In the example of Figure~\ref{fig:sinusoid}, the data derive from $\sin(2\pi
x)$ plus Gaussian noise. The best fit on the Fourier basis might ideally (with
no noise) find $B = 0$, $A = 1$, and $\phi = 0$. With 100~simulated points and
a standard deviation of 0.3~in the noise, the example shows a best-fit $B =
0.00182$, $A = \sqrt{0.0217^2 + 0.975^2} = 0.975$, and $\phi =
\arctan(\tfrac{0.0217}{0.975}) = 1.27^\circ$.
The green curve in Figure~\ref{fig:sinusoid} also shows a linear regression for
the same set of points. The first four functions of the polynomial basis are
used to estimate the first four polynomial coefficients. This choice of basis
amounts to the hypothesis that in the data the variation underlying the noise
is a polynomial function of at most third degree. One sees, from the goodness
of the fit, that a third-degree polynomial fit is a reasonable approximation of
sinusoidal variation over a single wavelength.
The main benefits of linear regression include determinism in the time required
to compute the fit and certainty that the solution is in fact the best
solution. The main problem, however, is that linear regression cannot always be
used. Returning to the example, one notes that if one want to fit a sinusoidal
model to a set of points in the plane, then the phase can be extracted from the
fit, but the wavelength cannot be extracted. The wavelength cannot be found by
linear regression because the wavelength is not the coefficient of a basis
function. However, if the wavelength be known, then the linear regression can
be peformed, and the phase can be extracted from the ratio $c_2/c_1$.
The phrase, ``independent variable'' has at least a couple of different
meanings. Although each basis function is expressed in terms of the same
independent variable, the output of each basis function serves in a different
sense as a separate independent variable. So ``independent variable'' has two
senses distinguished here:
\begin{itemize}
\item In Sense~1, the context of a curve in the plane, the abscissa
represents the independent variable.
\item However, in Sense~2, the context of \emph{multiple linear regression},
each term multiplied by a coefficient represents an independent variable.
\end{itemize}
In multiple linear regression, there are multiple independent variables, each
multiplied by a coefficient, and the coefficients are determined by minimizing
the sum of squared deviations. The present document describes a special case
of multiple linear regression. In this case, each of the multiple independent
variables (in Sense 2) is really the output of a basis function of the same
underlying independent variable (in Sense 1). So each basis function transforms
the same underlying independent variable into a new, different independent
variable, for the purpose of the regression.
The phrase, ``\htmladdnormallink{linear
regression}{https://en.wikipedia.org/wiki/Linear\_regression}'' is associated
with a wide variety of mathematical procedures. The present document describes
a C++-based approach to the computation of a linear regression. There are
different ways to compute the same regression. Documented here is the
\htmladdnormallink{method recommended in
Version~3.2.8}{http://eigen.tuxfamily.org/dox/group\_\_TutorialLinearAlgebra.html\#TutorialLinAlgLeastsquares}
of the \htmladdnormallink{Eigen header-only C++
library}{http://eigen.tuxfamily.org/index.php?title=Main\_Page}. This is the
method involving Jacobi \htmladdnormallink{singular-value
decomposition}{https://en.wikipedia.org/wiki/Singular\_value\_decomposition}.
\section{General Formulation}
Suppose that one wants to find by linear regression the column matrix
\begin{equation}
\mathbf{c} =
\begin{pmatrix}
c_1\\
\vdots\\
c_n
\end{pmatrix}
\end{equation}
of parameters of the function
\begin{equation}
f_{\mathbf{c}}(x) = \sum_{i=1}^{n} c_i \: b_i(x),
\end{equation}
where the basis functions are $b_1,$ $\ldots,$ $b_n$. For a set $\{(x_1,y_1),$
$\ldots,$ $(x_m,y_m)\}$ of $m$ measured points, the sum of squared
deviations---each between the fit and one of the measurements---is
\begin{equation}
\sum_{j=1}^{m} \left[ y_j - \sum_{i=1}^{n} c_i \: b_i(x_j) \right]^2.
\end{equation}
To find the best fit, one takes the derivative of this expression with respect
to each of the coefficients and sets that derivative to zero. Taking the
derivative with respect to $c_k$ and setting the result to zero (and then
dividing both sides of the equation by two), one finds
\begin{equation}
\sum_{j=1}^{m} \left[ y_j - \sum_{i=1}^{n} c_i \: b_i(x_j) \right] b_k(x_j)
= 0.
\end{equation}
So, for every $k$ between $1$ and $n$ (inclusive), one has
\begin{equation}
\sum_{j=1}^{m} \sum_{i=1}^{n} c_i \: b_i(x_j) \: b_k(x_j) = \sum_{j=1}^{m}
y_j \: b_k(x_j).
\label{eq:deriv-result}
\end{equation}
\subsection{The Straight-Forward and Fast Approach}
Equation~\ref{eq:deriv-result} is the same as the matrix equation,
\begin{equation}
\boldsymbol{\beta} \mathbf{c} = \mathbf{d}
\label{eq:beta}
\end{equation}
\begin{equation}
\begin{pmatrix}
\beta_{11} & \cdots & \beta_{n1}\\
\vdots & & \vdots\\
\beta_{1n} & \cdots & \beta_{nn}
\end{pmatrix}
\begin{pmatrix}
c_1\\
\vdots\\
c_n
\end{pmatrix}
=
\begin{pmatrix}
\sum_{j=1}^{m} y_j \: b_1(x_j)\\
\vdots\\
\sum_{j=1}^{m} y_j \: b_n(x_j)
\end{pmatrix},
\end{equation}
where
\begin{equation}
\beta_{ik} = \beta_{ki} = \sum_{j=1}^m b_i(x_j) \: b_k(x_j).
\end{equation}
Inverting $\boldsymbol{\beta}$ to solve for $\mathbf{c}$ is the computationally
fastest approach, but it is, in some cases, not so numerically robust an
approach as one might prefer.
\subsection{Numerical Robustness}
Another approach picks up where the previous approach ends. One notices that
\begin{equation}
\boldsymbol{\beta} = \mathbf{B}^{\text{T}} \mathbf{B}
\end{equation}
and
\begin{equation}
\mathbf{d} = \mathbf{B}^{\text{T}} \mathbf{y},
\end{equation}
where
\begin{equation}
\mathbf{B} =
\begin{pmatrix}
b_1(x_1) & \cdots & b_n(x_1)\\
\vdots & & \vdots\\
b_1(x_m) & \cdots & b_n(x_m)
\end{pmatrix}
\end{equation}
and
\begin{equation}
\mathbf{y} =
\begin{pmatrix}
y_1\\
\vdots\\
y_m
\end{pmatrix}.
\end{equation}
With these definitions, one sees that Equation~\ref{eq:beta} becomes
\begin{equation}
\mathbf{B}^{\text{T}} \mathbf{B} \mathbf{c} = \mathbf{B}^{\text{T}}
\mathbf{y}
\end{equation}
\begin{equation}
\mathbf{B}^{\text{T}} [\mathbf{B} \mathbf{c} - \mathbf{y}] =
\mathbf{0}_{n \times 1},
\label{eq:B}
\end{equation}
where $\mathbf{0}_{n \times 1}$ is the zero-valued column matrix with $n$ rows.
This suggests that $\mathbf{B} \mathbf{c}$ and $\mathbf{y}$ are nearly equal in
some sense. What one wants, then, is the solution $\mathbf{c}$ to the
over-determined system expressed in Equation~\ref{eq:B}. The system is said to
be overdertermined because $\mathbf{B}$ is taller than $\mathbf{c}$; that is,
$m > n$. The solution can be computed by way of the singular-value
decomposition (SVD) of $\mathbf{B}$. The numerical stability of this approach,
which sacrifices some speed, is better than that of the previous approach. One
may find the details of the SVD approach in a textbook or in the Wikipedia page
on SVD.
However, one might prefer the first, straight-forward approach (the direct
solution to Equation~\ref{eq:beta}) in a circumstance in which one needs so
much speed as possible, and one knows that the numerical accuracy will be
sufficient for every $\boldsymbol{\beta}$ that one will use.
\subsection{Numerically Robust Implementation Via Eigen}
If, in a C++ program that includes the headers for Eigen~3, the variable
\texttt{B} of type \texttt{MatrixXd} contain the value of every basis function
evaluated at every measured $x$, and if the variable \texttt{y} of type
\texttt{VectorXd} contain every measured $y$, then the following C++ expression
will return the solution (of type \texttt{VectorXd}):
\begin{small}
\begin{verbatim}
B.jacobiSvd(ComputeThinU | ComputeThinV).solve(y)
\end{verbatim}
\end{small}
The flags that are combined by the logical OR operation indicate that only the
minimal decomposition necessary for solving is calculated. (The full
singular-value decomposition is not necessary for solving the system.)
%\bibliographystyle{plainnat}
%
%\begin{thebibliography}{}
%
%\bibitem[LastName1 et al.(Year)LastName1, LastName2, and LastName3]{CiteTag}
%LastName1, Initials1, Initials2 LastName2, and Initials3 LastName3\ \
%``Title of Article''\ \ {\it Title of Journal}, {\bf Volume}: Pages, Year.
%
%\end{thebibliography}
\newpage
\input{fdl-1.3}
\end{document}