-
Notifications
You must be signed in to change notification settings - Fork 0
Module 00 Interlude
We have some data. We want to model it.
- First we need to make an assumption, or hypothesis, about the structure of the data and the relationship between the variables.
- Then we can apply that hypothesis to our data to make predictions.
$$hypothesis(data) = predictions$$
Let’s start with a very simple and intuitive hypothesis on how the price of a spaceship can be predicted based on the power of its engines.
We will consider that the more powerful the engines are, the more expensive the spaceship is.
Furthermore, we will assume that the price increase is proportional to the power increase. In other words, we will look for a linear relationship between the two variables.
This means that we will formulate the price prediction with a linear equation, that you might be already familiar with:
We add the ˆ symbol over the y to specify that
Now, how can we generate a set of predictions on an entire dataset? Let’s consider a dataset containing
Where:
-
$\hat y^{(i)}$ is the$i^{th}$ component of vector$\hat y$ , -
$x$ is the$i^{th}$ component of vector$x$ ,
Which can be experessed as:
For example,
You might have two questions at the moment:
-
WTF is that weird symbol? This strange symbol,
$θ$ , is called "theta". -
Why use this notation instead of
$a$ and$b$ , like we’re used to? Despite its seeming more complicated at first, the theta notation is actually meant to simplify your equations later on. Why? a and b are good for a model with two parameters, but you will soon need to build more complex models that take into account more variables than just x. You could add more letters like this:$\hat y = ax_1 + bx_2 + cx_3 + ... + yx_25 + z$ But how do you go beyond 26 parameters? And how easily can you tell what parameter is associated with, let’s say,$x_{19}$ ? That’s why it becomes more handy to describe all your parameters using the theta notation and indices. With$θ$ , you just have to increment the number to name the parameter:$\hat y = θ_0 + θ_1x_1 + θ_2x_2 + ... + θ_{2468}x_{2468} ...$ Easy right?
As you know, vectors and matrices can be multiplied to perform linear combinations. Let’s do a little linear algebra trick to optimize our calculation and use matrix multiplication. If we add a column full of 1’s to our vector of examples
We can then rewrite our hypothesis as:
Therefore, the calculation of each
We can now get to the same result as in the previous exercise with just a single multiplication between our brand new
In further Interludes, we will use the following convention:
- Capital letters represent matrices (e.g.:
$X$ ) - Lower case letters represent vectors and scalars (e.g.:
$x^{(i)} , y$ )
How good is our model? It is hard to say just by looking at the plot. We can clearly observe that certain regression lines seem to fit the data better than others, but it would be convenient to find a way to measure it.
To evaluate our model, we are going to use a metric called loss function (sometimes called cost function). The loss function tells us how bad our model is, how much it costs us to use it, how much information we lose when we use it. If the model is good, we won’t lose that much, if it’s terrible, we have a high loss!
The metric you choose will deeply impact the evaluation (and therefore also the training) of your model.
A frequent way to evaluate the performance of a regression model is to measure the distance between each predicted value (
In the last exercise, we implemented the loss function in two subfunctions. It worked, but it’s not very pretty. What if we could do it all in one step, with linear algebra?
As we did with the hypothesis, we can use a vectorized equation to improve the calculations of the loss function.
So now let’s look at how squaring and averaging can be performed (more or less) in a single matrix multiplication!
Now, if we apply the definition of the dot product: