Skip to content

Latest commit

 

History

History
3112 lines (2247 loc) · 120 KB

learning_from_data.org

File metadata and controls

3112 lines (2247 loc) · 120 KB

Learning from Data

Yaser Abu-Mostafa

assets/screenshot_2017-10-31_21-30-29.png

SnoTopics
1The Learning Problem
2Is Learning Feasible?
3The Linear Model I
4Error and Noise
5Training versus Testing
6Theory of Generalization
7The VC Dimension
8Bias-Variance Tradeoff
9The Linear Model II
10Neural Networks
11Overfitting
12Regularization
13Validation
14Support Vector Machines
15Kernel Methods
16Radial Basis Functions
17Three Learning Principles
18Epilogue

The storyline:

  • What is learning? - 1-4
  • Can we learn? - 5-8
  • How to do it? - 9-10
  • How to do it well? - 11-16
  • The philosophy - 17-18

Chapter 1: The Learning Problem

Example of machine learning

Predicting how a viewer will rate a movie

Netflix wanted to improve 10% for 1 million

3 components that ML can help with:

  • a pattern exists
    • if no pattern, ML cannot help
  • we cannot pin it down mathematically
  • we have data on it

We can describe each viewer with a vector of features. We can also create a same vector for the movie. When we take their inner product, we get a measure of how likely the user is to like the movie.

This approach is a problem because you have to get those vectors - watch the movie, interview the user etc

ML will reverse engineer the process. It will start with the rating and them come up with vectors for the movies and users (starting from vectors for both)

Components of Learning

Metaphor: Credit approval The banks have historical data about customers - age, gender, annual salary, etc

Formalization:

  • Input: Χ
    • customer vector
  • Output: y
    • to extend or deny credit
  • Target function: ƒ
    • ƒ: Χ → y
    • this function has a binary co-domain (y can only be 1 or 0)
    • ƒ is the target function, that is what we have to find
  • Data: (x1, y1), (x2, y2), …, (xN, yN)
    • this is the historical data

We use the “Data” to get an approximation of the “target function” called “hypothesis”

  • Hypothesis: g: Χ → y

So, we use the training examples to find the hypothesis which approximates the target function using the learning algorithm which selects our hypothesis (g) from the hypothesis set Η. The hypothesis set can be discreet or continuous, limited or infinite. In general, the hypothesis set is continuous and infinite (very infinite!) - but we will still be able to learn.

We will be able to with theory, put a number to the sophistication of the hypothesis set.

assets/screenshot_2017-11-01_00-04-14.png

The red components are what we can choose

Why do we need the hypothesis set? Why not let the learning algorithm select from universal set?

  • this will help us formalize learning, we will see this later

So, the components of learning are:

  • The hypothesis set:
    • Η ∈ {h}, g ∈ H
  • The learning algorithm
  • Together, they form the learning model

A simple model - the perceptron

Input is: x → {x1, …, xd}

Approve credit if:

Σ wixi > threshold, i → 1 to d

else deny

This linear formula h ∈ H can be written as:

h(x) = sign ( Σ wixi - threshold ), i → 1 to d

If h is +ve, we approve credit else we deny credit

We see that h is a function of w and threshold

If the data is linearly separable, we can learn a single line from a random line

We can rename -threshold to w0 and call it bias - we added an artificial coordinate x0 whose value is always 1 Note: w0 is not 1, it is a free parameter that we learn, x0 is 1 Now, formula:

h(x) = sign ( Σ wixi ), i → 0 to d

Vector form:

h(x) = sign (wTx)

w is a column vector, or [w0, …, wd]T and vector x is [x0, …, xd]

The perceptron learning algorithm takes a misclassified point and updates the weights such that they behave better on that point i.e. w ← w + ynxn

Consider these 2 cases of a point being misclassified In first case, the inner product will be negative but y is positive. So, the update rule moves w towards x (it adds x to w) and their inner product is now positive, so prediction is positive

In second case, the inner product is positive but y is negative So, the update rule moves w away from x (it adds -x to w) and now the inner product is negative as well

assets/screenshot_2017-11-01_00-18-52.png

The problem with this approach is that we can wrong the other ones when we classify a particular point correctly If you keep on repeating this process, and if the data is linearly separable, you can classify all points correctly with guarantee

(If it is not linearly separable, you can map it to a space where they are linearly separable)

Types of learning

The basic premise of learning

“using a set of observations to uncover an underlying process”

  • Supervised Learning
  • Unsupervised learning
    • we can get a high level of input data
  • Reinforcement learning

Puzzle

Consider this problem, you are given 6 training examples which are labeled with the correct output For the new one, what is the label?

assets/screenshot_2017-11-01_00-48-36.png

It is impossible to predict correctly - this is because the target function can be anything!

  • It can be -1
    • if you take the function to be -1 if top left square is black
  • It can be +1
    • if you take the function to be +1 if pattern is symmetric

We have this problem in machine learning also. But this does not mean that learning is impossible, this will be proved in next lecture

Chapter 2 - Is learning feasible?

Review

  • Learning is used when
    • there is a pattern, we cannot write mathematical formula for it, we have data
  • Notation:
    • we don’t know the target function y = ƒ(x)
    • we have the data set: (x1, y1), (x2, y2), …, (xN, yN)
    • we have a learning algorithm that finds g from the hypothesis set such that g ≈ y
  • we were stuck at the puzzle, where the random function could be anything, how are we to learn?

This lecture will address this question

Probability to the rescue

Consider an experiment: A bin has red and green marbles.

assets/screenshot_2017-11-01_01-20-55.png

We need to find the probability of picking a red marble, let’s call it μ

P[red] = μ

So, P[green] = 1 - μ

We pick N marbles independently (recall N is for # of data points) so, sample: GRGGGGRRGG

Let the fraction of R in sample: ν

Does ν (sample frequency) say anything about μ (the actual frequency in bin)? The short answer is not on the face of it, but it does give us some bounds.

ν is likely to be close to μ

What does ν say about μ

In a big sample, large N, ν is close to μ (within ε)

Formally:

P[|ν - μ| > ε] ≤ 2e**{-2ε2N}

This is Hoeffding’s inequality* This will give us the VC dimension

Hoeffding’s inequality* in words:

  • μ = ν probably approximately correct (PAC)

That is, the N required rises squared and exponentially with the bound

This is the inequality you are looking for when you want to see how closely your sample represents the actual truth. That is, say, you are checking for nulls in a database attribute, how many samples should you take to be 90% confident of your estimation

This formula is valid for N and ε, and doesn’t depend on μ However, there is a tradeoff involved, less ε, more N

Also note: ν ≈ μ ⇒ μ ≈ ν

Connection to learning

How is the bin related to learning?

  • Bin:
    • the unknown is a number μ
  • Leaning:
    • the unknown is a function ƒ: X → y

We can think of the bin as the input space. Each marble is a point x ∈ X So, all the possible applicants for credit function are represented in the bin

assets/screenshot_2017-11-01_01-42-13.png

Now, let’s say we have a hypothesis. For all the applicants that our hypothesis gets right, we can mark with a green marble in the bin. What we want is, the accuracy in the test dataset (ν) allow us to say something about the actual accuracy in the entire input space (μ)

Reiterating:

  • green if
    • h(x) = f(x)
  • red if:
    • h(x) ≠ f(x)

assets/screenshot_2017-11-01_01-46-25.png

Here, we don’t have all the points in the input space to check. We have only a sample of the data points. The sample that we have comes from a probability distribution; P on X, that gives us one point from X over another - the probability generates the dataset We are not restricting P on X, we don’t even need to know what it is because our Hoeffding’s inequality does not depend on probability distribution of N

We aren’t done yet, what we have discussed is, for this h, ν ≈ μ within some ε

However, this is just verification of h, not learning. Currently we chose some h and verified that it makes sense. We don’t want to pick the h ourselves, we want the algorithm to do it for us - we need to choose h from H

This is simple, we already have a probability distribution that gives us some data points from the input space(bin). We can test our h on each and choose the h that gives us the maximum right results from them and invoke Hoeffding’s to get our bound.

Notation:

  • both μ and ν depend on h
  • ν is “in sample”, called Ein
  • μ is “out of sample”, called Eout
  • Ein and Eout are actually Ein(h) and Eout(h) (are functions of h) because they are a dependent on h

Hoeffding’s becomes

P[|Ein(h) - Eout(h)| > ε] ≤ 2e**{-2ε2N}

A dilemma and a solution

We still have a problem!

We cannot just have multiple h and apply Hoeffding’s to them. Why? Consider this:

Take a coin, flip it 5 times. If we get 5 heads and we choose h which is always heads, it means ν is 1, but it doesn’t mean that μ is also 1

The probability of 10 heads in 10 tosses is: 1/210

If we toss 1000 fair coins 10 times each, probability that some coin will get 10 heads: 1 - P[no coin gets 10 heads] 1 - P[a particular coin doesn’t get 10 heads]1000 1 - [1 - P[a particular coin gets 10 heads]1000 1 - [1 - 1/210]1000 0.63

So, it is more likely that the 10 heads will occur than not. So, the 10 heads are no indication at all of the real probability of getting head. We cannot choose a h which will give 1 always and choose a sample which has all 1s and say we have a perfect system according to Hoeffding’s. Hoeffding’s applies to each one individually, but in each case, there is a probability that we will be off by say half a percent that we are off in some aspect in the 1st case, and half a percent off in another aspect in the 2nd case. If these “off” probabilities are disjoint, we end up with a bad system.

We need to find a way to deal with multiple h/bins.

An easy solution:

  • recall we had: P[|Ein(h) - Eout(h)| > ε] ≤ 2e**{-2ε2N}
  • we wanted to make a statement about Eout based on Ein

What we want now is:

  • P[|Ein(g) - Eout(g)| > ε] ≤ ??
    • so, we want to choose the best hypothesis g and want a bound for that choice
    • by plain logic (not using Hoeffding’s or anything), since g ∈ H, we have:
    • P[|Ein(g) - Eout(g)| > ε] ≤ P[ |Ein(h1) - Eout(h1)| > ε + |Ein(h2) - Eout(h2)| > ε + … + |Ein(hM) - Eout(hM)| > ε ]
      • where M is the number of h ∈ H or the cardinality of H
      • this is valid because g ∈ H, g is one of the h-s
    • This is the union bound, which assumes no overlap, this is the worst case, it cannot get worse than this
    • using Hoeffding’s we get:
    • P[|Ein(g) - Eout(g)| > ε] ≤ Σ P[|Ein(m) - Eout(m)| > ε], m → 1 to M
      • P[|Ein(g) - Eout(g)| > ε] ≤ Σ 2e**{-2ε2N}, m → 1 to M

assets/screenshot_2017-11-01_02-38-03.png

But the problem is:

  • P[|Ein(g) - Eout(g)| > ε] ≤ 2Me**{-2ε2N}

And M is the cardinality of H, which is ∞ generally, so we get that P[something] < ∞ etc So, the more sophisticated the model that you use, the looser will Ein track the Eout

Chapter 3 - The Linear Model I

Review

  • We ended with a problem, the loose tracking of Eout(g) by Ein(g)
  • Since g has to be one of h1, …, hM we conclude that: (union bound)
    • if | Ein(g) - Eout(g) | > ε then:
      • | Ein(1) - Eout(1) | > ε or
      • | Ein(M) - Eout(M) | > ε or
    • This gives us an added M factor
    • This generally works for other things, because they are disjoint. Eg: on a roll dice, we get a 5 or we get a 4 = P[5] or P[4] = P[5] + P[4]

We need to have a tighter bound on the tracking of Eout(g) by Ein(g). The union bound assumes no correlation b/w all events. It makes sense if the events are independent, and happen disjointly, like in the coin flipping scenario. We will get a smaller number if there is some correlation and they overlap.

We have established the principle that thru learning, you can generalize, and we have established that. We will later established that even when the cardinality of H is infinite, we can still generalize - the theory of generalization.

The Input representation

  • We’ll work with the MNIST like dataset. 16x16 grey level pixels.

assets/screenshot_2017-11-01_16-53-12.png

This is the raw input. 16x16 - 256 pixels

  • ‘raw’ input x = (x0, x1, …, x256)

Recall, we added the mandatory x0 to make the formula better

Our parameters: (w0, w1, …, w256) - this is 256 dimensional space

We can reduce the parameters, i.e. we can extract features:

  • intensity and symmetry x = (x0, x1, x2)
    • we have lost some irrelevant info but we also lost some information

Plotting just 5 and 1:

assets/screenshot_2017-11-01_16-57-47.png

Linear Classification

  • we’ll generalize perceptron to linearly non separable case
  • What does PLA do?
    • it tries to iteratively correctly classify a single point with each iteration
    • the data is not linearly separable, so it will not terminate

assets/screenshot_2017-11-01_16-59-47.png

  • we see that the error jumps around a lot
  • also note that Eout is being tracked nicely by Ein

Final perceptron boundary (after stopping at 1000 iterations):

assets/screenshot_2017-11-01_17-01-54.png

  • we can make a modification - “Pocket algorithm”
    • Just make a few iterations and keep the h that had the lowest Ein so far.
    • The errors now look like this:

assets/screenshot_2017-11-01_17-03-41.png

We get this boundary:

assets/screenshot_2017-11-01_17-04-51.png

Linear Regression

  • we’ll generalize the target function from being a binary function (a classifier) to a real valued function

Let’s discuss the credit problem again

Linear regression input:

  • let’s say we have d input features:
    • annual salary, years in residence, years in job, current debt etc

Linear regression output: h(x) = Σ wixi = wTx

  • summation over i → 0 to d

Linear regression dataset:

  • (x1, y1), …, (xN, yN)
  • yn ∈ R, is the credit line for customer xn

Error:

  • we can use the squared error: (h(x) - f(x))2
  • this squared error has good analytical properties - helps with differentiation etc

assets/screenshot_2017-11-01_17-50-48.png

When we plot linear regression, we never plot x0 because that is always 1.

The “line” or “hyperplane” is 1 dimension short of what you are working with. Eg:, we have mandatory x0, we also have one feature, and we have the output - 2 dimensions, and so 1D line

assets/screenshot_2017-11-01_17-54-53.png

The red is the in sample error Ein

We had:

assets/screenshot_2017-11-01_17-50-48.png

Which can be re-written in matrix form as:

assets/screenshot_2017-11-01_17-57-32.png

Minimizing Ein:

We have only w as the variable. How we can find the minimum of the function Ein(w) is by taking it’s derivative and equating it to 0 (a vector of 0s)

assets/screenshot_2017-11-01_18-00-54.png

X-dagger is “pseudo-inverse” because X is not a square matrix, it is a very tall matrix and but X-dagger is still it’s inverse, because if you multiple X-dagger with X, you’ll get identity matrix.

*actually d+1 and not d, because we also have constant x0 ⬇️

X is Nxd, XT is dxN So, XTX is dxN * Nxd = dxd Inverse of dxd is simple (since num of features is generally less), so we have: (XTX)-1XT = dxd * dxN = dxN → X-dagger X-dagger * y = dxN * Nx1 = dx1 → our features

y is the target vector X is the input data matrix

This is “one step learning”

You can use linear regression for classification, it doesn’t matter if you set y to be ±1, the algo will still learn. You just use sign(wTx) as the output - similar to using 0 as the threshold

assets/screenshot_2017-11-01_18-16-38.png

Linear regression applied like so 🔝 is called linear classification There is a problem here: 🔝

The value for the red region is highly negative for the lower points. But their target value is just -1. So, the linear regression reports high error due to this and the boundary is pushed towards the center

So, one can use Linear Regression to give a jumpstart to the perceptron - a good initial weights to start with.

Nonlinear transformation

  • Sometimes we need to transform data to make them linearly separable
  • Sometimes we want to have non-linear features, like in the credit example, we don’t want the time spent in one location to be linear, we want it to be something like: 0 for |x<1| and 1 for |x>5| etc
  • Non linear transformations remain within the realm of linear models.

This is because the variables of the function are the weights, not the features. So, we can do anything with the features, we are still in the linear realm.

So, we can do this transformation:

(x1, x2) → (x12, x22)

assets/screenshot_2017-11-01_18-27-34.png

We can now use perceptron in the transformed space. We cannot use arbitrarily complex transformation, there is a catch about which we’ll see later

Chapter 4 - Error and Noise

Review

  • Linear models use the “signal” wTx (which is vector form for Σ wixi)
  • One step learning: w = (XTX)-1XT · y
  • Non linear transformation:
    • wTx is linear in w
    • any transformation x → z preserves this linearity and so can be used

Nonlinear transformation (continued)

We had Φ transformation last time:

assets/screenshot_2017-11-01_18-36-09.png

Note, Φ is a non-linear transformation of input space, that is to say, straight and parallel lines drawn in input space won’t remain straight and parallel in transformed space. So, each point in input space, may map to more than 1 point in the transformed space and vice versa (or it may not have a mapping)

Note, we can transform from d-dimensional space to d`-dimensional space, there is no limit really

We learn the weights in the transformed space, the targets remain in the original space.

The learned hypothesis is still g,

assets/screenshot_2017-11-01_19-00-41.png

Note, Φ(x) gives us z

Error measures

They try to answer this:

  • what does it mean for “h ≈ ƒ”?
  • Error measure: E(h, ƒ)
  • If error is 0, h perfectly represents ƒ and we are golden.
  • It is almost always “pointwise defination”:
    • e(h(x), ƒ(x))
  • example: square error: e(h(x), ƒ(x)) = (h(x)-ƒ(x))2
  • another example: e(h(x), ƒ(x)) = || h(x) ≠ ƒ(x) ||
    • || h(x) ≠ ƒ(x) || returns 1 if h ≠ f else 0

How do we go from the pointwise to overall? We take pointwise and average it to get to overall

  • Thus, in-sample error:

assets/screenshot_2017-11-01_19-11-39.png

  • Also, out of sample error:
    • this is by logic, the definition of Eout

assets/screenshot_2017-11-01_19-12-47.png

It is the expectation of pointwise error e on data points x

So, the learning diagram becomes:

How to choose error measure

It depends on the system.

  • For some systems, false positive would be expensive. For other systems, false negatives would be expensive

For classification, we can create a table:

For supermarket:

assets/screenshot_2017-11-01_19-18-56.png

False negative is expensive; if you are given a discount coupon and you go there and they say you can’t use it, it is horrible It is okay if you weren’t given it in the first place but you still trick them into giving the discount

For CIA, false positives are disastrous,

assets/screenshot_2017-11-01_19-21-20.png

Sometimes we don’t have the ideal error measure, so we use alternative error measures

  • if the noise is gaussian, we can use squared error
  • in binary, we can use cross entropy error measure (this is what we used in logistic regression, the Bernoulli thingy)

Sometimes, we use friendly measures:

  • squared error for linear regression is friendly because it has a closed-formed solution. The cross entropy error measure in classification turns out to be convex and so we can find the global minimum etc.

Noisy targets

Very important, because the target function is not always a “function” in true mathematical sense, it is noisy. This is because the target function cannot capture ALL the information that plays a part in determining the outcome

So, we use target distribution:

  • Instead of y = ƒ(x) we get:
    • P(y|x)

Earlier, (x, y) was generated by a probability distribution (the dataset was generated by PD, say P) but now, y is non-deterministic and is generated by a PD too. So, (x, y) is generated by a joint PD:

  • P(x, y) = P(x)P(y|x)
  • we can model noisy target as:
    • Noisy target = deterministic target “ƒ(x) = E(y|x)” + noise “y - f(x)”

No loss of generality by assuming probability distribution over P over y, because even if the target function is indeed a function, we can model it as:

P(y|x) = 0 everywhere except for y = f(x)

So, discrete case:

  • Probability is 1 for y=f(x), 0 elsewhere

So, continuous case:

  • delta function at y=f(x), 0 elsewhere

Noisy target function?

  • The target function is noisy because we aren’t able to model it completely. We are trying to model it with limited parameters (the features) and not taking into consideration all the factors that it depends on. Hence, the target function that we are trying to learn is noisy. The god send truth target function that is generating the data is not noisy. Our approximation, the one we are trying to learn, is.
  • Or is it inherently noisy? The function that generated the data in the input space, that true function ƒ is noisy itself?

Our updated learning diagram:

assets/screenshot_2017-11-01_19-49-22.png

Now, the target function is replaced by a distribution. And we also have an error measure lying between learning algorithm and final hypothesis. It is the prize we pay for not being perfect in our approximation of the target function.

Note:

  • P(x)
    • This is the probability of generating the dataset that we train on
    • We introduced this to accommodate Hoeffding’s
    • it quantifies the relative importance of x - it is not what we are interested in
    • this also dictates what region of the input space you get samples from. The probability distribution could be such that it never gives you points from particular regions of input space. Currently, our model will be agnostic of that, but in Bayesian learning, the confidence of the model will be lower in those regions where it hasn’t seen examples from
    • also, once you have P(x), you draw samples from it independently, i.e. i.i.d - independent identically distributed
    • P(x) is not a problem if it has a long tail or if it is a delta function or whatever; as long as it is used in both training and testing.
  • P(y|x)
    • This is the probability the target function gives y for an input x
    • This is because we the target function is inherently noisy
      • this is what we are trying to learn

Both these PDs together generate our dataset; P(x, y) So, we must always remember that P(x, y) is actually the mixture of concepts, of 2 PDs which are inherently different

Preamble to the theory

What we know so far:

  • we know that learning is feasible
    • Eout(g) ≈ Ein(g)
  • However, this is not really learning. It is just validation that our “selected” hypothesis gives us a measure of how it will perform out of sample. This is just good generalization
  • This guarantee that Ein is a good proxy for Eout is important for us to learn. It is a building block for learning.
  • Real learning is actually:
    • g ≈ f
      • i.e. Eout(g) ≈ 0
      • this measure how far are you from the target function

The 2 questions of learning

  • We need Eout(g) ≈ 0
  • This is achieved thru:
    • Eout(g) ≈ Ein(g) and Ein(g) ≈ 0
    • It’s like saying, I can do good on the sample dataset and I know that this is means that I will do good outside the sample too

assets/screenshot_2017-11-01_20-17-38.png

We know that answer to 1 is “Yes”. But we were left with the M factor on right hand side that we have to deal with. The answer to 2 is what ML algorithms are suppose to do! They give us a good fit for sample data.

The theory will achieve this for us:

  • characterize feasibility of learning for infinite M
    • we will have a single parameter that tells us the sophistication of the hypothesis set
  • characterize the tradeoff b/w model complexity and how well our Ein(g) tracks Eout(g)

assets/screenshot_2017-11-01_20-24-10.png

Chapter 5 - Training versus Testing

Review

assets/screenshot_2017-11-02_01-08-23.png

Ein is averaged over N points Eout has a logical definition, the weighted error on a point, over all points.

Also, last time, we were confused about the noisiness of the target function. It is that the target function inherently is noisy, it is not a function in a true mathematical sense. It has a probability distribution over outcome (y) based on the input so: ƒ: y → x + noise or: y ˜ P(y|x)

Earlier, when we considered y to be deterministic, we generated the sample by P(x) But now, y also comes from the PD, so we have the dataset generated from 2 PDs P(x, y) = P(x)P(y|x)

Eout(h) is now Ex,y[e(h(x), y)]

From training to testing

Say you have a final exam. You get some practice problems. This is training for final exam. If you directly do the final exam questions, you might not have learned. The end goal is low Eout which only happens if you study and learn the material.

We have:

  • Testing
    • How well you do in the final exam (Ein) tracks how well you do in the wild (Eout)

assets/screenshot_2017-11-02_01-22-00.png

  • Training
    • How well you do in the practice does not track very well how you do in the wild
    • There is a factor of M that comes in here
    • M represents “how much you explore”, how many cases are possible

assets/screenshot_2017-11-02_01-23-13.png

We need to replace M with a friendlier quantity, if we are able to do it, we are in business.

Recall, M is cardinality of hypothesis set. Our learning algorithm is free to choose any hypothesis it wants from the set and so to invoke Hoeffding’s inequality, we had to add Ps of all bad events i.e. we had:

P[B1 or … or BM] where B is the bad event: | Ein(hm) - Eout(hm) | > ε

By union bound: P[B1 or … or BM] = P[B1] + … + P[BM]

We took disjoint events of all bad events, but in reality they are related and overlap a lot

assets/screenshot_2017-11-02_01-34-00.png

We can solve this exactly for the perceptron for eg, but it would be a nightmare. We want to extract from a given hypothesis set H, a number that would characterize this overlap and give us a good bound

Illustrative examples to show overlap

  • Consider a perceptron

assets/screenshot_2017-11-02_01-37-12.png

assets/screenshot_2017-11-02_01-38-00.png

Also, we have Ein,

assets/screenshot_2017-11-02_01-38-35.png

Here, we apply Hoeffding’s inequality and we know that Ein tracks Eout etc Now, consider another perceptron, slightly different:

assets/screenshot_2017-11-02_01-40-09.png

The green line is another perceptron. In both the cases, Eout will be slightly different. Also, Ein will be different if there is any point that falls in the yellow region.

So, | Ein(h1) - Eout(h1) | ≈ | Ein(h2) - Eout(h2) | If one exceeds ε, the other does as well. But we do not consider this strong correlation between these, we consider them to be independent. So, just counting the number of hypothesis is not optimal in that it does not give us a tight bound. We can improve it.

Instead of the whole input space, we can consider only a finite set of input points. And we can differentiate b/w the perceptrons if they classify the input points differently So, given a set of red and blue points, we can count all possible classifications of them. This is a good proxy for the complexity of the hypothesis set. If hypothesis set is powerful, it can classify the points in all possible ways. If it is not so powerful, it may not be able to achieve some classifications. The number of classifications that the hypothesis set can give is called dichotomies.

Dichotomies is a proxy for the number of hypothesis. It is based only on the input points and not on the general input space. They are “mini hypotheses”

A hypothesis is a function h: X → {-1, +1} which takes in the full input space and gives the classification

A dichotomy is also a function h: {x1, x2, …, xN} → {-1, +1} which takes in only the input points and gives the classification (each x, say x1 is a vector of all input features for first data point)

Number of hypothesis |H| → ∞ Number of dichotomies |H(x1, x2, …, xN)| → 2N if H is extremely expressive

This is a candidate for replacing M. We can also define “m”, the growth function which gives us the most dichotomies on N points (given a hypothesis set)

assets/screenshot_2017-11-02_01-59-07.png

The subscript H is because the growth function is defined for a given hypothesis set.

We also know that mH(N) ≤ 2N

Growth function for the perceptron:

  • N = 3, i.e. mH(3) = 8
  • recall this is the maximum dichotomies possible
  • N = 4, i.e. mH(4) = 14
  • this is because we cannot get this combination with our perceptron:

assets/screenshot_2017-11-02_02-03-29.png

Growth functions for simple cases of H

Example 1: positive rays

  • It is defined on the real line
  • it has a point, everything on right is +1, on left is -1

assets/screenshot_2017-11-02_02-06-36.png

H is the set of h: R → {-1, +1} So, for N points, we get: N dichotomies, so mH(N) = N + 1 (all blue, all red, N-1 sandwiched positions)

Example 2: Positive intervals

  • here, we have an interval
  • everything within is +1, everything else is -1

assets/screenshot_2017-11-02_02-07-58.png H is the set of h: R → {-1, +1} So, for N points, we get: N dichotomies, so mH(N) = N + 1 The way we get a different dichotomy is by choosing 2 points on the number line:

mH(N) = nC2 We need to add 1 for the case that we can select the same point, (blue region is null set)

So, mH(N) = nC2 + 1 or, mH(N) = N2/2 + N/2 + 1

More powerful than the last one!

Example 3: convex sets

  • we define a convex region as our +1 area
  • a convex region is the region where a line segment connecting 2 points on the region lie entirely inside the region H is the set of h: R → {-1, +1} h(x) = +1 is convex

What is the growth function? We can put our N points on the perimeter of a circle and thus we can get ANY classification from the 2N eg:

assets/screenshot_2017-11-02_02-16-09.png

So, the growth function is mH(N) = 2N Since the hypothesis set gets all the 2N dichotomies, we say that the shatters the N points

We have 3 growth function:

assets/screenshot_2017-11-02_02-19-01.png

Note that the dichotomies also aren’t very tight. Convex sets is a complex hypothesis set, but not the most complex, we still have 2N growth function

So, talking about replacing M with mH(N)… We had:

assets/screenshot_2017-11-02_02-22-39.png

We can replace M with mH(N) if mH(N) is a polynomial. This is because then we can increase N and get the RHS to be very small, so small that the bound makes sense. The only criterion is that mH(N) should be a polynomial so that we can defeat it (we cannot defeat M right now, it is infinite)

So, once we are able to prove that our hypothesis set’s growth is polynomial, we will be able to learn using that hypothesis set. We may need a lot of data points, but we will be able to generalize to the entire input space given the finite (albeit large) input points

Key notion: Break point

Defined as the k after which growth function, mH(k) is less than 2k –> mH(k) < 2k “If no data set of size k can be shattered by H, we call k a break point for H”

  • For perceptrons it is 4.

So, break point for

  • positive rays:
    • k = 2
    • For 2 points, we cannot get this: “<red> <blue>”
    • we can only get: “<red> <red>” or “<blue> <blue>”
  • positive intervals
    • using the formula mH(k) < 2k we have: k = 3
    • we cannot get: <blue> <red> <blue>
  • convex sets
    • never fails, so, k = ∞

So, we have this result:

  • No break point ⇒ mH(N) = 2N
  • Any break point ⇒ mH(N) is polynomial in N

So, to be able to learn with a given hypothesis set, we just need to prove that it has a break point

Example: Given 3 points, and given that the break point is 2, we can have only 4 dichotomies

assets/screenshot_2017-11-02_02-40-43.png

Any addition (out of 23 = 8) is not allowed because then it would classify 2 points completely (and that is not allowed, as break point is 2)

Chapter 6 - Theory of Generalization

The existence of a break point restricts the number of dichotomies drastically.

We have 2 things to do now:

  • Prove that a growth function mH(N) with a break point is polynomial
  • Prove that we can replace mH(N) with M

Recall to show that mH(N) is a polynomial, we need to show that mH(N) is bound above with a polynomial in the general case It is actually! mH(N) is bounded above by a polynomial of power Nk-1 where k is the break point for the hypothesis set

  • examples:
    • H is positive rays
      • k is 2, so, mH(N) should be bounded by polynomial of power 1. Which is true, since mH(N) is N + 1
    • H is positive intervals
      • k is 3, so, mH(N) should be bounded by N2. Which is true, since mH(N) is nC2
    • H is 2D perceptrons
      • k is 4, mH(N) should be bounded by N3. We did not know the mH(N) for it, but we know that it is bounded above by N3

So, as long as there is a break point, we will be able to get a polynomial and thus learn.

This result that you can replace mH(N) with M is called the Vapnik-Chervonenkis Inequality is the most important theoretical result in machine learning.

Chapter 7 - The VC dimension

Review

  • We saw that mH(N) is bounded above by a polynomial in k, where k is the break point for the hypothesis set
  • Also, that we can replace M with mH(N) and thus learn according to VC inequality

Thus, we can learn for any hypothesis set which has a break point. This lecture will give us the notion of the “VC dimension”, which characterizes the complexity of the hypothesis set

The definition

  • It is a quantity defined for a hypothesis set H, denoted by dvc(H)
  • It the is most points H can shatter
  • it is the “largest” value for N, for which mH(N) = 2N
  • it is k-1
  • N ≤ dvc(H) ⇒ H can shatter N points
  • N > dvc(H) ⇒ N is a break point for H (anything above dvc(H) is a break point)
  • in terms of break point k:
    • for any hypothesis set with VC dimension dvc(H), we have it’s growth function a polynomial of degree dvc(H)
  • examples:
    • H is positive rays
      • dvc(H) is 1
    • H is 2D perceptrons
      • k is 4, so dvc(H) is 3
    • H is convex sets
      • k is ∞, so dvc(H) is infinite as well
      • this is the most pessimistic case, since we won’t get the points in a neat circle. This is the upper bound for the hypothesis set
      • so, we can still learn

Reiterating, we have established that if dvc(H) is finite (aka k exists), g ∈ H will generalize And it will generalize independently of learning algorithm, input distribution, independent of target function, for any hypothesis set

VC dimension of perceptrons in any dimension

  • for 2D, dvc(H) was 3
  • for 3D, dvc(H) is 4
  • in general, dvc(H) = d + 1 where d is the dimension
  • I.e., the perceptron in d dimensions can shatter d+1 points completely
  • Also, d is the number of paramteres in the model - (w0, w1, …, wd)

Interpreting the VC dimension

  • the #parameters are analog degrees of freedom
  • the VC dimension makes them binary degrees of freedom - it is based on the #dichotomies possible, #of points you can shatter
  • examples:
    • H is positive rays
      • dvc(H) is 1, so 1 parameter, 1 degree of freedom - which is what we have!
    • H is positive intervals
      • dvc(H) is 2, so 2 parameters, 2 degrees of freedom - which is what we have!

Actually, it’s not just parameters, it is degrees of freedom. A parameter may not contribute to the degree of freedom, then it won’t count.

Consider a 1D perceptron,

assets/screenshot_2017-11-03_00-10-42.png

It is 2 parameters(variables), 1 weight and 1 bias/threshold Now, we can take the output and feed it into another perceptron, …, 3 times and that is the output. Thus, we have this:

assets/screenshot_2017-11-03_00-17-41.png

Here, we have 4*2 = 8 parameters, but we still have 2 degrees of freedom, so dvc(H) is still 2 You don’t care about the internal structure, you ask yourself how many points can I shatter? k is it? Then my dvc(H) is k-1. That’s all.

So, dvc(H) measures the effective number of parameters

  • Number of data points needed:
    • we had this statement according to the VC inequality:

assets/screenshot_2017-11-03_00-29-40.png

Here, we can see that the RHS is just Nde-N where d is the dvc(H) (vc dimension) and N is the number of samples needed Plotting LHS (probability) vs RHS:

assets/screenshot_2017-11-03_00-32-02.png

We see that the polynomial is high initially (and the bound is absurd, like LHS > 10 etc) but then the exponential kills it and we get tighter and tighter bounds. This is the reason if we use a linear regression model on large number of sample points, and we get some error on the sample points, we’ll get the same error out of sample as well - with a high probability. This is the case of having RHS (δ) as very small, I.e. Having a tight bound on the | Ein - Eout |, but the ε may be large. That is to say, we are 99% sure that the in sample error will be within 60% of out of sample error - “very confident that it is a bad system”

Rule of hand:

  • N ≥ 10dvc(H)
  • You need 10 times the dvc(H) number of examples

Generalization bounds

  • We can rearrange things:
  • We had this from the VC inequality:

assets/screenshot_2017-11-03_00-45-24.png

From the RHS, we can get ε in terms of δ δ is just how tight a bound do you want on the statement you make…

assets/screenshot_2017-11-03_00-46-35.png

So, we have:

  • ε depends on N, δ, mH - growth function of hypothesis set - we call the RHS Ω
  • if mH is large, dvc(H) is large, the guarantee on generalization(ε) is worse
  • if δ is small, I.e. You want to be very sure of the statement you make, worse ε
  • if N is large, ε becomes smaller
  • eventually, the log is killed by linear N (just like polynomial is killed by exponential)

We can re-write as:

With probability ≥ 1 - δ

assets/screenshot_2017-11-03_00-52-00.png

Ω is a function of

  • N → goes down with it
    • the more examples, the tighter bound on generalization
  • H → goes up with it
    • more complex H, higher the VC dimension dvc(H) and so less generalization
  • δ → goes up with smaller delta
    • as you want to be more sure about the confidence in your statement, less generalization

We can simplify further as:

  • remove the modulus, because mostly Eout is smaller than Ein
  • so, Eout - Ein ≤ Ω
    • Eout - Ein is called the generalization error
  • moving Ein to RHS,
    • Eout ≤ Ein + Ω(N, H, δ)
    • So, the RHS bounds the Eout
    • We have control on the RHS, Ein we are controlling and Ω is dependent on H, N, δ etc
    • If we choose a more complex H, Ein goes down, but Ω goes up.
      • Hence, there is a tradeoff involved here

This form will also give us the groundwork for regularization. Earlier, we used just Ein as a proxy for Eout. It seems here, that we can use Ein + Ω as the proxy and that may give us a better handle on Eout.

It is not always possible to find the dvc(H) dimension of the hypothesis set, eg neural nets. We can say it is bounded above by the #parameters, but it can be much much lower as we saw in the chained perceptrons.

Chapter 8 - Bias Variance Tradeoff

Review

  • We saw that dvc(H) is just the max number of points a hypothesis set can shatter
  • It is k-1, where k is the break point
  • We have a rule of thumb: N ≥ dvc(H)
  • Also, we rearranged the generalization bound to:
    • Eout ≤ Ein + Ω
    • Ω is a function of:
      • δ - what is the probability of error in the statement we make, I.e., what is the probability that Ein tracks Eout within ε
      • ε - how loosely does Ein track Eout

Bias-Variance tradeoff

  • It is a stand alone theory which gives us a different angle on generalization (as contrasted with VC analyses)
  • The tradeoff is b/w approximation and generalization
    • Small Eout is what we want; good approximation of ƒ out of sample
    • More complex H → better chance of approximating ƒ - we have a lot of options
    • Less complex H → better chance of generalization ƒ - we have a hard time trying to get the right one from so many
    • The hypothesis with only the target function in it!
    • Ideal is H = {ƒ}
  • The quantification of this tradeoff in VC dimension analysis is:
    • Eout ≤ Ein + Ω
      • Ein is approximation, because we are trying to fit a target function to input dataset
      • Ω is generalization

Bias-Variance provides an alternate way of looking at this same tradeoff

  • It decomposes Eout is decomposed into:
    • How well H can approximate ƒ - how flexible is your H
    • How well we can zoom in on a good h ∈ H
  • This applies to real valued targets. We’ll use squared-error because it helps with derivation
  • We start with Eout
  • We keep D in the notation because the Es depend on the dataset. Earlier, in VC analysis also it was present, but we never wrote it because it wasn’t used
  • Eout(g(D)) = Ex[(g(D)(x) - ƒ(x))2]
  • Expected error over entire space 🔝
  • To remove dependency on a particular D, we take ED on both sides
  • ED[Eout(g(D))] = ED[Ex[(g(D)(x) - ƒ(x))2]]
  • we now manipulate this equation 🔝
  • Let’s focus on just Ex[(g(D)(x) - ƒ(x))2]
  • we define an “average” hypothesis - bar

assets/screenshot_2017-11-27_21-16-03.png

  • what is gbar?
    • imagine you have many datasets, so gbar is the average of the hypothesis that you choose from them.

assets/screenshot_2017-11-27_21-19-17.png

So, we have this derivation:

assets/screenshot_2017-11-27_21-37-30.png Adding and subtracting gbar 🔝

assets/screenshot_2017-11-27_21-37-59.png

We get this 🔝 on rearrangement.

Now, the second term is independent of D, so ED of it is itself Also, in the 3rd term, the first half is gbar - gbar so that is 0 as well

So,we get:

assets/screenshot_2017-11-27_21-39-56.png

This 🔝 ⬆️ is a very important equation. The first term tells you how the hypothesis we got from our D differers from the best that you could get from that D. Gbar is the best we can get from D because it is the average so is “better”

There are 2 hops; the hop from your hypothesis to the best you can get, and the second hop is from the best you can get to the actual target function The second term is how far is the best possible you can get from the target function itself?

  • The 1st term is variance
    • This is how much away you are from the best h in you H. You cannot get gbar because you don’t have all the possible datasets, you have only the one you got
  • The 2nd term is bias
    • Because your hypothesis set may be biased away from the target function because your best is so much away from the target function

Thus, recall we started with:

assets/screenshot_2017-11-27_21-48-21.png

So, we get: ED[Eout(g(D))] = Ex[bias(x) + bar(x)] = bias + var

So, we have broken down the out of sample error into it’s bias and variance components. So, if we have an Eout of 0.3, we can say that 0.05 is because of bias and 0.25 is because of variance

The tradeoff

We have:

assets/screenshot_2017-11-27_21-51-28.png

Pictorial representation:

assets/screenshot_2017-11-27_21-52-44.png

If we have a H with only 1 function, we have no variance and we always choose that. But it may not be the best, our g ( = gbar) will be biased away from the target function

assets/screenshot_2017-11-27_21-53-27.png

If we have a very complicated H, we won’t be able to choose the best - the target function because we have a lot of variance. We get a different D each time to learn etc. And depending on that D, we might not choose ƒ. The centroid of the red region is gbar. It would be quite close to ƒ. (The difference b/w them would be the bias, which is close to 0 here)

If H goes up, bias goes down, variance goes up

Let’s take an example

  • Let’s take our target function to be a sine curve.
  • ƒ(x) = sin(π x), ƒ:[-1, 1] → ℜ
  • Our dataset is only 2 datapoints.
  • We have 2 hypothesis sets, H0 and H1
  • H0 is the constant model - h(x) = b
  • H1 is the linear model - h(x) = ax + b
  • Which one is better?
    • Better for what?

Approximation

The linear line is better. Since we know the target function, a line is a better fit than a constant. (we are using mean squared error)

assets/screenshot_2017-11-27_22-03-35.png

The errors are in yellow:

assets/screenshot_2017-11-27_22-04-05.png

Eout in this case is 0.2

For H0, we have the constant only. So, the hypothesis will be g(x) = 0 and the error will be huge:

assets/screenshot_2017-11-27_22-05-36.png

Eout in this case is 0.5

So, the linear model wins. We can do better with a 3rd order polynomial, even better with a 17th etc. This is because we have no question of zooming in, we have all the information

Learning

We get 2 examples now. We don’t know the target function, we just have independently picked examples. So, we have:

assets/screenshot_2017-11-28_08-35-20.png

So, H0 and H1 will be:

assets/screenshot_2017-11-28_08-36-33.png

We can compute the error (the yellow regions) etc, but it depends on which 2 points we get, on our dataset. This is why we had the bias-variance tradeoff. Now, we can do a bias-variance decomposition on these two hypothesis sets.

So, we do a simulation in which we take 2 points on random from the target function and fit our constant hypothesis to them We get this:

assets/screenshot_2017-11-28_08-40-47.png

The gray lines are all the various g-s we choose depending on the input dataset D We can take the average of it, to get g-bar

assets/screenshot_2017-11-28_08-41-34.png

G-bar is just what we had in the Approximating case. It shows us that the average hypothesis is inherently better because it “cancels out the variance”. The output of our learning process is one of the gray lines, not g-bar. The gray shaded region is the variance around g-bar - the std dev around g-bar.

The error b/w the green line and the target function will give you the bias. The gray region will be the variance

If we have the H1, the straight line, we get this (on the same dataset as in case H0):

assets/screenshot_2017-11-28_08-47-39.png

Note we have a lot of variance.

On average we get:

assets/screenshot_2017-11-28_08-48-45.png

The variance depends on x, the gray region is the std dev. When you want one number to define it, we take the expected value of the width2 of the gray region. So, we have better approximation but very bad (high) variance. So, given only 2 points, we are better off with a constant.

So, if asked what is better at approximating a sine curve, a line or a constant? - A line of course! But, if asked what is better at approximating 2 points coming from some unknown curve? - A constant is better

Let’s quantify this:

assets/screenshot_2017-11-28_09-31-39.png

Here, the bias is exactly what we had for the approximating case. The variance is 0.25

Whereas for H1,

assets/screenshot_2017-11-28_09-32-31.png

The bias is 0.21. This is slightly higher than in the approximating case where it was 0.20. This is to be expected because here, we get only 2 points at a time and not the whole sine curve. The variance is huge.

So, H0 is better we see.

In any ML scenario, we are matching the model complexity to the data resources, not the target complexity; we don’t know what the target function is.

ML idea

Let’s say you have 100 datapoints. Is there any utility in taking 10 points at random and trying to learn a hypothesis? Doing this for many random samplings of 10 points and then averaging the hypothesis would give us something close to gbar. This is a way of dealing with variance, no?

Learning curves

Plot of Eout and Ein as a function of N Data set D of size N. Expected out-of-sample - ED[Eout(g(D))]

  • it depends on the dataset. If you want it to be agnostic of any particular dataset, we “integrate” D out and get expected value wrt D

Expected in-sample error: ED[Ein(g(D))]

  • This is how you fit them

How do these vary with N?

assets/screenshot_2017-11-28_09-42-51.png

This is for the simple model.

assets/screenshot_2017-11-28_12-47-32.png

The same thing for the complex model as well, just the graph is shifted towards right. The x-axis where the Ein is zero corresponds to the VC dimension. Till there, the hypothesis set fits everything perfectly, all the points are shattered.

VC analysis vs bias-variance

In VC analysis, we had Eout which comprised of in-sample error and generalization error (Ω) So, the figure looks like this:

assets/screenshot_2017-11-28_12-51-50.png One thing is that in the VC analysis, the generalization error isn’t this much, it is much much higher. It just follows this shape. The bound is not that tight.

If we plot the same thing for the BV analysis, we get:

assets/screenshot_2017-11-28_12-52-51.png

Here, the bias is the error in g-bar, that is the best you can do. As N goes up, the variance decreases, and we get closer and closer to g-bar.

Linear regression case

Let’s say we have a noisy target y = w*Tx + noise D is {(x1, y1), (x2, y2), …, (xN, yN)} The input points in D are picked independently.

Learning solution: X = (XTX)-1XTy

In sample error: Xw - y

Out-of-sample error - same input points + new noise Xw - y’

On analysis we get is:

assets/screenshot_2017-11-28_17-03-37.png

σ2 is the variance of the noise, and that is the best you can do, which is true, since you can’t capture the noise Also, till d+1, you can shatter N points. As we get more N, the variance cancels out but the bias persists and we get lower and lower Eout

Best approximation error = σ2

assets/screenshot_2017-11-28_17-09-52.png

Here, we see that we have an exact formula for Expected generalization error and we see that dVC ∝ N

Chapter 9 - The Linear Model II

Review

  • We took the expected value of the Eout wrt the identity of dataset D, size fixed at N to get rid of variation due to datasets. And we got a clean decomposition of Eout into bias + variance terms.
  • Bias is how far is your best hypothesis in your hypothesis set away from ƒ
  • Variance is how far close to the best hypothesis are you.
  • g(D)(x) → gbar(x) → ƒ(x)
  • N ∝ dVC

Non linear transforms

Summary from earlier

We used non-linear transforms earlier, when we applied (x1, x2) → (z1, z2, …, zd) so, z = Φ(x) Example: z = (1, x1, x2, x12, x22, x1x2) - this is the quadratic transformation Our final hypothesis g(x) will always live in X space.

assets/screenshot_2017-11-28_17-35-19.png

We take the first in classification, 2nd in regression

Price we pay

In terms of generalization 🔝

Earlier, in X space, we had w free parameters. But in transformed space, we have w-tilde free parameters. So, the dVC dimension increases and so we pay the price for generalization.

assets/screenshot_2017-11-29_08-28-29.png

However, we can do this to reduce the parameters:

assets/screenshot_2017-11-29_08-27-23.png

Here, we are reducing the parameters to just 1. This won’t work because we looked at the data and did the “learning” ourselves. We decided that we need a circle and don’t need the cross terms etc. We have to pay the price in terms of dVC for what we start with (in our mind here), which is the full quadratic form.

Looking at the data before choosing the model is hazardous to your Eout. You explore a very high hypothesis space but account for very little in your analysis. This is data snopping.

This is why you cannot arbitrarily go to very high dimensions during transformations, because the dVC gets high and generalization suffers.

Logistic regression

We have seen so far:

  • Linear perceptron
  • Linear regression

Now we will see Logistic regression.

The model

This will be our third linear model. Being linear means that we compute the signal as being a linear combination of the weights. And then we use the signal to do classification, estimation etc

assets/screenshot_2017-11-29_09-36-10.png

The perceptron is this:

assets/screenshot_2017-11-29_09-36-48.png

The staircase step pattern is there because we it is a step function. We look at the sign and return +1/-1

All the linear models differ in what they do to the signal. Perceptron looks at the sign 🔝

Regression uses s as is. We can say we apply the identity function.

assets/screenshot_2017-11-29_09-40-19.png

Logistic regression squashes it to be b/w 0 and 1 by applying a nonlinearity to it (logistic function θ).

assets/screenshot_2017-11-29_09-41-22.png

This is somewhat b/w perceptron and linear regression

The logistic function θ looks like this:

assets/screenshot_2017-11-29_09-42-46.png

This is a soft threshold. There are many formulas that give us this, we will use:

assets/screenshot_2017-11-29_09-43-18.png

So, we have: h(x) = θ(s), s = wTx. The signal s, can be thought of as a “risk score”

Since during training, we have a binary label on the datapoints in D, we don’t have a probability, we get:

assets/screenshot_2017-11-29_10-15-14.png Since the target function itself is noisy, it gives y +1 or -1 with a certain underlying probability that we are trying to learn.

Error measure

We can either do a cost-benefit analysis and come up with a very plausible error measure. Or we can use something that is friendly to the optimizer.

For each (x, y), y is generated by probability ƒ(x) We can have a plausible error measure based on likelihood:

We will grade different hypothesis according to the likelihood that they generated the data. This gives us a way of grading the different hypotheses.

Let’s assume that the datapoints in D were generated by h. (ie h = ƒ). This is the reverse of what we were doing earlier. Earlier, we extracted the most probable hypothesis given the data. That was clean. Now we are asking what is the probability of the data given the hypothesis. There are some data snooping concerns. The Bayesian’s get around this by starting with a prior and then using likelihood to get a posterior.

assets/screenshot_2017-11-29_10-28-44.png

What we need to do is, we need to compute the LHS in 🔝 Then we select the h which has the greatest LHS

So, we do this: We replace h(x) in that formula 🔝 with:

h(x) = θ(wTx)

Also, we see:

Θ(-s) = 1 - θ(s)

So, we get:

P(y | x) = θ(y · wTx)

This is for one point, for all the points in the dataset, we get the likelihood:

assets/screenshot_2017-11-29_10-49-54.png

So, since we find the w that maximizes the likelihood of getting this dataset, we learn something about the underlying probability distribution that generated this dataset.

How do we maximize the likelihood?

We have:

assets/screenshot_2017-11-29_11-42-07.png recall, we are maximizing wrt w

Instead, we can maximize log-likelihood

assets/screenshot_2017-11-29_11-42-47.png

We can also do this: ⬇️

assets/screenshot_2017-11-29_11-43-18.png

Also, we can do this:

assets/screenshot_2017-11-29_11-43-47.png

We can substitute θ and get this simplification:

assets/screenshot_2017-11-29_11-44-43.png

So, we can call the terms inside the Σ as the e - pointwise error

assets/screenshot_2017-11-29_11-45-37.png

The exponent, -ynwTxn is nice because if the risk score is high and yn is negative, we get a total -(-)(+) which is negative so, high error etc.

This is called the cross-entropy error. (b/w h and ƒ)

Learning algorithm

Since we know the error measure, we can minimize it

We had in linear regression:

assets/screenshot_2017-11-29_11-49-58.png

We had a closed-form solution in linear regression, the pseudo-inverse, the one step learning.

We cannot find a closed form solution for the logistic regression case. We need an iterative solution - gradient descent We have a convex error surface in logistic regression, so we can optimize it perfectly. We won’t have this 🔝 when we have neural networks etc.

We have:

assets/screenshot_2017-11-29_11-58-17.png

We need the direction, v-hat. Earlier, we just said that the direction of steepest fall is the gradient of that surface at that points. An alternative approach to the same idea is:

assets/screenshot_2017-11-29_11-59-37.png

We can replace w(1) with the formula

assets/screenshot_2017-11-29_12-00-13.png

Using Taylor series expansion of the first term, we get:

assets/screenshot_2017-11-29_12-00-48.png

If the surface was linear, we wouldn’t have the O(η2) term, the first order approximation would have been exact.

We just need the direction of v-hat now such that it is as negative as possible. So, we have:

assets/screenshot_2017-11-29_12-02-37.png

Size of η?

assets/screenshot_2017-11-29_12-03-27.png

η should increase with slope. So, if we can do this:

assets/screenshot_2017-11-29_12-05-33.png

So, we moved from a fixed step size to a fixed learning rate. We can manipulate that as well, with momentum etc.

Summary

assets/screenshot_2017-11-29_12-06-30.png

We have seen 3 linear models now:

assets/screenshot_2017-11-29_12-09-11.png

Chapter 10 - Neural Networks

In logistic regression we have a convex error surface, so we could use gradient descent and minimize the error. This is not the case with Neural Networks.

Stochastic gradient descent

We had GD, which minimizes the in sample error

assets/screenshot_2017-11-29_13-31-24.png

We need to evaluate the hypothesis at every point in D, so this is expensive. Each step is one epoch (epoch is when you have considered the full dataset at once)

This was “batch GD” 🔝

We can also have stochastic GD. Pick 1 example (xn, yn) at a time. Apply GD to e(h(xn), yn)

This works because the “average” direction in which we move is:

assets/screenshot_2017-11-29_13-35-56.png

assets/screenshot_2017-11-29_13-37-33.png

So, at every step, we are moving in this direction(this is the expected direction) + noise This is randomized GD, aka stochastic GD

Benefits of SGD

  • cheaper computation
    • with the same expected movement
  • randomization
    • so you don’t get stuck in local minima always
  • simple
    • simplest possible optimization to GD
  • Rule of thumb: η = 0.1 works

SGD in action

In the Netflix competition, we had user vectors uik for user i, and movie vectors vjk for movie j We also had a rating rij for user i and movie k

assets/screenshot_2017-11-29_13-48-26.png

We can define an error measure as:

assets/screenshot_2017-11-29_13-48-47.png

So, we have 2k parameters, and we can use SGD to move in this 2k dimensional space. We nudge them little by little to go towards the rating.

Neural Network Model

The building blocks are perceptrons. Neural networks are combination of them The case where the perceptrons failed was the 4 points arranged funnily.

assets/screenshot_2017-11-29_13-51-57.png

But this can be solved by 2 perceptrons

assets/screenshot_2017-11-29_13-52-12.png

We can combine them using OR/AND gates which are manifested by the weights on them

assets/screenshot_2017-11-29_13-53-11.png

Note the step function just takes the sign to give +1/-1

Now, once we have OR and AND, we can create layers of these to get more complex function approximaters or logic gates etc

We can get XOR with:

assets/screenshot_2017-11-29_13-57-18.png

So, we have: the 2 perceptrons (doing AND, and NAND) and then we are ORing them

So, the full multilayer perceptron that implemented the function that we wanted(where the single perceptron failed) is:

assets/screenshot_2017-11-29_14-01-30.png

3 layers, feedforward architecture.

Now, you can get whatever surface you want to approximate:

assets/screenshot_2017-11-29_14-03-06.png

2 red flags: generalization and optimization For Generalization, we know exactly what we are dealing with, and we can reason rationally about it For optimization, what we can do is we can have soft thresholds thru out, and then give the answer after hard thresholding it. Soft thresholds are better because they are twice differential and we can use gradient descent readily.

The neural network

We have this:

assets/screenshot_2017-11-29_16-28-08.png

Θ is any non-linearity that you want. We can use logistic function, but it goes from -1 to +1 Each of these guys can be different as well.

The input layer has x The hidden layers 1 ≤ l ≤ L output layer l = L

The non-linearity θ can be: tanh (hyperbolic tangent)

assets/screenshot_2017-11-29_16-37-24.png

This function behaves linearly near low signal score, otherwise as a hard threshold.

The parameters of the neural network, the weights have elaborate indices.

assets/screenshot_2017-11-29_16-40-40.png

d(l) is the dimension of the layer l inputs start with index 0 because each neuron has the mandatory input x0 = 1

assets/screenshot_2017-11-29_16-43-50.png

So, 🔝 the value in the next layer is just the signal (computed from the weights+values of previous layer) and passed thru the nonlinear function θ.

So, we start applying this from the input layer all the way to the output layer to get x1(L)

assets/screenshot_2017-11-29_16-47-29.png

Applying SGD

All the weights are w = {wij(l)}, they determine h(x) Error on example (xn, yn) is: e(h(xn), yn) = fn of e(w)

assets/screenshot_2017-11-29_16-52-04.png

We need the gradient of the e(h(xn), yn)

assets/screenshot_2017-11-29_16-51-55.png

How do we compute this?

To find de(w)/dwij(l) - we can do it by chain rule. The problem is that we have to do it for each one of them, we need a faster way of computing it.

We can use backpropagation algo to get the entire gradient in 1 shot

Backpropagation algorithm

We can write using chain rule:

assets/screenshot_2017-11-29_17-09-54.png

The 2nd term is simple. The 1st term needs thought, let’s call it δj(l) So, e’ is a product of the two terms, which are basically the x of the previous layer and the δ of the next layer. This is an attractive property.

We get δ at the output layer and then use that to compute the inner layer etc.

This is just a normal application of chain rules.

assets/screenshot_2017-11-29_17-17-39.png

The hidden layers are performing nonlinear transformation. But the transformations are learned, and we are already charging the generalization for that. The VC dimension is ≈ #weights

Chapter 11 - Overfitting

Review

  • We can combine perceptrons to get more complex boundaries
  • The optimization is difficult, so we introduced neural networks that softened thresholds of the perceptrons.
  • We used chain rule to get derivative of error wrt weights and so we can use SGD to minimize the error.

What is Overfitting?

Let’s say we have a simple target function. We generate 5 data points with random noise in them.

assets/screenshot_2017-11-29_17-33-53.png

Overfitting is when you go father than you should have. If you try to fit the above ƒ with a 4th order polynomial, we will overfit.

When training a neural network, if you plot Ein and Eout at each step and you get something like so:

assets/screenshot_2017-11-29_17-38-31.png

Overfitting is when Ein goes down and at the same time, Eout is going up. So, we should stop at that moment and report that. This is called early stopping

Overfitting is “fitting the data more than warranted” The culprit is when we are fitting the noise.

Case-study to investigate effects of fitting noise on overfitting

Let’s have a 10th order target function. We generate 15 datapoints + noise

assets/screenshot_2017-11-29_17-50-28.png

We also have target function which is a a 50th order polynomial. We also generate 15 noiseless datapoints lying on the curve.

assets/screenshot_2017-11-29_17-51-53.png

We will have 2 models for each target

  • 2nd order polynomial
  • 10th order polynomial

For noisy low-order target

assets/screenshot_2017-11-29_18-14-46.png

This is a blatant case of overfitting

For noiseless high-order target

assets/screenshot_2017-11-29_18-17-14.png

So, here too, it is overfitting. Even when we are fitting a 10th order polynomial to a 50th order target function

The reason is that the second case also has “noise”, not the traditional noise, but noise. We need to match the hypothesis complexity to the data resources and not the target function complexity.

A detailed experiment

Impact of noise level and target complexity

y = f(x) + ε(x)

ε(x) has σ2 std dev, or “energy”

f(x) is a normalized Qth order polynomial so that the noise has effect - we’ll use Lagrange polynomials. So, the things affecting overfitting are:

assets/screenshot_2017-11-29_18-29-08.png

We would like to understand the dependency b/w these on overfitting

We fit the polynomials using our 2 hypothesis - H2 which is a 2nd order polynomial and H10 which is a 10th order polynomial. And we’ll compare the out-of-sample errors. Overfit measure: Eout(g10) - Eout(g2) So, a positive value means the more complex guy is doing worse, so overfitting. Negative means that the large model is doing better, no overfitting

When we run this for 10s of millions of examples, we get this:

  • Impact of σ2
  • N vs. σ2

assets/screenshot_2017-11-29_18-34-23.png

(Red is overfitting) So, more σ, I.e. More noise, we need more N to beat overfitting.

For all the simulations 🔝, the target complexity is fixed at 20th order polynomial

This was okayish, no surprises. What about the other case where we were getting overfitting without noise?

assets/screenshot_2017-11-29_18-36-38.png

Here we fix the level of noise, to 0.1, but we are changing the target function complexity We note somewhat the same pattern. Overfitting happens with higher complexity of ƒ(not as pronounced as earlier, but there. This is because we can readily capture the pattern in this “noise”, whereas in the other case we are trying to learn random noise, which takes a lot of N) but with more N, we can beat it. Also, it does not start until the power of polynomial reaches 10, this is because we are trying to fit a 2nd order polynomial to it. So, the target polynomial needs to be sufficiently complex for us to not be able to capture it. Thus from all this, there appears to be another factor apart from conventional noise that affects overfitting.

The role of noise

The first case, with σ2 energy is called “stochastic noise” The second case is noise which isn’t random, it is deterministic but we just cannot capture it and it looks like noise to me. We’ll call it “deterministic noise”

assets/screenshot_2017-11-29_18-42-28.png

Deterministic noise

The part of ƒ that H cannot capture → f(x) − h*(x)

assets/screenshot_2017-11-29_20-45-30.png

Main differences with stochastic noise:

  • depends on H
  • fixed for a given x (difference b/w f(x) and h(x) at that point)

They behave exactly the same as long as learning is concerned.

So, if we have only 10 points and the hypothesis is very complex, it will try to fit in the noise - either stochastic or deterministic - and you overfit

Earlier, when we did the Bias-Variance decomposition, we got the bias term and the variance term. Adding noise to the mix and repeating the analysis, we get:

assets/screenshot_2017-11-29_21-27-39.png The 3rd term is the stochastic noise. The second term, the bias, is the deterministic noise. The noise terms increase the variance (the 1st term)

Dealing with overfitting

Two cures:

  • Regularization - putting the brakes
  • Validation - checking the bottom line - learning to not look at Ein when deciding when to stop but getting an estimate for Eout

This will allow us to use the 4th order polynomial but improve dramatically:

assets/screenshot_2017-11-29_21-36-36.png

Chapter 12 - Regularization

Review:

assets/screenshot_2017-11-30_09-38-46.png

The 1st cure to overfitting is Regularization, 2nd is validation

Regularization - informal

Regularization has a mathematical basis - you want to approximate functions, which is an ill-posed problem because there are many functions that approx it, so you have smoothness constraints

It also has heuristic basis.

Earlier, we studied how a linear model fails on fitting a sine curve with only 2 points.

assets/screenshot_2017-11-30_12-22-09.png

We can apply regularization here and restrict offsets and too deep slopes. This will mean we sacrifice a little on the fit of the datapoints.

assets/screenshot_2017-11-30_12-23-39.png

This shows that the variance is reduced.

Quantifying it:

assets/screenshot_2017-11-30_12-32-42.png

So, with regularization, the total score is better than in the constant model. There is a small increase in the bias, the g-bar is different as well. This is because we are handicapping the flexibility of the hypothesis and so we don’t get the same g-bar as earlier. The slight increase in bias can be considered a side-effect of the treatment.

The g-bar is the best hypothesis in H. This is what you’ll get if you get infinitely many datasets of 2 datapoints each and you average all the various hypothesises you form.

Regularization allows you to choose a model “in-between” the constant and linear model - a sort of a restricted linear model

What regularization did we apply?

Regularization - formal

Let’s define a hypothesis set: HQ = polynomials of order Q

The non-linear transformation that produces this polynomial: z

assets/screenshot_2017-11-30_12-43-05.png

You take the input x, and combine and evaluate these polynomials. This is the nonlinear transformation. This means, we are converting each feature into higher powers till power Q.

The hypothesis set is simply all possible Legendre polynomials till power Q

assets/screenshot_2017-11-30_12-45-26.png

We can use linear regression to get the solution This is simple, we know the drill:

  1. We apply the transformation on the input points

assets/screenshot_2017-11-30_12-47-50.png

  1. We write Ein(w), in vector form

assets/screenshot_2017-11-30_12-48-51.png

  1. We know the formula

assets/screenshot_2017-11-30_12-51-08.png

What if we constrain the weights?

  • we can consider H2 as a constrained version of H10
    • But this is hard constraint, with wq = 0 for q > 2

A softer constraint would be:

assets/screenshot_2017-11-30_12-52-22.png

Here, we are decreasing the possbile hypotheses, so the dVC reduces and our chances of generalization improve. So, the optimization problem changes to:

assets/screenshot_2017-11-30_12-53-43.png

Pictorially, we have our old error surface:

assets/screenshot_2017-11-30_12-56-28.png

The lowest error lies at the center of this ellipse. This is what is reported by linear regression without regularization. With the constraint, we have:

assets/screenshot_2017-11-30_12-57-29.png

So, we have to now pick a point that is as close to wlin as possible and still respects the constraint. So, it will lie on the circle. Also, if C is too large, we will have something like this:

assets/screenshot_2017-11-30_12-58-59.png

Here, the solution will be wlin, as the regularization was too liberal.

Now, consider a non-optimal point:

assets/screenshot_2017-11-30_13-01-44.png

Here, the gradient of Ein will be orthogonal to the ellipse. Also, the normal to the circle will be w vector (origin to point w will be orthogonal to circumference of circle) Thus, we can write:

assets/screenshot_2017-11-30_13-03-35.png

We can put proportionality constant as:

assets/screenshot_2017-11-30_13-04-39.png

So, the last equation looks like the minimization of:

assets/screenshot_2017-11-30_13-05-11.png

So, we moved from constraint satisfication optimization to normal optimization. λ is dependent on C (among other steps), it is inversely proportional to λ

λ ∝ -C

Augmented error We are now minimizing the augmented error - that is, error with the regularization term attached to it

assets/screenshot_2017-11-30_13-08-58.png These 🔝 two optimizations are fundamentally the same.

The bottom formulation lends itself to VC analysis, as we are restricting our hypothesis set explicitly, we have a lower dVC and better chances at generalization.

The solution is simple:

We have:

assets/screenshot_2017-11-30_13-11-42.png

So, we get gradient:

assets/screenshot_2017-11-30_13-11-59.png

So, this is the solution with regularization.

Applying regularization on the previous problem:

assets/screenshot_2017-11-30_13-15-58.png

We see regularization increased bias, reduces variance We went from overfitting to underfitting

There is a optimal value:

assets/screenshot_2017-11-30_13-31-25.png

Weight decay

This regularization we studied, wTw ≤ C is called weight decay.

assets/screenshot_2017-11-30_13-18-36.png

Why is this called weight decay?

We had in GD:

assets/screenshot_2017-11-30_13-19-36.png

But now, we have gradient due to wTw also

assets/screenshot_2017-11-30_13-20-26.png

We see that λ is playing the role of decaying the weight at each step. This applies in neural networks as well, just the wTw is the sum of all the weights in the network

assets/screenshot_2017-11-30_13-22-45.png

Apart from constraining the total weights, we can also decide some weights are more important that others. We can use the importance factor - γq So, if γq is low, that weight need not be reduced. If it is big, emphasis on reducing that weight.

assets/screenshot_2017-11-30_13-28-39.png

Example:

assets/screenshot_2017-11-30_13-25-37.png

The algorithm will try to find a lower order fit

assets/screenshot_2017-11-30_13-26-02.png

In neural networks, we give different layers different γ’s

One famous family of regularizers is Tikhonov regularizer

assets/screenshot_2017-11-30_13-28-16.png

In the above form:

assets/screenshot_2017-11-30_13-28-56.png

We apply the regularization only to the diagonal terms, I.e. w12, w22, etc In the Tikhonov regularizer, when you write it in the matrix form, you can write off diagonal terms as well and get all sorts of regularization effects - weight decay, low order fit, high order fit etc

Practical rules:

  • stochastic noise is high frequency
  • deterministic noise is also nonsmooth
  • Constrain learning towards smoother hypotheses (because the noise is not smooth, so we harm the noise by going smooth)

This works because it will stop the hypothesis from trying to fit the high frequency noise (stochastic or deterministic) Generally, small weights correspond to smoother hypothesis.

The holy grail of machine learning is finding Ein which is a great proxy for Eout. If we get that, we can minimize Ein and go home. But there is this slack, this bound etc. Eaug is better at being a proxy for Eout compared to Ein.

Choosing a regularizer

In neural networks, you can do this ⬇️ to eliminate small weights and leave the large ones alone.

assets/screenshot_2017-11-30_14-41-25.png

So, this means that we will be near the logical part (+/-1) of tanh and not the linear part. This is good because the neural network can implement OR/AND gate and thus combine to get approximate any function

assets/screenshot_2017-11-30_14-43-57.png

Early stopping is also a regularization. Optimal λ needs validation.

assets/screenshot_2017-11-30_14-46-44.png

When there is no noise, we don’t need λ We need more λ when there was more noise

This is for stochastic noise 🔝

We get the exact same thing for deterministic noise

assets/screenshot_2017-11-30_14-47-44.png

This should seal the correspondence in your mind that as far as overfitting and it’s cures are concerned, both the noises behave exactly the same way.

Chapter 13 - Validation

Review

We got the Eaug to minimize now:

assets/screenshot_2017-11-30_14-50-16.png

The general form of regularization was:

assets/screenshot_2017-11-30_14-50-30.png

The choice of λ is retrieved using validation

The validation set

Validation verses regularization

We have, Eout(h) = Ein(h) + overfit penalty

Both regularization and validation deal with this “overfit penalty” 🔝

Regularization is a way of getting a handle on overfit penalty; estimate it and try to minimize Eaug Validation tries to estimate Eout directly

Analyzing the Eout estimate

On out-of-sample points (x,y), we have e(h(x), y) as:

  • squared error = (h(x)-y)2
  • binary error = || h(x) ≠ y ||

If we take the expected value of this error wrt x, we get Eout Eout(h)= Ex[e(h(x), y)]

There is a lot of variance because we have only 1 point here var[e(h(x), y)] = σ2

To reduce variance, we move from a single point to a validation set Num of points in validation set = K

The error on validation set is:

assets/screenshot_2017-11-30_15-31-10.png

These folks 🔝 weren’t used in training. Also, they are more than 1, so we hope this is a good estimate for Eout

So, Eout(h) = E[Eval(h)] which is:

assets/screenshot_2017-11-30_15-32-41.png

The variance on the Eval(h) is: using the same analysis as earlier, the cross terms go to zero, the covariance terms go to zero since the terms were chosen independently, we get:

assets/screenshot_2017-11-30_15-37-53.png

So, we get:

assets/screenshot_2017-11-30_15-38-08.png

Since K is taken out of N, we cannot increase it indefinitely.

One idea: K is put back into N

Divide D into 2 parts, N-K to train and K to validate

assets/screenshot_2017-11-30_16-58-27.png

Train on N-K to get g-, use that hypothesis, train on the full N and report it back. We don’t have an estimate on Eout now but we know it will be better than Eval due to the behavior of the learning curves

assets/screenshot_2017-11-30_15-46-03.png

This is good because we get to train on the whole set. But bad because the validation error we are reporting is the error on a different hypothesis than the one we finally use. So, our error estimate is bad.

Rule of thumb: K = 20% of N, N/5

Why ‘validation’

When we use validation error for early stopping, we introduce a optimistic bias in our model for estimation of Eout. This is because there is a certain variance in Eval and we ignore Eval when it is high, but we take it when it is low.

Model selection

The main purpose of validation set is to help with model selection. We have M models to choose from.

We will use Dtrain to train and we’ll get gm for each model We’ll evaluate each using Dval. Pick the best.

assets/screenshot_2017-11-30_17-26-38.png When you pick the best 🔝, there’s a optimistic bias

The output is gm* which is the best model trained on the complete D

Eval(gm*) is a biased estimate of Eout(gm*)

assets/screenshot_2017-11-30_17-30-15.png

The bias diminishes with growing K

  • The curves are going up because we have fewer datapoints left for train
  • the 2 curves are getting closer together because the Eval estimate of Eout is getting better and better

We can quantify the optimistic bias. If you think about it, selecting the best hypothesis from the M options (by selecting the model with the lowest Eval) is just regular training. Dval is used for “training” on the finalists model.

Hval = {g1, g2, …, gM}

Using Hoeffding and VC!

assets/screenshot_2017-11-30_17-56-32.png

So, we have an added term of ln M. Hence, if we use validation to select from 20 parameters using 100 points, the bias is more than if we are selecting 2 parameters(only by a log factor, but still)

Data contamination: We have Ein, Etest, Eval

If we use the data to make choices, we are contaminating it and reducing it’s ability to estimate real performance.

Test set: Completely contaminated Validation set: Somewhat contaminated Test set: Totally clean

Cross validation

We have the following chain of reasoning.

Eout(g) → this is what we want to estimate. g is our g hypothesis trained on the full N datapoints Eout(g) → this is our proxy for Eout(g), obtained by training on N-K Eval(g) → this is our estimate of Eout(g) obtained by training only on N-K(the lowest Eval of all M models), this one has a optimistic bias in it

assets/screenshot_2017-11-30_18-10-27.png

We want small K so that g is fairly close to g. We want large K so that Eval(g) is a good proxy for Eout(g) (due to reduced variance)

Leave one out

We will use N-1 points for training. This means g will be very close to g. But Eval(g) is not a good estimate for Eout(g) due to insane variation. So, let’s see

We have D - nth point

assets/screenshot_2017-11-30_20-47-51.png

So, final hypothesis learned from Dn is gn We have: en = Eval(gn) = e(gn(xn), yn) So, RHS is a bad estimate for Eval

If we do this for all the points once, we get N estimates of Eval, the average of them is perfect, the variance is reduced - we’ll call it ECV.

assets/screenshot_2017-11-30_20-52-35.png

So, we have the best of both worlds, large K and small K

Demonstration: Let’s use CV to choose between a linear and constant model for this dataset

assets/screenshot_2017-11-30_20-57-22.png

We get ECV:

assets/screenshot_2017-11-30_20-57-43.png

Same for constant model:

assets/screenshot_2017-11-30_20-58-04.png

It turns out ECV for constant model is lower, we’ll choose that. That would be the right choice, the datapoints were indeed generated by a constant target function with noise. Using ECV will give you the right answer on average.

Another example: We can use CV to find out what order of polynomial to use for digit classification

assets/screenshot_2017-11-30_21-00-51.png

We’ll use 500 points for training and the rest for testing. The nonlinear transformation will be polynomials for upto 5th order polynomial

assets/screenshot_2017-11-30_21-02-01.png

We’ll have 500 training session, each with 499 points. We’ll get this curve:

assets/screenshot_2017-11-30_21-02-51.png

We note how ECV is a good proxy for Eout CV tells us to stop at 5-7, we can stop at 6. Here’s the plot:

assets/screenshot_2017-11-30_21-04-36.png

The CV result is smooth, good Eout

There is a problem with “Leave one out”. The problem is that you’ll have N training sessions(with N-1 points each time). So, we generally leave K out.

assets/screenshot_2017-11-30_21-07-21.png

This works nice if N is big. The number of training sessions reduces and we don’t take a very huge hit on num of points (N-K is good if N is big)

Rule of thumb: 10-fold Cross validation works nicely in practice.

Chapter 14 - Support Vector Machines

Review

The whole lecture in a slide:

assets/screenshot_2017-11-30_21-12-26.png

SVM - Classification King

Maximizing the margin

For linearly separable data, we can have different separating lines. The ones with the largest margin are the best.

We can see that the margin successively increases and so does the separating line (even when Ein and Eout and all are the same) because we can accept noise better

assets/screenshot_2017-11-30_22-08-54.png

Can we solve for w that maximizes the margin? Recall the dichotomies, the growth function? We recall that the perceptron can shatter 3 points completely. So, we have 8 dichotomies:

assets/screenshot_2017-11-30_22-12-51.png

Each dichotomy’s max margin:

assets/screenshot_2017-11-30_22-13-43.png

If we accept dichotomies with a mandatory minimum margin, we reject some dichotomies, so the growth function decreases, dVC decreases, we have better generalization chances. This is just like regularization.

Finding w with large margin

Let xn be the nearest data point to the line/plane/hyperplane: wTx=0 (this is the separating surface) The distance b/w the separating plane and the nearest point is the margin

2 preliminary technicalities:

  • normalize w
    • we’ll want the min distance to be 1, so we scale distances down accordingly. So, |wTxn| = 1
  • We’ll put out w0, the bias term. So, the vector w is not only [w1, …, wd]
  • So, the equation of the plane becomes: wTx+b = 0

Now, we can compute the distance between the nearest point xn and the plane wTx+b = 0 where |wTxn + b| = 1 So, we have:

assets/screenshot_2017-12-01_17-12-35.png

Vector w is ⊥ to the plane in the X space (eg: in y=x, vector [1, -1] is ⊥ to the line) Proof: Take 2 points on the plane x’ and x” Since they line on the plane, we have:

wTx’ + b = 0 wTx” + b = 0

So, we have the difference: wT(x’-x”) = 0

This 🔝 means that w vector is ⊥ to (x’-x”) vector Since x’-x” lies on the plane, w is ⊥ to the plane

assets/screenshot_2017-12-01_17-15-47.png

Now the distance b/w xn and plane is:

  • Take any point x on the plane
  • Projection of xn - x on w is the required distance of xn from plane

assets/screenshot_2017-12-01_17-18-09.png

So, w·(xn-x) divided by mag of w Also, w·(xn-x) is just: wT(xn-x) We take the abs value, so:

assets/screenshot_2017-12-01_17-20-26.png

Or:

assets/screenshot_2017-12-01_17-20-46.png

We can add and subtract b to get:

assets/screenshot_2017-12-01_17-21-02.png

The 2nd group of terms is 0 since x lies on plane. The 1st group was scaled to be 1 So, we have distance/margin as:

assets/screenshot_2017-12-01_17-21-34.png

Hence, the resulting optimization problem:

assets/screenshot_2017-12-01_17-23-06.png

We are maximizing the margin, subject to the scaling of w such that it makes the nearest point be at a distance of 1

assets/screenshot_2017-12-01_17-27-26.png

Also, instead of maximizing 1\|w|, we maximize:

assets/screenshot_2017-12-01_17-28-02.png

So, the problem becomes:

assets/screenshot_2017-12-01_17-29-36.png

The Solution

We’ll use quadratic programming to solve this

This is related to the optimization problem we had earlier with Regularization:

assets/screenshot_2017-12-01_17-34-07.png

We’ll formulate the Lagrange: by putting the constraint into the objective (by getting the -1 on LHS and adding α for the slack):

assets/screenshot_2017-12-01_17-46-23.png

We are minimizing this 🔝 wrt w, b and maximizing wrt αn ≥ 0 We take gradient of Lagrangian wrt w and b and equate to 0:

assets/screenshot_2017-12-01_17-48-02.png

So, we get:

assets/screenshot_2017-12-01_17-48-21.png

Substituting this back into the original Lagrangian:

assets/screenshot_2017-12-01_17-48-59.png we get:

assets/screenshot_2017-12-01_17-49-50.png

So, we have:

assets/screenshot_2017-12-01_17-50-12.png

Which is the same as:

assets/screenshot_2017-12-01_17-50-28.png

Expanding the formula to see what matrices are passed to QP:

assets/screenshot_2017-12-01_17-51-17.png

Along with the conditions:

assets/screenshot_2017-12-01_17-51-35.png

This is to be passed to the QP package and it gives us the α-s back We get a convex function that the quadratic programming optimizes.

So, we have our α-s. α = [α1, …, αN] We can get w by:

assets/screenshot_2017-12-01_17-52-48.png

The α vector has mostly 0, this is because in the interior points, the slack is 0

assets/screenshot_2017-12-01_17-54-49.png

This is similar to the case in regularization where the C was so large that we actually could get to wlin and λ was 0 there. For the cases where we actually had to compromise, we had a positive λ, so is the case here with our α-s.

assets/screenshot_2017-12-01_17-55-41.png

Those points where αn > 0, are the support points, xn is the support vector. They are the circled ones in the diagram:

assets/screenshot_2017-12-01_17-59-45.png

The support vectors achieve the margin for them: yn(wTxn + b) = 1

Also, since α-s for non support vectors are 0, we can do:

assets/screenshot_2017-12-01_18-01-52.png

Since w-s are the parameters of the model, less w means lower dVC, better generalization. So, getting the best separator using SVMs has a generalization dividend as well. Also, b is simple to get: apply “yn(wTxn + b) = 1” for any support vector.

Nonlinear transforms

Currently, we are talking only about linearly separable case, where all the points are strictly linearly separable. If they aren’t linearly separable, we can apply a non-linear transformation and get them to be linearly separable in that higher dimension space.

Recall, we had:

assets/screenshot_2017-12-01_17-49-50.png

Here, we are passing xTx to the LP. xTx is just inner product of 2 D dimensional vectors, where D is the dimensionality of the x space. So, if we apply a nonlinear transformation to 1 million dimensional space, we’ll have to do a inner product of 2 one million vectors, but the optimization problem won’t be any difficult.

assets/screenshot_2017-12-01_18-11-11.png

The w-s will belong to the z space. The SVs live in the Z space. In the X space, the Z-space hyperplane with the largest margin looks like this say, with the “pre-images” of support vectors highlighted (Support Vectors are defined in the Z-space):

assets/screenshot_2017-12-01_18-15-27.png The margins are in the Z space

Here, we have only 4 support vectors, so, the w is only defined by 4 parameters in the Z space. This after going to a million dimensional space! So, good generalization.

Generalization result:

assets/screenshot_2017-12-01_18-18-26.png

Actually, we need to run several runs and get the expected values:

assets/screenshot_2017-12-01_18-19-18.png

So, SVMs are awesome because you get to go to high dimensions without paying for the computation of doing so and without pay in terms of generalization.

Kernel methods take this one step further, you won’t need to pay for getting the inner product as well. This will enable you to go to infinite dimensional spaces.

Chapter 15 - Kernel Methods

Review

SVMs are classifiers with maximum margin. They allow us to go to higher dimensional spaces without paying for generalization and complexity.

This means, we get a complex final hypothesis (g) but our hypothesis set is “simple”

The kernel trick

The only thing we need from the Z space is the inner-dot-product. If we get that somehow, we can use the Z-space

assets/screenshot_2017-12-01_19-26-00.png We need z in one more place as well…, Our final hypothesis We have: g(x) = sign(wTz + b)

But we know w is:

assets/screenshot_2017-12-01_19-29-31.png

So, this reduces to inner product as well.

What about the b? That is solvable by taking any SV and using this equation:

assets/screenshot_2017-12-01_19-30-34.png

So, here as well, we need just the inner product.

This raises a very interesting possibility. If we can get the inner product of z, we can go to that space – even if we do not know what the z space is – even if z space is infinite dimensional.

The inner product that we are taking about, zTz, let’s assume it is given to us by a function We have, for any 2 points in X space: x and x’ ∈ X

We have: zTz = K(x, x’)

Where K is kernel. The kernel will correspond to some Z space.

Example:

Let’s have a 2nd order transformation φ So, z vector: z = φ(x) = [1, x1, x2, x12, x22, x1x2]

Hence, K will be:

assets/screenshot_2017-12-01_19-36-16.png

The trick: is it possible to compute this kernel (evaluate K) without transforming x and x’?

Let’s consider K to be (1 + xTx’)2 This is a function of x, x’ and it gives us a inner product in some Z space (we don’t yet know which)

K(x, x’) = (1 + xTx’)2 = (1 + x1x2 + x2x2)2

Squaring, we get:

assets/screenshot_2017-12-01_19-40-33.png

This looks like an 2nd order non-linear transformation inner product if we didn’t have the 2s. It actually is still a 2nd order non-linear transformation but x vector is not transformed to (1, x12, x22, x1x2) but to:

assets/screenshot_2017-12-02_09-20-50.png

Hence, K(x, x’) = (1 + xTx’)2 does compute a inner product in 2nd order without converting to those coordinates first If we replace the power 2 with power 100, we get the inner product in 100D. This without having to go there.

This is the polynomial kernel

Soft-margin SVM