## Transcript

An interesting example of how to use the Approximation Theorem is to consider the situation where we have a collection of points, and we want to find the polynomial that comes closest to having these points on them. Calculus tackles this problem using equations, but this is linear algebra, so we want to use vectors and matrices to find our answer. And I mentioned the Approximation Theorem because we are going to set up a vector y of our data points, and have our subspace be the set of all polynomials of the appropriate degree, translated into Rn, of course. So the idea will be that we are looking for the vector or polynomial that is closest to y.

The technique we use is best derived while doing an example. So let’s try to find the equation of the form y = a + bt, the equation of a line, that best fits the points {(1, 7), (2, 10), and (3, 12)}. Well, the first thing we need to decide is what we mean by “best fit”. In general, we want to compare the difference between our data points and our equation, and make that difference as small as possible. So first, we’d better calculate the difference. Our data point (1, 7) will be compared with the point (1, a + b) on the line. Calculating the difference between them, we get the point (0, 7 – a – b). Our data point (2, 10) will be compared with the point (2, a + 2b) on the line. Taking the difference, we get the point (0, 10 – a – 2b). Our data point (3, 12) will be compared with the point (3, a + 3b) on the line. Taking their difference, we get the point (0, 12 – a – 3b).

Now, since the first component of the difference will always be 0, we can ignore it. So, we want to minimize the value (7 – a – b) + (10 – a – 2b) + (12 – a – 3b), right? Well, almost. The problem with this idea is that there could be a large negative and large positive values cancelling out to create a small overall value. We could look at minimizing the absolute value of the differences, but it is actually much easier to simply square all the values to ensure that they are positive. And thus, we have the derivation of the name for our technique: the Method of Least Squares.

So, we are trying to find values for a and b that will minimize the value (7 – a – b)-quantity-squared + (10 – a – 2b)-quantity-squared + (12 – a – 3b)-quantity-squared. Now, this is the point at which we wish to switch into vector mode, as we can think of this as the following: (the norm of the vector [7 – a – b; 10 – a – 2b; 12 – a – 3b])-squared. We can also separate out this vector until it looks as if we’re taking the norm-squared of (the vector [7; 10; 12] – the quantity (a[1; 1; 1] + b[1; 2; 3])).

So if we try to start generalizing this process instead of thinking specifically about this example, let’s go ahead and set y = [7; 10; 12]. This will be our vector of data values. We can call the vector 1 the vector whose components are all [1; 1; 1], and let’s say the vector t is the vector [1; 2; 3]. This is our vector of t values from our data points. So now we’re trying to find scalars a and b that will minimize the square of the norm of (y – (a1 + bt)). So if we let S be the span of the vectors {1, t}, then we are looking for some s in our span that minimizes the square of the difference between y and s. By the Approximation Theorem, we know that this is proj(S)(y), so should we find an orthogonal basis for S and proceed to find proj(S)(y)? Well, actually, no. For one thing, that would produce an answer in the coordinates for the orthogonal basis, when we want our answer to be in the coordinates with respect to our set {1, t}, which is a basis so long as we pick appropriate t values, which, in this course, we will always do, so we don’t need to worry about it.

But the more important reason is that we don’t have to. Instead, consider that if a1 + bt = proj(S)(y), then y – (a1 + bt) would equal y – proj(S)(y), which equals perp(S)(y). Now, the book refers to this value as the error vector e, since it is the vector that measures the difference between our data values y and the equation values a1 + bt. But I’ve just shown that e is the same as perp(S)(y), which means that e must be in S-perp. Specifically, this means that 1 dotted with e will equal 0, and t dotted with e will equal 0. So again, recalling that e is y – the quantity (a1 + bt), this means that we get that 1 dotted with (y – the quantity (a1 + bt)) = 0, and t dotted with (y – (a1 + bt)) = 0.

The next step is to introduce a matrix. So we’re going to let X be the matrix whose columns are the vectors 1 and t, and now we’ll create a vector a whose entries are our scalars, a and b. Well, now our quantity a1 + bt can be written as Xa, and we can write y – quantity (a1 + bt) as y – Xa. And this means that we get that (X-transpose)(y – Xa) equals, we’ll write out X-transpose, times (y – Xa). But of course, matrix multiplication is simply the dot product times a column, so we can think of this as, it’s going to be the vector whose first entry is the first row, [1; 1; 1], dotted with (y – Xa). And the second entry will be our second row, [1; 2; 3], dotted with (y – Xa). Well, of course, this was just our vectors 1 and t, so we’ve got our first entry is 1 dotted with (y – Xa), and our second entry is t dotted with (y – Xa), which, from our earlier information about our error vector, we know both of those equal 0.

So we’ve done all this work to find that (X-transpose)(y – Xa) equals the 0 vector. If we go ahead and distribute our X-transpose, then we can write this equation as (X-transpose)y – (X-transpose)Xa = 0. At long last, we can solve for a. First, we’ll get that (X-transpose)Xa = (X-transpose)y. Now, if (X-transpose)X is invertible, then we could say that a = (((X-transpose)X)-inverse)(X-transpose)y. The question becomes, is (X-transpose)X invertible? Well, this will go back to us choosing the appropriate t values, and so again, in this course, you can always assume that the answer is yes. And so, at long last, we have found our vector a. Oh, except for actually calculating (((X-transpose)X)-quantity-inverse)(X-transpose)y.

Let’s leave the abstract world and actually find the solution to the example we started with. So, this does involve a lot of calculations. First, we have to compute (X-transpose)X. So that’s going to be the matrix [1, 1, 1; 1, 2, 3] times the matrix [1, 1; 1, 2; 1, 3], which equals [3, 6; 6, 14]. Next, we use the matrix inverse algorithm to compute ((X-transpose)X)-inverse. I’ll write the steps here, and you can pause if you would like to look at them closer. But we’ll see that our answer is that ((X-transpose)X)-inverse equals the matrix [7/3, -1; -1, 1/2]. Now we can use this to calculate (((X-transpose)X)-inverse)(X-transpose), and we’ll get the matrix [4/3, 1/3, -2/3; -1/2, 0, 1/2]. Finally, we’ll multiply on the right by y to get that (((X-transpose)X)-inverse)(X-transpose)y equals the vector [14/3; 5/2].

And so we see that the line that is the best approximation for our data points {(1, 7), (2, 10), (3, 12)} is y = 14/3 + (5/2)t.

Generalizing this example, we see that to find the equation a0 + a1t + through to an(t-to-the-n) that is the best fit for data points {(t1, y1) through (tk, yk)}, we want to set the vector y equal to our [y1 through yk], we’ll set the vector a = [a0 through an], and we’ll set the vector 1 still to be a vector whose entries are all 1. The vector t will be the vector whose entries are [t1 through tk]. Now, we’ll also create a vector t-squared; its entries will be [t1-squared through tk-squared], and so on, eventually getting to the vector t-to-the-n, which will be, have entries [t1-to-the-n through tk-to-the-n]. Then we will set our matrix X equal to the matrix whose columns are our vectors 1, t, all the way through to t-to-the-n. And then, using the same technique as in our example, we will find that the vector a will be equal to (((X-transpose)X)-inverse)(X-transpose)y.

For example, let’s find the equation of the form a + bt + c(t-squared) that best fits the data points {(-1, 4), (0, 7), (1, 12)}. So we’ll set a to be our vector of unknown scalars [a; b; c]. y will be our vector of data points [4; 7; 12]. And then we’ll have a vector 1 = [1; 1; 1], a vector t equal to, again, remember, our t values, which are [-1; 0; and 1], and our vector t-squared, which will be a vector whose entries are the squares of our t values, so this ends up being [1; 0; 1].

Now we can create our matrix X. Its first column will be the vector [1; 1; 1], its second column will be our t vector, [-1; 0; 1], and its third column will be our t-squared vector, [1; 0; 1]. Now we can begin our calculations. First, we’ll calculate (X-transpose)X, which is written here, and turns out to be the matrix [3, 0, 2; 0, 2, 0; 2, 0, 2]. The next step is to find ((X-transpose)X)-inverse using the matrix inverse algorithm. Again, I’ll put this calculation on the screen, so you can pause if you would like to have a closer look at it. But we’ll see that ((X-transpose)X)-inverse = [1, 0, -1; 0, 1/2, 0; -1, 0, 3/2]. And now we calculate that (((X-transpose)X)-inverse)(X-transpose) is this matrix product, which equals [0, 1, 0; -1/2, 0, 1/2; 1/2, -1, 1/2]. And at long last, we get that our unknown vector a will be (((X-transpose)X)-inverse)(X-transpose)y, so it’s our matrix times our y vector, and it ends up equaling [7; 4; 1].

This means that the equation 7 + 4t + t-squared is the equation of the form a + bt + c(t-squared) that best fits the data points {(-1, 4), (0, 7), (1, 12)}.

There are still a couple of things to note about this method. The first is that the matrix X, and thus the matrix (((X-transpose)X)-inverse)(X-transpose), is only dependent on the form of the equation we are trying to fit and the experimental t values, but not the y values. You can imagine a situation where an experiment is run several times, so the t values stay constant but the y values vary. So we could reuse our matrix (((X-transpose)X)-inverse)(X-transpose) for each of our y vectors, which cuts down on a lot of the work.

Another thing to note is that if we want one of the ai values in our goal equation to be 0, we need to remove that entry from our list of the vectors t-to-the-i when creating X. If we don’t, our technique may (in fact, probably will) present us with a solution where the ai value is not 0. So basically, we want to make sure that our matrix X is a custom fit for our equation.

Here’s an example to show what I mean. Let’s try to find the equation of the form at + b(t-cubed) that best fits the data points {(-2, -10), (-1, -3), (0, 1), (1, 5), (2, 12)}. So for equations of the form at + b(t-cubed), we will want our matrix X to have columns t and t-cubed, so no 1 and no t-squared because those values don’t appear in our equation. The vector t is, of course, our t values, [-2; -1; 0; 1; 2], and our vector t-cubed will be the cube of all those values, so we’ll get it to be [-8; -1; 0; 1; 8].

Now with X determined, we continue with our calculations as before. First, we compute (X-transpose)X, and we’ll see that it’s the matrix [10, 34; 34, 130]. Next, we need to find ((X-transpose)X)-inverse using the matrix inverse algorithm. You can see the calculations here, and we’ll see that ((X-transpose)X)-inverse equals the matrix [130/144, -34/144; -34/144, 10/144], which we can write as the matrix (1/72)[65, -17; -17, 5]. Now we calculate (((X-transpose)X)-inverse)(X-transpose), as seen here, and we’ll get the matrix (1/12)[1, -8, 0, 8, -1; -1, 2, 0, -2, 1]. And at long last, we’ll multiply this on the right by y, and we’ll get that a = (((X-transpose)X)-inverse)(X-transpose)y, which is (1/12)[42; 6], or [7/2; 1/2].

And so we’ve found that the equation (7/2)t + (1/2)(t-cubed) is the equation of the form at + b(t-cubed) that best fits the data points {(-2, -10), (-1, -3), (0, 1), (1, 5), (2, 12)}.