## Transcript — Introduction

Over the next two lectures, we will look at a real-world application of what we have been doing. We begin by proving a very useful theorem.

The Approximation Theorem: Let W be a finite-dimensional subspace of an inner product space V. If v is a vector in V, then the length of (v – w) is greater than the length of (v minus the projection of v onto W) for all vectors w in W not equal to the projection.

Proof: To prove this theorem, we want to relate v – w to v minus the projection. We can do this by starting with v – w, and then adding and subtracting the projection. In particular, we get that v – w = v minus the projection plus the projection minus w. Observe that v minus the projection is the perpendicular of v onto W, and hence, by definition, is orthogonal to the projection and to w. Hence, using Theorem 9.2.2, we get (the length of (v – w))-squared = (the length of ((v minus the projection) + (the projection minus w)))-squared, which equals (the length of (v minus the projection))-squared plus (the length of (the projection minus w))-squared. We can make this smaller by removing the positive quantity (the length of (the projection minus w))-squared, and hence, we have (the length of (v – w))-squared is greater than (the length of (v minus the projection))-squared as required.

At first glance, the statement of this theorem may seem weird and confusing. Take a minute to think about the statement, and try to figure out what it really is saying. Thinking about the geometric interpretation can really help. What the theorem really says is that the vector in W that is closest to the vector v in V is the projection of v onto W. This is demonstrated geometrically in the picture below.

Onto our real-world application: Assume a scientist wants to relate two quantities x and y, and by theoretical reasons, they know that they are related by an equation of the form, say, y = a0 + a1x + a2x-squared. The scientist wants to determine values of the coefficients a0, a1, and a2 so that this will best model the real world. To do this, the scientist performs some experiments involving the two quantities. Every time they do the experiment, they will get quantities yi and xi. Putting this into their equation, they get a linear equation yi = a0 + a1xi + a2xi-squared. Because of experimentation error, the scientist will want to perform the experiment a lot more than three times. They will end up with a system of linear equations which has more equations than unknowns. Such a system is said to be overdetermined—that is, because there is, in some sense, too many equations to be satisfied. For this reason, and because of experimentation error, this system is most likely going to be inconsistent. But then how is the scientist going to determine appropriate values for a0, a1, and a2? The only thing they can do is to find the values of a0, a1, and a2 that is closest to being a solution for their system. To figure out how to do this, let’s change the problem into a linear algebra problem.

Let’s begin by representing the system in the form Xa = y, where X is the coefficient matrix, y is the vector of y-values, and a is the vector of unknowns. We want to find the vector a that is closest to being a solution. That is, we want to find the vector a such that the length of (Xa – y) is as small as possible. We should be able to use the Approximation Theorem to do this. We notice that Xa is a vector in the columnspace of X, and so we are really trying to minimize the distance from y to the columnspace of X. By the Approximation Theorem, this is the projection of y onto the columnspace of X. Hence, we want to find a vector a such that Xa is equal to the projection of y onto the columnspace of X. Notice that this is guaranteed to be a consistent system, so we now can find our desired vector a.

But although this works, it is not particularly nice. Notice that to find the projection, we need an orthogonal basis for the columnspace of X. That is, we first need to find a basis for the columnspace of X, and then apply the Gram-Schmidt procedure to make it orthogonal. After all of that work, we still need to find the projection, and then solve the system. So, although this method will work, we want to simplify this to make it easier. But how? Notice that we are really trying to minimize the length of (Xa – y), which is the same as the length of (y – Xa). So, subtracting both sides of the equation above from y, we get y – Xa is equal to y minus the projection of y onto the columnspace. But y minus the projection is the perpendicular of y onto the columnspace, which is orthogonal to the columnspace. By the Fundamental Theorem of Linear Algebra, the orthogonal complement of the columnspace is the left nullspace. Hence, we have that y – Xa is a vector in the nullspace of X-transpose, and so, by definition of the left nullspace, we have that (X-transpose)(y – Xa) is the 0 vector. And so rearranging this, we get (X-transpose)Xa = (X-transpose)y.

We now have a simple system of linear equations to solve, which we know is consistent by construction. We call this system the normal system. This is quite remarkable. We have started with the inconsistent system Xa = y, and by simply multiplying both sides by X-transpose, we not only get a consistent system, but the solution of the new system best approximates a solution to the original inconsistent system. This proves another important result—that math is awesome. Before we look at some examples of this, we make one more note that the normal system need not have a unique solution. If it does have infinitely many solutions, then each of the solutions will minimize the length of (Xa – y).

## Examples

Example: Determine the vector x that is closest to being a solution for the system 3x1 – x2 = 4, x1 + 2x2 = 0, 2x1 + x2 = 1. Solution: Let A = [3, -1; 1, 2; 2, 1], y = [4; 0; 1], and x = [x1; x2]. We want to find the vector x that minimizes the length of (Ax – y). So, from the work we just did, we just need to solve the normal system (A-transpose)Ax = (A-transpose)y. We find that (A-transpose)A is invertible, and hence, doing the calculations, we get x = (((A-transpose)A)-inverse)(A-transpose)y, which equals [87/83; -56/83]. Thus, x1 = 87/83 and x2 = -56/83. As usual, when doing an approximation, we should check to see how good of an approximation we got. Plugging these values into the original equations, we get 3x1 – x2 = 3.82, x1 + 2x2 = -0.3, and 2x1 + x2 = 1.42. So, we can see that, in this case, it was a pretty good approximation.

Example: Determine the vector x that minimizes the length of (Ax – b) for the system 2x1 + x2 = -5, -2x1 + x2 = 8, and 2x1 + 3x2 = 1. Solution: Let A = [2, 1; -2, 1; 2, 3], b = [-5; 8; 1], and x = [x1, x2]. Then the normal system gives x = (((A-transpose)A)-inverse)(A-transpose)b, which equals [-25/8; 9/4]. Thus, x1 = -25/8 and x2 = 9/4. Plugging these values into the original equations, we get 2x1 + x2 = -4, -2x1 + x2 = 8.5, and 2x1 + 3x2 = 0.5.

Example: Determine the vector x that minimizes the length of (Ax – b) for the system 3x1 – 2x2 = 1, x1 + 4x2 = 1, and 2x1 + x2 = -2. Solution: Putting this into the normal system, we get x = (((A-transpose)A)-inverse)(A-transpose)b, which equals [0; 0]. Grrr! What happened? This is obviously not a good approximation. Take a minute to see if you can figure out why we got such a bad approximation. Thinking about our geometric interpretation, we see that the error in the approximation is the distance that v is from W. The distance will be greatest when v is orthogonal to W. Indeed, in this example, we see that b is orthogonal to the columnspace of A.

The method we have derived in this lecture for finding the best approximation to a solution of an overdetermined system is called the method of least squares.

In the next lecture, we will use the method of least squares to find polynomials of best fit for a given set of data. This ends this lecture.