Linear Regression and Mean Squared Error
Math Review
When I first heard about machine learning, I imagined a computer that was rewarded every time it gave the right answer. Maybe there were electric carrots and sticks that no one had bothered to tell me about? While I now know as little as I did then about computer hardware, I have learned that machine learning is fundamentally a mathematical process.
Luckily, we’ve been learning about the very mathematical ideas that make machine learning work for years! We’ll review the basics of these concepts and then jump in to linear regression, arguably the foundation of neural networks.
Derivatives
Imagine a function \(\mathcal{L}: \mathbb{R} \to \mathbb{R}\). (Instead of the usual \(f\), we’ll use \(\mathcal{L}\) for reasons that will soon become clear.) The mapping notation means that \(\mathcal{L}\) takes a single real number as input and outputs a single real number. In general, mathematicians tell us to be careful about whether we can differentiate a function. But, we’re computer scientists so we’ll risk it for the biscuit.
Let \(z \in \mathbb{R}\) be the input to \(\mathcal{L}\). The derivative of \(\mathcal{L}\) with respect to its input \(z\) is mathematically denoted by \(\frac{\partial}{\partial z}[\mathcal{L}(z)]\).
Formally, the derivative is defined as \[ \frac{\partial}{\partial z}[\mathcal{L}(z)] = \lim_{h \to 0} \frac{\mathcal{L}(z + h) - \mathcal{L}(z)}{h}. \] If we were to plot \(\mathcal{L}\), the derivative at a point \(z\) would be the slope of the tangent line to the curve at that point.
Here are several examples of functions and their derivatives that you might remember from calculus.
Function: \(\mathcal{L}(z)\) | Derivative: \(\frac{\partial}{\partial z}[\mathcal{L}(z)]\) |
\[z^2\] | \[2z\] |
\[z^a\] | \[a z^{a-1}\] |
\[az + b\] | \[a\] |
\[\ln(z)\] | \[\frac{1}{z}\] |
\[e^z\] | \[e^z\] |
Chain Rule and Product Rule
While working with a simple basic function is easy, we’re not always so lucky. Modern machine learning chains many many complicated functions together. Fortunately, we will think of these operations modularly.
Let \(g: \mathbb{R} \to \mathbb{R}\) be another function. Consider the composite function \(g(\mathcal{L}(z))\).
By the chain rule, the derivative of \(g(\mathcal{L}(z))\) with respect to \(z\) is \[ \frac{\partial }{\partial z}[g(\mathcal{L}(z))] = \frac{\partial g}{\partial z}(\mathcal{L}(z)) \frac{\partial}{\partial z}[\mathcal{L}(z)]. \]
Often, we will also multiply functions together. The product rule tells us that \[ \frac{\partial }{\partial z}[g(z) \mathcal{L}(z)] = g(z) \frac{\partial}{\partial z}[\mathcal{L}(z)] + \mathcal{L}(z) \frac{\partial}{\partial z}[g(z)]. \]
Gradients
In machine learning, we process high-dimensional data so we are interested in functions with multivariate input. Consider \(\mathcal{L}: \mathbb{R}^d \to \mathbb{R}\). The output of the function is still a real number but the input consists of \(d\) real numbers. We will use the vector \(\mathbf{z} \in \mathbb{R}^d\) to represent all \(d\) inputs \(z_1, \ldots, z_d\).
Instead of the derivative, we will talk use the partial derivative. The partial derivative with respect to \(z_i\) is denoted by \(\frac{\partial}{\partial z_i}[\mathcal{L}(\mathbf{z})]\). In effect, the partial derivative tells us how \(\mathcal{L}\) changes when we change \(z_i\) while keeping all other inputs fixed.
The gradient stores all the partial derivatives in a vector. The \(i\)th entry of this vector is given by the partial derivative of \(\mathcal{L}\) with respect to \(z_i\). In mathematical notation, \[ \nabla_\mathbf{z} \mathcal{L} = \left[\begin{smallmatrix} \frac{\partial}{\partial z_1}[\mathcal{L}(\mathbf{z})] \\ \vdots \\ \frac{\partial}{\partial z_d}[\mathcal{L}(\mathbf{z})] \\ \end{smallmatrix}\right] \]
Just like the derivative in one dimension, the gradient contains information about the slope of \(\mathcal{L}\) with respect to each of the \(d\) dimensions in its input.
Vector and Matrix Multiplication
Vector and matrix multiplication lives at the heart of deep learning. In fact, deep learning really started to take off when researchers realized that the Graphical Processing Unit (GPU) could be used to perform gradient updates in addition to the matrix multiplication it was designed to do for gaming.
Consider two vectors \(\mathbf{u} \in \mathbb{R}^d\) and \(\mathbf{v} \in \mathbb{R}^d\). We will use \(\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^d u_i v_i\) to denote the inner product of \(\mathbf{u}\) and \(\mathbf{v}\). The \(\mathcal{\ell}_2\)-norm of \(v\) is given by \(\|\mathbf{v}\|_2 = \sqrt{\mathbf{u} \cdot \mathbf{u}}\).
Consider two matrices: \(\mathbf{A} \in \mathbb{R}^{d \times m}\) and \(\mathbf{B} \in \mathbb{R}^{m \times n}\) where \(d \neq n\). We can only multiply two matrices when their inner dimension agrees. Because the number of columns in \(\mathbf{A}\) is the same as the number of rows in \(\mathbf{B}\), we can compute \(\mathbf{AB}\). However, because the number of columns in \(\mathbf{B}\) is not the same as the number of rows in \(\mathbf{A}\), the product \(\mathbf{BA}\) is not defined.
When we can multiply two matrices, the \((i,j)\) entry in \(\mathbf{AB}\) is given by the inner product between the \(i\)th row of \(\mathbf{A}\) and the \(j\)th column of \(\mathbf{B}\). The resulting dimensions of the matrix product will be the number of rows in \(\mathbf{A}\) by the number of columns in \(\mathbf{B}\).
Inverse Matrices
If we have a scalar equation \(ax = b\), we can simply solve for \(x\) by dividing both sides by \(a\). In effect, we are applying the inverse of \(a\) to \(a\) i.e., \(\frac1{a} a =1\). The same principle applies to matrices. The \(n \times n\) identity matrix generalizes the scalar identity \(1\). This identity matrix is denoted by \(\mathbf{I}_{n \times n} \in \mathbb{R}^{n \times n}\): the on-diagonal entries \((i,i)\) are 1 while the off-diagonal entries \((i,j)\) for \(i\neq j\) are 0.
Consider the matrix equation \(\mathbf{Ax} = \mathbf{b}\) where \(\mathbf{A} \in \mathbb{R}^{d \times d}\), \(\mathbf{x} \in \mathbb{R}^d\), and \(\mathbf{b} \in \mathbb{R}^d\). (Notice that the inner dimensions of \(\mathbf{A}\) and \(\mathbf{x}\) agree so their multiplication is well-defined, and the resulting vector is the same dimension as \(\mathbf{b}\).)
If we want to solve for \(\mathbf{x}\), we can use the matrix inverse. For a matrix \(\mathbf{A}\), we use \(\mathbf{A}^{-1}\) to denote its inverse. The inverse is defined so that \(\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}_{n \times n}\) where \(\mathbf{I}_{n \times n}\) is the identity matrix. Then, we can solve for \(\mathbf{x}\) by multiplying both sides of the equation by \(\mathbf{A}^{-1}\). \[ \mathbf{A}^{-1} \mathbf{Ax} = \mathbf{A}^{-1} \mathbf{b} \] Since \(\mathbf{A}^{-1} \mathbf{A} = \mathbf{I}_{n \times n}\), we have that \(\mathbf{I}_{n \times n} \mathbf{x} = \mathbf{x} = \mathbf{A}^{-1} \mathbf{b}\).
Univariate Linear Regression
Now that we’ve reviewed the basic building blocks of machine learning, we can dive into linear regression. Linear regression is a simple and powerful tool that we will use to understand the basics of deep learning.
Goal
We will begin our study of deep learning in the supervised setting. In this setting, we are given labelled data with input features and an outcome. Formally, we will have \(n\) labelled observations \((x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})\). In general, we will have \(y \in \mathbb{R}\). For simplicity, we will assume for now that \(x \in \mathbb{R}\).
Our goal is to process the data and learn a function that approximates the outcomes. In mathematical notation, we want to learn a function \(f: \mathbb{R} \to \mathbb{R}\) so that \(f(x^{(i)}) \approx y^{(i)}\) for the \(n\) labelled observations \(i \in \{1,\ldots,n\}\).
Before we dive into the specific way we will accomplish this with linear regression, let’s discuss the general deep learning framework. This three-step framework gives a flexible scaffolding that we will use to understand almost every topic in this course.
The three-step framework includes:
• The Model: The function that we’ll use to process the input and produce a corresponding output.
• The Loss: The function that measures the quality of the outputs from our model. (Without loss of generality, we will assume that lower is better.)
• The Optimizer: The method of updating the model to improve the loss.
With these general concepts in mind, we’ll explore linear regression.
Linear Model
As its name suggests, linear regression uses a linear model to process the input into an approximation of the output. Let \(w \in \mathbb{R}\) be a weight parameter. The linear model (for one-dimensional inputs) is given by \(f(x) = wx\).
Unlike many deep learning models, we can visualize the linear model since it is given by a line. In the plot, we have the \(n=10\) data points plotted in 2 dimensions. There is one linear model \(f(x) = 2x\) that closely approximates the data and another linear model \(f(x)=\frac12 x\) that does not approximate the data.
Our goal is to learn how to find a linear model that fits the data well. Before we can do this though, we will need to define what it means for a model to “fit the data well”.
Mean Squared Error Loss
Our goal for the loss function is to measure how closely the data fits the prediction made by our model. Intuitively, we should take the difference between the prediction and the true outcome \(f(x^{(i)})-y^{(i)}\).
The issue with this approach is that \(f(x^{(i)})-y^{(i)}\) can be small (negative) even when \(f(x^{(i)}) \neq y^{(i)}\). A natural fix is to take the absolute value \(|f(x^{(i)}) - y^{(i)}|\). The benefit of the absolute value is that the loss is \(0\) if and only if \(f(x^{(i)}) = y^{(i)}\). However, the absolute value function is not differentiable, which is a property we’ll need for optimization. Instead, we use the squared loss:
\(\mathcal{L}(w) = \frac1{n} \sum_{i=1}^n (f(x^{(i)}) - y^{(i)})^2\)
Here, we use the mean squared error loss, which is the average squared difference between the prediction and the true output over the dataset. Unlike the absolute value function, the squared function is differentiable everywhere. In addition, the squared error penalizes predictions that are far from the true output even more.
The plot above compares the squared function to the absolute value function. While both are \(0\) if and only if their input is \(0\), the squared function is differentiable everywhere and penalizes large errors more.
Exact Optimization
We now have our model and loss function: the linear model and mean squared error loss. The question becomes how to update the weights of the model to minimize the loss. In particular, we want to find \(w\) that minimizes \(\mathcal{L}(w)\). While the language we’re using is new, the problem is not. We’ve actually been studying how to do this since pre-calculus!
The squared loss is convex (a bowl facing up versus the downward facing cave of concave); see the plot above for a ‘proof’ by picture. In this case, we know there is only one minimum. Not only that but we can find the minimum by setting the derivative to \(0\)!
As such, our game plan is to set \(\frac{\partial \mathcal{L}}{\partial w}\) to \(0\) and solve for \(w\). Recall that \(f(x) = wx\). We will use the linearity of the derivative, the chain rule, and the power rule to compute the derivative of \(\mathcal{L}\) with respect to \(w\):
\[ \frac{\partial}{\partial w}[\mathcal{L}(w)] = \frac1{n} \sum_{i=1}^n \frac{\partial}{\partial w} [(f(x^{(i)}) - y^{(i)})^2] = \frac1{n} \sum_{i=1}^n 2(f(x^{(i)}) - y^{(i)}) \frac{\partial}{\partial w} [(f(x^{(i)}) - y^{(i)})] = \frac1{n} \sum_{i=1}^n 2(w x^{(i)} - y^{(i)}) x^{(i)}. \]
Setting the derivative to \(0\) and solving for \(w\), we get \(\frac2{n} \sum_{i=1}^n w (x^{(i)})^2 = \frac2{n} \sum_{i=1}^n y^{(i)} x^{(i)}\) and so \[ w = \frac{\sum_{i=1}^n y^{(i)}}{\sum_{i=1}^n (x^{(i)})^2}. \]
This is the exact solution to the univariate linear regression problem! We can now use this formula to find the best linear model for our data. But we’re not done with linear regression yet. We assumed that the input was one-dimensional; however, we often have high-dimensional data.
Multivariate Linear Regression
Consider the more general setting where the input is \(d\)-dimensional. As before, we observe \(n\) training observations \((\mathbf{x}^{(1)}, y^{(1)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})\) but now \(\mathbf{x}^{(i)} \in \mathbb{R}^d\). We will generalize the ideas from univariate linear regression to the multivariate setting.
Linear Model
Instead of using a single weight \(w \in \mathbb{R}\), we will use \(d\) weights \(\mathbf{w} \in \mathbb{R}^d\). Then the model is given by \(f(x) = \mathbf{w} \cdot \mathbf{x}\).
Instead of using a line to fit the data, we use a hyperplane. While visualizing the model is difficult in high dimensions, we can still see the model when \(d=2\).
In the plot above, we have \(n=10\) data points in 3 dimensions. There is one linear model \(f(\mathbf{x}) = \begin{bmatrix} 2 \\ \frac12 \end{bmatrix} \cdot \mathbf{x}\) that closely approximates the data and another linear model \(f(\mathbf{x}) = \begin{bmatrix} \frac12 \\ 0 \end{bmatrix} \cdot \mathbf{x}\) that does not approximate the data.
Mean Squared Error
Since the output of \(f\) is still a single real number, we do not have to change the loss function. However, we can use our linear algebra notation to write the mean squared error in an elegant way.
Let \(\mathbf{X} \in \mathbb{R}^{n \times d}\) be the data matrix where the \(i\)th row is \((\mathbf{x}^{(i)})^\top\). Similarly, let \(\mathbf{y} \in \mathbf{R}^n\) be the target vector where the \(i\)th entry is \(y^{(i)}\). We can write the mean squared error loss as \[ \mathcal{L}(\mathbf{w}) = \frac1{n} \| \mathbf{X w - y} \|_2^2. \]
Exact Optimization
Just like computing the derivative and setting it to \(0\), we can compute the gradient and set it to the zero vector \(\mathbf{0} \in \mathbb{R}^d\). In mathematical notation, we will set \(\nabla_\mathbf{w} \mathcal{L}(\mathbf{w}) = \mathbf{0}\) and solve for \(\mathbf{w}\).
We will leave this as an exercise on the homework. The final solution is that \(\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}\).
This is the exact solution to the multivariate linear regression problem!
Conclusion
We’ve learned about the basic building blocks of machine learning and used them to understand linear regression. We’ve seen how to define the model, loss, and optimizer. We’ve also seen how to solve the linear regression problem in both the univariate and multivariate settings.
There are two big questions that our discussion of linear regression raises:
What if the data does not have a linear relationship?
What happens when our model is too complex to solve for the exact solution?
We will answer these questions in future lectures as we explore logistic regression, neural networks, and gradient descent.