Building Linear Regression (Least Squares) with Linear Algebra
Fully linear algebra based approach to solve linear regression problem using excel or numpy.
With a lot of sophisticated packages in python and R at our disposal, the math behind an algorithm is unlikely to be gone through by us each time we have to fit a bunch of data points. But it is sometimes useful to learn the math and solve an algorithm from scratch manually so that we will be able to build intuition of how it is done in the background. During my course work for , one of the lectures for statistics involved solving for intercept, coefficients and R Square values of multiple linear regression with just matrix multiplication on an excel using linear algebra. Before that, I have always used statmodel OLS in python or lm() command on R to get the intercept and coefficients and a glance at the R Square value will tell how good a fit it is.
Since the completion of my course, I have long forgotten how to solve it using excel, so I wanted to brush up on the concepts and also write this post so that it could be useful to others as well.
I have done this entire post using numpy on my K. Please review and upvote my notebook if you found this post useful!
Math behind solving Linear Regression using Matrix Multiplication
Let us take a simple linear regression to begin with. We want to find the best fit line through a set of data points: (x1, y1), (x2, y2), … (xn, yn). But what does the best fit mean?
If we can find a slope and an intercept for a single line that passes through all the possible data points, then that is the best fit line. But in most of the cases, such a line does not exist! So we resolve to finding a line such that when a connecting line is drawn parallel to the y-axis from the data points to the regression line, which measures the error of each data point, the sum of all such errors should be minimum. Simple, eh?
In the diagram, errors are represented by red, blue, green, yellow, and the purple line correspondingly. To formulate this as a matrix solving problem, consider linear equation is given below, where Beta 0 is the intercept and Beta is the slope.
To simplify this notation, we will add Beta 0 to the Beta vector. This is done by adding an extra column with 1’s in X matrix and adding an extra variable in the Beta vector. Consequently, the matrix form will be:
Then the least square matrix problem is:
Let us consider our initial equation:
Multiplying both sides by X_transpose matrix:
Where:
Ufff that is a lot of equations. But it will be simple enough to follow when we solve it with a simple case below.
Solving for a toy simple linear regression problem
For simplicity, we will start with a simple linear regression problem which has 4 data points (1, 1), (2, 3), (3, 3) and (4, 5). X = [1, 2, 3, 4] and y = [1, 3, 3, 5]. When we convert into matrix form as described above, we get:
Here is the numpy code to implement this simple solution:
Multiple linear regression
Solving for multiple linear regression is also quite similar to simple linear regression and we follow the 6 steps:
- Add a new column the beginning with all 1’s for the intercept in the X matrix
- Take the transpose of X matrix
- Multiply X transpose and X matrices
- Find the inverse of this matrix
- Multiply X transpose with y matrix
- Multiply both the matrices to find the intercept and the coefficient
For solving multiple linear regression I have taken a dataset from kaggle which has .
I have manually computed all the calculations in excel. I have taken the first 300 rows from Volkswagen dataset and took out only the numerical variables from it. The regression gives a r square score of 0.77.
I urge you to download the excel workbook and follow the calculations (the formatting for new math font on google sheet is not good. You can download and view it on MS excel for better readability). In the sheet “Explanation” I have matrix multiplied X_Transpose and X. This has all the information that we need for calculation of model parameters like R-Square value.
Please refer to the section 3 of the kaggle notebook here: where I have solved this problem using matrix multiplication.
References:
- Complete business statistics book
- My course work for
Image :