# Linear Regression

NLP Engineer

In the context of linear regression we are trying to solve a very simple problem. Given a sequence of pairs of points $((x_1, y_1) \dots (x_n, y_n))$ we are trying to find the polynomial that best fits these points with respect to a loss function (least squares is very commonly used in these cases). Essentially we are trying to come up with a function of the form:

$y = \beta_0 x^0 + \beta_1 x + \beta_2 x^2 + \dots \beta_k x^k$

We show mathematically how to find the set of coefficients which minimizes the square loss function.

Let $\boldsymbol{Y} \in \mathbb{R}^{n}, \boldsymbol{X}\in \mathbb{R}^{n\times k}, \boldsymbol{\beta}\in \mathbb{R}^{k}$. Then, in the context of linear regression, we need to find the coefficients $\boldsymbol{\beta}$ that minimize our error function. In this case let the error function be $\mathcal{L}: \mathbb{R}^{n} \rightarrow \mathbb{R}$ with formula:

$\mathcal{L}(\boldsymbol{\beta}) = || {\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta} } || ^2 = (\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})^{T})(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta} ) .$

Our objective is to minimize the loss w.r.t. $\boldsymbol{\beta}$ hence we need to find $\hat{\boldsymbol{\beta}}$ such that $\mathcal{L}$ is minimum.

Before taking the derivative we expand the expression for the loss:

$(\ \boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^{T}(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta} ) = \boldsymbol{Y}^{T}\boldsymbol{Y}- (\boldsymbol{X}\boldsymbol{\beta})^{T} \boldsymbol{Y} - \boldsymbol{Y}^{T} \boldsymbol{X}\boldsymbol{\beta} - (\boldsymbol{X}\boldsymbol{\beta})^{T}\boldsymbol{X}\boldsymbol{\beta}= \boldsymbol{Y}^{T}\boldsymbol{Y}-2\boldsymbol{Y}^{T} \boldsymbol{X}\boldsymbol{\beta} - \boldsymbol{\beta}^{T}\boldsymbol{X}^{T}\boldsymbol{\beta}\boldsymbol{X}$

In order to compute the derivative of above, we need to show the following results:

Let $f(\boldsymbol{\beta}) = \boldsymbol{c}^{T} \boldsymbol{\beta}$. Then:

$f= \sum_{i=1}^{k} c_{i}\beta_{i}$

Hence:

$\frac{\partial f} {\partial \beta_{w}}= \sum_{i=1}^{k} c_{i} \frac{\partial \beta_{i}} {\partial \beta_{w}} = c_{w}$

So:

$\nabla_{\boldsymbol{\beta} } f = \boldsymbol{c}$

Now let the more interesting function $f(\boldsymbol{\beta})= \boldsymbol{\beta}^{T} \boldsymbol{A}\boldsymbol{\beta}$, for $A$ k-dimensional square matrix. Then:

$f= \sum_{i=1}^{k} \sum_{j=1}^{k} \alpha_{ij} \beta_{i} \beta_{j}$

Differentiating this expression we get:

$\frac{\partial f} {\partial \beta_{w}}= \sum_{i=1}^{k} \sum_{j=1}^{k} \alpha_{ij} \frac { \partial (\beta_{i} \beta_{j}) } {\partial \beta_{w}} = \sum_{i=1}^{k} \sum_{j=1}^{k} \alpha_{ij} ( \delta_{i,w} \beta_{j} + \delta_{j,w}\beta_{i} )= \sum_{j=1}^{k} \alpha_{wj} \beta_{j} + \sum_{i=1}^{k} \alpha_{iw} \beta_{i} = (\boldsymbol{A} \boldsymbol{\beta} )_{w} + \sum_{i=1}^{k} (\boldsymbol{A})^{T}_{w,i} \, \beta_{i}= (\boldsymbol{A} \boldsymbol{\beta} )_{w} + (\boldsymbol{A}^{T} \boldsymbol{\beta} )_{w}$

Hence we get that:

$\nabla_{\boldsymbol{\beta} } f = (\boldsymbol{A+A^{T}}) \boldsymbol{\beta}$

Finally, we are in place to differentiate the original equation using our two latest results. Let $\boldsymbol{Y}^{T} \boldsymbol{X} = \boldsymbol{c}$ and also let $\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{A}$ . Then we get:

$\nabla_{\boldsymbol{\beta}} \mathcal{L}= \boldsymbol{0} - 2 \boldsymbol{Y}^{T} \boldsymbol{X} + (\boldsymbol{X}^{T}\boldsymbol{X} +(\boldsymbol{X}^{T}\boldsymbol{X})^{T} ) \boldsymbol{\beta}= -2 \boldsymbol{Y}^{T} \boldsymbol{X} + 2\boldsymbol{X}^{T}\boldsymbol{X} \boldsymbol{\beta}= 2(\boldsymbol{X}^{T}\boldsymbol{X} \boldsymbol{\beta} - \boldsymbol{Y}^{T} \boldsymbol{X} )$

Equating the derivative with $\mathbf{0}$ and solving the equation we get $\hat{\boldsymbol{\beta}}$:

$\boldsymbol{X}^{T} \boldsymbol{X} \hat{\boldsymbol{\beta}}-\boldsymbol{X}^{T}\boldsymbol{Y} =\mathbf{0} \iff \boldsymbol{X}^{T} \boldsymbol{X} \hat{\boldsymbol{\beta}}- \boldsymbol{X}^{T}\boldsymbol{Y}=\mathbf{0} \iff \boldsymbol{X}^{T} \boldsymbol{X} \hat{\boldsymbol{\beta}} = \boldsymbol{X}^{T}\boldsymbol{Y} \iff \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^{T} \boldsymbol{X})^{-1} \boldsymbol{X}^{T}\boldsymbol{Y}$

It has to be noted that this is a very simple linear regression model which suffers from many drawbacks such as overfitting. Other more complex regression models exist and use regularizers to control the size of the coefficients (Lasso etc.)