Skip to main content

Linear Regression

· 3 min read
Panos Karagiannis
NLP Engineer

In the context of linear regression we are trying to solve a very simple problem. Given a sequence of pairs of points ((x1,y1)(xn,yn))((x_1, y_1) \dots (x_n, y_n)) we are trying to find the polynomial that best fits these points with respect to a loss function (least squares is very commonly used in these cases). Essentially we are trying to come up with a function of the form:

y=β0x0+β1x+β2x2+βkxky = \beta_0 x^0 + \beta_1 x + \beta_2 x^2 + \dots \beta_k x^k

We show mathematically how to find the set of coefficients which minimizes the square loss function.

Let YRn,XRn×k,βRk\boldsymbol{Y} \in \mathbb{R}^{n}, \boldsymbol{X}\in \mathbb{R}^{n\times k}, \boldsymbol{\beta}\in \mathbb{R}^{k}. Then, in the context of linear regression, we need to find the coefficients β\boldsymbol{\beta} that minimize our error function. In this case let the error function be L:RnR\mathcal{L}: \mathbb{R}^{n} \rightarrow \mathbb{R} with formula:

L(β)=YXβ2=(YXβ)T)(YXβ).\mathcal{L}(\boldsymbol{\beta}) = || {\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta} } || ^2 = (\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})^{T})(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta} ) .

Our objective is to minimize the loss w.r.t. β\boldsymbol{\beta} hence we need to find β^\hat{\boldsymbol{\beta}} such that L\mathcal{L} is minimum.

Before taking the derivative we expand the expression for the loss:

( YXβ)T(YXβ)=YTY(Xβ)TYYTXβ(Xβ)TXβ=YTY2YTXββTXTβX(\ \boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^{T}(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta} ) = \boldsymbol{Y}^{T}\boldsymbol{Y}- (\boldsymbol{X}\boldsymbol{\beta})^{T} \boldsymbol{Y} - \boldsymbol{Y}^{T} \boldsymbol{X}\boldsymbol{\beta} - (\boldsymbol{X}\boldsymbol{\beta})^{T}\boldsymbol{X}\boldsymbol{\beta}= \boldsymbol{Y}^{T}\boldsymbol{Y}-2\boldsymbol{Y}^{T} \boldsymbol{X}\boldsymbol{\beta} - \boldsymbol{\beta}^{T}\boldsymbol{X}^{T}\boldsymbol{\beta}\boldsymbol{X}

In order to compute the derivative of above, we need to show the following results:

Let f(β)=cTβf(\boldsymbol{\beta}) = \boldsymbol{c}^{T} \boldsymbol{\beta}. Then:

f=i=1kciβif= \sum_{i=1}^{k} c_{i}\beta_{i}

Hence:

fβw=i=1kciβiβw=cw\frac{\partial f} {\partial \beta_{w}}= \sum_{i=1}^{k} c_{i} \frac{\partial \beta_{i}} {\partial \beta_{w}} = c_{w}

So:

βf=c\nabla_{\boldsymbol{\beta} } f = \boldsymbol{c}

Now let the more interesting function f(β)=βTAβf(\boldsymbol{\beta})= \boldsymbol{\beta}^{T} \boldsymbol{A}\boldsymbol{\beta}, for AA k-dimensional square matrix. Then:

f=i=1kj=1kαijβiβjf= \sum_{i=1}^{k} \sum_{j=1}^{k} \alpha_{ij} \beta_{i} \beta_{j}

Differentiating this expression we get:

fβw=i=1kj=1kαij(βiβj)βw=i=1kj=1kαij(δi,wβj+δj,wβi)=j=1kαwjβj+i=1kαiwβi=(Aβ)w+i=1k(A)w,iTβi=(Aβ)w+(ATβ)w\frac{\partial f} {\partial \beta_{w}}= \sum_{i=1}^{k} \sum_{j=1}^{k} \alpha_{ij} \frac { \partial (\beta_{i} \beta_{j}) } {\partial \beta_{w}} = \sum_{i=1}^{k} \sum_{j=1}^{k} \alpha_{ij} ( \delta_{i,w} \beta_{j} + \delta_{j,w}\beta_{i} )= \sum_{j=1}^{k} \alpha_{wj} \beta_{j} + \sum_{i=1}^{k} \alpha_{iw} \beta_{i} = (\boldsymbol{A} \boldsymbol{\beta} )_{w} + \sum_{i=1}^{k} (\boldsymbol{A})^{T}_{w,i} \, \beta_{i}= (\boldsymbol{A} \boldsymbol{\beta} )_{w} + (\boldsymbol{A}^{T} \boldsymbol{\beta} )_{w}

Hence we get that:

βf=(A+AT)β\nabla_{\boldsymbol{\beta} } f = (\boldsymbol{A+A^{T}}) \boldsymbol{\beta}

Finally, we are in place to differentiate the original equation using our two latest results. Let YTX=c\boldsymbol{Y}^{T} \boldsymbol{X} = \boldsymbol{c} and also let XTX=A\boldsymbol{X}^{T}\boldsymbol{X}=\boldsymbol{A} . Then we get:

βL=02YTX+(XTX+(XTX)T)β=2YTX+2XTXβ=2(XTXβYTX)\nabla_{\boldsymbol{\beta}} \mathcal{L}= \boldsymbol{0} - 2 \boldsymbol{Y}^{T} \boldsymbol{X} + (\boldsymbol{X}^{T}\boldsymbol{X} +(\boldsymbol{X}^{T}\boldsymbol{X})^{T} ) \boldsymbol{\beta}= -2 \boldsymbol{Y}^{T} \boldsymbol{X} + 2\boldsymbol{X}^{T}\boldsymbol{X} \boldsymbol{\beta}= 2(\boldsymbol{X}^{T}\boldsymbol{X} \boldsymbol{\beta} - \boldsymbol{Y}^{T} \boldsymbol{X} )

Equating the derivative with 0\mathbf{0} and solving the equation we get β^\hat{\boldsymbol{\beta}}:

XTXβ^XTY=0    XTXβ^XTY=0    XTXβ^=XTY    β^=(XTX)1XTY\boldsymbol{X}^{T} \boldsymbol{X} \hat{\boldsymbol{\beta}}-\boldsymbol{X}^{T}\boldsymbol{Y} =\mathbf{0} \iff \boldsymbol{X}^{T} \boldsymbol{X} \hat{\boldsymbol{\beta}}- \boldsymbol{X}^{T}\boldsymbol{Y}=\mathbf{0} \iff \boldsymbol{X}^{T} \boldsymbol{X} \hat{\boldsymbol{\beta}} = \boldsymbol{X}^{T}\boldsymbol{Y} \iff \hat{\boldsymbol{\beta}}= (\boldsymbol{X}^{T} \boldsymbol{X})^{-1} \boldsymbol{X}^{T}\boldsymbol{Y}

It has to be noted that this is a very simple linear regression model which suffers from many drawbacks such as overfitting. Other more complex regression models exist and use regularizers to control the size of the coefficients (Lasso etc.)