In the context of linear regression we are trying to solve a very simple problem. Given a sequence of pairs of points ((x1,y1)…(xn,yn)) we are trying to find the polynomial that best fits these points with respect to a loss function (least squares is very commonly used in these cases). Essentially we are trying to come up with a function of the form:
y=β0x0+β1x+β2x2+…βkxk
We show mathematically how to find the set of coefficients which minimizes the square loss function.
Let Y∈Rn,X∈Rn×k,β∈Rk. Then, in the context of linear regression, we need to find the coefficients β that minimize our error function. In this case let the error function be L:Rn→R with formula:
L(β)=∣∣Y−Xβ∣∣2=(Y−Xβ)T)(Y−Xβ).
Our objective is to minimize the loss w.r.t. β hence we need to find β^ such that L is minimum.
Before taking the derivative we expand the expression for the loss:
( Y−Xβ)T(Y−Xβ)=YTY−(Xβ)TY−YTXβ−(Xβ)TXβ=YTY−2YTXβ−βTXTβX
In order to compute the derivative of above, we need to show the following results:
Let f(β)=cTβ. Then:
f=i=1∑kciβi
Hence:
∂βw∂f=i=1∑kci∂βw∂βi=cw
So:
∇βf=c
Now let the more interesting function f(β)=βTAβ, for A k-dimensional square matrix. Then:
f=i=1∑kj=1∑kαijβiβj
Differentiating this expression we get:
∂βw∂f=i=1∑kj=1∑kαij∂βw∂(βiβj)=i=1∑kj=1∑kαij(δi,wβj+δj,wβi)=j=1∑kαwjβj+i=1∑kαiwβi=(Aβ)w+i=1∑k(A)w,iTβi=(Aβ)w+(ATβ)w
Hence we get that:
∇βf=(A+AT)β
Finally, we are in place to differentiate the original equation using our two latest results. Let YTX=c and also let XTX=A . Then we get:
∇βL=0−2YTX+(XTX+(XTX)T)β=−2YTX+2XTXβ=2(XTXβ−YTX)
Equating the derivative with 0 and solving the equation we get β^:
XTXβ^−XTY=0⟺XTXβ^−XTY=0⟺XTXβ^=XTY⟺β^=(XTX)−1XTY
It has to be noted that this is a very simple linear regression model which suffers from many drawbacks such as overfitting. Other more complex regression models exist and use regularizers to control the size of the coefficients (Lasso etc.)