In the context of linear regression we are trying to solve a very simple problem. Given a sequence of pairs of points ((x1,y1)…(xn,yn)) we are trying to find the polynomial that best fits these points with respect to a loss function (least squares is very commonly used in these cases). Essentially we are trying to come up with a function of the form:
y=β0x0+β1x+β2x2+…βkxk
We show mathematically how to find the set of coefficients which minimizes the square loss function.
Let Y∈Rn,X∈Rn×k,β∈Rk. Then, in the context of linear regression, we need to find the coefficients β that minimize our error function. In this case let the error function be L:Rn→R with formula:
L(β)=∣∣Y−Xβ∣∣2=(Y−Xβ)T)(Y−Xβ).
Our objective is to minimize the loss w.r.t. β hence we need to find β^ such that L is minimum.
Before taking the derivative we expand the expression for the loss:
( Y−Xβ)T(Y−Xβ)=YTY−(Xβ)TY−YTXβ−(Xβ)TXβ=YTY−2YTXβ−βTXTβX
In order to compute the derivative of above, we need to show the following results:
Let f(β)=cTβ. Then:
f=i=1∑kciβi
Hence:
∂βw∂f=i=1∑kci∂βw∂βi=cw
So:
∇βf=c
Now let the more interesting function f(β)=βTAβ, for A k-dimensional square matrix. Then:
f=i=1∑kj=1∑kαijβiβj
Differentiating this expression we get: