The Effective Number of Parameters

The Effective Number of Parameters

More complex model is, more increased the number of parameter is. Many parameter is good for reducing error, but it is vulnerable to overfitting. In this case, we can lower our model complexity by imposing penalty on our model. Using this penalty term, we only can choose beta truly necessary in our model. How the number of necessary beta(parameter) is decided? It depends on matrix property multiplied by output vector.

Degree of freedom is relevant to the number of eigenvalues in Smoother matrix.

https://stats.stackexchange.com/questions/114434/what-is-meant-by-effective-parameters-in-machine-learning

y^=Sydf(S)=trace(S)\hat{y}=Sy \\ df(S)=trace(S)

Smoother matrix: https://math.stackexchange.com/questions/2784061/how-to-interpret-the-smoother-matrix

The degree of freedom in matrix is the number of independent column vectors(or the number of non-zero eigenvalues) We can do eigenvalue decomposition of our smoother matrix because it is symmetric matrix.

S=X(XTX+λΩ)−1XT=UDUTSS<S(consequence  of  shrinkage  nature)trace(S)=trace(UDUT)=trace(UUTD)=trace(D)S=X(X^TX+\lambda\Omega)^{-1}X^T =UDU^T \\ SS<S (consequence \; of \; shrinkage \; nature)\\ trace(S)=trace(UDU^T)=trace(UU^TD)=trace(D)

If it is projection matrix not smoother matrix, trace(D) becomes rank(S) because projection matrix(idempotent matrix) only has 0 or 1 as an eigenvalue. SSSSis always small than SS, so eigenvalue is between 0 and 1. I think it would be reason of df(S)=trace(S)df(S)=trace(S)

If SSis orthogonal-projection matrix and has M parameters, trace(S)=M=df(S)trace(S)=M=df(S). It replaces d in Cp statistic.

Y=f(X)+ϵ(additive  model)Var(ϵ)=σϵ2Y=f(X)+\epsilon (additive \; model) \\ Var(\epsilon)=\sigma^2_\epsilon

In the above assumption, the following expression is satisfied.

Σi=1nCov(y^i,yi)=trace(S)σϵ2df(y^)=Σi=1nCov(y^i,yi)σϵ2\Sigma^n_{i=1}Cov(\hat{y}_i,y_i)=trace(S)\sigma^2_\epsilon \\ df(\hat{y})=\dfrac{\Sigma^n_{i=1} Cov(\hat{y}_i,y_i)}{\sigma^2_\epsilon}
BIC=−2loglik+(logN)dBIC = -2loglik+(logN)d

VC dimension

The Vapnik-Chervonenkis theory provides such a general measure of complexity, and gives associated bounds on the optimism. It measures the complexity by assessing how wiggly its members can be.

The VC dimension of the class {f(x,alpha)} is defined to be the largest number of points that can be shattered by members of {f(x,alpha)}

If our function can perfectly separate the three points into two classes, we can say that the VC dimension of the function is 3(# of points).

https://keepmind.net/%EA%B8%B0%EA%B3%84%ED%95%99%EC%8A%B5-vc-dimension/

ErrT≤errˉ+ϵ2(1+1+4errˉϵ),(binary  classification)ErrT≤errˉ(1−ce)+,(binary  classification)Err_\mathcal{T}\leq \bar{err}+\frac{\epsilon}{2}(1+\sqrt{1+\frac{4\bar{err}}{\epsilon}}), \quad (binary \; classification) \\ Err_\mathcal{T} \leq \frac{\bar{err}}{(1-c\sqrt{e})_+}, \quad (binary \; classification)

Last updated