Coefficients(Beta)

Dive intoβ^\hat{\beta}

β^LS\hat{\beta}^{LS}contains y\mathbf{y}. When we assume the error follows probability distribution, y\mathbf{y}also becomes random variable that has uncertainty. Thus β^LS\hat{\beta}^{LS} also follows some distribution related to the distribution of error.

Don't get confused! In a frequentist view, β\beta is constant. However the estimation value of betaβ^=f{(X1,Y1),...,(Xn,Yn)}\hat{\beta}=f\{(X_1,Y_1),...,(X_n,Y_n)\} is a statistic so it has a distribution.

β^=(XTX)1XTy=(XTX)1XT(Xβ+ϵ)β^N(β,(XTX)1σ2)\hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} =(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\beta+\epsilon) \\\hat{\beta}\sim N(\beta,(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2)
σ^2=1Np1i=1N(yiy^i)2(Np1)σ^2σ2χNp12\hat{\sigma}^2=\dfrac{1}{N-p-1}\sum^N_{i=1}(y_i-\hat{y}_i)^2 \\ (N-p-1)\hat{\sigma}^2 \sim \sigma^2\chi^2_{N-p-1}

Square of Normal becomes chi-square.

zj=β^jsd^(β^j)=β^jσ^vjt(df)s.t.  Np1=dfz_j=\dfrac{\hat{\beta}_j}{\hat{sd}(\hat{\beta}_j)} =\dfrac{\hat{\beta}_j}{\hat{\sigma}\sqrt{v_j}} \sim t(df) \quad s.t. \; N-p-1 = df

Var^(β^)=(XTX)1σ^2,  Var^(β^j)=vjσ^2vj=jth  diagonal  element  of  (XTX)1\hat{Var}(\hat{\beta})=(X^TX)^{-1}\hat{\sigma}^2, \; \hat{Var}(\hat{\beta}_j)=v_j\hat{\sigma}^2\\v_j=j_{th} \; diagonal \; element \; of \; (X^TX)^{-1}

Now we know the distribution of test statistic zjz_j, so we can test whether the coefficient is zero and get the confidence interval. When we want to test whether subset of coefficients is zero, we can use the test statistic below.

F=among  group  varwithin  group  var=MSRMSE=(RSS0RSS1)/(p1p0)RSS1/(Np11)F=\dfrac{among \; group \; var}{within \; group \; var}=\dfrac{MSR}{MSE}=\dfrac{(RSS_0-RSS_1)/(p_1-p_0)}{RSS_1/(N-p_1-1)}

FF has a distribution, so we can do zero value test for the coefficient. This testing gives hint for eliminating some input variables.

Gauss-Markov Theorem

This Theorem says Least Square estimates are good! There are three assumptions below.

  1. Input variables are fixed constant.

  2. E(εi)=0E(\varepsilon_i)=0

  3. Var(εi)=σ2<,Cov(εi,εj)=0Var(\varepsilon_i)=\sigma^2<\infty, \quad Cov(\varepsilon_i,\varepsilon_j)=0

Under these assumptions, OLS is the best estimate by GM.(Refer to statkwon.github.io)

E(β^)=E(β~)=βVar(β~)Var(β^)  :positive  semidefiniteE(\hat{\beta})=E(\tilde{\beta})=\beta \\ Var(\tilde{\beta})- Var(\hat{\beta}) \;: positive \; semi-definite

Proof

β~=Cy,  C=(XTX)1X+D,  D:  K×n  matrix \tilde{\beta}=Cy, \; C=(X^TX)^{-1}X+D, \; D: \; K\times n \; matrix

E[β~]=E[Cy]=E[((XX)1X+D)(Xβ+ε)]=((XX)1X+D)Xβ+((XX)1X+D)E[ε]=((XX)1X+D)Xβ(E[ε]=0)=(XX)1XXβ+DXβ=(IK+DX)β.{\displaystyle {\begin{aligned}\operatorname {E} \left[{\tilde {\beta }}\right]&=\operatorname {E} [Cy]\\&=\operatorname {E} \left[\left((X'X)^{-1}X'+D\right)(X\beta +\varepsilon )\right]\\&=\left((X'X)^{-1}X'+D\right)X\beta +\left((X'X)^{-1}X'+D\right)\operatorname {E} [\varepsilon ]\\&=\left((X'X)^{-1}X'+D\right)X\beta \quad \quad (\operatorname {E} [\varepsilon ]=0)\\&=(X'X)^{-1}X'X\beta +DX\beta \\&=(I_{K}+DX)\beta .\\\end{aligned}}}
Var(β~)=Var(Cy)=C Var(y)C=σ2CC=σ2((XX)1X+D)(X(XX)1+D)=σ2((XX)1XX(XX)1+(XX)1XD+DX(XX)1+DD)=σ2(XX)1+σ2(XX)1(DX)+σ2DX(XX)1+σ2DD=σ2(XX)1+σ2DD(DX=0)=Var(β^)+σ2DD(σ2(XX)1=Var(β^)){\displaystyle {\begin{aligned}\operatorname {Var} \left({\tilde {\beta }}\right)&=\operatorname {Var} (Cy)\\&=C{\text{ Var}}(y)C'\\&=\sigma ^{2}CC'\\&=\sigma ^{2}\left((X'X)^{-1}X'+D\right)\left(X(X'X)^{-1}+D'\right)\\&=\sigma ^{2}\left((X'X)^{-1}X'X(X'X)^{-1}+(X'X)^{-1}X'D'+DX(X'X)^{-1}+DD'\right)\\&=\sigma ^{2}(X'X)^{-1}+\sigma ^{2}(X'X)^{-1}(DX)'+\sigma ^{2}DX(X'X)^{-1}+\sigma ^{2}DD'\\&=\sigma ^{2}(X'X)^{-1}+\sigma ^{2}DD' \quad \quad (DX=0)\\&=\operatorname {Var} \left({\widehat {\beta }}\right)+\sigma ^{2}DD' \quad \quad (\sigma ^{2}(X'X)^{-1}=\operatorname {Var} \left({\widehat {\beta }}\right))\end{aligned}}}

DD DD'is a positive semi-definite matrix.(\becauseit is a symmetric matrix.)β^LS\hat{\beta}^{LS}is MVUE(Minimum Variance Unbiased Estimator).

Always good?

Err(x0)=E[(Yf^(x0))2X=x0]=  σϵ2+[Ef^(x0)f(x0)]2+E[f^(x0)Ef^(x0)]2=  σϵ2+Bias2(f^(x0))+Var(f^(x0))=  Irreducible  Error+Bias2+Variance\begin{split} Err(x_0) ={}& E[(Y-\hat{f}(x_0))^2|X=x_0] \\ = \; & \sigma^2_\epsilon+[E\hat{f}(x_0)-f(x_0)]^2+E[\hat{f}(x_0)-E\hat{f}(x_0)]^2 \\ = \; & \sigma^2_\epsilon+Bias^2(\hat{f}(x_0))+Var(\hat{f}(x_0)) \\ = \; & Irreducible \; Error +Bias^2+Variance \end{split}

We can image the biased estimator away from old school OLS. By keeping more bias, we can lower much more variance. It means we more accurately predict future value.

Ridge, Lasso, and Elastic Net

Last updated