Dive into
β ^ \hat{\beta} β ^ β ^ L S \hat{\beta}^{LS} β ^ L S contains y \mathbf{y} y . When we assume the error follows probability distribution, y \mathbf{y} y also becomes random variable that has uncertainty. Thus β ^ L S \hat{\beta}^{LS} β ^ L S also follows some distribution related to the distribution of error.
Don't get confused! In a frequentist view, β \beta β is constant. However the estimation value of betaβ ^ = f { ( X 1 , Y 1 ) , . . . , ( X n , Y n ) } \hat{\beta}=f\{(X_1,Y_1),...,(X_n,Y_n)\} β ^ = f {( X 1 , Y 1 ) , ... , ( X n , Y n )} is a statistic so it has a distribution.
β ^ = ( X T X ) − 1 X T y = ( X T X ) − 1 X T ( X β + ϵ ) β ^ ∼ N ( β , ( X T X ) − 1 σ 2 ) \hat{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} =(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\beta+\epsilon)
\\\hat{\beta}\sim N(\beta,(\mathbf{X}^T\mathbf{X})^{-1}\sigma^2) β ^ = ( X T X ) − 1 X T y = ( X T X ) − 1 X T ( X β + ϵ ) β ^ ∼ N ( β , ( X T X ) − 1 σ 2 ) σ ^ 2 = 1 N − p − 1 ∑ i = 1 N ( y i − y ^ i ) 2 ( N − p − 1 ) σ ^ 2 ∼ σ 2 χ N − p − 1 2 \hat{\sigma}^2=\dfrac{1}{N-p-1}\sum^N_{i=1}(y_i-\hat{y}_i)^2 \\
(N-p-1)\hat{\sigma}^2 \sim \sigma^2\chi^2_{N-p-1} σ ^ 2 = N − p − 1 1 i = 1 ∑ N ( y i − y ^ i ) 2 ( N − p − 1 ) σ ^ 2 ∼ σ 2 χ N − p − 1 2 Square of Normal becomes chi-square.
z j = β ^ j s d ^ ( β ^ j ) = β ^ j σ ^ v j ∼ t ( d f ) s . t . N − p − 1 = d f z_j=\dfrac{\hat{\beta}_j}{\hat{sd}(\hat{\beta}_j)} =\dfrac{\hat{\beta}_j}{\hat{\sigma}\sqrt{v_j}} \sim t(df) \quad s.t. \; N-p-1 = df z j = s d ^ ( β ^ j ) β ^ j = σ ^ v j β ^ j ∼ t ( df ) s . t . N − p − 1 = df V a r ^ ( β ^ ) = ( X T X ) − 1 σ ^ 2 , V a r ^ ( β ^ j ) = v j σ ^ 2 v j = j t h d i a g o n a l e l e m e n t o f ( X T X ) − 1 \hat{Var}(\hat{\beta})=(X^TX)^{-1}\hat{\sigma}^2, \; \hat{Var}(\hat{\beta}_j)=v_j\hat{\sigma}^2\\v_j=j_{th} \; diagonal \; element \; of \; (X^TX)^{-1} Va r ^ ( β ^ ) = ( X T X ) − 1 σ ^ 2 , Va r ^ ( β ^ j ) = v j σ ^ 2 v j = j t h d ia g o na l e l e m e n t o f ( X T X ) − 1
Now we know the distribution of test statistic z j z_j z j , so we can test whether the coefficient is zero and get the confidence interval. When we want to test whether subset of coefficients is zero, we can use the test statistic below.
F = a m o n g g r o u p v a r w i t h i n g r o u p v a r = M S R M S E = ( R S S 0 − R S S 1 ) / ( p 1 − p 0 ) R S S 1 / ( N − p 1 − 1 ) F=\dfrac{among \; group \; var}{within \; group \; var}=\dfrac{MSR}{MSE}=\dfrac{(RSS_0-RSS_1)/(p_1-p_0)}{RSS_1/(N-p_1-1)} F = w i t hin g ro u p v a r am o n g g ro u p v a r = MSE MSR = RS S 1 / ( N − p 1 − 1 ) ( RS S 0 − RS S 1 ) / ( p 1 − p 0 ) F F F has a distribution, so we can do zero value test for the coefficient. This testing gives hint for eliminating some input variables.
Gauss-Markov Theorem
This Theorem says Least Square estimates are good! There are three assumptions below.
Input variables are fixed constant.
E ( ε i ) = 0 E(\varepsilon_i)=0 E ( ε i ) = 0
V a r ( ε i ) = σ 2 < ∞ , C o v ( ε i , ε j ) = 0 Var(\varepsilon_i)=\sigma^2<\infty, \quad Cov(\varepsilon_i,\varepsilon_j)=0 Va r ( ε i ) = σ 2 < ∞ , C o v ( ε i , ε j ) = 0
Under these assumptions, OLS is the best estimate by GM.(Refer to statkwon.github.io)
E ( β ^ ) = E ( β ~ ) = β V a r ( β ~ ) − V a r ( β ^ ) : p o s i t i v e s e m i − d e f i n i t e E(\hat{\beta})=E(\tilde{\beta})=\beta
\\ Var(\tilde{\beta})- Var(\hat{\beta}) \;: positive \; semi-definite E ( β ^ ) = E ( β ~ ) = β Va r ( β ~ ) − Va r ( β ^ ) : p os i t i v e se mi − d e f ini t e Proof
β ~ = C y , C = ( X T X ) − 1 X + D , D : K × n m a t r i x \tilde{\beta}=Cy, \; C=(X^TX)^{-1}X+D, \; D: \; K\times n \; matrix β ~ = C y , C = ( X T X ) − 1 X + D , D : K × n ma t r i x
E [ β ~ ] = E [ C y ] = E [ ( ( X ′ X ) − 1 X ′ + D ) ( X β + ε ) ] = ( ( X ′ X ) − 1 X ′ + D ) X β + ( ( X ′ X ) − 1 X ′ + D ) E [ ε ] = ( ( X ′ X ) − 1 X ′ + D ) X β ( E [ ε ] = 0 ) = ( X ′ X ) − 1 X ′ X β + D X β = ( I K + D X ) β . {\displaystyle {\begin{aligned}\operatorname {E} \left[{\tilde {\beta }}\right]&=\operatorname {E} [Cy]\\&=\operatorname {E} \left[\left((X'X)^{-1}X'+D\right)(X\beta +\varepsilon )\right]\\&=\left((X'X)^{-1}X'+D\right)X\beta +\left((X'X)^{-1}X'+D\right)\operatorname {E} [\varepsilon ]\\&=\left((X'X)^{-1}X'+D\right)X\beta \quad \quad (\operatorname {E} [\varepsilon ]=0)\\&=(X'X)^{-1}X'X\beta +DX\beta \\&=(I_{K}+DX)\beta .\\\end{aligned}}}
E [ β ~ ] = E [ C y ] = E [ ( ( X ′ X ) − 1 X ′ + D ) ( Xβ + ε ) ] = ( ( X ′ X ) − 1 X ′ + D ) Xβ + ( ( X ′ X ) − 1 X ′ + D ) E [ ε ] = ( ( X ′ X ) − 1 X ′ + D ) Xβ ( E [ ε ] = 0 ) = ( X ′ X ) − 1 X ′ Xβ + D Xβ = ( I K + D X ) β . Var ( β ~ ) = Var ( C y ) = C Var ( y ) C ′ = σ 2 C C ′ = σ 2 ( ( X ′ X ) − 1 X ′ + D ) ( X ( X ′ X ) − 1 + D ′ ) = σ 2 ( ( X ′ X ) − 1 X ′ X ( X ′ X ) − 1 + ( X ′ X ) − 1 X ′ D ′ + D X ( X ′ X ) − 1 + D D ′ ) = σ 2 ( X ′ X ) − 1 + σ 2 ( X ′ X ) − 1 ( D X ) ′ + σ 2 D X ( X ′ X ) − 1 + σ 2 D D ′ = σ 2 ( X ′ X ) − 1 + σ 2 D D ′ ( D X = 0 ) = Var ( β ^ ) + σ 2 D D ′ ( σ 2 ( X ′ X ) − 1 = Var ( β ^ ) ) {\displaystyle {\begin{aligned}\operatorname {Var} \left({\tilde {\beta }}\right)&=\operatorname {Var} (Cy)\\&=C{\text{ Var}}(y)C'\\&=\sigma ^{2}CC'\\&=\sigma ^{2}\left((X'X)^{-1}X'+D\right)\left(X(X'X)^{-1}+D'\right)\\&=\sigma ^{2}\left((X'X)^{-1}X'X(X'X)^{-1}+(X'X)^{-1}X'D'+DX(X'X)^{-1}+DD'\right)\\&=\sigma ^{2}(X'X)^{-1}+\sigma ^{2}(X'X)^{-1}(DX)'+\sigma ^{2}DX(X'X)^{-1}+\sigma ^{2}DD'\\&=\sigma ^{2}(X'X)^{-1}+\sigma ^{2}DD' \quad \quad (DX=0)\\&=\operatorname {Var} \left({\widehat {\beta }}\right)+\sigma ^{2}DD' \quad \quad (\sigma ^{2}(X'X)^{-1}=\operatorname {Var} \left({\widehat {\beta }}\right))\end{aligned}}} Var ( β ~ ) = Var ( C y ) = C Var ( y ) C ′ = σ 2 C C ′ = σ 2 ( ( X ′ X ) − 1 X ′ + D ) ( X ( X ′ X ) − 1 + D ′ ) = σ 2 ( ( X ′ X ) − 1 X ′ X ( X ′ X ) − 1 + ( X ′ X ) − 1 X ′ D ′ + D X ( X ′ X ) − 1 + D D ′ ) = σ 2 ( X ′ X ) − 1 + σ 2 ( X ′ X ) − 1 ( D X ) ′ + σ 2 D X ( X ′ X ) − 1 + σ 2 D D ′ = σ 2 ( X ′ X ) − 1 + σ 2 D D ′ ( D X = 0 ) = Var ( β ) + σ 2 D D ′ ( σ 2 ( X ′ X ) − 1 = Var ( β ) ) D D ′ DD' D D ′ is a positive semi-definite matrix.(∵ \because ∵ it is a symmetric matrix.)β ^ L S \hat{\beta}^{LS} β ^ L S is MVUE(Minimum Variance Unbiased Estimator).
Always good?
E r r ( x 0 ) = E [ ( Y − f ^ ( x 0 ) ) 2 ∣ X = x 0 ] = σ ϵ 2 + [ E f ^ ( x 0 ) − f ( x 0 ) ] 2 + E [ f ^ ( x 0 ) − E f ^ ( x 0 ) ] 2 = σ ϵ 2 + B i a s 2 ( f ^ ( x 0 ) ) + V a r ( f ^ ( x 0 ) ) = I r r e d u c i b l e E r r o r + B i a s 2 + V a r i a n c e \begin{split}
Err(x_0) ={}& E[(Y-\hat{f}(x_0))^2|X=x_0] \\
= \; & \sigma^2_\epsilon+[E\hat{f}(x_0)-f(x_0)]^2+E[\hat{f}(x_0)-E\hat{f}(x_0)]^2 \\
= \; & \sigma^2_\epsilon+Bias^2(\hat{f}(x_0))+Var(\hat{f}(x_0)) \\
= \; & Irreducible \; Error +Bias^2+Variance
\end{split} E rr ( x 0 ) = = = = E [( Y − f ^ ( x 0 ) ) 2 ∣ X = x 0 ] σ ϵ 2 + [ E f ^ ( x 0 ) − f ( x 0 ) ] 2 + E [ f ^ ( x 0 ) − E f ^ ( x 0 ) ] 2 σ ϵ 2 + B ia s 2 ( f ^ ( x 0 )) + Va r ( f ^ ( x 0 )) I rre d u c ib l e E rror + B ia s 2 + Va r ian ce We can image the biased estimator away from old school OLS. By keeping more bias, we can lower much more variance. It means we more accurately predict future value.
Ridge, Lasso, and Elastic Net