This part is of selecting necessary independent variables. With several unnecessary variables, model has low predictive power and explanatory power. There are three ways to pick up the valuable variables:
Best Subset Selection
Forward & Backward Stepwise Selection
Forward Stagewise Regression
Best Subset Selection
k∈{0,1,2,...,p}
It is the method that every possible regression fitting by Subset size k. ( The optimal value in subset size 1 doesn't have to be optimal in size 2.)
Forward & Backward Stepwise Selection
Forward: Starting from zero model(Only intercept term exists), we put variables into our model one by one. This process could be faster in QR decomposition.
Backward: Starting from Full model, we remove variables from our model one by one.
QR decomposition can lower our computational cost.
It is the Forward-stepwise regression with more constraints.
1) We pick the variable most correlated to the variable in our fitted model.
2) Let this variable be a target and the residual be an explanatory variable. Calculate the coefficient.
3) Adding this coefficient to the coefficient of our existing model.
Shrinkage
Variable Selection is way to choose values in discrete way. The decision is only zero or one(removing or putting in). Because of this property, model variance would be increased. Shrinkage method is more free to variance because it choose variables in continuous way.
By adding some processes, we can make a more elaborate model. Many algorithms make emphasize on the relationship between X variables and residual.
y^is on col(X). With the view of linear combination, it is expressed as y^=β1^X1+β2^X2+...+βp^Xp. The residuals is ε=y−y^.
Our aim is to minimize ∣∣ε∣∣2=ε⋅ε.
What is the meaning of minimizing in terms of variable selection?
X1is an existing variable and X2is a new variable.In this case, yand β1^X1are fixed and β2^X2is an unfixed vector. Let's assume the norm ofβ2^X2is fixed, and the direction is only changed. The important thing is the relationship between existing residual and X2. Digging into the relationship between them is the key to decide whether the new variable would be put in or not. Many methods have been made from it.
LAR, PCR and PLS are representative examples. This algorithm is the method to make new features and the feature selection proceeds in such a way to minimize the error.
LAR(Least Angle Regression)
y=yˉ+r
In this situation, we find βj of which xj has a high correlation with r
y=β^0+β^1X1+r
β^1∈[0,β^1LS],s.t.∣∣xj∣∣xj⋅r<∣∣xk∣∣xk⋅r
Our fitted beta move from 0 to β^1LS until the correlation between another input variable and error becomes bigger. This approach makes beta come closer to the beta of least square, and the correlation between all input variables and error get reduced. This method can mine as many information as possible from our data. In the situation that β^j can't get close to β^jLS, β^j is fixed as 0 like the Lasso regression.