This part is of selecting necessary independent variables. With several unnecessary variables, model has low predictive power and explanatory power. There are three ways to pick up the valuable variables:
Best Subset Selection
Forward & Backward Stepwise Selection
Forward Stagewise Regression
Best Subset Selection
kβ{0,1,2,...,p}
It is the method that every possible regression fitting by Subset size k. ( The optimal value in subset size 1 doesn't have to be optimal in size 2.)
Forward & Backward Stepwise Selection
Forward: Starting from zero model(Only intercept term exists), we put variables into our model one by one. This process could be faster in QR decomposition.
Backward: Starting from Full model, we remove variables from our model one by one.
QR decomposition can lower our computational cost.
It is the Forward-stepwise regression with more constraints.
1) We pick the variable most correlated to the variable in our fitted model.
2) Let this variable be a target and the residual be an explanatory variable. Calculate the coefficient.
3) Adding this coefficient to the coefficient of our existing model.
Shrinkage
Variable Selection is way to choose values in discrete way. The decision is only zero or one(removing or putting in). Because of this property, model variance would be increased. Shrinkage method is more free to variance because it choose variables in continuous way.
By adding some processes, we can make a more elaborate model. Many algorithms make emphasize on the relationship between X variables and residual.
y^βis on col(X). With the view of linear combination, it is expressed as y^β=Ξ²1β^βX1β+Ξ²2β^βX2β+...+Ξ²pβ^βXpβ. The residuals is Ξ΅=yβy^β.
Our aim is to minimize β£β£Ξ΅β£β£2=Ξ΅β Ξ΅.
What is the meaning of minimizing in terms of variable selection?
X1βis an existing variable and X2βis a new variable.In this case, yand Ξ²1β^βX1βare fixed and Ξ²2β^βX2βis an unfixed vector. Let's assume the norm ofΞ²2β^βX2βis fixed, and the direction is only changed. The important thing is the relationship between existing residual and X2β. Digging into the relationship between them is the key to decide whether the new variable would be put in or not. Many methods have been made from it.
LAR, PCR and PLS are representative examples. This algorithm is the method to make new features and the feature selection proceeds in such a way to minimize the error.
LAR(Least Angle Regression)
y=yΛβ+r
In this situation, we find Ξ²jβ of which xjβ has a high correlation with r
Our fitted beta move from 0 to Ξ²^β1LSβ until the correlation between another input variable and error becomes bigger. This approach makes beta come closer to the beta of least square, and the correlation between all input variables and error get reduced. This method can mine as many information as possible from our data. In the situation that Ξ²^βjβ can't get close to Ξ²^βjLSβ, Ξ²^βjβ is fixed as 0 like the Lasso regression.