Linear method: Regression
Last updated
Was this helpful?
Last updated
Was this helpful?
This part is of selecting necessary independent variables. With several unnecessary variables, model has low predictive power and explanatory power. There are three ways to pick up the valuable variables:
Best Subset Selection
Forward & Backward Stepwise Selection
Forward Stagewise Regression
It is the method that every possible regression fitting by Subset size k. ( The optimal value in subset size 1 doesn't have to be optimal in size 2.)
Forward: Starting from zero model(Only intercept term exists), we put variables into our model one by one. This process could be faster in QR decomposition.
Backward: Starting from Full model, we remove variables from our model one by one.
QR decomposition can lower our computational cost.
It is the Forward-stepwise regression with more constraints.
1) We pick the variable most correlated to the variable in our fitted model.
2) Let this variable be a target and the residual be an explanatory variable. Calculate the coefficient.
3) Adding this coefficient to the coefficient of our existing model.
Variable Selection is way to choose values in discrete way. The decision is only zero or one(removing or putting in). Because of this property, model variance would be increased. Shrinkage method is more free to variance because it choose variables in continuous way.
Lasso
Ridge
By adding some processes, we can make a more elaborate model. Many algorithms make emphasize on the relationship between X variables and residual.
What is the meaning of minimizing in terms of variable selection?
\begin{split} min(\varepsilon \cdot \varepsilon) & = {} min((y-\hat{y}) \cdot (y-\hat{y})) \\ & = min((y-(\hat{\beta_1}X_1+\hat{\beta_2}X_2))\cdot(y-(\hat{\beta_1}X_1+\hat{\beta_2}X_2))) \ \\ & =min((\hat{\beta_1}X_1 - y)\cdot \hat{\beta_2}X_2) \end{split}
LAR, PCR and PLS are representative examples. This algorithm is the method to make new features and the feature selection proceeds in such a way to minimize the error.
penalty is imposed.
is on . With the view of linear combination, it is expressed as . The residuals is .
Our aim is to minimize .
is an existing variable and is a new variable.In this case, and are fixed and is an unfixed vector. Let's assume the norm ofis fixed, and the direction is only changed. The important thing is the relationship between existing residual and . Digging into the relationship between them is the key to decide whether the new variable would be put in or not. Many methods have been made from it.
In this situation, we find of which has a high correlation with
Our fitted beta move from to until the correlation between another input variable and error becomes bigger. This approach makes beta come closer to the beta of least square, and the correlation between all input variables and error get reduced. This method can mine as many information as possible from our data. In the situation that can't get close to , is fixed as like the Lasso regression.