Statistical Decision Theory
Statistical Decision Theory
When it comes to decide which model we will use, the important thing to consider is to minimize error. Let's generalize it.
The goal is to find the that predict well. Loss function is needed to find this , and this function gives penalizing errors in prediction of . The square error loss method is a common loss function.
✏️ The goal is minimizing the EPE
In terms of c, the solutions is as follow:
Conditional expectation. We didn't give any penaltiy such as linearity assumption on f(x). However, optimal f is eventually a conditional expectation in this context.
What is the conditional expectation? To think about this we first deal with a conditional distribution. There are probability assumption on Input variable and Target variable both.
🎲Dice Example 🎲
Dice is rolled, and let be the number of possible event of and . When we rolled a dice times, if and show up, we successively roll the dice.( and doesn't show up -> Stop rolling the dice.)
Letbe the number of possible event of head in coin flip. We flip coin if the result of rolling dice is and .
Our aim is to predict by using . When we roll a dice in times, how can we predict ? We can explain it through conditional expectation. We can get , and it is determined by the distribution of . In this case, .
We can say that this explanation below is reasonable. When we roll a dice 2 times, the number of head of coin would be . When we roll a dice 4 times the number would be . This predictive model is the very regression model. The function that has minimum error is eventually regression model.
Thus the best prediction of Y at any point X=x is the conditional mean, when best is measured by average squared error.
✏️ Conditional Expectation and KNN
Expectation is approximated by averaging over sample data;
Conditioning at a point is relaxed to conditioning on some region "close" to the target point.
With some conditions, we can show that
When and go to infinity, the estimate of is equal to the conditional expectation. However, when p go bigger the convergence rate decreases and the speed by which goes to conditional expectation is getting slower.
✏️ Conditional Expectation and linear regression.
The above expression has the assumption of constant . When we regard as random variable, the optimal solution of is as follows:
This solution can be interpreted as averaging over training data, which same as the beta in least square method. In conclusion, the conditional expectation is derived in KNN and least square as an approximation of averaging. These two approaches have different assumption.
Least squares assumes is well approximated by a globally linear function.
k-nearest neighbors assumes is well approximated by a locally constant function.
✏️ Several Loss functions
When we predict a binary variable , we use zero-one loss function.has an element in wrong prediction, in correct prediction.
Last updated