block-quote On this pagechevron-down
copy Copy chevron-down
Machine Learning chevron-right Model Assessment & Selection CV and Boostrap Cross validation
C V ( f ^ ) = 1 N ∑ i = 1 N L ( y i , f ^ − κ ( i ) x i ) ) C V ( f ^ , α ) = 1 N ∑ i = 1 N L ( y i , f ^ − κ ( i ) x i , α ) ) CV(\hat{f})=\frac{1}{N}\sum^N_{i=1}L(y_i, \hat{f}^{-\kappa(i)}x_i)) \\
CV(\hat{f},\alpha)=\dfrac{1}{N}\sum^N_{i=1}L(y_i,\hat{f}^{-\kappa(i)}x_i, \alpha)) C V ( f ^ ​ ) = N 1 ​ i = 1 ∑ N ​ L ( y i ​ , f ^ ​ − κ ( i ) x i ​ )) C V ( f ^ ​ , α ) = N 1 ​ i = 1 ∑ N ​ L ( y i ​ , f ^ ​ − κ ( i ) x i ​ , α ))
When f is linear, it follows that
1 N ∑ i = 1 N [ y i − f ^ − i ( x i ) ] 2 = 1 N ∑ i = 1 N [ y i − f ^ ( x i ) 1 − S i i ] 2 \frac{1}{N}\sum^N_{i=1}[y_i-\hat{f}^{-i}(x_i)]^2=\frac{1}{N}\sum^N_{i=1}[\frac{y_i-\hat{f}(x_i)}{1-S_{ii}}]^2 N 1 ​ i = 1 ∑ N ​ [ y i ​ − f ^ ​ − i ( x i ​ ) ] 2 = N 1 ​ i = 1 ∑ N ​ [ 1 − S ii ​ y i ​ − f ^ ​ ( x i ​ ) ​ ] 2 In left side we pick out one value of our data, but in right side there is no picking out. It means without actual removing process we can calculate the value of left side.
∑ i = 1 N ( y i − f ^ ( x i ) ) 2 ≤ ∑ i = 1 N ( y i − f ( x i ) ) 2 f ^ ( k ) = a r g m i n f ∑ i ≠k n ( y i − f ( x i ) ) 2 ∑ i ≠k N ( y i − f ^ ( k ) ( x i ) ) 2 ≤ ∑ i ≠k N ( y i − f ( x i ) ) 2 \sum^N_{i=1}(y_i-\hat{f}(x_i))^2\leq \sum^N_{i=1}(y_i-{f}(x_i))^2 \\
\hat{f}^{(k)}=argmin_{f}{\sum^n_{i \neq k} (y_i-f(x_i))^2} \\
\sum^N_{i \neq k}(y_i-\hat{f}^{(k)}(x_i))^2\leq \sum^N_{i \neq k}(y_i-{f}(x_i))^2 \\
i = 1 ∑ N ​ ( y i ​ − f ^ ​ ( x i ​ ) ) 2 ≤ i = 1 ∑ N ​ ( y i ​ − f ( x i ​ ) ) 2 f ^ ​ ( k ) = a r g mi n f ​ i î€ = k ∑ n ​ ( y i ​ − f ( x i ​ ) ) 2 i î€ = k ∑ N ​ ( y i ​ − f ^ ​ ( k ) ( x i ​ ) ) 2 ≤ i î€ = k ∑ N ​ ( y i ​ − f ( x i ​ ) ) 2
Boostrap Methods
Training point Z = ( z 1 , … , z N ) Z=(z_1,\dots,z_N) Z = ( z 1 ​ , … , z N ​ ) . We randomly pick data set in replacement. We want to estimate some aspects of the distribution of S(Z).
V a r ^ [ S ( Z ) ] = 1 B − 1 ∑ b = 1 B ( S ( Z ∗ b ) − S ˉ ∗ ) 2 \hat{Var}[S(Z)]=\frac{1}{B-1}\sum^B_{b=1}(S(Z^{*b})-\bar{S}^*)^2 Va r ^ [ S ( Z )] = B − 1 1 ​ b = 1 ∑ B ​ ( S ( Z ∗ b ) − S ˉ ∗ ) 2
Our estimate is below:
E r r ^ b o o t = 1 B 1 N ∑ b = 1 B ∑ i = 1 N L ( y i , f ^ ∗ b ( x i ) ) \hat{Err}_{boot}=\frac{1}{B}\frac{1}{N}\sum^B_{b=1}\sum^N_{i=1}L(y_i,\hat{f}^{*b}(x_i)) E rr ^ b oo t ​ = B 1 ​ N 1 ​ b = 1 ∑ B ​ i = 1 ∑ N ​ L ( y i ​ , f ^ ​ ∗ b ( x i ​ )) The bootstrap datasets are acting as the training samples, but the original training set is acting as the test sample, and these two samples have observations in common. This overlap can make overfit predictions .
E r r ^ ( 1 ) = 1 N ∑ i = 1 N 1 ∣ C − i ∣ ∑ b ∈ C − i L ( y i , f ^ ∗ b ( x i ) ) E r r ^ .632 = .368 e r r ˉ + .632 E r r ^ ( 1 ) \hat{Err}^{(1)}=\frac{1}{N}\sum^N_{i=1}\frac{1}{|C^{-i}|}\sum_{b\in C^{-i}}L(y_i,\hat{f}^{*b}(x_i)) \\
\hat{Err}^{.632}=.368\bar{err}+.632\hat{Err}^{(1)} E rr ^ ( 1 ) = N 1 ​ i = 1 ∑ N ​ ∣ C − i ∣ 1 ​ b ∈ C − i ∑ ​ L ( y i ​ , f ^ ​ ∗ b ( x i ​ )) E rr ^ .632 = .368 err ˉ + .632 E rr ^ ( 1 )