Ensemble

ECOC(Error Correcting Output Codes)

Multiclass classification

  • OVR(One vs Rest): Red VS [Blue, Green] - Logistic regression

  • OVO (One vs One): Red VS Blue - SVM

  • ECOC: Random code assignment

In a multiclass classification, we need to encode a categorical variable into several dummy variables. There are three ways to encoding, and ECOC is introduced as the way for adding randomness.

Random code assignment worked as well as the optimally constructed error-correcting codes. The idea is that the redundant "error correcting" bits allow for some inaccuracies, and can improve performance.

  1. Learn a separate classifier for each of the L=15L=15

  2. At a test point xx, p^l(x)\hat{p}_l(x)is the predicted probability of a one for the lthl_{th} response.

  3. Ī“k(x)=āˆ‘l=1L∣Cklāˆ’p^l(x)∣\delta_k(x)=\sum^L_{l=1} |C_{kl}-\hat{p}_l(x)|

Boosting and Regularization Paths

Forward-stagewise(FS) regression

Forward-stagewise regression is a constrained version of forward-stepwise regression. At each step, this algorithm identifies the most correlated variable with the current residual. You can check the specific algorithm in page 5 in the research below.

This research shows the algorithm for forward-stagewise regression.

Generalized Additive Models

Usually effects are often not linear, ff might be a non-linear function.

E(Y∣X1,...,Xp)=α+f1(X1)+f2(X2)+⋯fp(Xp)E(Y|X_1,...,X_p)=\alpha+f_1(X_1)+f_2(X_2)+\cdots f_p(X_p)

Tree

f(x)=āˆ‘m=1McmI(x∈Rm)f(x)=\sum^M_{m=1} c_m I(x\in R_m)

Penalized regression

Penalty approach

f(x)=āˆ‘k=1KαkTk(x),ā€…ā€ŠK=card(T),ā€…ā€ŠT={Tk}min⁔α{āˆ‘i=1N(yiāˆ’āˆ‘k=1KαkTk(xi))2+λ⋅J(α)}J(α)=āˆ‘k=1K∣αk∣2ā€…ā€Š(Ridge),ā€…ā€ŠJ(α)=āˆ‘k=1K∣αkāˆ£ā€…ā€Š(Lasso)f(x) = \sum^K_{k=1} \alpha_k T_k(x), \; K=card(T),\;T=\{T_k\} \\ \min_\alpha \Big\{ \sum^N_{i=1} \Big( y_i-\sum^K_{k=1} \alpha_k T_k(x_i) \Big)^2+\lambda \cdot J(\alpha)\Big \} \\ J(\alpha)=\sum^K_{k=1}|\alpha_k|^2 \;(Ridge), \;J(\alpha)=\sum^K_{k=1}|\alpha_k| \;(Lasso)

T={Tk}T=\{T_k\} is the dictionary of all possible JJterminal node regression trees. TT can be realized on the training data as basis functions in Rp\mathbb{R}^p. Also, all coefficients for each tree are estimated by least squares.

However, when the number of basis functions become so large, solving the optimization problem with the lasso penalty is not possible. In this case, following forward stagewise approach can be used by approximation of the lasso penalty approach.

Forward Stagewise approach(FS)

  1. Initialization α^k=0,ā€…ā€Šk=1,...,K.\hat{\alpha}_k=0, \;k=1,...,K.

  2. (a)ā€…ā€Š(Ī²āˆ—,kāˆ—)=argminβ,kāˆ‘i=1N(yiāˆ’āˆ‘l=1Kα^lTl(xi)āˆ’Ī²Tk(xi))2(b)ā€…ā€ŠĪ±^kāˆ—ā†Ī±^kāˆ—+ε⋅sign(Ī²āˆ—) (a) \; (\beta^*,k^*)=argmin_{\beta, k}\sum^N_{i=1}\Big(y_i-\sum^K_{l=1} \hat{\alpha}_lT_l(x_i)-\beta T_k(x_i)\Big)^2 \\ (b) \; \hat{\alpha}_{k^*}\leftarrow \hat{\alpha}_{k^*} + \varepsilon \cdot sign(\beta^*)

  3. fM(x)=āˆ‘k=1Kα^kTk(x)f_M(x)=\sum^K_{k=1} \hat{\alpha}_kT_k(x)

This algorithm leverages the forward-stagewise algorithm. As like the regression, it adopts the tree most correlated with the current residual. The iteration number MM is inversely related to Ī»\lambda, when Ī»=āˆž\lambda=\infty the forward stagewise algorithm is on the initialization stage that every coefficients are zero.(M=0M=0)

∣α^k(λ)∣<∣α^k(0)∣|\hat{\alpha}_k(\lambda)|<|\hat{\alpha}_k(0)| and also ∣α^k(M)∣<∣α^k(0)∣|\hat{\alpha}_k(M)|<|\hat{\alpha}_k(0)|. When λ=0\lambda=0, the regularization term disappears, so the solution becomes a least square solution.

When all of the basis functions TkT_k are mutually uncorrelated, FS shows exactly the same solution with lasso for bound parameter t=āˆ‘k∣αk∣t=\sum_k |\alpha_k| ( Even if these functions are not uncorrelated, when α^k(Ī»)\hat{\alpha}_k(\lambda) is a monotone function of Ī»\lambda, FS also becomes same with lasso regression.) The regularization term Ī»\lambda is inversely proportional to the Lagrange constant tt. M=250,ā€…ā€ŠĪµ=0.01M =250, \;\varepsilon=0.01 in the right panel.

"Bet on Sparsity" Principle

Usually, the shrinkage with L1L_1penalty is better suited to sparse situations, where there are few basis functions with nonzero coefficients.

  • Dense Scenario: 10,000 data points and a linear combination of million trees with coefficients from a Gaussian distribution. → Ridge works better than Lasso

  • Sparse Scenario: 10,000 data points and a linear combination of 1,000 tress with nonzero coefficients. → Lasso works well.

The degree of sparseness is determined by the true target function and the chosen dictionary TT. Noise-to-signal ratio(NSR) can also be a key in determining the sparseness. Because larger training sets allow the estimation for coefficients with smaller standard errors. When the size of dictionary becomes increased, it leads to a sparser representation for our function and higher variance in searching.

In this example, NSR is defined as Var(Y∣η(X))/Var(η(X))Var(Y|\eta(X))/Var(\eta(X)). The nominator is the variance of YY(unexplained part), and the denominator is the variance of our model. The bigger NSR is, the bigger the rate of unexplainable error is. It has been known as that lasso works well for sparse setting.

>> Reference <<

Regularization Paths, Over-fitting and Margins

Lasso suffers somewhat from the multi-collinearity problem(you can check the reason in Ex. 3.28). Because in the exercise 3.28, when lasso has the exact same copy Xjāˆ—=XjX_j^*=X_j, the coefficients for Xjāˆ—,XjX_j^*,X_j become a/2,a/2a/2, a/2 which is the a half of the original coefficient aa.

Monotone version of the lasso

>> Reference <<

We create an expanded data matrix X~=[X:āˆ’X]\tilde{X}=[X:-X]

min⁔β0,βj+,βjāˆ’āˆ‘i=1N(yiāˆ’Ī²0āˆ’[āˆ‘j=1pxijβj+āˆ’āˆ‘j=1pxijβjāˆ’])2s.t.ā€…ā€ŠĪ²j+,βjāˆ’ā‰„0ā€…ā€Šāˆ€jā€…ā€Šandā€…ā€Šāˆ‘j=1p(βj++βjāˆ’)≤s\min_{\beta_0,\beta_j^+,\beta_j^-}\sum^N_{i=1} \Big(y_i-\beta_0-\Big[\sum^p_{j=1}x_{ij}\beta_j^+-\sum^p_{j=1}x_{ij}\beta_j^- \Big]\Big)^2 \\ s.t. \; \beta_j^+,\beta_j^- \geq 0 \; \forall j \;and \; \sum^p_{j=1}(\beta_j^++\beta_j^-)\leq s

In this setting,

Hastie & Taylor & Tibshirani &Walther, Forward stagewise regression and the monotone lasso, Electronic Journal of Statistics Vol1. (2007) 1-29

The monotone lasso coefficient path β(l)\beta(l) for a dataset X~={X,āˆ’X}\tilde{X}=\{X,-X\} is the solution to the different equation āˆ‚Ī²āˆ‚l=ρml(β(l))\dfrac{\partial \beta}{\partial l}=\rho_{ml}(\beta(l)). The ρml(β)\rho_{ml}(\beta) is standardized to have unit L1L_1 norm, which is the L1L_1 arc length in the monotone lasso situation.

The margin of a fitted model f(x)=āˆ‘kαkTk(x)f(x)=\sum_k \alpha_kT_k(x) is defined as

m(f)=min⁔iyif(xi)āˆ‘k=1K∣αk∣m(f)=\min_i \dfrac{y_if(x_i)}{\sum^K_{k=1} |\alpha_k|}

Learning Ensembles

The ensemble can be divided into two steps.

  • A finite dictionary TL={T1(X),...,TM(x)}T_L=\{T_1(X),...,T_M(x)\}is induced from the training data.

  • α(Ī»)=argminĪ±āˆ‘i=1NL[yi,α0+āˆ‘m=1MαmTm(xi)]+Ī»āˆ‘m=1M∣αm∣\alpha(\lambda)=argmin_\alpha \sum^N_{i=1} L[y_i, \alpha_0+\sum^M_{m=1} \alpha_mT_m(x_i)]+\lambda \sum^M_{m=1} |\alpha_m|

These step are seen as a way of post-processing boosting or random forests, because in a first step TLT_L is already fitted by the gradient boosting or random forest.

f(x)=∫β(γ)b(x;γ)dĪ³ā‰ˆfM(x)=α0+āˆ‘m=1Mαmb(x;γm)f(x)=\int \beta(\gamma)b(x;\gamma)d\gamma \approx f_M(x)=\alpha_0+\sum^M_{m=1}\alpha_mb(x;\gamma_m)

The measure of (lack of) relevance using loss function is evaluated on the training data by this formula. Q(γ)=min⁔c0,c1āˆ‘i=1NL(yi,c0+c1b(xi;γ))Q(\gamma)=\min_{c_0,c_1} \sum^N_{i=1} L(y_i, c_0+c_1b(x_i;\gamma)). Ī³āˆ—=argminĪ³āˆˆĪ“Q(γ)\gamma^*=argmin_{\gamma \in \Gamma} Q(\gamma), which is the global minimizer so Q(γ)≄Q(Ī³āˆ—)Q(\gamma) \geq Q(\gamma^*)

σ=ES[Q(γ)āˆ’Q(Ī³āˆ—)]\sigma=E_S[Q(\gamma)-Q(\gamma^*)]
  • Too narrow σ\sigma suggests too many of the b(x;γm)b(x;\gamma_m) look alike, and similar to b(x;Ī³āˆ—)b(x;\gamma^*)

  • Too wide σ\sigma suggests a large spread in the b(x;γm)b(x;\gamma_m), but possibly consisting of many irrelevant cases.

ISLE Ensemble Generation

1. f0(x)=argmincāˆ‘i=1NL(yi,c)f_0(x)=argmin_c \sum^N_{i=1} L(y_i,c)

2. For m=1m=1 to MM do

(a)ā€…ā€ŠĪ³m=argminĪ³āˆ‘i∈Sm(Ī·)L(yi,fmāˆ’1(xi)+b(xi;r))(b)ā€…ā€Šfm(x)=fmāˆ’1(x)+νb(x;γm) (a) \; \gamma_m=argmin _\gamma \sum_{i \in S_m(\eta)} L(y_i,f_{m-1}(x_i)+b(x_i;r))\\ (b) \; f_m(x) = f_{m-1}(x)+\nu b(x;\gamma_m)

3. TISLE={b(x;γ1),b(x;γ2),...,b(x;γM)}T_{ISLE}=\{b(x;\gamma_1),b(x;\gamma_2),...,b(x;\gamma_M)\}

  • Sm(Ī·)S_m(\eta) refers to a subsample of NĆ—Ī·(η∈(0,1])N \times \eta (\eta \in (0,1]) of the training observations, typically without replacement.

  • ν∈[0,1]\nu \in [0,1] is the memory into the randomization process; the larger ν\nu, the more the procedure avoids b(x;γ)b(x;\gamma) similar to these found before.

Randomization schemes we already deal with are special cases of ISLE Ensemble Generation

  • Bagging: Sampling with replacement has Ī·=1,ν=0\eta=1, \nu=0. Sampling without replacement with Ī·=1/2\eta=1/2 is equal to sampling without replacement with Ī·=1\eta=1, but former one is much more efficient.

  • Random Forest: More randomness is introduced by the selection of the splitting variable. Smaller Ī·<1/2\eta <1/2 is, smaller mm is in random forests.

  • Gradient Boosting: With shrinkage uses Ī·=1\eta=1, but typically doesn't produce sufficient width σ\sigma

  • Stochastic Gradient Boosting

Importance Sampled Learning Ensemble(ISLE) is recommended with ν=0.1,η≤1/2\nu=0.1, \eta \leq1/2

sparse versus dense can be calculated with the noise-to-signal ratio(NSR)

Revisit the Gradient Tree Boosting Algorithm

In Boosting tree, one tree is formally expressed as T(x;Θ)=āˆ‘j=1JγjI(x∈Rj)T(x; \Theta)=\sum^J_{j=1} \gamma_j I(x\in R_j) with parameter Θ={Rj,γj}1J\Theta=\{R_j,\gamma_j\}^J_1. This parameter can be found by solving Θ^=argminĪ˜āˆ‘j=1Jāˆ‘xi∈RjL(yi,γj) \hat{\Theta}=argmin_\Theta \sum^J_{j=1} \sum_{x_i \in R_j} L(y_i,\gamma_j)

The parameter

Gradient Tree Boosting Algorithm
  1. Initialization f0(x)=argminĪ³āˆ‘i=1NL(yi,γ)f_0(x)=argmin _\gamma \sum^N_{i=1} L(y_i,\gamma)

  2. For m=1ā€…ā€Štoā€…ā€ŠMm=1 \;to \; M

(a) Forā€…ā€Ši=1,2,...,Nā€…ā€Šrim=āˆ’[āˆ‚L(yi,f(xi))āˆ‚f(xi)]f=fmāˆ’1Ā  For \;i=1,2,...,N \; \quad \quad r_{im}=-\Big[ \dfrac{\partial L(y_i,f(x_i))}{\partial f(x_i)} \Big]{f=f_{m-1}} \

(b) Fit a regression tree to the targets rim r_{im}

(b) Forā€…ā€Šj=1,2,...,Jmγjm=argminĪ³āˆ‘xi∈RjmL(yi,fmāˆ’1(xi)+γ)For \;j=1,2,...,J_m \quad \gamma_{jm}=argmin_\gamma \sum_{x_i \in R_{jm}} L(y_i,f_{m-1}(x_i)+\gamma)

3. Output f^(x)=fM(x)\hat{f}(x)=f_M(x)

Last updated

Was this helpful?