Idea
It is the method combining several classifiers in a sequential way.
Correctly classified data doesn't need to be considered in our model, so it puts a weight on wrong classified data.
Notation
Classifiers: H = { h 1 , ⋯ , h m } , s . t . h j : X → { 1 , − 1 } H=\{h_1,\cdots,h_m\}, \;s.t. \;h_j:X\rightarrow\{1,-1\} H = { h 1 , ⋯ , h m } , s . t . h j : X → { 1 , − 1 }
Data: X = { x 1 , ⋯ , x n } , Y = { y 1 , ⋯ , y n } , y i ∈ { 1 , − 1 } X=\{x_1,\cdots,x_n\}, \;Y=\{y_1,\cdots,y_n\}, \;y_i\in\{1,-1\} X = { x 1 , ⋯ , x n } , Y = { y 1 , ⋯ , y n } , y i ∈ { 1 , − 1 }
Matrix: A i j = y i h j ( x i ) A_{ij}=y_ih_j(x_i) \; A ij = y i h j ( x i ) A i j = 1 ( c o r r e c t ) , A i j = − 1 ( w r o n g ) A_{ij}=1(correct),A_{ij}=-1(wrong) A ij = 1 ( correc t ) , A ij = − 1 ( w ro n g )
Weight for classifiers: w ∈ R m w\in\mathbb{R}^m w ∈ R m
Weight f or data: λ ∈ R n \lambda \in \mathbb{R}^n λ ∈ R n
Problem
p ( λ ) = m i n { w T A λ : w ∈ Δ n } , s . t . Δ n = { w ∣ Σ i = 1 n = 1 , w i ≥ 0 } p(\lambda)=min\{w^TA\lambda:w\in\Delta_n\}, \;s.t. \;\Delta_n=\{w|\Sigma^n_{i=1}=1,w_i \geq0\} p ( λ ) = min { w T A λ : w ∈ Δ n } , s . t . Δ n = { w ∣ Σ i = 1 n = 1 , w i ≥ 0 } ( A λ ) = [ A 1 ⋯ A m ] [ λ 1 ⋮ λ m ] , w T A λ = w T A 1 λ 1 + ⋯ + w T A m λ m (A\lambda)=\begin{bmatrix}A_1 \cdots A_m\end{bmatrix} \begin{bmatrix} \lambda_1 \\ \vdots \\ \lambda_m\end{bmatrix}, w^TA\lambda=w^TA_1\lambda_1+\cdots+w^TA_m\lambda_m ( A λ ) = [ A 1 ⋯ A m ] λ 1 ⋮ λ m , w T A λ = w T A 1 λ 1 + ⋯ + w T A m λ m max λ ∈ Δ m { p ( λ ) = min w ∈ Δ n w T A λ } min w ∈ Δ n { f ( w ) = max λ ∈ Δ m w T A λ } \max_{\lambda \in \Delta_m} \{p(\lambda)=\min_{w\in \Delta_n}w^TA\lambda\} \\ \min_{w\in \Delta_n}\{f(w)=\max_{\lambda \in \Delta_m}w^TA\lambda\} λ ∈ Δ m max { p ( λ ) = w ∈ Δ n min w T A λ } w ∈ Δ n min { f ( w ) = λ ∈ Δ m max w T A λ } The above optimization problem is an original problem, and the below one is the duality problem.
In this situation, if A j λ j A_j\lambda_j A j λ j is lower than other values, it means the classifier A j A_j A j is not good at predicting. So p ( λ ) p(\lambda) p ( λ ) means the lowest value which the worst classifier has. When we make more weight on the worst classifier by making λ \lambda λ bigger, p ( λ ) p(\lambda) p ( λ ) would be bigger.
To solve this problem, we can use subgradient method or mirror descent method.
Subgradient method
The optimization problem is defined as:
max λ ≥ 0 p ( λ ) \max_{\lambda \geq 0 }p(\lambda) λ ≥ 0 max p ( λ )