Generalized Discriminant Analysis

Generalized LDA

The strength of LDA is the very simplicity.

  • It's a simple prototype classifier. One point is classified into the class with closest centroid. Distance is measure with Mahalanobis metric, using a pooled covariance estimate.

  • The decision boundary is linear, so it provides simple description and implementation. LDA is informative because it provides low-dimensional view.

The weakness of LDA is the simplicity also.

  • It is not enough to describe our data just by using two prototypes(class centroid and a common covariance matrix.)

  • Linear decision boundary can't adequately separate the data into classes.

  • When many features are used, LDA estimates high variance and the performance becomes weak. In this case, we need to restrict or regularize LDA.

Flexible Discriminant Analysis

Definition

This is devised for nonlinear classification.

minβ,θi=1N(θ(gi)xiTβ)2\min_{\beta, \theta}\sum^N_{i=1}(\theta(g_i)-x_i^T\beta)^2

gi g_i is the label for ithi_{th}group, and θ\theta is a function mapping with GR1G\rightarrow \mathbb{R}^1. GG is a quantitative variable(score) and θ\theta maps this quantitative variable(score) to a categorical value. We call θ(gi)\theta(g_i) as transformed class labels which can be predicted by linear regression.

ASR=1Nl=1L[i=1N(θl(gi)xiTβl)2]ASR=\dfrac{1}{N}\sum^L_{l=1}[\sum^N_{i=1}(\theta_l(g_i)-x_i^T\beta_l)^2]

θl\theta_l and βl\beta_l are chosen to minimize Average Squared Residual(ASR).

Matrix Notation

  • YY is a N×JN\times J matrix, Yij=1Y_{ij}=1 if ithi_{th} observation falls into jthj_{th} class.

  • Θ\Theta is a J×KJ \times K matrix, the column vectors are kk score vectors for jthj_{th} class.

  • Θ=YΘ \Theta^*=Y\Theta, it is a transformed class label vector.

  • ASR(Θ)=tr(ΘT(IPX)Θ)/N=tr(ΘTYT(IPX)YΘ)/NASR(\Theta)=tr(\Theta^{*T}(I-P_X)\Theta^*)/N=tr(\Theta^TY^T(I-P_X)Y\Theta)/N, s.t. PXP_X is a projection onto the column space of XX

For reference, i=1N(yiy^i)2=yT(IPX)y\sum^N_{i=1}(y_i-\hat{y}_i)^2=y^T(I-P_X)y. If the scores(ΘT\Theta^{*T}) have mean zero, unit variance, and are uncorrelated for the NN observation (ΘTΘ/N=IK\Theta^{*T}\Theta/N=I_K), minimizing ASR(Θ)ASR(\Theta) amounts to finding KK largest eigenvectors Θ\Theta of YTPXYY^TP_XY with normalization ΘTDpΘ=IK\Theta^TD_p\Theta=I_K

minΘtr(ΘTYT(1PX)YΘ)/N=minΘ[tr(ΘTYTYΘ)/Ntr(ΘTYTPXYΘ)/N]=minΘ[Ktr(ΘTYPXYTΘ)/N]=maxΘtr(ΘTSΘ)/N\min_\Theta tr(\Theta^TY^T(1-P_X)Y\Theta)/N=\min_{\Theta}[tr(\Theta^TY^TY\Theta)/N-tr(\Theta^TY^TP_XY\Theta)/N] \\ =\min_\Theta [K-tr(\Theta^{*T}YP_XY^T\Theta^*)/N]=\max_\Theta tr(\Theta^{*T}S\Theta^*)/N

The theta maximizing trace is the matrix which consists of KK largest eigenvectors of SS by Courant-Fischer-characterization of eigenvalues. Finally, we can find an optimalΘ\Theta.

Implementation

  1. Initialize: Form YY, N×JN\times J indicator matrix.(described above)

  2. Multivariate Regression: Set Y^=PXY\hat{Y}=P_XY, B:Y^=XB\mathbf{B}:\hat{Y}=X\mathbf{B}

  3. Optimal scores: Find the eigenvector matrix Θ\Theta of YTYY^TY=YTPXYY^TP_XY with normalization ΘTDpΘ=I\Theta^TD_p\Theta=I

  4. Update: BBΘ\mathbf{B}\leftarrow\mathbf{B}\Theta

The final regression fit is a(J1)(J-1) vector function η(x)=BTx\eta(x)=\mathbf{B}^Tx. The canonical variates form is as follows.

UTx=DBTx=Dη(x),    s.t.  Dkk2=1αk2(1αk2)U^Tx=D\mathbf{B}^Tx=D\eta(x), \;\;s.t. \;D_{kk}^2=\dfrac{1}{\alpha_k^2(1-\alpha_k^2)}

αk2\alpha_k^2 is the kthk_{th} largest eigenvalue computed in the 3. Optimal scores. We update our coefficient matrix B\mathbf{B} by using Θ\Thetawhich is the eigenvector matrix of YTPXYY^TP_XY. UTxU^Tx is the linear canonical variates and Dη(x)D\eta(x)is a nonparametric version of this discriminant variates. By replacing X,PXX,P_X with h(X),Ph(x)=S(λ)h(X),P_{h(x)}=S(\lambda) we can expand it to a nonparametric version. We can call this extended version as a FDA.

Implementation

  1. Initialize: Form Θ0\Theta_0, s.t. ΘTDpΘ=I,  Θ0=YΘ0\Theta^TD_p\Theta=I, \; \Theta_0^*=Y\Theta_0

  2. Multivariate Nonparametric Regression: Fit Θ^0=S(λ)Θ0\hat{\Theta}_0^*=S(\lambda)\Theta_0^*, η(x)=BTh(x)\eta(x)=\mathbf{B}^Th(x)

  3. Optimal scores: Find the eigenvector matrix Φ\Phi of Θ0Θ^0=Θ0S(λ^)Θ0T\Theta_0^*\hat{\Theta}_0^*=\Theta_0^*S(\hat{\lambda})\Theta_0^{*T}. The optimal score is Θ=Θ0Φ\Theta=\Theta_0\Phi

  4. Update: η(x)=ΦTη(x)\eta(x)=\Phi^T\eta(x)

With this implementation, we can get a Φ\Phiand update η(x)\eta(x). The final η(x)\eta(x) is used for calculation of a canonical distance δ(x,j)\delta(x,j) which is the only thing for classification.

δ(x,j)=D(η^(x)ηˉ(x)j)2\delta(x,j)=||D(\hat{\eta}(x)-\bar{\eta}(x)^j)||^2

Penalized Discriminant Analysis

ASR({θl,βl}l=1L=1Nl=1L[i=1N(θl(gi)hT(xi)βl)2+λβlTΩβl]ASR(\{\theta_l,\beta_l\}^L_{l=1}=\dfrac{1}{N}\sum^L_{l=1}[\sum^N_{i=1}(\theta_l(g_i)-h^T(x_i)\beta_l)^2+\lambda\beta_l^T\Omega\beta_l]

When we can choose ηl(x)=h(x)βl\eta_l(x)=h(x)\beta_l, Ω\Omega becomes

  • hT(xi)=[h1T(xi)    h2T(xi)        hpT(xi)]  h^T(x_i)=[h_1^T(x_i) \; | \;h_2^T(x_i) \;| \; \cdots \; | \; h_p^T(x_i)]\; , we can define hjh_jbe a vector of up to NN natural-spline basis function.

  • S(λ)=H(HTH+Ω(λ))1HTS(\lambda)=H(H^TH+\Omega(\lambda))^{-1}H^T

  • ASRp(Θ)=tr(ΘTYT(1S(λ))YΘ)/NASR_p(\Theta)=tr(\Theta^TY^T(1-S(\lambda))Y\Theta)/N

  • Σwthn+Ω\Sigma_{wthn}+\Omega: penalized within-group covariance of h(xi)h(x_i)

  • Σbtwn\Sigma_{btwn}: between-group covariance of h(xi)h(x_i)

  • Find argmaxuuTΣbtwnu,  s.t.  uT(Σwthn+Ω)u=1argmax_{u} u^T\Sigma_{btwn}u, \;s.t.\;u^T(\Sigma_{wthn}+\Omega)u=1, uu becomes a canonical variate.

  • D(x,u)=(h(x)h(u))T(ΣW+λΩ)1(h(x)h(u))D(x,u)=(h(x)-h(u))^T(\Sigma_W+\lambda\Omega)^{-1}(h(x)-h(u)),

Last updated