Generalized LDA
The strength of LDA is the very simplicity.
It's a simple prototype classifier. One point is classified into the class with closest centroid. Distance is measure with Mahalanobis metric, using a pooled covariance estimate.
The decision boundary is linear, so it provides simple description and implementation. LDA is informative because it provides low-dimensional view.
The weakness of LDA is the simplicity also.
It is not enough to describe our data just by using two prototypes(class centroid and a common covariance matrix.)
Linear decision boundary can't adequately separate the data into classes.
When many features are used, LDA estimates high variance and the performance becomes weak. In this case, we need to restrict or regularize LDA.
Flexible Discriminant Analysis
Definition
This is devised for nonlinear classification.
Ξ²,ΞΈminβi=1βNβ(ΞΈ(giβ)βxiTβΞ²)2 giβ is the label for ithβgroup, and ΞΈ is a function mapping with GβR1. G is a quantitative variable(score) and ΞΈ maps this quantitative variable(score) to a categorical value. We call ΞΈ(giβ) as transformed class labels which can be predicted by linear regression.
ASR=N1βl=1βLβ[i=1βNβ(ΞΈlβ(giβ)βxiTβΞ²lβ)2] ΞΈlβ and Ξ²lβ are chosen to minimize Average Squared Residual(ASR).
Matrix Notation
Y is a NΓJ matrix, Yijβ=1 if ithβ observation falls into jthβ class.
Ξ is a JΓK matrix, the column vectors are k score vectors for jthβ class.
Ξβ=YΞ, it is a transformed class label vector.
ASR(Ξ)=tr(ΞβT(IβPXβ)Ξβ)/N=tr(ΞTYT(IβPXβ)YΞ)/N, s.t. PXβ is a projection onto the column space of X
For reference, βi=1Nβ(yiββy^βiβ)2=yT(IβPXβ)y. If the scores(ΞβT) have mean zero, unit variance, and are uncorrelated for the N observation (ΞβTΞ/N=IKβ), minimizing ASR(Ξ) amounts to finding K largest eigenvectors Ξ of YTPXβY with normalization ΞTDpβΞ=IKβ
Ξminβtr(ΞTYT(1βPXβ)YΞ)/N=Ξminβ[tr(ΞTYTYΞ)/Nβtr(ΞTYTPXβYΞ)/N]=Ξminβ[Kβtr(ΞβTYPXβYTΞβ)/N]=Ξmaxβtr(ΞβTSΞβ)/N The theta maximizing trace is the matrix which consists of K largest eigenvectors of S by Courant-Fischer-characterization of eigenvalues. Finally, we can find an optimalΞ.
Implementation
Initialize: Form Y, NΓJ indicator matrix.(described above)
Multivariate Regression: Set Y^=PXβY, B:Y^=XB
Optimal scores: Find the eigenvector matrix Ξ of YTY=YTPXβY with normalization ΞTDpβΞ=I
Update: BβBΞ
The final regression fit is a(Jβ1) vector function Ξ·(x)=BTx. The canonical variates form is as follows.
UTx=DBTx=DΞ·(x),s.t.Dkk2β=Ξ±k2β(1βΞ±k2β)1β Ξ±k2β is the kthβ largest eigenvalue computed in the 3. Optimal scores. We update our coefficient matrix B by using Ξwhich is the eigenvector matrix of YTPXβY. UTx is the linear canonical variates and DΞ·(x)is a nonparametric version of this discriminant variates. By replacing X,PXβ with h(X),Ph(x)β=S(Ξ») we can expand it to a nonparametric version. We can call this extended version as a FDA.
Implementation
Initialize: Form Ξ0β, s.t. ΞTDpβΞ=I,Ξ0ββ=YΞ0β
Multivariate Nonparametric Regression: Fit Ξ^0ββ=S(Ξ»)Ξ0ββ, Ξ·(x)=BTh(x)
Optimal scores: Find the eigenvector matrix Ξ¦ of Ξ0ββΞ^0ββ=Ξ0ββS(Ξ»^)Ξ0βTβ. The optimal score is Ξ=Ξ0βΞ¦
Update: Ξ·(x)=Ξ¦TΞ·(x)
With this implementation, we can get a Ξ¦and update Ξ·(x). The final Ξ·(x) is used for calculation of a canonical distance Ξ΄(x,j) which is the only thing for classification.
Ξ΄(x,j)=β£β£D(Ξ·^β(x)βΞ·Λβ(x)j)β£β£2 Penalized Discriminant Analysis
ASR({ΞΈlβ,Ξ²lβ}l=1Lβ=N1βl=1βLβ[i=1βNβ(ΞΈlβ(giβ)βhT(xiβ)Ξ²lβ)2+λβlTβΩβlβ] When we can choose Ξ·lβ(x)=h(x)Ξ²lβ, Ξ© becomes
hT(xiβ)=[h1Tβ(xiβ)β£h2Tβ(xiβ)β£β―β£hpTβ(xiβ)], we can define hjβbe a vector of up to N natural-spline basis function.
S(Ξ»)=H(HTH+Ξ©(Ξ»))β1HT
ASRpβ(Ξ)=tr(ΞTYT(1βS(Ξ»))YΞ)/N
Ξ£wthnβ+Ξ©: penalized within-group covariance of h(xiβ)
Ξ£btwnβ: between-group covariance of h(xiβ)
Find argmaxuβuTΞ£btwnβu,s.t.uT(Ξ£wthnβ+Ξ©)u=1, u becomes a canonical variate.
D(x,u)=(h(x)βh(u))T(Ξ£Wβ+λΩ)β1(h(x)βh(u)),