Generalized LDA
The strength of LDA is the very simplicity.
It's a simple prototype classifier. One point is classified into the class with closest centroid. Distance is measure with Mahalanobis metric, using a pooled covariance estimate.
The decision boundary is linear, so it provides simple description and implementation. LDA is informative because it provides low-dimensional view.
The weakness of LDA is the simplicity also.
It is not enough to describe our data just by using two prototypes(class centroid and a common covariance matrix.)
Linear decision boundary can't adequately separate the data into classes.
When many features are used, LDA estimates high variance and the performance becomes weak. In this case, we need to restrict or regularize LDA.
Flexible Discriminant Analysis
Definition
This is devised for nonlinear classification.
β,θmini=1∑N(θ(gi)−xiTβ)2 gi is the label for ithgroup, and θ is a function mapping with G→R1. G is a quantitative variable(score) and θ maps this quantitative variable(score) to a categorical value. We call θ(gi) as transformed class labels which can be predicted by linear regression.
ASR=N1l=1∑L[i=1∑N(θl(gi)−xiTβl)2] θl and βl are chosen to minimize Average Squared Residual(ASR).
Matrix Notation
Y is a N×J matrix, Yij=1 if ith observation falls into jth class.
Θ is a J×K matrix, the column vectors are k score vectors for jth class.
Θ∗=YΘ, it is a transformed class label vector.
ASR(Θ)=tr(Θ∗T(I−PX)Θ∗)/N=tr(ΘTYT(I−PX)YΘ)/N, s.t. PX is a projection onto the column space of X
For reference, ∑i=1N(yi−y^i)2=yT(I−PX)y. If the scores(Θ∗T) have mean zero, unit variance, and are uncorrelated for the N observation (Θ∗TΘ/N=IK), minimizing ASR(Θ) amounts to finding K largest eigenvectors Θ of YTPXY with normalization ΘTDpΘ=IK
Θmintr(ΘTYT(1−PX)YΘ)/N=Θmin[tr(ΘTYTYΘ)/N−tr(ΘTYTPXYΘ)/N]=Θmin[K−tr(Θ∗TYPXYTΘ∗)/N]=Θmaxtr(Θ∗TSΘ∗)/N The theta maximizing trace is the matrix which consists of K largest eigenvectors of S by Courant-Fischer-characterization of eigenvalues. Finally, we can find an optimalΘ.
Implementation
Initialize: Form Y, N×J indicator matrix.(described above)
Multivariate Regression: Set Y^=PXY, B:Y^=XB
Optimal scores: Find the eigenvector matrix Θ of YTY=YTPXY with normalization ΘTDpΘ=I
Update: B←BΘ
The final regression fit is a(J−1) vector function η(x)=BTx. The canonical variates form is as follows.
UTx=DBTx=Dη(x),s.t.Dkk2=αk2(1−αk2)1 αk2 is the kth largest eigenvalue computed in the 3. Optimal scores. We update our coefficient matrix B by using Θwhich is the eigenvector matrix of YTPXY. UTx is the linear canonical variates and Dη(x)is a nonparametric version of this discriminant variates. By replacing X,PX with h(X),Ph(x)=S(λ) we can expand it to a nonparametric version. We can call this extended version as a FDA.
Implementation
Initialize: Form Θ0, s.t. ΘTDpΘ=I,Θ0∗=YΘ0
Multivariate Nonparametric Regression: Fit Θ^0∗=S(λ)Θ0∗, η(x)=BTh(x)
Optimal scores: Find the eigenvector matrix Φ of Θ0∗Θ^0∗=Θ0∗S(λ^)Θ0∗T. The optimal score is Θ=Θ0Φ
Update: η(x)=ΦTη(x)
With this implementation, we can get a Φand update η(x). The final η(x) is used for calculation of a canonical distance δ(x,j) which is the only thing for classification.
δ(x,j)=∣∣D(η^(x)−ηˉ(x)j)∣∣2 Penalized Discriminant Analysis
ASR({θl,βl}l=1L=N1l=1∑L[i=1∑N(θl(gi)−hT(xi)βl)2+λβlTΩβl] When we can choose ηl(x)=h(x)βl, Ω becomes
hT(xi)=[h1T(xi)∣h2T(xi)∣⋯∣hpT(xi)], we can define hjbe a vector of up to N natural-spline basis function.
S(λ)=H(HTH+Ω(λ))−1HT
ASRp(Θ)=tr(ΘTYT(1−S(λ))YΘ)/N
Σwthn+Ω: penalized within-group covariance of h(xi)
Σbtwn: between-group covariance of h(xi)
Find argmaxuuTΣbtwnu,s.t.uT(Σwthn+Ω)u=1, u becomes a canonical variate.
D(x,u)=(h(x)−h(u))T(ΣW+λΩ)−1(h(x)−h(u)),