Clustering & Nearest Neighbor

Clustering is the method that binding similar group together. We can cluster customer type based on several variables reflecting some pattern in consumption. In a mathematical expression, we need to bind similar rows of the data matrix. For clustering, usually prototype methods are adopted. Prototype methods are the way to assign each observation to its closest prototype (centroid, medoid, etc.) "Closest" is usually defined by Euclidean distance in a feature space, after each feature has been standardized to have overall mean 0 and variance 1 in the training sample.
It works well for capturing the distribution of each class.
How many prototypes to use and where to put them are the main issues.
🕺 Goal: Partition the observations into clusters, so that the pairwise dissimilarities between points assigned to the same cluster becomes smaller than points in different clusters.
Combinatorial Algorithm: Work without any probability model.
Mixture modeling: Assume the data is an i.i.d sample from some population with probability function. The density function is expressed by a parameterized model.
Mode seeking("bump hunters"): Directly estimate distinct modes of the probability density function. It takes a nonparametric perspective.
Dissimilarity
How can we define the similarity?
Most of clustering methods take a dissimilarity matrix. By the attribute type, we can differently choose the shape of this function d.
dj(xij,xi′j)=(xij−xi′j)2
Quantitative variables. d(xi,xi′)=l(∣xi−xi′∣)
Ordinal variables. Mi−1/2,i=1,...,M
Categorical variables: M×M matrix with elements Lrr′=1 for all r=r′
Minkowski distance: d(x,y)={∑i=1n(xi−yi)p}1/p (p=2: Euclidian)
How can we integrate all dissimilarity of attributes?
The influence of Xjon D(xi,xi′) depends upon its relative contribution to the average measure over all pairs of observation.
Because wj⋅dˉj is the relative influence of jthvariable, so setting wj∼1/dˉj would give all attributes equal influence.
What is desired properties of dissimilarity function?
Symmetry: d(x,y)=d(y,x) 👏 Other wise you could claim "Alex looks like Bob, but Bob looks nothin Alex"
Positive separability: d(x,y)=0,ifandonlyifx=y 👏Otherwise there are objects that are different, but you cannot tell apart
Triangular inequality: d(x,y)≤d(x,z)+d(z,y)👏 Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl
Combinatory Algorithm
This algorithm assigns each observation to a cluster without regard to a probability model describing the data.
W(C) stands for within-cluster and B(C) for between-cluster. Our aim is to minimize W(C) and maximize B(C). Combinatory Algorithm needs to calculate all possible situations leading to an enormous inefficient computational speed.
This value rapidly increases so we don't want to adopt this algorithm for large N,K
K-means Algorithm(Combinatory)
xˉk=(xˉ1k,...,xˉpk) is the mean vector associated with kth cluster. Also it can be expressed as cj=(c1,...,ck)
Nk=∑i=1NI(C(i)=k), C(i) is also expressed as π(i)
After we define our dissimilarity function, we can define the distort measure that we want to minimize for finding the closest centroids.
However, it is NP-hard problem so we need to adopt an iterative optimization way.
Algorithm[iterative descent algorithm]
🌻 Initialization Initialize k cluster centers, {c1,⋯,ck}
🌻 Cluster assignment (Calculating the distance )π(i)=argminj=1,⋯,k∣∣xi−cj∣∣2
🌻 Center adjustment (Updating cluster center) cj=∣{i:π(i)=j}∣1∑i:π(i)=jxi
Repeat assigning and adjusting until there is no change in cj
K-means algorithm is usually used to find clusters from our unlabeled data. However, it also can be used to classify target variable from a labeled data.
Apply K-means algorithm to the training data in each class separately, using R prototypes per class.
Assign a class label to each of the K×R prototypes. In this case, K is the number of classes, and R is the number of prototypes per class.
Classify a new feature x to the class of the closest prototype.

Learning Vector Quantization is used to correct the prototypes after we select prototypes. This method has the shortcoming that for each class, the other classes don't have a say in the positioning of the prototypes for that class.
Learning Vector Quantization(LVQ)
🌻 Initialization Initialize R prototypes for each class: m1(k),⋯,mR(k)
🌻 Sampling and adjusting Sample a training point xi randomly with replacement, and let (j,k) index the closest prototype mj(k) to xi.
🌻 If gi=k, mj(k)←mj(k)+ϵ(xi−mj(k))
🌻 if gi=k, mj(k)←mj(k)−ϵ(xi−mj(k))
Repeat 2nd step, until the learning rate becomes 0.
Implementation From Scratch

Gaussian Mixtures
Setting :
Z={e1,e2,⋯,eK},e1=[10⋯0]T
p(X∣Z=ek)=N(X∣μk,Σk)=pdfofnormaldist
p(Z=ek)=πk=probabilityofselecting ;each;cluster
p(X)=∑zp(Z)×p(X∣Z)=∑k=1Kπk×N(X∣μk,Σk)
p(X,Z)=p(Z)×p(X∣Z)=πkN(X∣μk,Σk)
p(Z=ek∣X)=p(X)p(Z)×p(X∣Z)=∑j=1Kπj×N(X∣μj,Σj)πk×N(X∣μk,Σk)=r(Znk)

Algorithm[EM Algorithm]
🌻 Initialization Initialize the means μk, covariance Σk and mixing coefficients πk.
🌻 E step. Evaluate r(znk)=∑j=1KπjN(xn∣μj,Σj)πkN(xn∣μk,Σk)
🌻 M step. Re-estimate the parameters using r(znk)
μknew=Nk1∑i=1Nγ(znk)xn
Σknew=Nk1∑i=1Nγ(znk)(xn−μknew)(xn−μknew)T
πknew=NNk,s.tNk=∑n=1Nγ(znk)
🌻 Evaluation
Evaluate the log likelihood
lnp(X∣μ,Σ,π)=∑n=1Nln{∑k=1KπkN(xn∣μk,Σk)}
Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to E step. r(znk) means responsibility.

K-medoids
K-means cluster has following two flaws.
[Lack of Robustness] Euclidean distance places the highest influence on the largest distance leading to lack of robustness.
[Only For Numerical Variables] It is impossible to compute arithmetic mean for a categorical feature.
To handle with these two problems, K-medoids can be used. This way
rij=1 if xi belongs to j cluster. (otherwise the value becomes 0)
D(xi,μj), μj is the mean vector in j cluster.
Algorithm
🌻 Initialization a set of cluster centers:{m1,...,mK}, and assignment C
🌻 Cluster Assignment
ik∗=argmini:C(i)=k∑C(i′)=kD(xi,xi′)
🌻 Center Update C(i)=argmin1≤k≤KD(xi,mk)
Iterate Assigning/Updating until the assignments do not change.
In this algorithm, we don't need to compute the cluster center so we can use categorical variables also. We just need to keep track of the indices ik∗
Hierarchical Clustering
Hierarchy in cluster mean that at the lowest level each cluster contains only one single point but at the highest level there is only one cluster containing all of the data. Hierarchical clustering composes of two steps:
Agglomerative(bottom-up): It starts at the bottom and at each level it recursively merge a selected pair of clusters into a single cluster.
Divisive(top-down): It starts at the top and at each level it recursively split one of the existing clusters at that level into two new clusters.
For visualization tool, dendrogram is usually used. I only handle bottom-up model in this chapter. The algorithms is as follows.
Algorithm
🌻 Assigning: Assign each data point to its own cluster, g1={x1},...,gm={xm}
let G={g1,...,gm}
🌻 Do:
Find two clusters to merge: i,j=argmin1≤i,j≤∣G∣D(gi,gj)
Merge the two clusters to a new cluster: g←gi∪gj
Remove the merged clusters: G←G(gi),G←G(gj)
Add the new clusters: G←G∪{g}
🌻 While ∣G∣>1


Single Linkage: dSL(G,H)=mini∈G,i′∈Hdii′
Complete Linkage: dCL(G,H)=mini∈G,i′∈Hdii′
Group Average: dGA(G,H)=NGNH1∑i∈G∑i′∈Hdii′, NG,NH are the number of observations in each group. (This method is similar with K-means clustering)

Nearest Neighbor
Nearest Neighbor is the way to classify or predict a target data of a new point X by leveraging data near by this new point. This method regulates the weight of each data by distance.
Our goal is to determine the number of class k to be large enough to minimize an error, and to be small to give an accurate estimation. I'm gonna show that the class of k−NN rules, the single nearest neighbor rule is admissible by showing that there exists no k−NN rule, k=1, which has lower probability of error against all distributions for the n sample problem.
Let me consider the situation that the prior probabilities η1=η2=21, and data points in each class are well separated. Also, the number of category is only two, so each data point is assigned into category 1 or 2. This problem can also be thought as the binomial situation, which each trial has just success or fail. Under this condition, I'll show that 1−NN rule is strictly better than the k−NN rule in the case where the densities are clearly separated.
The probability that j individuals come from category 1: (21)n(jn)
Pϵ(1;n) is the error that all points lie in category 2: (21)n
Pϵ(k;n)=(21)n∑j=0k0(jn),s.t.k=2k0+1. It means the probability that k0 or fewer points lie in category 1
When the new data point x is actually in a category 1, only if the k0 nearest points or fewer points are in a category 2, this point x is assigned into a category 1. This is the situation of choosing k0 points as 2 and other points as 1.
x∗ is the nearest neighbor point to x. k∗ is the category to which x∗ belongs. The conditional NN risk r(x,x∗) is as follows.
ν(x) is a nonzero probability measure at the points x.
Bayesian Procedure
In a Bayesian view, p(k∣x)=pk(x)=p(k)p(x∣k)p(k), which means the probability that the individual x is assigned into a class k. p(k) is a prior probability. L(k,j) is the loss incurred by assigning an individual from category k to category j. The conditional loss is as follows.
The conditional loss rj(x)becomes a minimum when the individual is assigned to the category j for which pj(x) is lowest. The conditional Bayes risk is as follows.
r(x)=2p1(x)(1−p1(x))=2r∗(x)(1−r∗(x)) by the symmetry of r∗(x)
Comparison of errors
The overall NN risk R is as follows.
It implies R≤2R∗(1−R∗). When the problem is expended from 2-class into K-class, the inequality changes a bit.
Practical Ossues
How can we decide the number of clusters?
Gap Statistic
We can leverage the within cluster dissimilarity Wk
K∈{1,2,...,Kmax},W={W1,W2,...,WKmax}
K∗=argminK{K∣G(K)≥G(K+1)−sK+1′}


Reference [Gap statistic]
Silhouette value
We can also use a Silhouette value Si=max(ai,bi)bi−ai . aiis the average distance from the ith data point to the other points in the same cluster, and bi is the minimum average distance from the ith point to points in a different cluster. When our groups are well clustered, bi−ai has to become big.

Last updated