Clustering & Nearest Neighbor

Clustering is the method that binding similar group together. We can cluster customer type based on several variables reflecting some pattern in consumption. In a mathematical expression, we need to bind similar rows of the data matrix. For clustering, usually prototype methods are adopted. Prototype methods are the way to assign each observation to its closest prototype (centroid, medoid, etc.) "Closest" is usually defined by Euclidean distance in a feature space, after each feature has been standardized to have overall mean 0 and variance 1 in the training sample.

  • It works well for capturing the distribution of each class.

  • How many prototypes to use and where to put them are the main issues.

🕺 Goal: Partition the observations into clusters, so that the pairwise dissimilarities between points assigned to the same cluster becomes smaller than points in different clusters.

  • Combinatorial Algorithm: Work without any probability model.

  • Mixture modeling: Assume the data is an i.i.d sample from some population with probability function. The density function is expressed by a parameterized model.

  • Mode seeking("bump hunters"): Directly estimate distinct modes of the probability density function. It takes a nonparametric perspective.

Dissimilarity

How can we define the similarity?

D(xi,xi)=j=1pdj(xij,xij)D(x_i,x_{i'})=\sum^p_{j=1}d_j(x_{ij},x_{i'j})

Most of clustering methods take a dissimilarity matrix. By the attribute type, we can differently choose the shape of this function dd.

  • dj(xij,xij)=(xijxij)2 d_j(x_{ij},x_{i'j})=(x_{ij}-x_{i'j})^2

  • Quantitative variables. d(xi,xi)=l(xixi) d(x_i,x_{i'})=l(|x_i-x_{i'}|)

  • Ordinal variables. i1/2M,  i=1,...,M\frac{i-1/2}{M}, \;i=1,...,M

  • Categorical variables: M×MM \times M matrix with elements Lrr=1L_{rr'}=1 for all rrr \neq r'

  • Minkowski distance: d(x,y)={i=1n(xiyi)p}1/pd(x,y)=\{\sum^n_{i=1} (x_i-y_i)^p\}^{1/p} (p=2: Euclidian)

How can we integrate all dissimilarity of attributes?

D(xi,xi)=j=1pwjdj(xij,xij);  j=1pwj=1D(x_i,x_{i'})=\sum^p_{j=1}w_j \cdot d_j(x_{ij},x_{i'j}); \; \sum^p_{j=1}w_j=1

The influence of XjX_jon D(xi,xi)D(x_i,x_{i'}) depends upon its relative contribution to the average measure over all pairs of observation.

Dˉ=1N2i=1Ni=1ND(xi,xi)=j=1pwjdˉj,  dˉj=1N2i=1Ni=1Ndj(xij,xij)\bar{D}=\dfrac{1}{N^2}\sum^N_{i=1}\sum^N_{i'=1}D(x_i,x_{i'})=\sum^p_{j=1}w_j\cdot \bar{d}_j, \; \bar{d}_j=\dfrac{1}{N^2}\sum^N_{i=1}\sum^N_{i'=1}d_j(x_{ij},x_{i'j})

Because wjdˉjw_j\cdot \bar{d}_j is the relative influence of jthj_{th}variable, so setting wj1/dˉjw_j \sim 1/\bar{d}_j would give all attributes equal influence.

What is desired properties of dissimilarity function?

  • Symmetry: d(x,y)=d(y,x)d(x,y)=d(y,x) 👏 Other wise you could claim "Alex looks like Bob, but Bob looks nothin Alex"

  • Positive separability: d(x,y)=0,  if  and  only  if  x=yd(x,y)=0, \; if \;and \;only \;if \;x=y 👏Otherwise there are objects that are different, but you cannot tell apart

  • Triangular inequality: d(x,y)d(x,z)+d(z,y)d(x,y) \leq d(x,z)+d(z,y)👏 Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl

Combinatory Algorithm

This algorithm assigns each observation to a cluster without regard to a probability model describing the data.

W(C)12k=1KC(i)=kC(i)=kd(xi,xi)W(C)-\dfrac{1}{2}\sum^K_{k=1}\sum_{C(i)=k}\sum_{C(i')=k} d(x_i,x_{i'})
T=12i=1Ni=1Ndii=12k=1KC(i)=k(C(i)=kdii+C(i)kdii)T=W(C)+B(C),  B(C)=12k=1KC(i)=kC(i)kdiiT=\dfrac{1}{2}\sum^N_{i=1}\sum^N_{i'=1} d_{ii'}=\dfrac{1}{2}\sum^K_{k=1} \sum_{C(i)=k} \Big ( \sum_{C(i')=k} d_{ii'}+\sum_{C(i')\neq k}d_{ii'} \Big) \\ T=W(C)+B(C),\;B(C)=\dfrac{1}{2} \sum^K_{k=1} \sum_{C(i)=k} \sum_{C(i')\neq k} d_{ii'}

W(C)W(C) stands for within-cluster and B(C)B(C) for between-cluster. Our aim is to minimize W(C)W(C) and maximize B(C)B(C). Combinatory Algorithm needs to calculate all possible situations leading to an enormous inefficient computational speed.

S(N,K)=1K!k=1K(1)Kk(Kk)kNS(N,K)=\dfrac{1}{K!}\sum^K_{k=1}(-1)^{K-k} {K \choose k} k^N

This value rapidly increases so we don't want to adopt this algorithm for large N,KN, K

K-means Algorithm(Combinatory)

d(xi,xi)=j=1p(xijxij)2=xixi2W(C)=12k=1KC(i)=kC(i)=kxixi2=k=1KNkC(i)=kxixˉk2d(x_i,x_{i'})=\sum^p_{j=1}(x_{ij}-x_{i'j})^2=||x_i-x_{i'}||^2 \\ W(C)=\dfrac{1}{2} \sum^K_{k=1} \sum_{C(i)=k} \sum_{C(i')=k} ||x_i-x_{i'}||^2=\sum^K_{k=1}N_k \sum_{C(i)=k} ||x_i-\bar{x}_k||^2
  • xˉk=(xˉ1k,...,xˉpk)\bar{x}_k=(\bar{x}_{1k},...,\bar{x}_{pk}) is the mean vector associated with kthk_{th} cluster. Also it can be expressed as cj=(c1,...,ck)c_j=(c_1,...,c_k)

  • Nk=i=1NI(C(i)=k)N_k=\sum^N_{i=1}I(C(i)=k), C(i)C(i) is also expressed as π(i)\pi(i)

After we define our dissimilarity function, we can define the distort measure that we want to minimize for finding the closest centroids.

J=n=1Nj=1kπ(i)xicj2J=\sum^N_{n=1}\sum^k_{j=1}\pi(i)||x_i-c_j||^2

However, it is NP-hard problem so we need to adopt an iterative optimization way.

Algorithm[iterative descent algorithm]

🌻 Initialization Initialize k cluster centers, {c1,,ck}\{c_1,\cdots,c_k\}

🌻 Cluster assignment (Calculating the distance )π(i)=argminj=1,,kxicj2\pi(i)=argmin_{j=1,\cdots,k}||x_i-c_j||^2

🌻 Center adjustment (Updating cluster center) cj=1{i:π(i)=j}i:π(i)=jxic_j=\dfrac{1}{|\{i:\pi(i)=j\}|}\sum_{i:\pi(i)=j}x_i

Repeat assigning and adjusting until there is no change in cjc_j

K-means algorithm is usually used to find clusters from our unlabeled data. However, it also can be used to classify target variable from a labeled data.

  1. Apply K-means algorithm to the training data in each class separately, using RR prototypes per class.

  2. Assign a class label to each of the K×R K \times R prototypes. In this case, KK is the number of classes, and RR is the number of prototypes per class.

  3. Classify a new feature xx to the class of the closest prototype.

Learning Vector Quantization is used to correct the prototypes after we select prototypes. This method has the shortcoming that for each class, the other classes don't have a say in the positioning of the prototypes for that class.

Learning Vector Quantization(LVQ)

🌻 Initialization Initialize RR prototypes for each class: m1(k),,mR(k)m_1(k),\cdots,m_R(k)

🌻 Sampling and adjusting Sample a training point xix_i randomly with replacement, and let (j,k)(j,k) index the closest prototype mj(k)m_j(k) to xix_i.

🌻 If gi=kg_i=k, mj(k)mj(k)+ϵ(ximj(k))m_j(k) \leftarrow m_j(k)+\epsilon(x_i-m_j(k))

🌻 if gikg_i \neq k, mj(k)mj(k)ϵ(ximj(k))m_j(k) \leftarrow m_j(k)-\epsilon(x_i-m_j(k))

Repeat 2nd step, until the learning rate becomes 0.

Implementation From Scratch

import numpy as np
from PIL import Image
from random import *
import matplotlib.pyplot as plt

img_m = np.array(Image.open('data/myself.png'))
img_m = np.delete(img_m,obj=3,axis=2)

#train = 
#test = np.array([[1,1,5],[1,2,3],[2,3,10],[2,10,20],[2,10,40]])

def d(x,y,method=['l2']):
    if method=='l2': return (x-y)**2

class clustering():
    def __init__(self, data, data_dim):
        self.reshaped = data.reshape(-1,data_dim) # Flatten our data
        self.data_dim = data_dim
        
    def init_mean(self, num_cluster):
        seed(2020) # Random seed for reproducible result
        self.unique = np.unique(self.reshaped, axis=0)
        idx = sample(range(0,len(self.unique)), num_cluster) # Index sampling 
        self.ux = self.unique[idx] # Initializing value within data matrix
        ## Poor initialization
        # self.ux = np.array([np.random.uniform(0,1,self.data_dim) for num in range(num_cluster)])
        self.num_cluster = num_cluster

    def kmeans(self):
        # Second step: make a cluster variabe "r"
        self.u_pre = (self.ux)+0.1 # Just to make these difference in first step
        while np.sum(abs(self.u_pre - self.ux))>0.08:
            self.u_pre = (self.ux).copy()
            r = np.array([np.argmin(np.sum(d(x,self.ux,'l2'), axis=1)) for x in self.reshaped]) # r = array([k0,k2,k1,k1,...])
            # Third step: make a mean vector "ux"
            self.ux = [] # ux = [[k0 mean vec], [k1 mean vec], ...]
            for k in range(0,self.num_cluster):
                rk = (r==k).astype(int).reshape(-1,1) # rk = [[1],[0],[0],[0],...]] for np multiplication
                # Avoiding loop statement leads to reducing the time complexity
                u = (self.reshaped*rk).sum(axis=0)/(rk).sum(axis=0) # This solution is already given in our lecture.
                (self.ux).append(list(u)) # Binding u together.
            self.ux = np.array(self.ux)
        return(r, self.ux)

def kmeans_vis(data, num_cluster):
    km = clustering(data, 3)
    km.init_mean(num_cluster)
    r, uk = km.kmeans()
    reps = data.copy().reshape(-1,3)
    for i in range(num_cluster):
        reps[r==i] = uk[i]

    result = reps.reshape(data.shape[0], data.shape[1],3)
    plt.imshow(result, interpolation='nearest')
    plt.show()
Result

Gaussian Mixtures

Setting :

XZ=ekN(μk,Σk)    where  μ1,...,μkRd,  Σ1,...,ΣkRd×dX|Z=e_k \sim N(\mu_k,\Sigma_k) \;\; \\where \; \mu_1,...,\mu_k\in \mathbb{R}^d, \; \Sigma_1,...,\Sigma_k\in \mathbb{R}^{d \times d}
  • Z={e1,e2,,eK},  e1=[1  0    0]TZ=\{e_1,e_2,\cdots,e_K\}, \;e_1=[1 \;0 \; \cdots \; 0]^T

  • p(XZ=ek)=N(Xμk,Σk)=pdf  of  normal  distp(X|Z=e_k)=N(X|\mu_k,\Sigma_k)=pdf \;of \;normal \; dist

  • p(Z=ek)=πk=probability  of  selecting ;each;cluster p(Z=e_k)=\pi_k =probability \;of \;selecting\ ; each ; cluster

  • p(X)=zp(Z)×p(XZ)=k=1Kπk×N(Xμk,Σk) p(X)= \sum_z p(Z) \times p(X|Z)=\sum^K_{k=1} \pi_k \times N(X|\mu_k,\Sigma_k)

  • p(X,Z)=p(Z)×p(XZ)=πkN(Xμk,Σk)p(X,Z)=p(Z) \times p(X|Z)=\pi_kN(X|\mu_k,\Sigma_k)

  • p(Z=ekX)=p(Z)×p(XZ)p(X)=πk×N(Xμk,Σk)j=1Kπj×N(Xμj,Σj)=r(Znk) p(Z=e_k|X)=\dfrac{p(Z) \times p(X|Z)}{p(X)}=\dfrac{\pi_k \times N(X|\mu_k,\Sigma_k)}{\sum^K_{j=1} \pi_j \times N(X|\mu_j,\Sigma_j)}=r(Z_{nk})

def random_mat(n):
    x = np.random.normal(0, 0.1, size=(n, n))
    return x@np.transpose(x)

def params_init():
    params = {'pi': np.random.uniform(0,1),
    'mu': np.array([np.random.normal(0, 0.1, size=(5,)) for i in range(2)]), # k=2
    'sigma': np.array([random_mat(5)+np.identity(5) for i in range(2)])} #k=2
    return params

mu = params_init()['mu']; sigma = params_init()['sigma']
dist = [mvn(mean=mu[i], cov=sigma[i]) for i in range(2)]

# PCA
scaler = StandardScaler()
scaler.fit(df)
X_scaled = scaler.transform(df)
cov_matrix = np.cov(X_scaled.T)
eigvals, eigvecs = np.linalg.eig(cov_matrix)
eigvec_mat = (eigvecs.T[:][:5]).T
X_pca = (X_scaled @ eigvec_mat / eigvals[:5]).astype(float)

obj = []

def e_step(x, params, i):
  pi = params['pi']; mu = params['mu']; sigma = params['sigma']
  dist = [mvn(mean=mu[i], cov=sigma[i]) for i in range(2)]
  r_vec = [[( pi * dist[k].pdf(x[i]) ) / ((pi * dist[0].pdf(x[i]))+ ((1-pi)*dist[1].pdf(x[i]))) for i in range(x.shape[0])] for k in [0,1]]
  clst = [ int(r_vec[0][i] < r_vec[1][i]) for i in range(x.shape[0])]
  # Originally, calculating log-likelihood is the step after m_step
  # But for computational convenience, I put it in this e_step.
  log_likelihood = sum([np.log( pi * dist[0].pdf(x[i]) + (1-pi)*dist[1].pdf(x[i])) for i in range(x.shape[0])])
  obj.append(log_likelihood)
  return np.array(r_vec), clst

def m_step(x, params, r_vec):
  N = r_vec.sum(axis=1)
  elem = np.dot(r_vec,x)
  mu = np.array([elem[0]/N[0] , elem[1]/N[1]])
  pi = (N/N.sum())[0]
  sigma = [(1/N[k]) * np.array([ r_vec[k][i] * (np.outer(X_pca[i] - params['mu'][k],X_pca[i] - params['mu'][k]))
  for i in range(X_pca.shape[0])]).sum(axis=0)
  for k in range(2)]
  params = {'pi':pi, 'mu':mu, 'sigma':np.array(sigma)}
  return params
  
 i = 0
 params = params_init()
 while i<50:
     i += 1
     r_vec, clst = e_step(X_pca, params,i)
     params = m_step(X_pca, params, r_vec)
     
fig, ax = plt.subplots()
ax.plot(range(i), obj, marker='.', color='r')
ax.grid(True)
ax.set_xlabel('# of iteration')
ax.set_ylabel('log likelihood')
ax.set_title('log likelihood')
plt.show()
Algorithm[EM Algorithm]

🌻 Initialization Initialize the means μk\mu_k, covariance Σk\Sigma_k and mixing coefficients πk\pi_k.

🌻 E step. Evaluate r(znk)=πkN(xnμk,Σk)j=1KπjN(xnμj,Σj) r(z_{nk})=\dfrac{\pi_k N(x_n|\mu_k,\Sigma_k)}{\sum^K_{j=1} \pi_j N(x_n|\mu_j, \Sigma_j)}

🌻 M step. Re-estimate the parameters using r(znk)r(z_{nk})

μknew=1Nki=1Nγ(znk)xn \mu_k^{new}=\dfrac{1}{N_k}\sum^N_{i=1}\gamma(z_{nk})x_n

Σknew=1Nki=1Nγ(znk)(xnμknew)(xnμknew)T\Sigma_k^{new}=\dfrac{1}{N_k}\sum^N_{i=1}\gamma(z_{nk})(x_n-\mu_k^{new})(x_n-\mu_k^{new})^T

πknew=NkN,    s.t  Nk=n=1Nγ(znk) \pi_k^{new}=\dfrac{N_k}{N}, \;\; s.t \; N_k=\sum^N_{n=1}\gamma(z_{nk})

🌻 Evaluation

Evaluate the log likelihood

lnp(Xμ,Σ,π)=n=1Nln{k=1KπkN(xnμk,Σk)} ln p(X|\mu,\Sigma,\pi)=\sum^N_{n=1}ln\{\sum^K_{k=1}\pi_k N(x_n|\mu_k,\Sigma_k)\}

Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to E step. r(znk)r(z_{nk}) means responsibility.

K-medoids

K-means cluster has following two flaws.

  • [Lack of Robustness] Euclidean distance places the highest influence on the largest distance leading to lack of robustness.

  • [Only For Numerical Variables] It is impossible to compute arithmetic mean for a categorical feature.

To handle with these two problems, K-medoids can be used. This way

J=i=1nj=1prijD(xi,μj)J=\sum^n_{i=1}\sum^p_{j=1}r_{ij}D(x_i,\mu_j)
  • rij=1r_{ij}=1 if xix_i belongs to jj cluster. (otherwise the value becomes 00)

  • D(xi,μj)D(x_i,\mu_j), μj\mu_j is the mean vector in jj cluster.

Algorithm

🌻 Initialization a set of cluster centers:{m1,...,mK}\{m_1,...,m_K\}, and assignment CC

🌻 Cluster Assignment

ik=argmini:C(i)=kC(i)=kD(xi,xi) i^*_k=argmin_{i:C(i)=k} \sum_{C(i')=k}D(x_i,x_{i'})

🌻 Center Update C(i)=argmin1kKD(xi,mk) C(i)=argmin_{1 \leq k \le K}D(x_i,m_k)

Iterate Assigning/Updating until the assignments do not change.

In this algorithm, we don't need to compute the cluster center so we can use categorical variables also. We just need to keep track of the indices iki_k^*

Hierarchical Clustering

Hierarchy in cluster mean that at the lowest level each cluster contains only one single point but at the highest level there is only one cluster containing all of the data. Hierarchical clustering composes of two steps:

  • Agglomerative(bottom-up): It starts at the bottom and at each level it recursively merge a selected pair of clusters into a single cluster.

  • Divisive(top-down): It starts at the top and at each level it recursively split one of the existing clusters at that level into two new clusters.

For visualization tool, dendrogram is usually used. I only handle bottom-up model in this chapter. The algorithms is as follows.

Algorithm

🌻 Assigning: Assign each data point to its own cluster, g1={x1},...,gm={xm}g_1 = \{x_1\},...,g_m=\{x_m\}

let G={g1,...,gm}G=\{g_1,...,g_m\}

🌻 Do:

  • Find two clusters to merge: i,j=argmin1i,jGD(gi,gj)i,j=argmin_{1\leq i, j \leq |G|} D(g_i,g_j)

  • Merge the two clusters to a new cluster: ggigjg \leftarrow g_i \cup g_j

  • Remove the merged clusters: GG(gi),  GG(gj)G \leftarrow G(g_i), \;G \leftarrow G(g_j)

  • Add the new clusters: GG{g}G \leftarrow G \cup \{g\}

🌻 While G>1|G|>1

Example - Georgia Tech 2021 Fall ISYE 6740 George Lan

  • Single Linkage: dSL(G,H)=miniG,iHdiid_{SL}(G,H)=\min_{i\in G, i' \in H} d_{ii'}

  • Complete Linkage: dCL(G,H)=miniG,iHdiid_{CL}(G,H)=min_{i\in G, i' \in H} d_{ii'}

  • Group Average: dGA(G,H)=1NGNHiGiHdiid_{GA}(G,H)=\dfrac{1}{N_GN_H}\sum_{i \in G} \sum_{i' \in H} d_{ii'}, NG,NHN_G, N_H are the number of observations in each group. (This method is similar with K-means clustering)

This type of cluster can be done with single linkage, because it uses a minimum distance.

Nearest Neighbor

Nearest Neighbor is the way to classify or predict a target data of a new point X by leveraging data near by this new point. This method regulates the weight of each data by distance.

Our goal is to determine the number of class kk to be large enough to minimize an error, and to be small to give an accurate estimation. I'm gonna show that the class of kNNk-NN rules, the single nearest neighbor rule is admissible by showing that there exists no kNNk-NN rule, k1k\neq 1, which has lower probability of error against all distributions for the nn sample problem.

Let me consider the situation that the prior probabilities η1=η2=12\eta_1 = \eta_2 = \dfrac{1}{2}, and data points in each class are well separated. Also, the number of category is only two, so each data point is assigned into category 1 or 2. This problem can also be thought as the binomial situation, which each trial has just success or fail. Under this condition, I'll show that 1NN1-NN rule is strictly better than the kNNk-NN rule in the case where the densities are clearly separated.

  • The probability that jj individuals come from category 1: (12)n(nj)(\dfrac{1}{2})^n {n \choose j}

  • Pϵ(1;n)P_\epsilon(1;n) is the error that all points lie in category 2: (12)n(\dfrac{1}{2})^n

  • Pϵ(k;n)=(12)nj=0k0(nj),    s.t.  k=2k0+1P_\epsilon(k;n)=(\dfrac{1}{2})^n\sum^{k_0}_{j=0} {n \choose j}, \;\; s.t. \; k=2k_0+1. It means the probability that k0k_0 or fewer points lie in category 1

When the new data point xx is actually in a category 1, only if the k0k_0 nearest points or fewer points are in a category 2, this point xx is assigned into a category 1. This is the situation of choosing k0k_0 points as 2 and other points as 1.

xx^* is the nearest neighbor point to xx. kk^* is the category to which xx^* belongs. The conditional NN risk r(x,x)r(x,x^*) is as follows.

r(x,x)=E[L(k,k)x,x]=P(kkx,x)=P(k=1x)P(k=2x)+P(k=2x)P(k=1x)=p1(x)p2(x)+p1(x)p2(x)r(x,x^*)=E[L(k,k^*)|x,x^*] =P(k\neq k^*|x,x^*) \\= P(k=1|x)P(k^*=2|x^*)+P(k=2|x)P(k^*=1|x^*)\\=p_1(x)p_2(x^*)+p_1(x^*)p_2(x)
P(xox)=(1ν(xo))n0,  r(x,x)2p1(x)p2(x) P(x_o \neq x^*)=(1-\nu(x_o))^n \rightarrow 0, \;r(x,x^*)\rightarrow 2p_1(x)p_2(x) \\

ν(x)\nu(x) is a nonzero probability measure at the points xx.

Bayesian Procedure

In a Bayesian view, p(kx)=pk(x)=p(xk)p(k)p(k)p(k|x)=p_k(x)=\dfrac{p(x|k)p(k)}{p(k)} , which means the probability that the individual xx is assigned into a class kk. p(k)p(k) is a prior probability. L(k,j)L(k,j) is the loss incurred by assigning an individual from category kk to category jj. The conditional loss is as follows.

rj(x)=k=1Kpk(x)L(k,j)r_{j}(x)=\sum^K_{k=1}p_k(x)L(k,j)

The conditional loss rj(x)r_{j}(x)becomes a minimum when the individual is assigned to the category jj for which pj(x)p_j(x) is lowest. The conditional Bayes risk is as follows.

r(x)=minjrj(x)=min{p1(x),p2(x)}=min{p1(x),1p1(x)}R=E(r(x))r^*(x)=\min_{j}r_{j}(x)=min\{p_1(x),p_2(x)\}=min\{p_1(x),1-p_1(x)\}\\ R^*=E(r^*(x)) \\

r(x)=2p1(x)(1p1(x))=2r(x)(1r(x))r(x)=2p_1(x)(1-p_1(x))=2r^*(x)(1-r^*(x)) by the symmetry of r(x)r^*(x)

Comparison of errors

The overall NN risk R is as follows.

R=limnE[r(x,x)]=E[limnr(x,x)]=E[r(x)]=E[2p1(x)p2(x)]=E[2r(x)(1r(x))]=2R(1R)2Var(r(x))\begin{align} R &= \lim_n E[r(x,x^*)]=E[\lim_n r(x,x^*)] =E[r(x)] \\&=E[2p_1(x)p_2(x)]=E[2r^*(x)(1-r^*(x))] \\ & = 2R^*(1-R^*)-2Var( r^*(x)) \end{align}

It implies R2R(1R)R \leq 2R^*(1-R^*). When the problem is expended from 2-class into K-class, the inequality changes a bit.

RRR(2KK1R)R^*\leq R \leq R^*(2-\dfrac{K}{K-1}R^*)

Proof

Practical Ossues

How can we decide the number of clusters?

Gap Statistic

We can leverage the within cluster dissimilarity WkW_k

  • K{1,2,...,Kmax},  W={W1,W2,...,WKmax}K\in \{1,2,...,K_{max}\}, \; W=\{W_1,W_2,...,W_{K_{max}}\}

  • K=argminK{KG(K)G(K+1)sK+1}K^*=argmin_K\{K|G(K) \geq G(K+1)-s'_{K+1}\}

Reference [Gap statistic]

Silhouette value

We can also use a Silhouette value Si=biaimax(ai,bi)S_i = \dfrac{b_i-a_i}{max(a_i,b_i)} . aia_iis the average distance from the iith data point to the other points in the same cluster, and bib_i is the minimum average distance from the iith point to points in a different cluster. When our groups are well clustered, biaib_i-a_i has to become big.

Last updated

Was this helpful?