Clustering & Nearest Neighbor

Clustering is the method that binding similar group together. We can cluster customer type based on several variables reflecting some pattern in consumption. In a mathematical expression, we need to bind similar rows of the data matrix. For clustering, usually prototype methods are adopted. Prototype methods are the way to assign each observation to its closest prototype (centroid, medoid, etc.) "Closest" is usually defined by Euclidean distance in a feature space, after each feature has been standardized to have overall mean 0 and variance 1 in the training sample.
It works well for capturing the distribution of each class.
How many prototypes to use and where to put them are the main issues.
🕺 Goal: Partition the observations into clusters, so that the pairwise dissimilarities between points assigned to the same cluster becomes smaller than points in different clusters.
Combinatorial Algorithm: Work without any probability model.
Mixture modeling: Assume the data is an i.i.d sample from some population with probability function. The density function is expressed by a parameterized model.
Mode seeking("bump hunters"): Directly estimate distinct modes of the probability density function. It takes a nonparametric perspective.
Dissimilarity
How can we define the similarity?
Most of clustering methods take a dissimilarity matrix. By the attribute type, we can differently choose the shape of this function .
Quantitative variables.
Ordinal variables.
Categorical variables: matrix with elements for all
Minkowski distance: (p=2: Euclidian)
How can we integrate all dissimilarity of attributes?
The influence of on depends upon its relative contribution to the average measure over all pairs of observation.
Because is the relative influence of variable, so setting would give all attributes equal influence.
What is desired properties of dissimilarity function?
Symmetry: 👏 Other wise you could claim "Alex looks like Bob, but Bob looks nothin Alex"
Positive separability: 👏Otherwise there are objects that are different, but you cannot tell apart
Triangular inequality: 👏 Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl
Combinatory Algorithm
This algorithm assigns each observation to a cluster without regard to a probability model describing the data.
stands for within-cluster and for between-cluster. Our aim is to minimize and maximize . Combinatory Algorithm needs to calculate all possible situations leading to an enormous inefficient computational speed.
This value rapidly increases so we don't want to adopt this algorithm for large
K-means Algorithm(Combinatory)
is the mean vector associated with cluster. Also it can be expressed as
, is also expressed as
After we define our dissimilarity function, we can define the distort measure that we want to minimize for finding the closest centroids.
However, it is NP-hard problem so we need to adopt an iterative optimization way.
K-means algorithm is usually used to find clusters from our unlabeled data. However, it also can be used to classify target variable from a labeled data.
Apply K-means algorithm to the training data in each class separately, using prototypes per class.
Assign a class label to each of the prototypes. In this case, is the number of classes, and is the number of prototypes per class.
Classify a new feature to the class of the closest prototype.

Learning Vector Quantization is used to correct the prototypes after we select prototypes. This method has the shortcoming that for each class, the other classes don't have a say in the positioning of the prototypes for that class.
Implementation From Scratch
import numpy as np
from PIL import Image
from random import *
import matplotlib.pyplot as plt
img_m = np.array(Image.open('data/myself.png'))
img_m = np.delete(img_m,obj=3,axis=2)
#train =
#test = np.array([[1,1,5],[1,2,3],[2,3,10],[2,10,20],[2,10,40]])
def d(x,y,method=['l2']):
if method=='l2': return (x-y)**2
class clustering():
def __init__(self, data, data_dim):
self.reshaped = data.reshape(-1,data_dim) # Flatten our data
self.data_dim = data_dim
def init_mean(self, num_cluster):
seed(2020) # Random seed for reproducible result
self.unique = np.unique(self.reshaped, axis=0)
idx = sample(range(0,len(self.unique)), num_cluster) # Index sampling
self.ux = self.unique[idx] # Initializing value within data matrix
## Poor initialization
# self.ux = np.array([np.random.uniform(0,1,self.data_dim) for num in range(num_cluster)])
self.num_cluster = num_cluster
def kmeans(self):
# Second step: make a cluster variabe "r"
self.u_pre = (self.ux)+0.1 # Just to make these difference in first step
while np.sum(abs(self.u_pre - self.ux))>0.08:
self.u_pre = (self.ux).copy()
r = np.array([np.argmin(np.sum(d(x,self.ux,'l2'), axis=1)) for x in self.reshaped]) # r = array([k0,k2,k1,k1,...])
# Third step: make a mean vector "ux"
self.ux = [] # ux = [[k0 mean vec], [k1 mean vec], ...]
for k in range(0,self.num_cluster):
rk = (r==k).astype(int).reshape(-1,1) # rk = [[1],[0],[0],[0],...]] for np multiplication
# Avoiding loop statement leads to reducing the time complexity
u = (self.reshaped*rk).sum(axis=0)/(rk).sum(axis=0) # This solution is already given in our lecture.
(self.ux).append(list(u)) # Binding u together.
self.ux = np.array(self.ux)
return(r, self.ux)
def kmeans_vis(data, num_cluster):
km = clustering(data, 3)
km.init_mean(num_cluster)
r, uk = km.kmeans()
reps = data.copy().reshape(-1,3)
for i in range(num_cluster):
reps[r==i] = uk[i]
result = reps.reshape(data.shape[0], data.shape[1],3)
plt.imshow(result, interpolation='nearest')
plt.show()

Gaussian Mixtures
Setting :
def random_mat(n):
x = np.random.normal(0, 0.1, size=(n, n))
return x@np.transpose(x)
def params_init():
params = {'pi': np.random.uniform(0,1),
'mu': np.array([np.random.normal(0, 0.1, size=(5,)) for i in range(2)]), # k=2
'sigma': np.array([random_mat(5)+np.identity(5) for i in range(2)])} #k=2
return params
mu = params_init()['mu']; sigma = params_init()['sigma']
dist = [mvn(mean=mu[i], cov=sigma[i]) for i in range(2)]
# PCA
scaler = StandardScaler()
scaler.fit(df)
X_scaled = scaler.transform(df)
cov_matrix = np.cov(X_scaled.T)
eigvals, eigvecs = np.linalg.eig(cov_matrix)
eigvec_mat = (eigvecs.T[:][:5]).T
X_pca = (X_scaled @ eigvec_mat / eigvals[:5]).astype(float)
obj = []
def e_step(x, params, i):
pi = params['pi']; mu = params['mu']; sigma = params['sigma']
dist = [mvn(mean=mu[i], cov=sigma[i]) for i in range(2)]
r_vec = [[( pi * dist[k].pdf(x[i]) ) / ((pi * dist[0].pdf(x[i]))+ ((1-pi)*dist[1].pdf(x[i]))) for i in range(x.shape[0])] for k in [0,1]]
clst = [ int(r_vec[0][i] < r_vec[1][i]) for i in range(x.shape[0])]
# Originally, calculating log-likelihood is the step after m_step
# But for computational convenience, I put it in this e_step.
log_likelihood = sum([np.log( pi * dist[0].pdf(x[i]) + (1-pi)*dist[1].pdf(x[i])) for i in range(x.shape[0])])
obj.append(log_likelihood)
return np.array(r_vec), clst
def m_step(x, params, r_vec):
N = r_vec.sum(axis=1)
elem = np.dot(r_vec,x)
mu = np.array([elem[0]/N[0] , elem[1]/N[1]])
pi = (N/N.sum())[0]
sigma = [(1/N[k]) * np.array([ r_vec[k][i] * (np.outer(X_pca[i] - params['mu'][k],X_pca[i] - params['mu'][k]))
for i in range(X_pca.shape[0])]).sum(axis=0)
for k in range(2)]
params = {'pi':pi, 'mu':mu, 'sigma':np.array(sigma)}
return params
i = 0
params = params_init()
while i<50:
i += 1
r_vec, clst = e_step(X_pca, params,i)
params = m_step(X_pca, params, r_vec)
fig, ax = plt.subplots()
ax.plot(range(i), obj, marker='.', color='r')
ax.grid(True)
ax.set_xlabel('# of iteration')
ax.set_ylabel('log likelihood')
ax.set_title('log likelihood')
plt.show()


K-medoids
K-means cluster has following two flaws.
[Lack of Robustness] Euclidean distance places the highest influence on the largest distance leading to lack of robustness.
[Only For Numerical Variables] It is impossible to compute arithmetic mean for a categorical feature.
To handle with these two problems, K-medoids can be used. This way
if belongs to cluster. (otherwise the value becomes )
, is the mean vector in cluster.
In this algorithm, we don't need to compute the cluster center so we can use categorical variables also. We just need to keep track of the indices
Hierarchical Clustering
Hierarchy in cluster mean that at the lowest level each cluster contains only one single point but at the highest level there is only one cluster containing all of the data. Hierarchical clustering composes of two steps:
Agglomerative(bottom-up): It starts at the bottom and at each level it recursively merge a selected pair of clusters into a single cluster.
Divisive(top-down): It starts at the top and at each level it recursively split one of the existing clusters at that level into two new clusters.
For visualization tool, dendrogram is usually used. I only handle bottom-up model in this chapter. The algorithms is as follows.


Single Linkage:
Complete Linkage:
Group Average: , are the number of observations in each group. (This method is similar with K-means clustering)

Nearest Neighbor
Nearest Neighbor is the way to classify or predict a target data of a new point X by leveraging data near by this new point. This method regulates the weight of each data by distance.
Our goal is to determine the number of class to be large enough to minimize an error, and to be small to give an accurate estimation. I'm gonna show that the class of rules, the single nearest neighbor rule is admissible by showing that there exists no rule, , which has lower probability of error against all distributions for the sample problem.
Let me consider the situation that the prior probabilities , and data points in each class are well separated. Also, the number of category is only two, so each data point is assigned into category 1 or 2. This problem can also be thought as the binomial situation, which each trial has just success or fail. Under this condition, I'll show that rule is strictly better than the rule in the case where the densities are clearly separated.
The probability that individuals come from category 1:
is the error that all points lie in category 2:
. It means the probability that or fewer points lie in category 1
When the new data point is actually in a category 1, only if the nearest points or fewer points are in a category 2, this point is assigned into a category 1. This is the situation of choosing points as 2 and other points as 1.
is the nearest neighbor point to . is the category to which belongs. The conditional NN risk is as follows.
is a nonzero probability measure at the points .
Bayesian Procedure
In a Bayesian view, , which means the probability that the individual is assigned into a class . is a prior probability. is the loss incurred by assigning an individual from category to category . The conditional loss is as follows.
The conditional loss becomes a minimum when the individual is assigned to the category for which is lowest. The conditional Bayes risk is as follows.
by the symmetry of
Comparison of errors
The overall NN risk R is as follows.
It implies . When the problem is expended from 2-class into K-class, the inequality changes a bit.
Practical Ossues
How can we decide the number of clusters?
Gap Statistic
We can leverage the within cluster dissimilarity


Reference [Gap statistic]
Silhouette value
We can also use a Silhouette value . is the average distance from the th data point to the other points in the same cluster, and is the minimum average distance from the th point to points in a different cluster. When our groups are well clustered, has to become big.

Last updated
Was this helpful?