Clustering is the method that binding similar group together. We can cluster customer type based on several variables reflecting some pattern in consumption. In a mathematical expression, we need to bind similar rows of the data matrix. For clustering, usually prototype methods are adopted. Prototype methods are the way to assign each observation to its closest prototype (centroid, medoid, etc.) "Closest" is usually defined by Euclidean distance in a feature space, after each feature has been standardized to have overall mean 0 and variance 1 in the training sample.
It works well for capturing the distribution of each class.
How many prototypes to use and where to put them are the main issues.
πΊ Goal: Partition the observations into clusters, so that the pairwise dissimilarities between points assigned to the same cluster becomes smaller than points in different clusters.
Combinatorial Algorithm: Work without any probability model.
Mixture modeling: Assume the data is an i.i.d sample from some population with probability function. The density function is expressed by a parameterized model.
Mode seeking("bump hunters"): Directly estimate distinct modes of the probability density function. It takes a nonparametric perspective.
W(C) stands for within-cluster and B(C) for between-cluster. Our aim is to minimize W(C) and maximize B(C). Combinatory Algorithm needs to calculate all possible situations leading to an enormous inefficient computational speed.
S(N,K)=K!1βk=1βKβ(β1)Kβk(kKβ)kN
This value rapidly increases so we don't want to adopt this algorithm for large N,K
However, it is NP-hard problem so we need to adopt an iterative optimization way.
Algorithm[iterative descent algorithm]
π» Initialization Initialize k cluster centers, {c1β,β―,ckβ}
π» Cluster assignment (Calculating the distance )Ο(i)=argminj=1,β―,kββ£β£xiββcjββ£β£2
π» Center adjustment (Updating cluster center) cjβ=β£{i:Ο(i)=j}β£1ββi:Ο(i)=jβxiβ
Repeat assigning and adjusting until there is no change in cjβ
K-means algorithm is usually used to find clusters from our unlabeled data. However, it also can be used to classify target variable from a labeled data.
Apply K-means algorithm to the training data in each class separately, using R prototypes per class.
Assign a class label to each of the KΓR prototypes. In this case, K is the number of classes, and R is the number of prototypes per class.
Classify a new feature x to the class of the closest prototype.
Learning Vector Quantization is used to correct the prototypes after we select prototypes. This method has the shortcoming that for each class, the other classes don't have a say in the positioning of the prototypes for that class.
Learning Vector Quantization(LVQ)
π» Initialization Initialize R prototypes for each class: m1β(k),β―,mRβ(k)
π» Sampling and adjusting Sample a training point xiβ randomly with replacement, and let (j,k) index the closest prototype mjβ(k) to xiβ.
π» If giβ=k, mjβ(k)βmjβ(k)+Ο΅(xiββmjβ(k))
π» if giβξ =k, mjβ(k)βmjβ(k)βΟ΅(xiββmjβ(k))
Repeat 2nd step, until the learning rate becomes 0.
Implementation From Scratch
import numpy as np
from PIL import Image
from random import *
import matplotlib.pyplot as plt
img_m = np.array(Image.open('data/myself.png'))
img_m = np.delete(img_m,obj=3,axis=2)
#train =
#test = np.array([[1,1,5],[1,2,3],[2,3,10],[2,10,20],[2,10,40]])
def d(x,y,method=['l2']):
if method=='l2': return (x-y)**2
class clustering():
def __init__(self, data, data_dim):
self.reshaped = data.reshape(-1,data_dim) # Flatten our data
self.data_dim = data_dim
def init_mean(self, num_cluster):
seed(2020) # Random seed for reproducible result
self.unique = np.unique(self.reshaped, axis=0)
idx = sample(range(0,len(self.unique)), num_cluster) # Index sampling
self.ux = self.unique[idx] # Initializing value within data matrix
## Poor initialization
# self.ux = np.array([np.random.uniform(0,1,self.data_dim) for num in range(num_cluster)])
self.num_cluster = num_cluster
def kmeans(self):
# Second step: make a cluster variabe "r"
self.u_pre = (self.ux)+0.1 # Just to make these difference in first step
while np.sum(abs(self.u_pre - self.ux))>0.08:
self.u_pre = (self.ux).copy()
r = np.array([np.argmin(np.sum(d(x,self.ux,'l2'), axis=1)) for x in self.reshaped]) # r = array([k0,k2,k1,k1,...])
# Third step: make a mean vector "ux"
self.ux = [] # ux = [[k0 mean vec], [k1 mean vec], ...]
for k in range(0,self.num_cluster):
rk = (r==k).astype(int).reshape(-1,1) # rk = [[1],[0],[0],[0],...]] for np multiplication
# Avoiding loop statement leads to reducing the time complexity
u = (self.reshaped*rk).sum(axis=0)/(rk).sum(axis=0) # This solution is already given in our lecture.
(self.ux).append(list(u)) # Binding u together.
self.ux = np.array(self.ux)
return(r, self.ux)
def kmeans_vis(data, num_cluster):
km = clustering(data, 3)
km.init_mean(num_cluster)
r, uk = km.kmeans()
reps = data.copy().reshape(-1,3)
for i in range(num_cluster):
reps[r==i] = uk[i]
result = reps.reshape(data.shape[0], data.shape[1],3)
plt.imshow(result, interpolation='nearest')
plt.show()
def random_mat(n):
x = np.random.normal(0, 0.1, size=(n, n))
return x@np.transpose(x)
def params_init():
params = {'pi': np.random.uniform(0,1),
'mu': np.array([np.random.normal(0, 0.1, size=(5,)) for i in range(2)]), # k=2
'sigma': np.array([random_mat(5)+np.identity(5) for i in range(2)])} #k=2
return params
mu = params_init()['mu']; sigma = params_init()['sigma']
dist = [mvn(mean=mu[i], cov=sigma[i]) for i in range(2)]
# PCA
scaler = StandardScaler()
scaler.fit(df)
X_scaled = scaler.transform(df)
cov_matrix = np.cov(X_scaled.T)
eigvals, eigvecs = np.linalg.eig(cov_matrix)
eigvec_mat = (eigvecs.T[:][:5]).T
X_pca = (X_scaled @ eigvec_mat / eigvals[:5]).astype(float)
obj = []
def e_step(x, params, i):
pi = params['pi']; mu = params['mu']; sigma = params['sigma']
dist = [mvn(mean=mu[i], cov=sigma[i]) for i in range(2)]
r_vec = [[( pi * dist[k].pdf(x[i]) ) / ((pi * dist[0].pdf(x[i]))+ ((1-pi)*dist[1].pdf(x[i]))) for i in range(x.shape[0])] for k in [0,1]]
clst = [ int(r_vec[0][i] < r_vec[1][i]) for i in range(x.shape[0])]
# Originally, calculating log-likelihood is the step after m_step
# But for computational convenience, I put it in this e_step.
log_likelihood = sum([np.log( pi * dist[0].pdf(x[i]) + (1-pi)*dist[1].pdf(x[i])) for i in range(x.shape[0])])
obj.append(log_likelihood)
return np.array(r_vec), clst
def m_step(x, params, r_vec):
N = r_vec.sum(axis=1)
elem = np.dot(r_vec,x)
mu = np.array([elem[0]/N[0] , elem[1]/N[1]])
pi = (N/N.sum())[0]
sigma = [(1/N[k]) * np.array([ r_vec[k][i] * (np.outer(X_pca[i] - params['mu'][k],X_pca[i] - params['mu'][k]))
for i in range(X_pca.shape[0])]).sum(axis=0)
for k in range(2)]
params = {'pi':pi, 'mu':mu, 'sigma':np.array(sigma)}
return params
i = 0
params = params_init()
while i<50:
i += 1
r_vec, clst = e_step(X_pca, params,i)
params = m_step(X_pca, params, r_vec)
fig, ax = plt.subplots()
ax.plot(range(i), obj, marker='.', color='r')
ax.grid(True)
ax.set_xlabel('# of iteration')
ax.set_ylabel('log likelihood')
ax.set_title('log likelihood')
plt.show()
Algorithm[EM Algorithm]
π» Initialization Initialize the means ΞΌkβ, covariance Ξ£kβ and mixing coefficients Οkβ.
π» E step. Evaluate r(znkβ)=βj=1KβΟjβN(xnββ£ΞΌjβ,Ξ£jβ)ΟkβN(xnββ£ΞΌkβ,Ξ£kβ)β
π» M step. Re-estimate the parameters using r(znkβ)
Check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied return to E step. r(znkβ) means responsibility.
K-medoids
K-means cluster has following two flaws.
[Lack of Robustness] Euclidean distance places the highest influence on the largest distance leading to lack of robustness.
[Only For Numerical Variables] It is impossible to compute arithmetic mean for a categorical feature.
To handle with these two problems, K-medoids can be used. This way
J=i=1βnβj=1βpβrijβD(xiβ,ΞΌjβ)
rijβ=1 if xiβ belongs to j cluster. (otherwise the value becomes 0)
D(xiβ,ΞΌjβ), ΞΌjβ is the mean vector in j cluster.
Algorithm
π» Initialization a set of cluster centers:{m1β,...,mKβ}, and assignment C
π» Center Update C(i)=argmin1β€kβ€KβD(xiβ,mkβ)
Iterate Assigning/Updating until the assignments do not change.
In this algorithm, we don't need to compute the cluster center so we can use categorical variables also. We just need to keep track of the indices ikββ
Hierarchical Clustering
Hierarchy in cluster mean that at the lowest level each cluster contains only one single point but at the highest level there is only one cluster containing all of the data. Hierarchical clustering composes of two steps:
Agglomerative(bottom-up): It starts at the bottom and at each level it recursively merge a selected pair of clusters into a single cluster.
Divisive(top-down): It starts at the top and at each level it recursively split one of the existing clusters at that level into two new clusters.
For visualization tool, dendrogram is usually used. I only handle bottom-up model in this chapter. The algorithms is as follows.
Algorithm
π» Assigning: Assign each data point to its own cluster, g1β={x1β},...,gmβ={xmβ}
let G={g1β,...,gmβ}
π» Do:
Find two clusters to merge: i,j=argmin1β€i,jβ€β£Gβ£βD(giβ,gjβ)
Merge the two clusters to a new cluster: gβgiββͺgjβ
Remove the merged clusters: GβG(giβ),GβG(gjβ)
Add the new clusters: GβGβͺ{g}
π» While β£Gβ£>1
Single Linkage: dSLβ(G,H)=miniβG,iβ²βHβdiiβ²β
Group Average: dGAβ(G,H)=NGβNHβ1ββiβGββiβ²βHβdiiβ²β, NGβ,NHβ are the number of observations in each group. (This method is similar with K-means clustering)
Nearest Neighbor
Nearest Neighbor is the way to classify or predict a target data of a new point X by leveraging data near by this new point. This method regulates the weight of each data by distance.
Our goal is to determine the number of class k to be large enough to minimize an error, and to be small to give an accurate estimation. I'm gonna show that the class of kβNN rules, the single nearest neighbor rule is admissible by showing that there exists no kβNN rule, kξ =1, which has lower probability of error against all distributions for the n sample problem.
Let me consider the situation that the prior probabilities Ξ·1β=Ξ·2β=21β, and data points in each class are well separated. Also, the number of category is only two, so each data point is assigned into category 1 or 2. This problem can also be thought as the binomial situation, which each trial has just success or fail. Under this condition, I'll show that 1βNN rule is strictly better than the kβNN rule in the case where the densities are clearly separated.
The probability that j individuals come from category 1: (21β)n(jnβ)
PΟ΅β(1;n) is the error that all points lie in category 2: (21β)n
PΟ΅β(k;n)=(21β)nβj=0k0ββ(jnβ),s.t.k=2k0β+1. It means the probability that k0β or fewer points lie in category 1
When the new data point x is actually in a category 1, only if the k0β nearest points or fewer points are in a category 2, this point x is assigned into a category 1. This is the situation of choosing k0β points as 2 and other points as 1.
xβ is the nearest neighbor point to x. kβ is the category to which xβ belongs. The conditional NN risk r(x,xβ) is as follows.
Ξ½(x) is a nonzero probability measure at the points x.
Bayesian Procedure
In a Bayesian view, p(kβ£x)=pkβ(x)=p(k)p(xβ£k)p(k)β, which means the probability that the individual x is assigned into a class k. p(k) is a prior probability. L(k,j) is the loss incurred by assigning an individual from category k to category j. The conditional loss is as follows.
rjβ(x)=k=1βKβpkβ(x)L(k,j)
The conditional loss rjβ(x)becomes a minimum when the individual is assigned to the category j for which pjβ(x) is lowest. The conditional Bayes risk is as follows.
We can also use a Silhouette value Siβ=max(aiβ,biβ)biββaiββ . aiβis the average distance from the ith data point to the other points in the same cluster, and biβ is the minimum average distance from the ith point to points in a different cluster. When our groups are well clustered, biββaiβ has to become big.
Symmetry: d(x,y)=d(y,x) Other wise you could claim "Alex looks like Bob, but Bob looks nothin Alex"
Positive separability: d(x,y)=0,ifandonlyifx=y Otherwise there are objects that are different, but you cannot tell apart
Triangular inequality: d(x,y)β€d(x,z)+d(z,y) Otherwise you could claim "Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl
π
π
π
Result
Example - Georgia Tech 2021 Fall ISYE 6740 George Lan
This type of cluster can be done with single linkage, because it uses a minimum distance.