K-medoids method needs to compare the distance between two points in the data set, so time complexity is so enormous. To avoid the time complexity issue, we can use PAM algorithm.
PAM: Partition Around Medoids.
[ Notation ]
O: The set of whole objects.
S : The set of selected objects.
U : The set of unselected objects
Dpβ: The dissimilarity between pand the closest object in S
Epβ: The dissimilarity between pand the second closet object in S
U = O - S
[ Process ]
BUILD PHASE
[Initialization] Put the point for which the sum of the distances to all other objects is minimal into set S
ForiβU, consider it as a candidate for S
For jβUβ{i}, compute Djβ
If Djβ>d(i,j)select object i, let Cjiβ=max{Djββd(j,i),0}
Compute the total gain. giβ=βjβUβCjβ
Choose that object ithat maximizes giβ
Let S:=Sβͺ{i}and U=Uβ{i}
Repeat 1-7 until kobjects have been selected.
def dist_others(i,j,target=['idx','dist','second_min']):
# The input data i,j must be 2 dimension arrays
i = np.array(i); j = np.array(j)
if i.shape == (len(i),): # If it is 1 dimension
i = np.expand_dims(i, axis=0)
if j.shape == (len(j),): # If it is 1 dimension
j = np.expand_dims(j, axis=0)
if (len(i)>1)&(len(j)>1): # Multi to Multi
ls = []; sum1=0
for i_idx, elem1 in enumerate(i):
for elem2 in j:
diff = (elem1-elem2)**2
sum1 += sum(diff)
ls.append([i_idx,sum1])
ls = np.array(ls)
dist = np.min(ls, axis=0)[1]
idx = np.argmin(ls, axis=0)[0]
trg = {target=='idx':idx, target=='dist':dist}.get(True)
return(trg)
elif (len(i)==1)&(len(j)>1): # One to Multi
diff = i-j
sum1 = np.sum(diff**2, axis=1)
idx = np.argmin(sum1); dist = np.min(sum1)
try: second_min = sorted(set(sum1))[1]
except : second_min = np.min(sum1)
trg = {target=='idx':idx, target=='dist':dist, target=='second_min':second_min}.get(True)
return(trg)
elif (len(i)==1)&(len(j)==1): # One to One
diff = i-j
sum0 = np.sum(diff**2)
dist = np.min(sum0)
trg = {target=='idx':'No index in one to one case', target=='dist':dist}.get(True)
return(trg)
# step2
def build(obj, sel):
gi = 0; sel_i = 0; gi_sum=0; n = len(obj)
for i in range(n):
gpre = gi
for j in range(n):
if j==i:
continue
gi = 0
Dj = dist_others(obj[j],sel,'dist')
dji = dist_others(obj[j],obj[i],'dist')
Cji = max(Dj-dji,0)
gi += Cji
gi_sum += gi
if (gpre<gi):
sel_i = i
obj = np.delete(obj, sel_i, axis=0)
if (gi_sum==0): return('Stop')
else: return(sel_i)
SWAP PHASE
This phase is the step switching the element in S to one in U(The other way around also). It improves the quality of the clustering.
Considers all pairs (i,h)βSΓU
If d(j,i)>Djβ
if d(j,h)β₯Djβ, then Kjihβ=0
if d(j,h)<Djβ, then Kjihβ=d(j,h)βDjβ
In both subcases, Kjihβ=min{d(j,h)βDjβ,0}
If d(j,i)=Djβ
if d(j,h)<Ejβ, then Kjihβ=d(j,h)βDjβ(K can be negative)
if d(j,h)β₯Ejβ, then Kjihβ=EjββDjβ ( K>0)
In both subcases, Kjihβ=min{d(j,h),Ejβ}βDjβ
Compute the total result of swap as Tihβ=β{Kijhββ£jβU}
Select a pair (i,h)βSΓUthat minimizes Tihβ
If Tihβ<0, the swap is carried out and DpβandEpβare updated for every object p. After it, return to Step1. If minTihβ>0, the algorithm halts. (All Tihβ>0 is also the halting condition.)
def swap(obj,sel,i,h):
n = len(obj); Tih = 0; Kjih = 0
for j in range(n):
dji = dist_others(obj[j],obj[i],'dist')
Dj = dist_others(obj[j],sel,'dist')
djh = dist_others(obj[j],obj[h],'dist')
if dji>Dj:
Kjih = min(djh-Dj,0)
elif dji==Dj:
Ej = dist_others(obj[j],sel,'second_min')
Kjih = min(djh,Ej)-Dj
Tih += Kjih
return(Tih)
Time complexity of numpy expend_dims is much bigger than using for statement.