K-means-clustering

by Hai Dang

Jan 03, 2019

Welcome to my next blog

Hello, this is blog about K means clustering.

K means clustering: We usually use this algorithm for unsupervised learning problem. With the unlabeled data, e.g we have multiple images of apple and mango, but we don’t know which one is apple and which one is mango. So, basically, we will create two clusters based on size, shape, … for these two kind of fruit and decide the most typical one will be the center of each cluster -> this is K means clustering.

Input: \(\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N] \in \mathbb{R}^{d \times N}\) data point and \(K\) number of clusters \(K < N\).
Output: \(\mathbf{Y} = [y _ {i1}, y _ {i2}, \dots, y _ {iK}] \) is label vector for each vector data point \(\mathbf{x}_i\). If \(\mathbf{x}_i\) is belong to to cluster \(k\) mean \(\mathbf{y} _ {ik} = 1\) and \(y _ {ij} = 0, \forall j \neq k \). This representation is call one-hot. We use this representation because two reasons: Firstly, K means clustering algorithm can not operate on label data directly, it should be numeric to calculate e.g the loss function below. Secondly, it works well in non-ordinal relationship problems e.g fruit problems. More detail here.

\[y_{ik} \in {0, 1},~~~ \sum_{k = 1}^K y_{ik} = 1 \]

Output: \(\mathbf{M} = [\mathbf{m}_1, \mathbf{m}_2, \dots \mathbf{m}_K]\) is the set of centroids (center of the cluster).
Loss function(total distance of every point to every centroids)

\[\mathcal{L}(\mathbf{Y}, \mathbf{M}) = \sum _ {i=1}^N \sum _ {j=1}^K y _ {ij} |\mathbf{x}_i - \mathbf{m}_j|_2^2\]

\(|\mathbf{x}_i - \mathbf{m}_k|_2^2\) is square of Euclidean distance, it hard to solve the derivative Euclidean function, so in the most case we take the power of two of this function.
\(y _ {ik} |\mathbf{x}_i - \mathbf{m}_k|_2^2 = \sum _ {j=1}^K y _ {ij} |\mathbf{x} _ i - \mathbf{m} _ j|_2^2 \) : Above we have, \(\mathbf{x}_i\) belongs to cluster \(k\) then the label of \(\mathbf{x}_i\) in cluster \(k\) is 1 and 0 in other clusters, so the total distance of \(\mathbf{x}_i\) to every centroids is equal the distance of \(\mathbf{x}_i\) to the cluster it is assigned.
Then, we have the loss function is the sum of all distance above. So, we have to minimize the loss function means: \[\mathbf{Y}, \mathbf{M} = \arg\min_{\mathbf{Y}, \mathbf{M}} \sum _ {i=1}^N \sum _ {j=1}^K y _ {ij} |\mathbf{x}_i - \mathbf{m}_j|_2^2 \]

\[\text{subject to:} ~~ y_{ij} \in {0, 1}~~ \forall i, j;~~~ \sum_{j = 1}^K y_{ij} = 1~~\forall i\]

Large data set: What happens if we have a data set with millions of object? mapreduce

Unknown \(k\): In case we don’t have the \(k\), we can use elbow method to determine number of clusters in a data set.

Input: \(X = \begin{Bmatrix} x_1 , x_2 \ldots x_n \end{Bmatrix} \) objects, \(P_j = \begin{Bmatrix} x_i , x _ {i + 1} \ldots x_j \end{Bmatrix} \) set of objects in cluster j, \(M = \begin{Bmatrix} 1 , 2 \ldots k \end{Bmatrix}\) the potential value for k
Ouput: optimal \(k\) (the \(k\) is that the value of WCSS at this \(k\) will only change slightly, the elbow of the graph ).
Algorithm: \[\mathsf { WCSS } = \sum _ { { j = 1 } }^{i \in \mathsf { M }}\sum _ { { P_j } \in \mathsf { Cluster } _j } \mathsf { distance } \left( { P } _ { { j } } , { C } _ { j } \right) ^ { 2 }\]
Performance: Let draw an imaginary line with two value of \(\mathsf { WCSS }\). The elbow of the graph will be the point that has largest distance to the imaginary line.
Function: \(R ^ {m*n}\)

5-tuples: T, E, P, A, F

Task: Find the center of each clusters
Experience: 1500 points and 3 clusters
Performance: compare the result with expected center we have created from the beginning.
Algorithm: K means clustering
Function: \((a, b) = f(A)\) \((a, b)\) is the coordinate of the center of cluster \(A\).
You can change the coordinate of test center point and increase the number of point in each cluster as well as number of cluster to check the accuracy.
The result also depend on separation between these clusters, if the clusters are quite mixed, the result will have a significant error.

Center point: \([2, 2], [8, 5], [3, 6], [9,8]\)

Input data