K-Means¶

1. Basic Concepts of K-Means¶

Cluster:
- A group of similar data points. K-Means divides the dataset into K distinct clusters.
Centroid:
- The central point of a cluster, representing the average position of all points in the cluster.
K Value:
- Specifies the number of clusters. K-Means requires a predefined K, which indicates how many clusters the dataset will be divided into.

K-Means algorithm iteratively proceeds through the following steps:

Initialize Centroids:
- Randomly select K initial centroids.
Assign Data Points to Clusters:
- Assign each data point to the nearest centroid, forming K clusters.
Update Centroids:
- Calculate the new centroid for each cluster by averaging all the points in the cluster.
Repeat Iterations:
- Repeat steps 2 and 3 until the centroids no longer change significantly, or until a predefined maximum number of iterations is reached.
Stopping Criteria:
- The algorithm converges and stops when the centroids no longer move.

Simple and Easy to Implement:
- K-Means is straightforward and easy to understand and implement.
Fast and Efficient:
- K-Means is fast and efficient, especially on large datasets, due to its low time complexity.
Easy to Interpret and Visualize:
- The clusters generated by K-Means are easy to interpret, and the results can be visualized in 2D or 3D plots.

Need to Predefine K:
- The number of clusters K must be predefined, which can be difficult to determine in advance.
Sensitive to Initial Centroids:
- K-Means is sensitive to the initial placement of centroids, which can lead to different clustering results.
Only Detects Spherical Clusters:
- K-Means assumes clusters are spherical, so it performs poorly on clusters with complex shapes.
Sensitive to Outliers:
- Outliers can significantly affect the centroid positions, leading to poor clustering results.

Elbow Method:
- Plot the SSE (Sum of Squared Errors) for each possible K value and choose the K at the "elbow" point, where the SSE starts to decrease more slowly.
Silhouette Coefficient:
- Use the silhouette coefficient to evaluate the quality of clustering, and select the K with the highest silhouette score.

Image Compression:
- K-Means can be used for image compression by clustering pixels into different color clusters, reducing the number of colors.
Customer Segmentation:
- K-Means is used in marketing to segment customers into different groups based on their behavior.
Document Classification:
- K-Means can be applied to cluster documents into different topics or categories.