K-Means¶
1. Basic Concepts of K-Means¶
-
Cluster:
- A group of similar data points. K-Means divides the dataset into
K
distinct clusters.
- A group of similar data points. K-Means divides the dataset into
-
Centroid:
- The central point of a cluster, representing the average position of all points in the cluster.
-
K Value:
- Specifies the number of clusters. K-Means requires a predefined
K
, which indicates how many clusters the dataset will be divided into.
- Specifies the number of clusters. K-Means requires a predefined
2. How K-Means Algorithm Works¶
K-Means algorithm iteratively proceeds through the following steps:
-
Initialize Centroids:
- Randomly select
K
initial centroids.
- Randomly select
-
Assign Data Points to Clusters:
- Assign each data point to the nearest centroid, forming
K
clusters.
- Assign each data point to the nearest centroid, forming
-
Update Centroids:
- Calculate the new centroid for each cluster by averaging all the points in the cluster.
-
Repeat Iterations:
- Repeat steps 2 and 3 until the centroids no longer change significantly, or until a predefined maximum number of iterations is reached.
-
Stopping Criteria:
- The algorithm converges and stops when the centroids no longer move.
3. Advantages of K-Means¶
-
Simple and Easy to Implement:
- K-Means is straightforward and easy to understand and implement.
-
Fast and Efficient:
- K-Means is fast and efficient, especially on large datasets, due to its low time complexity.
-
Easy to Interpret and Visualize:
- The clusters generated by K-Means are easy to interpret, and the results can be visualized in 2D or 3D plots.
4. Disadvantages of K-Means¶
-
Need to Predefine K:
- The number of clusters
K
must be predefined, which can be difficult to determine in advance.
- The number of clusters
-
Sensitive to Initial Centroids:
- K-Means is sensitive to the initial placement of centroids, which can lead to different clustering results.
-
Only Detects Spherical Clusters:
- K-Means assumes clusters are spherical, so it performs poorly on clusters with complex shapes.
-
Sensitive to Outliers:
- Outliers can significantly affect the centroid positions, leading to poor clustering results.
5. Methods to Select K Value¶
-
Elbow Method:
- Plot the SSE (Sum of Squared Errors) for each possible
K
value and choose theK
at the "elbow" point, where the SSE starts to decrease more slowly.
- Plot the SSE (Sum of Squared Errors) for each possible
-
Silhouette Coefficient:
- Use the silhouette coefficient to evaluate the quality of clustering, and select the
K
with the highest silhouette score.
- Use the silhouette coefficient to evaluate the quality of clustering, and select the
6. Applications of K-Means¶
-
Image Compression:
- K-Means can be used for image compression by clustering pixels into different color clusters, reducing the number of colors.
-
Customer Segmentation:
- K-Means is used in marketing to segment customers into different groups based on their behavior.
-
Document Classification:
- K-Means can be applied to cluster documents into different topics or categories.