Skip to main content

k-Means

Definition​

The k-Means algorithm is an unsupervised machine learning technique used for clustering data into k groups based on similarities in their features. It partitions the data into clusters by minimizing the sum of squared distances between data points and their respective cluster centroids

Practice​

kMeans(data, k):
// Step 1: Initialization
centroids = randomly_initialize_centroids(data, k)
clusters = []
repeat:
// Step 2: Assignment
clusters = assign_points_to_clusters(data, centroids)
// Step 3: Update centroids
centroids = update_centroids(clusters)
until convergence
return clusters

randomly_initialize_centroids(data, k):
centroids = []
randomly_select k points from data and add them to centroids
return centroids

assign_points_to_clusters(data, centroids):
clusters = initialize_empty_clusters(k)
for each point in data:
nearest_centroid = find_nearest_centroid(point, centroids)
add point to nearest_centroid's cluster in clusters
return clusters

update_centroids(clusters):
centroids = []
for each cluster in clusters:
new_centroid = calculate_mean(cluster.points)
add new_centroid to centroids
return centroids

find_nearest_centroid(point, centroids):
min_distance = infinity
nearest_centroid = null
for each centroid in centroids:
distance = calculate_distance(point, centroid)
if distance < min_distance:
min_distance = distance
nearest_centroid = centroid
return nearest_centroid

calculate_distance(point1, point2):
// Euclidean distance
return sqrt(sum((point1[i] - point2[i])^2) for i in dimensions)

initialize_empty_clusters(k):
clusters = []
for i from 1 to k:
create a new cluster and add it to clusters
return clusters