

K-Means Clustering || Machine Learning


    In our previous articles we discussed about various classification and regression algorithms that are commonly used in machine learning. Now we are moving to the clustering series .

    Clustering is an unsupervised machine learning technique. It is the process of division of the dataset into groups in which the members in the same group possess similarities in features. The commonly used clustering algorithms are K-Means clustering, Hierarchical clustering, Density-based clustering, Model-based clustering, etc. In this article, we are going to discuss K-Means clustering in detail.

K-Means Clustering

    It is the simplest and commonly used iterative type unsupervised learning algorithm. In this, we randomly initialize the K number of centroids in the data (the number of k is found using the Elbow method which will be discussed later in this article ) and iterates these centroids until no change happens to the position of the centroid. Let's go through the steps involved in K means clustering for a better understanding.

    1)  Select the number of clusters for the dataset ( K )
    2)  Select K number of centroids

  3)  By calculating the Euclidean distance or Manhattan distance assign the points to the nearest centroid, thus creating K groups

    4) Now find the original centroid in each group

    5) Again reassign the whole data point based on this new centroid, then repeat step 4 until the position of the centroid doesn't change.

    Finding the optimal number of clusters is an important part of this algorithm. A commonly used method for finding optimal K value is Elbow Method.  

Elbow Method

    In Elbow method we are actually varying the number of clusters ( K )  from 1 - 10. For each value of K, we are calculating WCSS ( Within-Cluster Sum of Square ). WCSS  is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is  largest when K = 1. When we analyze the graph we can see that  the graph will rapidly change at a point and thus creating an elbow shape. From this point the graph starts to move almost parallel to the X-axis. The  K value corresponding to this point is the optimal K value or an optimal number of cluster.

Now let's implement K-Means clustering using python

    About Dataset (Download) - Dataset we are using here is the Mall Customers data. It's unlabeled data that contains the details of customers in mall ( features like genre, age, annual income(k$), and spending score ). Our aim is to cluster the customers based on the relevant features annual income and spending score. 

First of all, we have to import essential libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

Now let's import the dataset and slice the important features 

dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3, 4]].values

    We have to find the optimal K value for clustering the data. Now we are using the Elbow method to find the optimal K value.

from sklearn.cluster import KMeans
wcss = [] for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
"init" argument is the method for initializing the centroid. We calculated the WCSS value for each K value.  Now we have to plot the WCSS with K value

plt.plot(range(1, 11), wcss)
plt.xlabel('Number of clusters')
Graph will be

    The point at which the elbow shape is created is 5, that is, our K value or an optimal number of clusters is 5. Now let's train the model on the dataset with a number of clusters 5.

kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X)

Now let's plot all the clusters using matplotlib

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 60, c = 'red', label = 'Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 60, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 60, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 60, c = 'violet', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 60, c = 'yellow', label = 'Cluster 5')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 100, c = 'black', label = 'Centroids')
plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend()


Full code -

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd # Importing the dataset
X = dataset.iloc[:, [3, 4]].values
dataset = pd.read_csv('Mall_Customers.csv')
from sklearn.cluster import KMeans
# Using the elbow method to find the optimal number of clusters wcss = [] for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(X) plt.plot(range(1, 11), wcss) plt.xlabel('Number of clusters')
y_kmeans = kmeans.fit_predict(X)
plt.ylabel('WCSS') plt.show() # Training the K-Means model on the dataset kmeans = KMeans(n_clusters = 5, init = 'k-means++', random_state = 42) # Visualising the clusters
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 60, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 60, c = 'red', label = 'Cluster 1') plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 60, c = 'green', label = 'Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 100, c = 'black', label = 'Centroids')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 60, c = 'violet', label = 'Cluster 4') plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 60, c = 'yellow', label = 'Cluster 5') plt.xlabel('Annual Income (k$)') plt.ylabel('Spending Score (1-100)') plt.legend()

Now you have an idea of how the K-Means clustering works . Stay awaited for upcoming post on interesting topics in Machine Learning . Do practice to get clear understanding.

Happy Reading!

Post a Comment