Complete Guide to Clustering Techniques

5 min readJul 27, 2020

In a bid to become a grandmaster on Kaggle, I started publishing notebooks with starting with a simple data mining algorithm. The first algorithm that I am taking for explanation is clustering.

. Please upvote my kaggle notebook if you like this post.

Introduction

Clustering is an unsupervised learning technique where you take the entire dataset and find the “groups of similar entities” within the dataset. Hence there are no labels within the dataset.

It is useful for organizing a very large dataset into meaningful clusters that can be useful and actions can be taken upon. For example, take the entire customer base of more than 1M records and try to group into high-value customers, low-value customers, and so on.

What questions does clustering typically tend to answer?

Types of pages are there on the Web?
Types of customers are there in my market?
Types of people are there on a Social network?
Types of E-mails in my Inbox?
Types of Genes the human genome has?

From clustering to classification

Clustering is base of all the classification problems. Initially, say we have a large ungrouped number of users in a new social media platform. We know for certain that the number of users will not be equal to the number of groups in social media, and it will be reasonably finite. Even though each user can vary in fine-grain, they can be reasonably grouped into clusters. Each of these grouped clusters become classes when we know what group each of these users fall into.

Partition clustering

Before the start of clustering if we assume that data is going to fall into x number of clusters, and then partition the data into those many numbers of clusters then it is called partition clustering. The number of clusters is known before performing clustering in partition clustering. k-Means is one of the popular partition clustering techniques, where the data is partitioned into k unique clusters.

k-Means clustering

Let the data points X = {x1, x2, x3, … xn} be N data points that needs to be clustered into K clusters.
K falls between 1 and N, where if:
- K = 1 then whole data is single cluster, and mean of the entire data is the cluster center we are looking for.
- K =N, then each of the data individually represent a single cluster.
Typically K falls between 1 and N.

Formulation as an optimization problem

Let M = {m1, m2, m3, … mk} be the cluster mean of the K clusters. Each of the m is the representative of the individual clusters that we are looking for.
The objective function is that we find such representation for each cluster, that approximates the data the best and the error of approximation is minimum.

The objective function that we are trying to minimize is sum squared of the distance between each data point and its representative.

Implementation of k-Means using SKLearn

But is 5 clusters good enough?

As said above, we need to exactly specify how many clusters we are looking for in this case. The number of clusters is very difficult to guess intuitively. It can only be intuitively be guessed if we know the data well.

Solution to number of clusters

We run the k-Means algorithm with a different number of clusters and plot the goodness of fit by plotting the inertia parameter from sklearn, which gives Sum of squared distances of samples to their closest cluster center.

Limitations of k-means clustering

k-Means clustering can only separate linear cluster boundaries, which means that it will fail to recognize far more complicated decision boundaries.

This can be explained by make moons dataset on sklearn as shown below:

The answer to this can be found in understanding Hierarchical Clustering.

Hierarchical Clustering

The natural world is made up of hierarchy, like in food chain, organizational structure, biological classification of species, etc,. Bottom-up hierarchical clustering also is known as agglomerative clustering.

The key hyperparameter in the agglometarive clustering is called the linkage. It is the distance between two clusters in general. It is similar to the cluster mean M that is taken for the k-Means clustering. It can be represented in many ways:

Single linkage: The distance between two closest points between the two clusters.
Complete linkage: Distance between two farthest points between the two clusters.
Average linkage: It is between single and complete linkage. Average between all pairs of points is taken. This is robust to noise.

Implementation of agglometarive clustering using SKLearn

For the same moon dataset we can now see that with AgglomerativeClustering, single linkage, we can get good clusters.

Spectral clustering

Works on similarity graphs where each node represents an entity and weight on the edge. Consider the structure similar to a graph where all the nodes are connected to all other nodes with edges constituting of weights. If we want to split it into two clusters, clearly we want to want to eliminate the edges which have the lowest weight.

Implementation of Spectral clustering using SKLearn

The make moons will be as shown below: