How K-Means Clustering Algorithm Works
In today’s world, where machine learning models implementation is so easy to find anywhere over the internet. It becomes paramount for all machine learning enthusiasts to get their hands dirty on topics related to it.
There are many fascinating topics of supervised and unsupervised learning or even reinforcement learning come up. But my favorite is the k-means clustering algorithm.
As the name suggests, it is a clustering algorithm.
Learn the popular clustering algorithm k-means clustering along with the implementation in python. #datascience #unsupervisedlearning #machinelearning #kmeansclustering #python
Even if you don’t know what is clustering, still, it is ok.
By the end of this article, you will learn everything you need to know about k-means clustering.
After reading this article, you don’t need to brush up on k-means clustering topics before attending any data scientist job interview.
Excited to learn 🙂
We are too 🙂
Great, before starting the article, let’s look at the topics you are going to learn in this article. Only if you read the complete article 😀
I am not kidding. It’s true. It will give you a better idea about the entire article flow.
Let’s split the k-means clustering into two parts,
Let’s learn clustering first. Then we will use this knowledge to understand k-means clustering.
What is clustering?
A cluster is a group of similar entities that are kept together. Their similarity decided by the feature they possess and how closely associated compared with the other entities to this feature.
Let’s say we have two points in a 2-d graph. Using the euclidean distance, we can measure how close these two points are located.
Likewise, using various similarity measures, we can find how close/similar the data points are.
All similar data points form clusters or groups. Creating these clusters in a meaningful way is called clustering.
In the machine learning world, clustering is the process in which we segregate a heap of data points into clusters on the basis of their features.
We will, discusses these features in the upcoming sections.
Clustering Real Life Example
One good real-life example for clustering is the world map. If you see here, each color represents a cluster. These clusters are created based on meaningful similarities.
For now, let’s say this similarity is distance. If you take any place in the cluster, it is closer to the center of that cluster compared with other clusters.
This is one of the main rules for creating clusters using any clustering algorithms.
Any point in the cluster should be closer to that cluster’s center and far from any other cluster.
In a more technical way, we can say the intra distance between the same points should be smaller compared with the inter points distance of different clusters.
- Intra Distance: Distance between the same cluster points.
- Inter Distance: Distance between different cluster points.
If the above statements are not clear, please go to the how to evaluate clusters section of this article. We provided an excellent visual example for this.
We hope the above sentence is clear by now. If not, read this sentence again. Once you have given the complete reading of the article.
How is clustering different from classification?
As a data science beginner, the difference between clustering and classification is confusing. So as the initial step, let us understand the fundamental difference between classification and clustering.
let us say we have four categories:
In this scenario, clustering would make 2 clusters. The one who lives on land and the other one lives in water.
So the entities of the first cluster would be dogs and cats. Similarly, for the second cluster, it would be sharks and goldfishes.
But in classification, it would classify the four categories into four different classes. One for each category. So dogs would be classified under the class dog, and similarly, it would be for the rest.
But in clustering, despite distinctions, we cannot classify them because we don’t have labels for them. And that is why clustering is an unsupervised learning algorithm.
In real life we can expect high volume of data without labels, Because of such great use, clustering technique have may real-time situations to help. Let us understand that.
Below are the listed clustering applications.
Clustering is widely used in recommendation engines to make clusters one’s likes and dislikes.
It clubs the pixels with similar values and segments them out from the rest of the image.
People with similar choices are clustered and studied in one category. It helps the firm in ways like promoting things to the right audience, taking the right feedback.
Various Clustering Algorithms
There are various clustering algorithms. Usage is dependent on their use cases. Below are the listed clustering algorithms.
- Agglomerative Clustering
- Gaussian Mixture Models
- Spectral Clustering
But in this article, we are focusing exclusively on K-Means algorithm.
How K-Means Clustering Works
Let us understand the basic intuition of this unsupervised learning algorithm.
The intuition of the algorithm
Let us start by understanding what does this “k” means in K-means. K is a free parameter that is for addressing the number of clusters we want to have out of the given data points.
From all the content mentioned above, what we understand from a cluster is that we intend to have only those entities in one cluster who are similar to each other.
The same is for K means clustering. It is a clustering algorithm that aims to have similar entities in one cluster.
Well, you may ask, how does this algorithm decide whether an entity would lie in it or not?
So the answer to it is that it calculates the distance between its data points to the centroid of that cluster and aims to minimize the sum of all the distances(the distance of each data point from the centroid).
In short it uses smilarity measures to decide that.
One small thing that we need to understand is that the more the number of clusters, the less would be the sum of the distance of all the data points from the centroid.
This is because of the very reason that the number of data points in each cluster would decrease with an increase in the number of clusters.
And at a point where the number of clusters is equal to the number of data points, the sum of distance becomes zeros because the centroid is the data point itself!
Now let us see how it works. Please refer to the below image for all the steps.
Here we are having a few data points, which we want to cluster. So we would start by picking the number of clusters we want to have for this case.
Let us select 2 for this instance. And then randomly selecting a point considering it to be the centroid of the cluster.
We have successfully marked the centers of these clusters. Now we will be marking all the points with respective colors on the basis of the distance they have from the centroid.
After marking all the data points, we will now be computing the centroid of this cluster again. We are doing it because initially, we had picked the centroid randomly. Then to remove error, if any, we are doing it.
The centroid of the cluster is computed by finding a point within the cluster that would be equidistant from all the data points.
Now since we have computed the centroid again and we know it is not the same as it was before so we would iterate the process again and would find the points nearest to this centroid for each cluster.
Now we have got the result again. One may ask when shall we stop the iteration of this finding the centroid and then placing the data points accordingly? Well, you have to do it till the time when the position of the centroids doesn’t change.
We marked the two clusters.
In this case, it was easy, so we were able to get the results in 2 iterations only.
We had also talked about the random initialization that we are putting ourselves into. With this a problem we have is that it can land us up with some really bad clusters which won’t be of any use.
How To Evaluate Clusters
Let us also understand different evaluation metrics for clustering. In classification evaluation metrics helps in understanding how good the build is performing on the unseen data. In the same way we are having ways to determine the performance of the clusters created.
Of many, we would discuss 2 criteria for evaluating clusters.
- Dunn Index
If you recall, we have discussed that it is very important for us to have similar entities in our cluster. So what it does basically calculates the sum of distances of all the entities present in the cluster.
Here comes the concept of inter and intra cluster distance. Intra cluster distance is handled by inertia, and that is the distance between the data points which are inside one cluster. Inter cluster distance means the distance between 2 different clusters.
So dunn index is the ratio of the minimum inter cluster distance to the maximum of intra cluster distance.
So more will be the value of the dunn index better would be the clusters in terms of being separable.
How K-Means++ Clustering Works
So to pull ourselves out of this random initialization trap, we have kmeans++.
Let us also see how this thing really works.
- Just like K means, here too we select the centroid randomly but the twist here is that there we used to select centroid for all the clusters and here we would be selecting the centroid randomly for only one cluster.
- Now we would be computing the distance between every data point from that cluster.
- Now comes the selecting of the cluster, here we would be choosing our second cluster by seeing which data point is the farthest from our centroid. Usually, we take the square of the distance just to be on a safer side.
- Now repeat the above steps until the desired number(k) of clusters have been selected.
We would be having a look at the implementation of this and along with that would look at how can we decide the right amount of clusters for the same.
Key Differences Between K means and Kmeans++
Methods to identify “K” in K means clustering
How to identify the best “K” ?
There are several methods to find an optimal number of clusters for KMeans clusters. To popular methods are
- Elbow Method
- Silhouette Method
It is to calculate the sum the distances from data points to centroids and aims at minimising the sum to an optimal value.
The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
Here we would be looking at the Elbow method. Details about the same are mentioned in the problem statement below.
K-means Clustering Implementation in Python
It is a problem to cluster people on the basis of their spending scores and income. In this problem, you will understand the dataset.
Also you will learn about how the elbow method determines the right number of cluster. At the we will learn the python implementation K-Means clustering and plotting the clusters
You can download the dataset from here.
Before jumping into this, we need to understand what exactly is wcss doing?
Wcss stands for the within-cluster sum of squares. Which is just a high-fi name for finding the sum of distances of all the data points to the centroid of the cluster.
In the code segment below, it would be starting off with 1 cluster and would go till 10. Always remember we want the sum of this distance to be as minimum as possible in a way where the number of data points in that cluster remains constant.
Here we are able to see a considerable decline in the value of WCSS after cluster 5. So this means that the optimal number of clusters is 5.
In this article we explained or provided a brief idea about k-means clustering. Also explained how clustering is different from classification, how we can evaluate clusters.
This gives the complete flow of how the K means algorithms works.
In that we had also seen more about the random initialisation trap and how can we use kmeans++ to pull ourselves out of it.
Lastly we had taken a look at a clustering based problem statement, which involved the concepts of choosing the right number of clusters and how to visualise it.
Recommended Machine Learning Courses
Cluster Analysis With Python
Unsupervised Learning Algorithms
A to Z Machine Learning with Python
Read More …